Data Sources

Primary Data Source

Historia Para Gandules collects all data from a single Instagram account:

@historiaparagandules

The official Historia Para Gandules Instagram account containing historical educational content about Puerto Rico.

Account Overview

The @historiaparagandules Instagram account serves as the sole data source for this project. It publishes historical educational content focused on Puerto Rican history and culture.

Content Focus

Historical Education

Educational videos explaining historical events, figures, and cultural aspects of Puerto Rico.

Video Format

Content is primarily delivered through Instagram Reels - short-form vertical videos optimized for mobile viewing.

Accessible Language

Content uses colloquial and accessible language to make history engaging for a broad audience (“para gandules” = “for lazy people” in a playful, approachable way).

Content Types

The scraper specifically targets video posts (reels) from the account:

Video Posts (Reels)

Collected: Short-form videos containing historical narrativesThe scraper filters for post.is_video == True to capture only video content.

Image Posts

Not Collected: Static image posts are filtered outThe current implementation focuses exclusively on video content.

Why Video-Only?

The project focuses on video content because:

Rich engagement data: Videos provide more metrics (views, duration)
Primary content format: Reels are the main content type for the account
Geospatial visualization: Video content is more suitable for the interactive timeline visualization
Consistent structure: Videos have standardized metadata fields

Data Collection Method

Data is collected using the Instaloader Python library, which accesses Instagram’s public API:

import instaloader

L = instaloader.Instaloader()
profile_name = "historiaparagandules"
profile = instaloader.Profile.from_username(L.context, profile_name)

# Filter for video posts only
for post in profile.get_posts():
    if post.is_video:
        # Extract video metadata

The scraper only accesses publicly available data. No authentication or login is required.

Available Metrics

For each video post, the scraper collects:

Temporal Data

Publication date and time: When the content was posted

Content Data

Caption/text: The description or narrative accompanying the video
Video URL: Direct link to the video file
Post URL: Permalink to the Instagram post

Engagement Metrics

Likes: Number of likes received
Comments: Number of comments
Views: Total video view count

Technical Metadata

Duration: Video length in seconds

Data Freshness

The scraping script collects all historical posts from the account’s inception. To update the dataset:

Run Scraper

Execute scraping5.py to collect the latest posts

Incremental Updates

The scraper iterates through all posts each time. For large accounts, consider implementing date-based filtering to collect only new posts.

Merge Data

Combine new data with existing datasets, removing duplicates based on post URLs

Data Limitations

Instagram’s API and data availability may change over time, affecting which fields are accessible.

Current Limitations

Public data only: Only publicly visible information is collected
No historical edits: Caption edits after publication are not tracked
Deleted posts: Posts deleted from Instagram will not appear in subsequent scrapes
Rate limiting: Instagram may throttle frequent scraping requests
View counts: May show as “No disponible” for older posts or due to API restrictions

Future Data Sources

Potential expansions for the data collection:

Additional Accounts

Incorporate related Puerto Rican historical education accounts

Cross-Platform

Expand to TikTok, YouTube, or other platforms where similar content exists

Manual Curation

Add manually curated historical data not available on social media

Geolocation Tags

Extract location data from posts that include geotags

Data Ethics

This project collects only publicly available data and respects Instagram’s terms of service. The data is used for educational and research purposes.

Ethical Considerations

Public content: Only publicly accessible posts are scraped
No personal data: User comments and personal information are not collected
Attribution: Content is attributed to the original creator
Non-commercial: Data is used for educational visualization purposes
Rate limiting: Scraping is performed responsibly to avoid overloading Instagram’s servers

Getting Started

Data Collection

Analysis & Visualization

Interactive Maps

Data Processing

Reference

Primary Data Source

@historiaparagandules

Account Overview

Content Focus

Content Types

Video Posts (Reels)

Image Posts

Why Video-Only?

Data Collection Method

Available Metrics

Temporal Data

Content Data

Engagement Metrics

Technical Metadata

Data Freshness

Data Limitations

Current Limitations

Future Data Sources

Additional Accounts

Cross-Platform

Manual Curation

Geolocation Tags

Data Ethics

Ethical Considerations

Next Steps

Scraping Guide

Data Schema

Build docs developers (and LLMs) love

Getting Started

Data Collection

Analysis & Visualization

Interactive Maps

Data Processing

Reference

​Primary Data Source

@historiaparagandules

​Account Overview

​Content Focus

​Content Types

Video Posts (Reels)

Image Posts

​Why Video-Only?

​Data Collection Method

​Available Metrics

​Temporal Data

​Content Data

​Engagement Metrics

​Technical Metadata

​Data Freshness

​Data Limitations

​Current Limitations

​Future Data Sources

Additional Accounts

Cross-Platform

Manual Curation

Geolocation Tags

​Data Ethics

​Ethical Considerations

​Next Steps

Scraping Guide

Data Schema

Build docs developers (and LLMs) love

Primary Data Source

Account Overview

Content Focus

Content Types

Why Video-Only?

Data Collection Method

Available Metrics

Temporal Data

Content Data

Engagement Metrics

Technical Metadata

Data Freshness

Data Limitations

Current Limitations

Future Data Sources

Data Ethics

Ethical Considerations

Next Steps