Skip to main content

Primary Data Source

Historia Para Gandules collects all data from a single Instagram account:

@historiaparagandules

The official Historia Para Gandules Instagram account containing historical educational content about Puerto Rico.

Account Overview

The @historiaparagandules Instagram account serves as the sole data source for this project. It publishes historical educational content focused on Puerto Rican history and culture.

Content Focus

Educational videos explaining historical events, figures, and cultural aspects of Puerto Rico.
Content is primarily delivered through Instagram Reels - short-form vertical videos optimized for mobile viewing.
Content uses colloquial and accessible language to make history engaging for a broad audience (“para gandules” = “for lazy people” in a playful, approachable way).

Content Types

The scraper specifically targets video posts (reels) from the account:

Video Posts (Reels)

Collected: Short-form videos containing historical narrativesThe scraper filters for post.is_video == True to capture only video content.

Image Posts

Not Collected: Static image posts are filtered outThe current implementation focuses exclusively on video content.

Why Video-Only?

The project focuses on video content because:
  • Rich engagement data: Videos provide more metrics (views, duration)
  • Primary content format: Reels are the main content type for the account
  • Geospatial visualization: Video content is more suitable for the interactive timeline visualization
  • Consistent structure: Videos have standardized metadata fields

Data Collection Method

Data is collected using the Instaloader Python library, which accesses Instagram’s public API:
import instaloader

L = instaloader.Instaloader()
profile_name = "historiaparagandules"
profile = instaloader.Profile.from_username(L.context, profile_name)

# Filter for video posts only
for post in profile.get_posts():
    if post.is_video:
        # Extract video metadata
The scraper only accesses publicly available data. No authentication or login is required.

Available Metrics

For each video post, the scraper collects:

Temporal Data

  • Publication date and time: When the content was posted

Content Data

  • Caption/text: The description or narrative accompanying the video
  • Video URL: Direct link to the video file
  • Post URL: Permalink to the Instagram post

Engagement Metrics

  • Likes: Number of likes received
  • Comments: Number of comments
  • Views: Total video view count

Technical Metadata

  • Duration: Video length in seconds

Data Freshness

The scraping script collects all historical posts from the account’s inception. To update the dataset:
1

Run Scraper

Execute scraping5.py to collect the latest posts
2

Incremental Updates

The scraper iterates through all posts each time. For large accounts, consider implementing date-based filtering to collect only new posts.
3

Merge Data

Combine new data with existing datasets, removing duplicates based on post URLs

Data Limitations

Instagram’s API and data availability may change over time, affecting which fields are accessible.

Current Limitations

  1. Public data only: Only publicly visible information is collected
  2. No historical edits: Caption edits after publication are not tracked
  3. Deleted posts: Posts deleted from Instagram will not appear in subsequent scrapes
  4. Rate limiting: Instagram may throttle frequent scraping requests
  5. View counts: May show as “No disponible” for older posts or due to API restrictions

Future Data Sources

Potential expansions for the data collection:

Additional Accounts

Incorporate related Puerto Rican historical education accounts

Cross-Platform

Expand to TikTok, YouTube, or other platforms where similar content exists

Manual Curation

Add manually curated historical data not available on social media

Geolocation Tags

Extract location data from posts that include geotags

Data Ethics

This project collects only publicly available data and respects Instagram’s terms of service. The data is used for educational and research purposes.

Ethical Considerations

  • Public content: Only publicly accessible posts are scraped
  • No personal data: User comments and personal information are not collected
  • Attribution: Content is attributed to the original creator
  • Non-commercial: Data is used for educational visualization purposes
  • Rate limiting: Scraping is performed responsibly to avoid overloading Instagram’s servers

Next Steps

Scraping Guide

Learn how to run the Instagram scraper

Data Schema

Explore the complete data structure and field specifications

Build docs developers (and LLMs) love