Skip to main content

Overview

The Historia Para Gandules dataset follows a structured schema with 8 fields per video post. Data is collected from Instagram and stored in CSV format with UTF-8 encoding.

Schema Structure

The dataset is exported as a CSV file with the following headers:
Fecha,Texto del reel,Likes,Comentarios,URL del video,Visualizaciones,Duración del video (s),URL del Post

Field Specifications

Temporal Fields

Fecha
DateTime
required
Publication timestamp - Date and time when the video was posted to Instagram
  • Format: YYYY-MM-DD HH:MM:SS
  • Example: 2024-01-15 14:30:00
  • Source: post.date.strftime('%Y-%m-%d %H:%M:%S')
  • Timezone: UTC (from Instagram API)
fecha = post.date.strftime('%Y-%m-%d %H:%M:%S')

Content Fields

Texto del reel
String
Caption/description text - The text content accompanying the video
  • Max length: Variable (Instagram allows up to 2,200 characters)
  • Default value: "Sin texto" if no caption provided
  • Encoding: UTF-8
  • May contain: Hashtags, mentions, emojis, line breaks
texto = post.caption or "Sin texto"
This field often contains the historical narrative or educational content describing the video’s topic.
URL del video
String (URL)
required
Direct video file URL - CDN link to the actual video file
  • Format: Full HTTPS URL to Instagram’s CDN
  • Default value: "Sin URL" if unavailable
  • Expiration: URLs may expire after a period of time
  • Usage: Download or stream the video content
url_video = post.video_url or "Sin URL"
Video URLs from Instagram CDN may expire. Download and store videos locally if long-term access is needed.
URL del Post
String (URL)
required
Instagram post permalink - Permanent link to the Instagram post
  • Format: https://www.instagram.com/p/{shortcode}/
  • Example: https://www.instagram.com/p/ABC123xyz/
  • Uniqueness: Unique identifier for each post
  • Permanence: Stable link unless post is deleted
url_post = f"https://www.instagram.com/p/{post.shortcode}/"
Use this field as the primary key to identify unique posts and avoid duplicates.

Engagement Fields

Likes
Integer
default:"0"
Like count - Number of likes the video has received
  • Range: 0 to unlimited
  • Default value: 0 if unavailable
  • Type: Non-negative integer
  • Note: Count is at the time of scraping (may increase over time)
likes = post.likes or 0
Comentarios
Integer
default:"0"
Comment count - Number of comments on the video
  • Range: 0 to unlimited
  • Default value: 0 if unavailable
  • Type: Non-negative integer
  • Note: Count reflects total comments at scraping time
  • Limitation: Individual comment text is not collected
comentarios = post.comments or 0
Visualizaciones
Integer or String
View count - Total number of times the video has been viewed
  • Range: 0 to unlimited when available
  • Default value: "No disponible" if Instagram doesn’t provide the data
  • Type: Integer (or string for unavailable cases)
  • Accuracy: Instagram’s view count methodology
visualizaciones = post.video_view_count or "No disponible"
View counts may not be available for all posts depending on Instagram’s API restrictions or account settings.

Technical Fields

Duración del video (s)
Float or String
Video duration - Length of the video in seconds
  • Unit: Seconds (with decimal precision)
  • Example: 45.5 for a 45.5-second video
  • Default value: "No disponible" if unavailable
  • Type: Float (or string for unavailable cases)
  • Typical range: 0 to 90 seconds (Instagram Reels limit)
duracion_video = post.video_duration or "No disponible"

Complete Example

Here’s a complete record with all fields populated:
Fecha,Texto del reel,Likes,Comentarios,URL del video,Visualizaciones,Duración del video (s),URL del Post
2024-01-15 14:30:00,"🇵🇷 La historia del Grito de Lares - el primer intento de independencia de Puerto Rico en 1868. #HistoriaPR #PuertoRico #Historia",1250,45,https://instagram.com/cdn/video.mp4,15000,45.5,https://www.instagram.com/p/ABC123xyz/

Data Types Reference

Field Type Summary
object

Data Validation

When working with the collected data, apply these validation rules:
import pandas as pd
from datetime import datetime

def validate_record(row):
    """Validate a single data record"""
    errors = []
    
    # Validate Fecha
    try:
        datetime.strptime(row['Fecha'], '%Y-%m-%d %H:%M:%S')
    except ValueError:
        errors.append("Invalid date format")
    
    # Validate numeric fields
    if not isinstance(row['Likes'], (int, float)) or row['Likes'] < 0:
        errors.append("Invalid likes count")
    
    if not isinstance(row['Comentarios'], (int, float)) or row['Comentarios'] < 0:
        errors.append("Invalid comments count")
    
    # Validate URLs
    if not row['URL del Post'].startswith('https://www.instagram.com/p/'):
        errors.append("Invalid post URL")
    
    # Validate duration if numeric
    if row['Duración del video (s)'] != "No disponible":
        if not isinstance(row['Duración del video (s)'], (int, float)) or row['Duración del video (s)'] < 0:
            errors.append("Invalid duration")
    
    return errors

# Load and validate CSV
df = pd.read_csv('informacion_reels_simple.csv')
for idx, row in df.iterrows():
    errors = validate_record(row)
    if errors:
        print(f"Row {idx}: {errors}")

Missing Data Handling

Some fields may contain default values when data is unavailable:
Reason: Post was published without a captionHandling: Treat as empty string or null in analysisFrequency: Rare (most posts have captions)
Reason: Video URL is not accessible (rare)Handling: Skip video download, use only metadataFrequency: Very rare
Reason: Instagram API doesn’t provide view count for this postHandling: Exclude from view-based analysis or impute based on likesFrequency: Can occur for older posts or due to API changes
Reason: Video duration metadata is missingHandling: Download video to calculate duration locallyFrequency: Rare

Data Processing Pipeline

Typical workflow for working with the collected data:
1

Collection

Run scraping5.py to generate informacion_reels_simple.csv
2

Validation

Validate data types, check for missing values, verify URL formats
3

Cleaning

Handle missing data, remove duplicates, normalize text encoding
4

Enrichment

Add calculated fields (e.g., engagement rate, posting day/time)
5

Transformation

Convert to desired format (JSON, database, etc.) for visualization

Example Data Transformations

import pandas as pd

df = pd.read_csv('informacion_reels_simple.csv')

# Calculate engagement rate
df['engagement_rate'] = (df['Likes'] + df['Comentarios']) / df['Visualizaciones'].replace('No disponible', 0)

# Handle division by zero
df['engagement_rate'] = df['engagement_rate'].replace([float('inf'), -float('inf')], 0)

Schema Evolution

As the project evolves, the schema may be extended with additional fields:

Potential Future Fields

  • Geolocation data (if posts are tagged)
  • Mentioned historical periods
  • Topics/categories (manual or AI tagging)
  • Sentiment analysis scores

Backward Compatibility

  • New fields will be added as optional columns
  • Existing fields will maintain their format
  • Legacy CSV files will remain importable

Next Steps

Scraping Guide

Learn how to collect data using the scraper

Data Sources

Understand where the data comes from

Build docs developers (and LLMs) love