Data Schema

Overview

The Historia Para Gandules dataset follows a structured schema with 8 fields per video post. Data is collected from Instagram and stored in CSV format with UTF-8 encoding.

Schema Structure

The dataset is exported as a CSV file with the following headers:

Fecha,Texto del reel,Likes,Comentarios,URL del video,Visualizaciones,Duración del video (s),URL del Post

Field Specifications

Temporal Fields

Fecha

DateTime

required

Publication timestamp - Date and time when the video was posted to Instagram

Format: YYYY-MM-DD HH:MM:SS
Example: 2024-01-15 14:30:00
Source: post.date.strftime('%Y-%m-%d %H:%M:%S')
Timezone: UTC (from Instagram API)

fecha = post.date.strftime('%Y-%m-%d %H:%M:%S')

Content Fields

Texto del reel

String

Caption/description text - The text content accompanying the video

Max length: Variable (Instagram allows up to 2,200 characters)
Default value: "Sin texto" if no caption provided
Encoding: UTF-8
May contain: Hashtags, mentions, emojis, line breaks

texto = post.caption or "Sin texto"

This field often contains the historical narrative or educational content describing the video’s topic.

URL del video

String (URL)

required

Direct video file URL - CDN link to the actual video file

Format: Full HTTPS URL to Instagram’s CDN
Default value: "Sin URL" if unavailable
Expiration: URLs may expire after a period of time
Usage: Download or stream the video content

url_video = post.video_url or "Sin URL"

Video URLs from Instagram CDN may expire. Download and store videos locally if long-term access is needed.

URL del Post

String (URL)

required

Instagram post permalink - Permanent link to the Instagram post

Format: https://www.instagram.com/p/{shortcode}/
Example: https://www.instagram.com/p/ABC123xyz/
Uniqueness: Unique identifier for each post
Permanence: Stable link unless post is deleted

url_post = f"https://www.instagram.com/p/{post.shortcode}/"

Use this field as the primary key to identify unique posts and avoid duplicates.

Engagement Fields

Likes

Integer

default:"0"

Like count - Number of likes the video has received

Range: 0 to unlimited
Default value: 0 if unavailable
Type: Non-negative integer
Note: Count is at the time of scraping (may increase over time)

likes = post.likes or 0

Comentarios

Integer

default:"0"

Comment count - Number of comments on the video

Range: 0 to unlimited
Default value: 0 if unavailable
Type: Non-negative integer
Note: Count reflects total comments at scraping time
Limitation: Individual comment text is not collected

comentarios = post.comments or 0

Visualizaciones

Integer or String

View count - Total number of times the video has been viewed

Range: 0 to unlimited when available
Default value: "No disponible" if Instagram doesn’t provide the data
Type: Integer (or string for unavailable cases)
Accuracy: Instagram’s view count methodology

visualizaciones = post.video_view_count or "No disponible"

View counts may not be available for all posts depending on Instagram’s API restrictions or account settings.

Technical Fields

Duración del video (s)

Float or String

Video duration - Length of the video in seconds

Unit: Seconds (with decimal precision)
Example: 45.5 for a 45.5-second video
Default value: "No disponible" if unavailable
Type: Float (or string for unavailable cases)
Typical range: 0 to 90 seconds (Instagram Reels limit)

duracion_video = post.video_duration or "No disponible"

Complete Example

Here’s a complete record with all fields populated:

Fecha,Texto del reel,Likes,Comentarios,URL del video,Visualizaciones,Duración del video (s),URL del Post
2024-01-15 14:30:00,"🇵🇷 La historia del Grito de Lares - el primer intento de independencia de Puerto Rico en 1868. #HistoriaPR #PuertoRico #Historia",1250,45,https://instagram.com/cdn/video.mp4,15000,45.5,https://www.instagram.com/p/ABC123xyz/

Data Types Reference

Field Type Summary

object

Show Python to CSV Type Mapping

Field	Python Type	CSV Storage	Notes
Fecha	`datetime`	`string`	Formatted as YYYY-MM-DD HH:MM:SS
Texto del reel	`str`	`string`	UTF-8 encoded, quoted if contains commas
Likes	`int`	`integer`	Always numeric
Comentarios	`int`	`integer`	Always numeric
URL del video	`str`	`string`	Full URL or “Sin URL”
Visualizaciones	`int` or `str`	`mixed`	Integer or “No disponible”
Duración del video	`float` or `str`	`mixed`	Float or “No disponible”
URL del Post	`str`	`string`	Always formatted as permalink

Data Validation

When working with the collected data, apply these validation rules:

import pandas as pd
from datetime import datetime

def validate_record(row):
    """Validate a single data record"""
    errors = []
    
    # Validate Fecha
    try:
        datetime.strptime(row['Fecha'], '%Y-%m-%d %H:%M:%S')
    except ValueError:
        errors.append("Invalid date format")
    
    # Validate numeric fields
    if not isinstance(row['Likes'], (int, float)) or row['Likes'] < 0:
        errors.append("Invalid likes count")
    
    if not isinstance(row['Comentarios'], (int, float)) or row['Comentarios'] < 0:
        errors.append("Invalid comments count")
    
    # Validate URLs
    if not row['URL del Post'].startswith('https://www.instagram.com/p/'):
        errors.append("Invalid post URL")
    
    # Validate duration if numeric
    if row['Duración del video (s)'] != "No disponible":
        if not isinstance(row['Duración del video (s)'], (int, float)) or row['Duración del video (s)'] < 0:
            errors.append("Invalid duration")
    
    return errors

# Load and validate CSV
df = pd.read_csv('informacion_reels_simple.csv')
for idx, row in df.iterrows():
    errors = validate_record(row)
    if errors:
        print(f"Row {idx}: {errors}")

Missing Data Handling

Some fields may contain default values when data is unavailable:

Texto del reel: 'Sin texto'

Reason: Post was published without a captionHandling: Treat as empty string or null in analysisFrequency: Rare (most posts have captions)

URL del video: 'Sin URL'

Reason: Video URL is not accessible (rare)Handling: Skip video download, use only metadataFrequency: Very rare

Visualizaciones: 'No disponible'

Reason: Instagram API doesn’t provide view count for this postHandling: Exclude from view-based analysis or impute based on likesFrequency: Can occur for older posts or due to API changes

Duración del video: 'No disponible'

Reason: Video duration metadata is missingHandling: Download video to calculate duration locallyFrequency: Rare

Data Processing Pipeline

Typical workflow for working with the collected data:

Collection

Run scraping5.py to generate informacion_reels_simple.csv

Validation

Validate data types, check for missing values, verify URL formats

Cleaning

Handle missing data, remove duplicates, normalize text encoding

Enrichment

Add calculated fields (e.g., engagement rate, posting day/time)

Transformation

Convert to desired format (JSON, database, etc.) for visualization

Example Data Transformations

Calculate Engagement Rate
Extract Hashtags
Convert to JSON
Time Series Analysis

import pandas as pd

df = pd.read_csv('informacion_reels_simple.csv')

# Calculate engagement rate
df['engagement_rate'] = (df['Likes'] + df['Comentarios']) / df['Visualizaciones'].replace('No disponible', 0)

# Handle division by zero
df['engagement_rate'] = df['engagement_rate'].replace([float('inf'), -float('inf')], 0)

import re

def extract_hashtags(text):
    """Extract hashtags from caption text"""
    if text == "Sin texto":
        return []
    return re.findall(r'#\w+', text)

df['hashtags'] = df['Texto del reel'].apply(extract_hashtags)

import pandas as pd
import json

df = pd.read_csv('informacion_reels_simple.csv')

# Convert to records format
records = df.to_dict('records')

# Save as JSON
with open('posts.json', 'w', encoding='utf-8') as f:
    json.dump(records, f, ensure_ascii=False, indent=2)

import pandas as pd

df = pd.read_csv('informacion_reels_simple.csv')

# Convert to datetime
df['Fecha'] = pd.to_datetime(df['Fecha'])

# Extract temporal features
df['year'] = df['Fecha'].dt.year
df['month'] = df['Fecha'].dt.month
df['day_of_week'] = df['Fecha'].dt.day_name()
df['hour'] = df['Fecha'].dt.hour

# Group by month
monthly_posts = df.groupby(['year', 'month']).size()

Schema Evolution

As the project evolves, the schema may be extended with additional fields:

Potential Future Fields

Geolocation data (if posts are tagged)
Mentioned historical periods
Topics/categories (manual or AI tagging)
Sentiment analysis scores

Backward Compatibility

New fields will be added as optional columns
Existing fields will maintain their format
Legacy CSV files will remain importable

Next Steps

Scraping Guide

Learn how to collect data using the scraper

Data Sources

Understand where the data comes from

Getting Started

Data Collection

Analysis & Visualization

Interactive Maps

Data Processing

Reference

Overview

Schema Structure

Field Specifications

Temporal Fields

Content Fields

Engagement Fields

Technical Fields

Complete Example

Data Types Reference

Data Validation

Missing Data Handling

Data Processing Pipeline

Example Data Transformations

Schema Evolution

Potential Future Fields

Backward Compatibility

Next Steps

Scraping Guide

Data Sources

Build docs developers (and LLMs) love

Getting Started

Data Collection

Analysis & Visualization

Interactive Maps

Data Processing

Reference

​Overview

​Schema Structure

​Field Specifications

​Temporal Fields

​Content Fields

​Engagement Fields

​Technical Fields

​Complete Example

​Data Types Reference

​Data Validation

​Missing Data Handling

​Data Processing Pipeline

​Example Data Transformations

​Schema Evolution

Potential Future Fields

Backward Compatibility

​Next Steps

Scraping Guide

Data Sources

Build docs developers (and LLMs) love

Overview

Schema Structure

Field Specifications

Temporal Fields

Content Fields

Engagement Fields

Technical Fields

Complete Example

Data Types Reference

Data Validation

Missing Data Handling

Data Processing Pipeline

Example Data Transformations

Schema Evolution

Next Steps