Overview
The Historia Para Gandules data processing pipeline transforms raw scraped Instagram data into enriched, analyzable datasets. The pipeline consists of three main stages:- Data Collection - Scraping Instagram content
- Data Enrichment - Categorization and title generation using LLMs
- Data Analysis - Statistical analysis and visualization
Pipeline Architecture
Data Flow
1. Raw Data Collection
The initial dataset contains scraped Instagram reel data with the following fields:- Fecha - Publication date
- Texto del reel - Post caption text
- Likes - Number of likes
- Comentarios - Number of comments
- Visualizaciones - View count
- Duración del video (s) - Video duration in seconds
- Localización - Geographic coordinates (latitude, longitude)
- URL del Post - Instagram post URL
- URL del video - Video file URL
- URL de imagen - Thumbnail image URL
2. Text Preprocessing
Before LLM enrichment, the text data undergoes cleaning:3. LLM-Powered Enrichment
The cleaned text is processed through OpenAI’s GPT-3.5-turbo for:- Categorization - Classifying content into historical themes
- Title Generation - Creating engaging social media titles
4. Analysis & Visualization
The enriched dataset enables:- Statistical analysis of engagement metrics
- Category-wise performance comparison
- Geographic distribution analysis
- Temporal trend identification
Data Format
Input Format
Raw data is stored in Excel format (excel_info_1.xlsx, excel26deenero.xlsx) with 121 rows representing historical content posts.
Output Format
Enriched data includes additional fields:- Texto limpio - Cleaned text without emojis
- Hashtags - Extracted hashtag list
- Categoria - AI-assigned category
- Titulo - AI-generated engaging title
Processing Statistics
Dataset Size: 121 Instagram reelsDate Range: February 2024 - December 2024Processing Time: ~2-3 minutes for full enrichment
Key Metrics
| Metric | Mean | Std | Min | Max |
|---|---|---|---|---|
| Likes | 1,316 | 1,930 | 304 | 14,659 |
| Comentarios | 39 | 49 | 3 | 361 |
| Visualizaciones | 15,392 | 39,250 | 2,277 | 337,001 |
| Duración (s) | 50.1 | 18.2 | 26.0 | 133.5 |
Usage Example
Here’s how the complete pipeline is executed:Next Steps
Geolocation Processing
Learn how coordinates are extracted and parsed
Data Enrichment
Understand the LLM categorization process