Skip to main content

Overview

This page provides a detailed statistical analysis of the Historia Para Gandules dataset, examining correlations between metrics, distribution patterns, and statistical significance of findings.

Descriptive Statistics

Summary Statistics for 121 Videos

import pandas as pd

# Load data
df = pd.read_excel('excel26deenero.xlsx')

# Calculate descriptive statistics
descriptive_stats = df[[
    'Likes', 
    'Comentarios', 
    'Visualizaciones', 
    'Duración del video (s)'
]].describe()

print("Estadísticas Descriptivas:")
print(descriptive_stats)

Results

StatisticLikesCommentsViewsDuration (s)
Count121121121121
Mean1,316.3739.0115,391.8850.08
Std Dev1,930.3248.7239,250.1418.22
Min30432,27726
25%640184,92638.13
50% (Median)828276,29445.90
75%1,2503910,24456.20
Max14,659361337,001133.49
The high standard deviation relative to mean values indicates significant variability in engagement, with some videos achieving viral status while others maintain steady baseline performance.

Correlation Analysis

Computing Correlations

import seaborn as sns
import matplotlib.pyplot as plt

# Select numeric columns
numeric_cols = ['Likes', 'Comentarios', 'Visualizaciones', 'Duración del video (s)']

# Calculate correlation matrix
correlation_matrix = df[numeric_cols].corr()

print(correlation_matrix)

Correlation Heatmap

# Create heatmap visualization
fig, ax = plt.subplots(figsize=(10, 8))

sns.heatmap(
    df[numeric_cols].corr(), 
    annot=True, 
    cmap='coolwarm', 
    center=0,
    square=True,
    linewidths=1,
    cbar_kws={"shrink": 0.8},
    ax=ax
)

ax.set_title('Mapa de Calor de Correlaciones', fontsize=16, pad=20)
plt.tight_layout()
plt.show()

Interpretation of Correlations

Strong positive correlation - Videos with more likes tend to receive more comments, indicating consistent engagement across metrics.
Positive correlation - Higher view counts generally lead to more likes, though the relationship shows some variance indicating quality matters.
Moderate positive correlation - Comments increase with views but at a lower rate than likes, suggesting comments require deeper engagement.
Weak correlation - Video duration shows minimal correlation with engagement metrics, indicating content quality trumps length.

Engagement Metrics Distribution

Likes Distribution

import numpy as np

# Analyze likes distribution
print(f"Mean Likes: {df['Likes'].mean():.2f}")
print(f"Median Likes: {df['Likes'].median():.2f}")
print(f"Mode: Median < Mean indicates right-skewed distribution")

# Visualize distribution
fig, ax = plt.subplots(figsize=(12, 6))
ax.hist(df['Likes'], bins=30, edgecolor='black', alpha=0.7)
ax.axvline(df['Likes'].mean(), color='red', linestyle='--', label='Mean')
ax.axvline(df['Likes'].median(), color='green', linestyle='--', label='Median')
ax.set_xlabel('Likes')
ax.set_ylabel('Frequency')
ax.set_title('Distribution of Likes across Videos')
ax.legend()
plt.show()
Key Finding: The distribution is right-skewed, meaning most videos cluster around the median (828 likes) with a few high-performing outliers pulling the mean higher (1,316 likes).

Comments Distribution

# Analyze comments distribution
print(f"Mean Comments: {df['Comentarios'].mean():.2f}")
print(f"Median Comments: {df['Comentarios'].median():.2f}")
print(f"Comments per Like Ratio: {df['Comentarios'].sum() / df['Likes'].sum():.4f}")
Insights:
  • Average of 39 comments per video
  • Median of 27 comments (also right-skewed)
  • Comment-to-like ratio: approximately 0.03 (3 comments per 100 likes)

Views Distribution

# Views analysis
print(f"Mean Views: {df['Visualizaciones'].mean():.2f}")
print(f"Median Views: {df['Visualizaciones'].median():.2f}")
print(f"View-to-Like Conversion: {(df['Likes'].sum() / df['Visualizaciones'].sum()) * 100:.2f}%")
Metrics:
  • Mean views: 15,392
  • Median views: 6,294 (significant skew)
  • View-to-like conversion rate: ~8.4%

Statistical Insights by Category

Category-Level Statistics

# Group by category and calculate statistics
category_stats = df.groupby('Categoria').agg({
    'Likes': ['count', 'mean', 'sum'],
    'Comentarios': ['mean', 'sum'],
    'Visualizaciones': ['mean', 'sum']
}).round(2)

print(category_stats)

Performance Variability

# Calculate coefficient of variation by category
cv_by_category = df.groupby('Categoria').agg({
    'Likes': lambda x: (x.std() / x.mean()) * 100,
    'Comentarios': lambda x: (x.std() / x.mean()) * 100
}).round(2)

print("Coefficient of Variation (%)")
print(cv_by_category)
Coefficient of Variation measures relative variability. Higher values indicate less predictable performance within a category.

Duration Analysis

Video Length Statistics

# Analyze video duration
print(f"Average Duration: {df['Duración del video (s)'].mean():.2f} seconds")
print(f"Median Duration: {df['Duración del video (s)'].median():.2f} seconds")
print(f"Range: {df['Duración del video (s)'].min():.0f}s - {df['Duración del video (s)'].max():.0f}s")
Results:
  • Average: 50.08 seconds
  • Median: 45.90 seconds
  • Range: 26 to 133.49 seconds

Duration vs Engagement

# Correlation between duration and engagement
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

# Duration vs Likes
ax1.scatter(df['Duración del video (s)'], df['Likes'], alpha=0.6)
ax1.set_xlabel('Duration (seconds)')
ax1.set_ylabel('Likes')
ax1.set_title('Duration vs Likes')

# Duration vs Views
ax2.scatter(df['Duración del video (s)'], df['Visualizaciones'], alpha=0.6, color='orange')
ax2.set_xlabel('Duration (seconds)')
ax2.set_ylabel('Views')
ax2.set_title('Duration vs Views')

plt.tight_layout()
plt.show()
Duration shows weak correlation with engagement, suggesting that content quality and topic relevance are more important than video length for this educational content.

Key Statistical Findings

High Variability

Standard deviations exceed means for all metrics, indicating diverse performance across content

Strong Engagement

Likes and comments show strong positive correlation (visible in heatmap)

Viral Potential

Top performers exceed mean by 10x, demonstrating viral content potential

Consistent Duration

~45-50 second sweet spot for short-form historical content

Next Steps

References

All statistical analyses are reproducible using the code provided in the project’s EDA.ipynb notebook.

Build docs developers (and LLMs) love