Master data analysis fundamentals using Pandas, from loading datasets to computing statistics and creating visualizations.
Exploratory Data Analysis (EDA) is the process of investigating datasets to discover patterns, spot anomalies, test hypotheses, and check assumptions. This guide introduces you to EDA using Pandas, the most popular Python library for data analysis.
In this tutorial, you’ll learn essential Pandas operations using the World Happiness Report dataset. The dataset contains 2,199 rows with happiness-related metrics for different countries across multiple years.
This is not a comprehensive Pandas guide, but rather focuses on the functions you’ll use most frequently in data analysis. For detailed documentation, see the official Pandas tutorial.
Column names with spaces can be problematic. Replace spaces with underscores:
# Automatic rename: replace spaces and lowercasecolumns_to_rename = {i: "_".join(i.split(" ")).lower() for i in df.columns}df = df.rename(columns=columns_to_rename)df.head()
Why Rename Columns?
Column names without spaces allow cleaner syntax:
df.life_ladder # Easy accessdf["Life Ladder"] # Required with spaces
Unlike NumPy arrays, DataFrame columns can have different data types. This makes DataFrames ideal for mixed-type data.
Change data types when needed:
# List columns that should be floatsfloat_columns = [i for i in df.columns if i not in ["country_name", "year"]]# Convert to float typedf = df.astype({i: float for i in float_columns})
When aggregating data across countries and years, consider whether simple averages make sense. Do all countries have equal data points? Should countries be weighted by population?
# Create color mappingcmap = { 'Brazil': 'Green', 'Slovenia': 'Orange', 'India': 'purple'}df.plot( kind='scatter', x='log_gdp_per_capita', y='life_ladder', c=[cmap.get(c, 'yellow') for c in df.country_name], s=2 # Point size)
Color coding reveals patterns that aggregate statistics might miss. For example, Brazil shows higher happiness than Slovenia despite lower GDP.