Exploratory Data Analysis with Pandas

Exploratory Data Analysis (EDA) is the process of investigating datasets to discover patterns, spot anomalies, test hypotheses, and check assumptions. This guide introduces you to EDA using Pandas, the most popular Python library for data analysis.

Introduction

In this tutorial, you’ll learn essential Pandas operations using the World Happiness Report dataset. The dataset contains 2,199 rows with happiness-related metrics for different countries across multiple years.

This is not a comprehensive Pandas guide, but rather focuses on the functions you’ll use most frequently in data analysis. For detailed documentation, see the official Pandas tutorial.

Getting Started

Import Required Libraries

Begin by importing Pandas and Seaborn for data manipulation and visualization:

import pandas as pd
import seaborn as sns

If Seaborn is not installed, uncomment and run: #!pip install seaborn

Load Your Dataset

Use pd.read_csv() to load CSV files:

df = pd.read_csv('data/world_happiness.csv')

This creates a DataFrame - Pandas’ primary data structure for working with tabular data.

Pandas supports various file formats. Check the I/O documentation for other data sources.

Viewing Your Data

Display First and Last Rows

Use head() and tail() to preview your DataFrame:

# Display first 5 rows (default)
df.head()

# Display last 2 rows
df.tail(2)

Interactive Display
Print vs Display

In Jupyter notebooks, the last line of a cell automatically displays its output:

df.head()  # Displays formatted table

For multiple DataFrames in one cell:

display(df.head())  # Formatted table
print("Additional text")
display(df.tail())  # Another formatted table

print(df) shows plain text, while display(df) maintains formatting.

Understanding DataFrame Structure

Index and Column Names

Every DataFrame has an index (row labels) and column names:

# View index
df.index  # RangeIndex(start=0, stop=2199, step=1)

# View column names
df.columns

Renaming Columns

Column names with spaces can be problematic. Replace spaces with underscores:

# Automatic rename: replace spaces and lowercase
columns_to_rename = {i: "_".join(i.split(" ")).lower() for i in df.columns}
df = df.rename(columns=columns_to_rename)

df.head()

Why Rename Columns?

Column names without spaces allow cleaner syntax:

df.life_ladder  # Easy access
df["Life Ladder"]  # Required with spaces

Data Types

Check column data types with dtypes:

df.dtypes

Unlike NumPy arrays, DataFrame columns can have different data types. This makes DataFrames ideal for mixed-type data.

Change data types when needed:

# List columns that should be floats
float_columns = [i for i in df.columns if i not in ["country_name", "year"]]

# Convert to float type
df = df.astype({i: float for i in float_columns})

Get comprehensive information:

df.info()  # Shows types, non-null counts, memory usage

Selecting Data

Selecting Columns

Single Column (Series)
Single Column (DataFrame)
Multiple Columns

# Dot notation - returns Series
x = df.life_ladder

# Bracket notation - returns Series
x = df["life_ladder"]

print(type(x))  # <class 'pandas.core.series.Series'>

# Pass list with one column - returns DataFrame
x = df[["life_ladder"]]

print(type(x))  # <class 'pandas.core.frame.DataFrame'>

# Pass list of column names
x = df[["life_ladder", "year"]]

print(type(x))  # <class 'pandas.core.frame.DataFrame'>

Selecting Rows

Use slicing to select rows:

# Select rows 2, 3, and 4
df[2:5]

Iterating Over Rows

Use .iterrows() for row-by-row iteration:

for index, row in df.iterrows():
    print(f"Country: {row['country_name']}, Year: {row['year']}")
    break  # Just show first row

.iterrows() does not preserve dtypes across rows. Data types are preserved across columns in DataFrames.

Boolean Indexing

Filter data based on conditions:

# Select data from 2022 only
df_2022 = df[df["year"] == 2022]

# Select high happiness scores
high_happiness = df[df["life_ladder"] > 7]

# Combine conditions
recent_happy = df[(df["year"] >= 2020) & (df["life_ladder"] > 7)]

Resetting Index

After filtering, reset the index for cleaner row numbering:

new_df = df[df["year"] == 2022]
new_df = new_df.reset_index(drop=True)
print(new_df.head())

Use drop=True to discard the old index. Without it, the old index becomes a new column.

Summary Statistics

Compute statistical summaries with describe():

df.describe()

This returns:

count: Number of non-null values
mean: Average value
std: Standard deviation
min/max: Minimum and maximum values
25%, 50%, 75%: Quartiles (50% is the median)

When aggregating data across countries and years, consider whether simple averages make sense. Do all countries have equal data points? Should countries be weighted by population?

Data Visualization

Basic Plotting

Pandas integrates matplotlib for quick visualizations:

# Plot all numeric columns
df.plot()

This default plot often isn’t useful for mixed datasets. Specify columns and plot types for better results.

Scatter Plots

Visualize relationships between variables:

# GDP vs Happiness
df.plot(kind='scatter', x='log_gdp_per_capita', y='life_ladder')

This reveals a positive correlation: wealthier countries tend to have happier populations.

Color-Coded Scatter Plots

Highlight specific data points:

# Create color mapping
cmap = {
    'Brazil': 'Green',
    'Slovenia': 'Orange',
    'India': 'purple'
}

df.plot(
    kind='scatter',
    x='log_gdp_per_capita',
    y='life_ladder',
    c=[cmap.get(c, 'yellow') for c in df.country_name],
    s=2  # Point size
)

Color coding reveals patterns that aggregate statistics might miss. For example, Brazil shows higher happiness than Slovenia despite lower GDP.

Histograms

Visualize value distributions:

df.hist("life_ladder")

Histograms show the distribution shape, helping identify skewness, outliers, and central tendencies.

Pairplot for Multiple Variables

Seaborn’s pairplot shows all pairwise relationships:

sns.pairplot(df)

This creates:

Scatter plots for each variable pair
Histograms on the diagonal

Pairplots can take time with many columns. Consider selecting a subset of columns for faster rendering.

Column Operations

Creating New Columns

Perform arithmetic operations:

# Create net affect difference
df["net_affect_difference"] = df["positive_affect"] - df["negative_affect"]

df.head()

Applying Functions

Use apply() for custom transformations:

Lambda Functions
Custom Functions

# Rescale to 0-1 range
df['life_ladder_rescaled'] = df['life_ladder'].apply(lambda x: x / 10)

def my_function(x):
    # Custom transformation
    y = x * 2
    return y

df['doubled'] = df['life_ladder'].apply(my_function)

Key Takeaways

Data Loading

Use pd.read_csv() to load data. Pandas supports many formats.

Data Selection

Use boolean indexing to filter data based on conditions.

Statistics

Use .describe() for quick statistical summaries.

Visualization

Combine Pandas with Matplotlib and Seaborn for powerful visualizations.

If you need a refresher on these concepts later, return to this guide. These are the foundational skills for all data analysis work.

Linear Algebra

Calculus

Probability & Statistics

Exploratory Data Analysis with Pandas

Introduction

Getting Started

Viewing Your Data

Display First and Last Rows

Understanding DataFrame Structure

Index and Column Names

Renaming Columns

Data Types

Selecting Data

Selecting Columns

Selecting Rows

Iterating Over Rows

Boolean Indexing

Resetting Index

Summary Statistics

Data Visualization

Basic Plotting

Scatter Plots

Color-Coded Scatter Plots

Histograms

Pairplot for Multiple Variables

Column Operations

Creating New Columns

Applying Functions

Key Takeaways

Data Loading

Data Selection

Statistics

Visualization

Build docs developers (and LLMs) love

Linear Algebra

Calculus

Probability & Statistics

​Introduction

​Getting Started

​Viewing Your Data

​Display First and Last Rows

​Understanding DataFrame Structure

​Index and Column Names

​Renaming Columns

​Data Types

​Selecting Data

​Selecting Columns

​Selecting Rows

​Iterating Over Rows

​Boolean Indexing

​Resetting Index

​Summary Statistics

​Data Visualization

​Basic Plotting

​Scatter Plots

​Color-Coded Scatter Plots

​Histograms

​Pairplot for Multiple Variables

​Column Operations

​Creating New Columns

​Applying Functions

​Key Takeaways

Data Loading

Data Selection

Statistics

Visualization

Build docs developers (and LLMs) love

Introduction

Getting Started

Viewing Your Data

Display First and Last Rows

Understanding DataFrame Structure

Index and Column Names

Renaming Columns

Data Types

Selecting Data

Selecting Columns

Selecting Rows

Iterating Over Rows

Boolean Indexing

Resetting Index

Summary Statistics

Data Visualization

Basic Plotting

Scatter Plots

Color-Coded Scatter Plots

Histograms

Pairplot for Multiple Variables

Column Operations

Creating New Columns

Applying Functions

Key Takeaways