Skip to main content

Overview

NumPy and Pandas are the cornerstone libraries for data manipulation and analysis in Python. This module covers array operations, DataFrame manipulation, and efficient data processing techniques.

NumPy Arrays

Multi-dimensional arrays and vectorized operations

Pandas DataFrames

Structured data manipulation and analysis

Broadcasting

Automatic array dimension matching

Indexing

Advanced selection and filtering techniques

NumPy: Vectorized Operations

NumPy provides powerful array operations that are orders of magnitude faster than native Python loops.

Creating Arrays

import numpy as np

# Seed for reproducibility
np.random.seed(42)

# Generate 5x5 matrix of stock prices (5 stocks, 5 days)
precios = np.random.uniform(10, 100, size=(5, 5))
# Rows = stocks A1..A5, columns = days D1..D5
print("Prices:\n", precios)

Statistical Operations

NumPy enables efficient statistical calculations across arrays without explicit loops:
# Calculate statistics per stock (across days)
promedios = precios.mean(axis=1)
maximos = precios.max(axis=1)
minimos = precios.min(axis=1)

print("Average per stock:", promedios)
print("Maximum per stock:", maximos)
print("Minimum per stock:", minimos)
Use axis=1 to operate across columns (rows), and axis=0 to operate down rows (columns).

Percentage Change Calculation

Vectorized operations make it easy to calculate daily variations:
# Daily percentage change: (price_t - price_(t-1)) / price_(t-1)
# Result: 5x4 matrix (loses first day in calculation)
variacion_pct = (precios[:, 1:] - precios[:, :-1]) / precios[:, :-1]

print("Daily percentage change:\n", variacion_pct)

Mathematical Functions and Broadcasting

Broadcasting automatically adjusts array dimensions for operations:
# Natural logarithm of prices
log_precios = np.log(precios)

# Min-max normalization per stock using broadcasting
min_accion = precios.min(axis=1).reshape(-1, 1)  # Shape: (5, 1)
max_accion = precios.max(axis=1).reshape(-1, 1)  # Shape: (5, 1)
precios_norm = (precios - min_accion) / (max_accion - min_accion)

# Exponential of normalized prices
exp_precios_norm = np.exp(precios_norm)

print("Log prices:\n", log_precios)
print("Normalized prices:\n", precios_norm)
Broadcasting eliminates the need for explicit loops when working with arrays of different shapes.

Advanced Indexing

Extract specific data points or subsets efficiently:
# Get specific stock price on specific day
accion_idx = 1  # Second row (stock 2)
dia_idx = 3     # Fourth day
precio_especifico = precios[accion_idx, dia_idx]
print("Stock 2 price on day 4:", precio_especifico)

# Extract all stocks on day 3
precios_dia3 = precios[:, 2]
print("Prices on day 3:", precios_dia3)

# Extract subset: stocks 1, 3, and 5 on days 2 and 5
acciones_idx = [0, 2, 4]
dias_idx = [1, 4]
subconjunto = precios[np.ix_(acciones_idx, dias_idx)]
print("Subset stocks [1,3,5] days [2,5]:\n", subconjunto)

Conditional Operations

# Create mask for temperatures above 30°C
mask_mayor_30 = X > 30
valores_mayor_30 = X[mask_mayor_30]

# Replace values below threshold
X_clean = X.copy()
X_clean[X_clean < 15] = 15  # Set minimum to 15°C

Pandas: Structured Data Analysis

Pandas builds on NumPy to provide labeled data structures and powerful data analysis tools.

Creating DataFrames from NumPy

import pandas as pd
import numpy as np

# Load NumPy array and convert to DataFrame
clientes_numpy_cargado = np.load("clientes_numpy.npy")

df_numpy = pd.DataFrame(
    clientes_numpy_cargado,
    columns=["ID", "Edad", "Total_Compras", "Monto_Total"]
)

print(df_numpy.head())

Loading Data from Multiple Sources

import pandas as pd

# Load CSV with automatic type inference
df_csv = pd.read_csv('empleados.csv')
print(f"Shape: {df_csv.shape[0]} rows, {df_csv.shape[1]} columns")

Data Exploration

import pandas as pd

df = pd.read_csv('empleados.csv')

# Overview of DataFrame
print(df.info())

# Statistical summary
print(df.describe())

# Check for missing values
print(df.isnull().sum())

# Check for duplicates
print(f"Duplicated rows: {df.duplicated().sum()}")

Data Type Conversion

# Convert date column to datetime
df['Fecha_Ingreso'] = pd.to_datetime(df['Fecha_Ingreso'])

# Ensure numeric type
df['Salario'] = pd.to_numeric(df['Salario'], errors='coerce')

# Convert to categorical for memory efficiency
df['Departamento'] = df['Departamento'].astype('category')

print(df.dtypes)

Filtering and Selection

# Conditional filtering
mayores_30 = df_numpy[df_numpy["Edad"] > 30]
con_mas_de_5_compras = df_numpy[df_numpy["Total_Compras"] > 5]

# Select specific columns
columnas_relevantes = ['ID', 'Nombre', 'Departamento', 'Salario']
df_selected = df[columnas_relevantes]

# Boolean indexing with multiple conditions
df_filtered = df[(df['Edad'] > 25) & (df['Salario'] > 40000)]

Aggregations and Grouping

# Group by department and calculate statistics
resumen_depto = df.groupby('Departamento').agg({
    'ID': 'count',
    'Salario': ['mean', 'min', 'max']
}).round(2)

resumen_depto.columns = ['Cantidad', 'Salario_Promedio', 
                          'Salario_Mínimo', 'Salario_Máximo']
print(resumen_depto)

Performance Comparison: NumPy vs Pure Python

# Calculate averages using NumPy (vectorized)
promedios = precios.mean(axis=1)
print("Averages with NumPy:", promedios)
Result: Fast, concise, optimized at C-level
Key Benefits:
  • 10-100x faster execution for large datasets
  • More readable and maintainable code
  • Leverages optimized C/Fortran implementations
  • Reduced memory consumption

Merging Data Sources

Combine multiple DataFrames for integrated analysis:
# Ensure matching types before merge
df_ecom_csv["ID"] = df_ecom_csv["ID"].astype(int)
df_base["ID"] = df_base["ID"].astype(int)

# Left join to unify data sources
df_unificado = pd.merge(
    df_base,
    df_ecom_csv[['ID', 'Nombre', 'Ciudad']],
    on='ID',
    how='left'
)

df_unificado.to_csv('dataset_consolidado.csv', index=False)

Best Practices

  • Use NumPy arrays for homogeneous numerical data
  • Use Pandas DataFrames for heterogeneous, labeled data
  • Consider memory efficiency with large datasets
  • Avoid explicit Python loops when possible
  • Use built-in NumPy/Pandas functions
  • Leverage broadcasting for array operations
  • Convert data types appropriately before analysis
  • Use categorical types for repeated string values
  • Parse dates as datetime objects
  • Use .copy() before modifying DataFrames
  • Keep raw data files unchanged
  • Document all transformations

Next Steps

Data Wrangling

Learn to clean and transform messy data

Exploratory Analysis

Discover patterns through visualization

Build docs developers (and LLMs) love