Python Data Ingestion

This guide demonstrates how to ingest data into MotherDuck using Python. We’ll cover two approaches: a simple method using pandas DataFrames and a production-ready method using PyArrow for efficient chunked ingestion.

Prerequisites

Before you begin, ensure you have:

Python 3.12 or higher
A MotherDuck token
Required dependencies installed

Installation

Install the required Python packages:

pip install duckdb>=1.1.3 pandas>=2.2.3 pyarrow>=19.0.0 python-dotenv>=1.0.1 requests>=2.32.3

Create a .env file in your project directory and add your MotherDuck token:

MOTHERDUCK_TOKEN=your_token_here

Approach 1: Simple Ingestion with Pandas

This method is ideal for smaller datasets and quick prototyping. It fetches data from an API, processes it into a pandas DataFrame, and loads it into MotherDuck.

Complete Example

load_to_motherduck_small.py

import requests
import pandas as pd
import duckdb
import logging
from dotenv import load_dotenv

load_dotenv()

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


def fetch_github_data():
    url = "https://api.github.com/repos/duckdb/duckdb/stats/contributors"
    logger.info(f"Fetching data from {url}")
    response = requests.get(url)
    response.raise_for_status()
    logger.info("Data fetched successfully")
    return response.json()


def process_data(data):
    logger.info("Processing data")
    records = []
    for author in data:
        total_commits = sum(week["c"] for week in author["weeks"])
        records.append(
            {"login": author["author"]["login"], "total_commits": total_commits}
        )
    df = pd.DataFrame(records)
    logger.info("Data processed into DataFrame")
    return df


def main():
    logger.info("Starting main process")
    data = fetch_github_data()
    df = process_data(data)

    # Connect to MotherDuck and create a table
    con = duckdb.connect()
    logger.info("Connecting to MotherDuck")
    con.sql("ATTACH 'md:'")
    con.sql("CREATE DATABASE IF NOT EXISTS github")
    # Loading data into MotherDuck based on the DataFrame
    con.sql("CREATE TABLE IF NOT EXISTS github.github_commits AS SELECT * FROM df")
    logger.info("Data loaded into MotherDuck successfully")


if __name__ == "__main__":
    main()

Key Steps

Fetch Data

Retrieve data from an external source (in this example, GitHub’s API)

Process into DataFrame

Transform the JSON data into a pandas DataFrame with the desired schema

Connect to MotherDuck

Use duckdb.connect() and attach to MotherDuck with ATTACH 'md:'

Load Data

Create a table directly from the DataFrame using CREATE TABLE AS SELECT * FROM df

DuckDB can automatically read pandas DataFrames by referencing the variable name in SQL queries.

Approach 2: Production-Ready Ingestion with PyArrow

For larger datasets, use PyArrow tables with chunked insertion. This approach provides better memory management, type safety, and performance optimization.

Complete Example

load_to_motherduck_large.py

import requests
import pyarrow as pa
import duckdb
import logging
from dotenv import load_dotenv

load_dotenv()

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


class ArrowTableLoadingBuffer:
    def __init__(
        self,
        conn: duckdb.DuckDBPyConnection,
        pyarrow_schema: pa.Schema,
        table_name: str,
        chunk_size: int = 100000,  # typical good chunk size for DuckDB
    ):
        self.conn = conn
        self.pyarrow_schema = pyarrow_schema
        self.table_name = table_name
        self.total_inserted = 0
        self.chunk_size = chunk_size

    def insert(self, table: pa.Table):
        total_rows = table.num_rows
        for batch_start in range(0, total_rows, self.chunk_size):
            batch_end = min(batch_start + self.chunk_size, total_rows)
            chunk = table.slice(batch_start, batch_end - batch_start)
            self.insert_chunk(chunk)
            logging.info(f"Inserted chunk {batch_start} to {batch_end}")
        self.total_inserted += total_rows
        logging.info(f"Total inserted: {self.total_inserted} rows")

    def insert_chunk(self, chunk: pa.Table):
        self.conn.register("buffer_table", chunk)
        insert_query = f"INSERT INTO {self.table_name} SELECT * FROM buffer_table"
        self.conn.execute(insert_query)
        self.conn.unregister("buffer_table")


def fetch_github_data():
    url = "https://api.github.com/repos/duckdb/duckdb/stats/contributors"
    logger.info(f"Fetching data from {url}")
    response = requests.get(url)
    response.raise_for_status()
    logger.info("Data fetched successfully")
    return response.json()


def process_data(data):
    logger.info("Processing data")
    records = []
    for author in data:
        total_commits = sum(week["c"] for week in author["weeks"])
        records.append(
            {"login": author["author"]["login"], "total_commits": total_commits}
        )
    schema = pa.schema([("login", pa.string()), ("total_commits", pa.int64())])
    table = pa.Table.from_pylist(records, schema=schema)
    logger.info("Data processed into Arrow Table")
    return table


def main():
    logger.info("Starting main process")
    data = fetch_github_data()
    table = process_data(data)

    # Connect to MotherDuck and create a table
    con = duckdb.connect()
    logger.info("Connecting to MotherDuck")
    con.execute("ATTACH 'md:'")
    con.execute("CREATE DATABASE IF NOT EXISTS github")

    con.execute("""
        CREATE TABLE IF NOT EXISTS github.github_commits_large (
            login VARCHAR,
            total_commits BIGINT
        )
    """)

    # Insert data using ArrowTableLoadingBuffer
    buffer = ArrowTableLoadingBuffer(
        conn=con,
        pyarrow_schema=table.schema,
        table_name="github.github_commits_large",
        chunk_size=100000,
    )
    buffer.insert(table)

    logger.info("Data loaded into MotherDuck successfully")


if __name__ == "__main__":
    main()

Key Advantages

Type Safety

Explicit PyArrow schema definition prevents type inference issues

Memory Efficiency

Chunked insertion handles large datasets without memory overflow

Performance

Arrow format is optimized for columnar data processing

Scalability

Configurable chunk size (default 100,000 rows) balances speed and memory

The ArrowTableLoadingBuffer Class

The ArrowTableLoadingBuffer class provides efficient batch insertion:

How it works

Initialization: Sets up the connection, schema, and chunk size
Slicing: Divides the PyArrow table into manageable chunks
Registration: Temporarily registers each chunk as a DuckDB table
Insertion: Uses INSERT INTO ... SELECT * FROM for efficient loading
Cleanup: Unregisters the temporary table after each chunk

Always define explicit table schemas when using PyArrow to avoid data type mismatches. The DuckDB table schema should match your PyArrow schema.

Comparison: When to Use Each Approach

Feature	Simple (Pandas)	Production (PyArrow)
Dataset Size	Small to medium	Large datasets
Memory Usage	Higher	Optimized
Type Safety	Inferred	Explicit
Setup Complexity	Minimal	Moderate
Best For	Prototyping, exploration	Production pipelines

Get Started

Data Ingestion

dbt Patterns

Data Replication

Integrations

Prerequisites

Installation

Approach 1: Simple Ingestion with Pandas

Complete Example

Key Steps

Approach 2: Production-Ready Ingestion with PyArrow

Complete Example

Key Advantages

Type Safety

Memory Efficiency

Performance

Scalability

The ArrowTableLoadingBuffer Class

Comparison: When to Use Each Approach

Next Steps

Loading Data Guide

DuckDB Python API

Build docs developers (and LLMs) love

Get Started

Data Ingestion

dbt Patterns

Data Replication

Integrations

​Prerequisites

​Installation

​Approach 1: Simple Ingestion with Pandas

​Complete Example

​Key Steps

​Approach 2: Production-Ready Ingestion with PyArrow

​Complete Example

​Key Advantages

Type Safety

Memory Efficiency

Performance

Scalability

​The ArrowTableLoadingBuffer Class

​Comparison: When to Use Each Approach

​Next Steps

Loading Data Guide

DuckDB Python API

Build docs developers (and LLMs) love

Prerequisites

Installation

Approach 1: Simple Ingestion with Pandas

Complete Example

Key Steps

Approach 2: Production-Ready Ingestion with PyArrow

Complete Example

Key Advantages

The ArrowTableLoadingBuffer Class

Comparison: When to Use Each Approach

Next Steps