Skip to main content
This guide demonstrates how to ingest data into MotherDuck using Python. We’ll cover two approaches: a simple method using pandas DataFrames and a production-ready method using PyArrow for efficient chunked ingestion.

Prerequisites

Before you begin, ensure you have:

Installation

Install the required Python packages:
pip install duckdb>=1.1.3 pandas>=2.2.3 pyarrow>=19.0.0 python-dotenv>=1.0.1 requests>=2.32.3
Create a .env file in your project directory and add your MotherDuck token:
MOTHERDUCK_TOKEN=your_token_here

Approach 1: Simple Ingestion with Pandas

This method is ideal for smaller datasets and quick prototyping. It fetches data from an API, processes it into a pandas DataFrame, and loads it into MotherDuck.

Complete Example

load_to_motherduck_small.py
import requests
import pandas as pd
import duckdb
import logging
from dotenv import load_dotenv

load_dotenv()

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


def fetch_github_data():
    url = "https://api.github.com/repos/duckdb/duckdb/stats/contributors"
    logger.info(f"Fetching data from {url}")
    response = requests.get(url)
    response.raise_for_status()
    logger.info("Data fetched successfully")
    return response.json()


def process_data(data):
    logger.info("Processing data")
    records = []
    for author in data:
        total_commits = sum(week["c"] for week in author["weeks"])
        records.append(
            {"login": author["author"]["login"], "total_commits": total_commits}
        )
    df = pd.DataFrame(records)
    logger.info("Data processed into DataFrame")
    return df


def main():
    logger.info("Starting main process")
    data = fetch_github_data()
    df = process_data(data)

    # Connect to MotherDuck and create a table
    con = duckdb.connect()
    logger.info("Connecting to MotherDuck")
    con.sql("ATTACH 'md:'")
    con.sql("CREATE DATABASE IF NOT EXISTS github")
    # Loading data into MotherDuck based on the DataFrame
    con.sql("CREATE TABLE IF NOT EXISTS github.github_commits AS SELECT * FROM df")
    logger.info("Data loaded into MotherDuck successfully")


if __name__ == "__main__":
    main()

Key Steps

1

Fetch Data

Retrieve data from an external source (in this example, GitHub’s API)
2

Process into DataFrame

Transform the JSON data into a pandas DataFrame with the desired schema
3

Connect to MotherDuck

Use duckdb.connect() and attach to MotherDuck with ATTACH 'md:'
4

Load Data

Create a table directly from the DataFrame using CREATE TABLE AS SELECT * FROM df
DuckDB can automatically read pandas DataFrames by referencing the variable name in SQL queries.

Approach 2: Production-Ready Ingestion with PyArrow

For larger datasets, use PyArrow tables with chunked insertion. This approach provides better memory management, type safety, and performance optimization.

Complete Example

load_to_motherduck_large.py
import requests
import pyarrow as pa
import duckdb
import logging
from dotenv import load_dotenv

load_dotenv()

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


class ArrowTableLoadingBuffer:
    def __init__(
        self,
        conn: duckdb.DuckDBPyConnection,
        pyarrow_schema: pa.Schema,
        table_name: str,
        chunk_size: int = 100000,  # typical good chunk size for DuckDB
    ):
        self.conn = conn
        self.pyarrow_schema = pyarrow_schema
        self.table_name = table_name
        self.total_inserted = 0
        self.chunk_size = chunk_size

    def insert(self, table: pa.Table):
        total_rows = table.num_rows
        for batch_start in range(0, total_rows, self.chunk_size):
            batch_end = min(batch_start + self.chunk_size, total_rows)
            chunk = table.slice(batch_start, batch_end - batch_start)
            self.insert_chunk(chunk)
            logging.info(f"Inserted chunk {batch_start} to {batch_end}")
        self.total_inserted += total_rows
        logging.info(f"Total inserted: {self.total_inserted} rows")

    def insert_chunk(self, chunk: pa.Table):
        self.conn.register("buffer_table", chunk)
        insert_query = f"INSERT INTO {self.table_name} SELECT * FROM buffer_table"
        self.conn.execute(insert_query)
        self.conn.unregister("buffer_table")


def fetch_github_data():
    url = "https://api.github.com/repos/duckdb/duckdb/stats/contributors"
    logger.info(f"Fetching data from {url}")
    response = requests.get(url)
    response.raise_for_status()
    logger.info("Data fetched successfully")
    return response.json()


def process_data(data):
    logger.info("Processing data")
    records = []
    for author in data:
        total_commits = sum(week["c"] for week in author["weeks"])
        records.append(
            {"login": author["author"]["login"], "total_commits": total_commits}
        )
    schema = pa.schema([("login", pa.string()), ("total_commits", pa.int64())])
    table = pa.Table.from_pylist(records, schema=schema)
    logger.info("Data processed into Arrow Table")
    return table


def main():
    logger.info("Starting main process")
    data = fetch_github_data()
    table = process_data(data)

    # Connect to MotherDuck and create a table
    con = duckdb.connect()
    logger.info("Connecting to MotherDuck")
    con.execute("ATTACH 'md:'")
    con.execute("CREATE DATABASE IF NOT EXISTS github")

    con.execute("""
        CREATE TABLE IF NOT EXISTS github.github_commits_large (
            login VARCHAR,
            total_commits BIGINT
        )
    """)

    # Insert data using ArrowTableLoadingBuffer
    buffer = ArrowTableLoadingBuffer(
        conn=con,
        pyarrow_schema=table.schema,
        table_name="github.github_commits_large",
        chunk_size=100000,
    )
    buffer.insert(table)

    logger.info("Data loaded into MotherDuck successfully")


if __name__ == "__main__":
    main()

Key Advantages

Type Safety

Explicit PyArrow schema definition prevents type inference issues

Memory Efficiency

Chunked insertion handles large datasets without memory overflow

Performance

Arrow format is optimized for columnar data processing

Scalability

Configurable chunk size (default 100,000 rows) balances speed and memory

The ArrowTableLoadingBuffer Class

The ArrowTableLoadingBuffer class provides efficient batch insertion:
  1. Initialization: Sets up the connection, schema, and chunk size
  2. Slicing: Divides the PyArrow table into manageable chunks
  3. Registration: Temporarily registers each chunk as a DuckDB table
  4. Insertion: Uses INSERT INTO ... SELECT * FROM for efficient loading
  5. Cleanup: Unregisters the temporary table after each chunk
Always define explicit table schemas when using PyArrow to avoid data type mismatches. The DuckDB table schema should match your PyArrow schema.

Comparison: When to Use Each Approach

FeatureSimple (Pandas)Production (PyArrow)
Dataset SizeSmall to mediumLarge datasets
Memory UsageHigherOptimized
Type SafetyInferredExplicit
Setup ComplexityMinimalModerate
Best ForPrototyping, explorationProduction pipelines

Next Steps

Loading Data Guide

Official MotherDuck documentation on Python data loading

DuckDB Python API

Complete reference for the DuckDB Python client

Build docs developers (and LLMs) love