Load data into MotherDuck using Python with pandas and PyArrow
This guide demonstrates how to ingest data into MotherDuck using Python. We’ll cover two approaches: a simple method using pandas DataFrames and a production-ready method using PyArrow for efficient chunked ingestion.
This method is ideal for smaller datasets and quick prototyping. It fetches data from an API, processes it into a pandas DataFrame, and loads it into MotherDuck.
import requestsimport pandas as pdimport duckdbimport loggingfrom dotenv import load_dotenvload_dotenv()# Configure logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)def fetch_github_data(): url = "https://api.github.com/repos/duckdb/duckdb/stats/contributors" logger.info(f"Fetching data from {url}") response = requests.get(url) response.raise_for_status() logger.info("Data fetched successfully") return response.json()def process_data(data): logger.info("Processing data") records = [] for author in data: total_commits = sum(week["c"] for week in author["weeks"]) records.append( {"login": author["author"]["login"], "total_commits": total_commits} ) df = pd.DataFrame(records) logger.info("Data processed into DataFrame") return dfdef main(): logger.info("Starting main process") data = fetch_github_data() df = process_data(data) # Connect to MotherDuck and create a table con = duckdb.connect() logger.info("Connecting to MotherDuck") con.sql("ATTACH 'md:'") con.sql("CREATE DATABASE IF NOT EXISTS github") # Loading data into MotherDuck based on the DataFrame con.sql("CREATE TABLE IF NOT EXISTS github.github_commits AS SELECT * FROM df") logger.info("Data loaded into MotherDuck successfully")if __name__ == "__main__": main()
Approach 2: Production-Ready Ingestion with PyArrow
For larger datasets, use PyArrow tables with chunked insertion. This approach provides better memory management, type safety, and performance optimization.