DataFrameReader interface provides methods to load DataFrames from external storage systems, including local files, S3, and Hugging Face datasets.
Access
Access the reader through a session’sread property:
Supported Storage Schemes
Amazon S3
Format:s3://{bucket_name}/{path_to_file}
- Uses boto3 to acquire AWS credentials
- Supports glob patterns
Hugging Face Datasets
Format:hf://{repo_type}/{repo_id}/{path_to_file}
- Supports glob patterns (
*,**) - Supports dataset revisions and branch aliases (e.g.,
@refs/convert/parquet,@~parquet) - Requires
HF_TOKENenvironment variable for private datasets
Local Files
Format:file://{absolute_or_relative_path} or implicit
- Paths without a scheme are treated as local files
- Supports relative and absolute paths
Methods
csv()
Load a DataFrame from one or more CSV files.A single file path, a glob pattern (e.g.,
"data/*.csv"), or a list of paths.Complete schema definition with column names and types. Only primitive types are supported. If provided, all files must match this schema exactly.
Whether to merge schemas across all files. If
True, column names are unified and missing columns are filled with nulls. If False, all files must match the schema of the first file.Notes
- The first row in each file is assumed to be a header row
- Delimiters (comma, tab, etc.) are automatically inferred
- Cannot specify both
schemaandmerge_schemas=True - Date/datetime columns are cast to strings during ingestion
- The “first file” is defined as:
- First file in lexicographic order (for glob patterns)
- First file in the provided list (for lists of paths)
Examples
parquet()
Load a DataFrame from one or more Parquet files.A single file path, a glob pattern (e.g.,
"data/*.parquet"), or a list of paths.If
True, infers and merges schemas across all files. Missing columns are filled with nulls, and differing types are widened to a common supertype.Notes
- If
merge_schemas=False(default), all files must match the schema of the first file - Date and datetime columns are cast to strings during ingestion
- The “first file” is defined as:
- First file in lexicographic order (for glob patterns)
- First file in the provided list (for lists of paths)
Examples
docs()
Load a DataFrame with document contents from markdown or JSON files.Glob pattern (or list of glob patterns) to the folder(s) to load.
Content type of the files.
A regex pattern to exclude files. If not provided, no files will be excluded.
Whether to recursively load files from folders.
Notes
- Each row in the DataFrame corresponds to one file
- The DataFrame has these columns:
file_path: The path to the fileerror: Error message if the file failed to loadcontent: The file content cast to thecontent_type
- Recursive loading works with the
**glob pattern whenrecursive=True - Without
recursive=True,**behaves like a single*pattern
Examples
pdf_metadata()
Load a DataFrame with metadata from PDF files.Glob pattern (or list of glob patterns) to the folder(s) to load.
A regex pattern to exclude files.
Whether to recursively load files from folders.
Metadata Columns
The resulting DataFrame contains these columns:file_path: Path to the documenterror: Error message if the file failed to loadsize: Size of the PDF file in bytestitle: Title of the PDF documentauthor: Author of the PDF documentcreation_date: Creation date of the PDFmod_date: Modification date of the PDFpage_count: Number of pages in the PDFhas_forms: Whether the PDF contains form fieldshas_signature_fields: Whether the PDF contains signature fieldsimage_count: Number of images in the PDFis_encrypted: Whether the PDF is encrypted
Examples
See Also
- DataFrameWriter - Write DataFrames to files
- Catalog - Manage tables and views
- Data Types - Available data types for schemas
