Skip to main content
The DataFrameReader interface provides methods to load DataFrames from external storage systems, including local files, S3, and Hugging Face datasets.

Access

Access the reader through a session’s read property:
session = fc.Session.get_or_create()
df = session.read.csv("data.csv")

Supported Storage Schemes

Amazon S3

Format: s3://{bucket_name}/{path_to_file}
  • Uses boto3 to acquire AWS credentials
  • Supports glob patterns
df = session.read.csv("s3://my-bucket/data.csv")
df = session.read.parquet("s3://my-bucket/data/*.parquet")

Hugging Face Datasets

Format: hf://{repo_type}/{repo_id}/{path_to_file}
  • Supports glob patterns (*, **)
  • Supports dataset revisions and branch aliases (e.g., @refs/convert/parquet, @~parquet)
  • Requires HF_TOKEN environment variable for private datasets
df = session.read.csv("hf://datasets/datasets-examples/doc-formats-csv-1/data.csv")
df = session.read.parquet("hf://datasets/cais/mmlu/astronomy/*.parquet")

Local Files

Format: file://{absolute_or_relative_path} or implicit
  • Paths without a scheme are treated as local files
  • Supports relative and absolute paths
df = session.read.csv("./data.csv")
df = session.read.parquet("file:///home/user/data.parquet")

Methods

csv()

Load a DataFrame from one or more CSV files.
paths
str | Path | list[str | Path]
required
A single file path, a glob pattern (e.g., "data/*.csv"), or a list of paths.
schema
Schema
default:"None"
Complete schema definition with column names and types. Only primitive types are supported. If provided, all files must match this schema exactly.
merge_schemas
bool
default:"False"
Whether to merge schemas across all files. If True, column names are unified and missing columns are filled with nulls. If False, all files must match the schema of the first file.

Notes

  • The first row in each file is assumed to be a header row
  • Delimiters (comma, tab, etc.) are automatically inferred
  • Cannot specify both schema and merge_schemas=True
  • Date/datetime columns are cast to strings during ingestion
  • The “first file” is defined as:
    • First file in lexicographic order (for glob patterns)
    • First file in the provided list (for lists of paths)

Examples

df = session.read.csv("file.csv")

parquet()

Load a DataFrame from one or more Parquet files.
paths
str | Path | list[str | Path]
required
A single file path, a glob pattern (e.g., "data/*.parquet"), or a list of paths.
merge_schemas
bool
default:"False"
If True, infers and merges schemas across all files. Missing columns are filled with nulls, and differing types are widened to a common supertype.

Notes

  • If merge_schemas=False (default), all files must match the schema of the first file
  • Date and datetime columns are cast to strings during ingestion
  • The “first file” is defined as:
    • First file in lexicographic order (for glob patterns)
    • First file in the provided list (for lists of paths)

Examples

df = session.read.parquet("file.parquet")

docs()

Load a DataFrame with document contents from markdown or JSON files.
paths
str | Path | list[str | Path]
required
Glob pattern (or list of glob patterns) to the folder(s) to load.
content_type
'markdown' | 'json'
required
Content type of the files.
exclude
str
default:"None"
A regex pattern to exclude files. If not provided, no files will be excluded.
recursive
bool
default:"False"
Whether to recursively load files from folders.

Notes

  • Each row in the DataFrame corresponds to one file
  • The DataFrame has these columns:
    • file_path: The path to the file
    • error: Error message if the file failed to load
    • content: The file content cast to the content_type
  • Recursive loading works with the ** glob pattern when recursive=True
  • Without recursive=True, ** behaves like a single * pattern

Examples

df = session.read.docs(
    "data/docs/**/*.md", 
    content_type="markdown", 
    recursive=True
)

pdf_metadata()

Load a DataFrame with metadata from PDF files.
paths
str | Path | list[str | Path]
required
Glob pattern (or list of glob patterns) to the folder(s) to load.
exclude
str
default:"None"
A regex pattern to exclude files.
recursive
bool
default:"False"
Whether to recursively load files from folders.

Metadata Columns

The resulting DataFrame contains these columns:
  • file_path: Path to the document
  • error: Error message if the file failed to load
  • size: Size of the PDF file in bytes
  • title: Title of the PDF document
  • author: Author of the PDF document
  • creation_date: Creation date of the PDF
  • mod_date: Modification date of the PDF
  • page_count: Number of pages in the PDF
  • has_forms: Whether the PDF contains form fields
  • has_signature_fields: Whether the PDF contains signature fields
  • image_count: Number of images in the PDF
  • is_encrypted: Whether the PDF is encrypted

Examples

df = session.read.pdf_metadata("data/docs/**/*.pdf", recursive=True)

See Also

Build docs developers (and LLMs) love