Overview
Arrow Datasets allow you to query against data that has been split across multiple files. This sharding of data may indicate partitioning, which can accelerate queries that only touch some partitions (files). ADataset contains one or more Fragments, such as files, of potentially differing type and partitioning.
Dataset Classes
Dataset
The base Dataset class.Factory Method
Dataset$create()
Alias foropen_dataset(). See the open_dataset() function below.
Methods
$NewScan()
Returns a ScannerBuilder for building a query.$WithSchema()
Returns a new Dataset with the specified schema. This method currently supports only adding, removing, or reordering fields in the schema: you cannot alter or cast the field types.The new schema to use
Active Bindings
$schema
Returns the Schema of the Dataset. You may also replace the dataset’s schema.$metadata
Returns the schema metadata.$num_rows
Number of rows in the dataset.$num_cols
Number of columns in the dataset.$type
Return the Dataset’s type.FileSystemDataset
A Dataset backed by files in a file system.Active Bindings
$files
Return the files contained in this FileSystemDataset.$format
Return the FileFormat of files in this Dataset.$filesystem
Return the filesystem of files in this Dataset.UnionDataset
A Dataset composed of multiple child Datasets.Active Bindings
$children
Return the UnionDataset’s child Datasets.InMemoryDataset
A Dataset backed by an in-memory Table.Factory Method
InMemoryDataset$create()
Create an InMemoryDataset from a Table.A Table or object that can be converted to a Table
Opening Datasets
open_dataset()
Open a multi-file dataset and return a Dataset object.One of:
- A string path or URI to a directory containing data files
- A FileSystem that references a directory (such as from
s3_bucket()) - A string path or URI to a single file
- A character vector of paths or URIs to individual data files
- A list of Dataset objects
- A list of DatasetFactory objects
Schema for the Dataset. If NULL (the default), the schema will be inferred from the data sources
When
sources is a directory path/URI, one of:- A Schema, in which case the file paths will be parsed and path segments matched with schema fields
- A character vector defining field names for path segments (types will be autodetected)
- A Partitioning or PartitioningFactory from
hive_partition() - NULL for no partitioning
Should partitioning be interpreted as Hive-style? Default is NA, which means to inspect file paths for Hive-style partitioning and behave accordingly
Should all data fragments be scanned to create a unified schema? If FALSE, only the first fragment is inspected. Default is FALSE for directory paths (may be slow) but TRUE when sources is a list of Datasets
File format identifier. Currently supported:
- “parquet”
- “ipc”/“arrow”/“feather” (Feather v2 only)
- “csv”/“text”
- “tsv”
- “json” (newline-delimited JSON only)
Additional format-specific arguments (see
read_csv_arrow(), read_parquet(), etc.)open_csv_dataset()
Open a multi-file CSV dataset. A wrapper aroundopen_dataset() with CSV-specific parameters.
Same as
open_dataset()Schema for the Dataset
Partitioning specification
Whether the first row contains column names, or a character vector of column names
Schema, partial schema, or compact representation of column types
Character vector of strings to interpret as missing values
Number of lines to skip before reading data
Whether to skip empty rows
Single character delimiter (CSV: ”,”, TSV: “\t”)
Single character used to quote strings
open_delim_dataset()
Open a multi-file delimiter-separated dataset.Single character delimiter
open_csv_dataset().
open_tsv_dataset()
Open a multi-file TSV (tab-separated values) dataset. Automatically setsdelim = "\t".
Parameters same as open_csv_dataset() except delimiter is fixed.
Writing Datasets
write_dataset()
Write a dataset to disk in partitioned files.Dataset, Table, RecordBatch, arrow_dplyr_query, or data.frame to write. If an arrow_dplyr_query, the query will be evaluated first
String path, URI, or SubTreeFileSystem referencing a directory to write to (will be created if it doesn’t exist)
File format identifier: “parquet”, “feather”, “arrow”, “ipc”, “csv”, “tsv”, “text”, or “json”
Character vector of columns to use as partition keys, or a Partitioning object. Default uses current
group_by() columnsString template for file names. Must contain
"{i}" which will be replaced with an autoincremented integerWrite partition segments as Hive-style (
key1=value1/key2=value2/file.ext) or as bare valuesOne of:
- “overwrite” - new files overwrite existing files
- “error” - fail if destination is not empty
- “delete_matching” - delete existing partitions that will be written to
Maximum number of partitions any batch may be written into
Maximum number of files that can be left open during write. If too low, may fragment data into many small files
Maximum number of rows per file. If 0, unlimited
Write row groups to disk when this number of rows have accumulated
Maximum rows allowed in a single group. Must be greater than
min_rows_per_groupWhether to create directories. Requires appropriate permissions
Preserve the order of rows
Additional format-specific arguments (see
write_parquet() for Parquet options)write_csv_dataset()
Write a dataset as CSV files.Whether to write column names as the first row
Maximum number of rows processed at a time
Delimiter character (cannot be changed for
write_csv_dataset())String to use for missing values
End of line character
Quoting style:
- “needed” - Quote strings/binary values that need quotes
- “all” - Quote all valid values
- “none” - Do not quote any values
write_dataset().
write_tsv_dataset()
Write a dataset as TSV (tab-separated) files. Automatically setsdelim = "\t".
Parameters same as write_csv_dataset() except delimiter is fixed.
write_delim_dataset()
Write a dataset as delimited files with a custom delimiter.Single character delimiter
write_csv_dataset().