Advanced Usage - Delta Sharing

Profile File Paths

The Delta Sharing Python Connector supports multiple ways to specify profile file paths, enabling flexible deployment scenarios.

FSSPEC Support

Profile paths for SharingClient and load_as_pandas support any URL format recognized by FSSPEC (Filesystem Spec).

What is FSSPEC?

FSSPEC is a Python library that provides a unified interface to access different filesystems. It allows you to use the same code to work with local files, cloud storage (S3, Azure, GCS), HTTP endpoints, and more.

Local File System

import delta_sharing

# Absolute path
client = delta_sharing.SharingClient("/home/user/profiles/production.share")

# Relative path
client = delta_sharing.SharingClient("./profiles/production.share")

Amazon S3

# S3 path (requires s3fs: pip install delta-sharing[s3])
profile_path = "s3://my-bucket/profiles/production.share"
client = delta_sharing.SharingClient(profile_path)

# Load table with S3 profile
table_url = f"{profile_path}#my_share.my_schema.my_table"
df = delta_sharing.load_as_pandas(table_url)

Install S3 support with: pip install delta-sharing[s3]

Azure Blob Storage

# Azure Blob Storage (requires adlfs: pip install delta-sharing[abfs])
profile_path = "abfs://[email protected]/profiles/production.share"
client = delta_sharing.SharingClient(profile_path)

Install Azure Blob Storage support with: pip install delta-sharing[abfs]

Google Cloud Storage

# GCS path (requires gcsfs: pip install delta-sharing[gcs])
profile_path = "gs://my-bucket/profiles/production.share"
client = delta_sharing.SharingClient(profile_path)

Install GCS support with: pip install delta-sharing[gcs]

DBFS Support

When using Databricks File System (DBFS), you can access profile files using the /dbfs/ prefix to treat them as local files:

import delta_sharing

# DBFS path accessed as local file
profile_path = "/dbfs/mnt/profiles/production.share"
client = delta_sharing.SharingClient(profile_path)

# Construct table URL
table_url = f"{profile_path}#my_share.my_schema.my_table"
df = delta_sharing.load_as_pandas(table_url)

The /dbfs/ prefix works for SharingClient and load_as_pandas, but load_as_spark requires Hadoop FileSystem-compatible paths (e.g., dbfs:/mnt/profiles/production.share).

Spark Profile Paths

For load_as_spark and load_table_changes_as_spark, profile paths must be compatible with Hadoop FileSystem:

import delta_sharing

# Hadoop-compatible paths for Spark
table_url = "s3a://my-bucket/profile.share#my_share.my_schema.my_table"
df = delta_sharing.load_as_spark(table_url)

# DBFS path for Spark
table_url = "dbfs:/mnt/profiles/production.share#my_share.my_schema.my_table"
df = delta_sharing.load_as_spark(table_url)

Use s3a:// (not s3://) for S3 paths with Spark, and dbfs:/ (not /dbfs/) for DBFS paths.

JSON Predicate Hints

Predicate hints allow you to push down filters to the Delta Sharing server, reducing the amount of data transferred and improving query performance.

Predicate Format

Predicates are specified as JSON strings following the Delta Sharing Protocol specification.

Basic Predicates

Equality

import delta_sharing

# Filter where country = 'USA'
predicate = '''{
  "op": "equal",
  "children": [
    {"op": "column", "name": "country", "valueType": "string"},
    {"op": "literal", "value": "USA", "valueType": "string"}
  ]
}'''

df = delta_sharing.load_as_pandas(
    table_url,
    jsonPredicateHints=predicate
)

Less Than / Greater Than

# Filter where age > 21
predicate = '''{
  "op": "greaterThan",
  "children": [
    {"op": "column", "name": "age", "valueType": "int"},
    {"op": "literal", "value": "21", "valueType": "int"}
  ]
}'''

df = delta_sharing.load_as_pandas(
    table_url,
    jsonPredicateHints=predicate
)

IS NULL

# Filter where email is null
predicate = '''{
  "op": "isNull",
  "children": [
    {"op": "column", "name": "email", "valueType": "string"}
  ]
}'''

df = delta_sharing.load_as_pandas(
    table_url,
    jsonPredicateHints=predicate
)

Composite Predicates

AND

# Filter where country = 'USA' AND age > 21
predicate = '''{
  "op": "and",
  "children": [
    {
      "op": "equal",
      "children": [
        {"op": "column", "name": "country", "valueType": "string"},
        {"op": "literal", "value": "USA", "valueType": "string"}
      ]
    },
    {
      "op": "greaterThan",
      "children": [
        {"op": "column", "name": "age", "valueType": "int"},
        {"op": "literal", "value": "21", "valueType": "int"}
      ]
    }
  ]
}'''

df = delta_sharing.load_as_pandas(
    table_url,
    jsonPredicateHints=predicate
)

OR

# Filter where country = 'USA' OR country = 'Canada'
predicate = '''{
  "op": "or",
  "children": [
    {
      "op": "equal",
      "children": [
        {"op": "column", "name": "country", "valueType": "string"},
        {"op": "literal", "value": "USA", "valueType": "string"}
      ]
    },
    {
      "op": "equal",
      "children": [
        {"op": "column", "name": "country", "valueType": "string"},
        {"op": "literal", "value": "Canada", "valueType": "string"}
      ]
    }
  ]
}'''

df = delta_sharing.load_as_pandas(
    table_url,
    jsonPredicateHints=predicate
)

Date and Timestamp Predicates

# Filter where hire_date = '2021-04-29'
predicate = '''{
  "op": "equal",
  "children": [
    {"op": "column", "name": "hire_date", "valueType": "date"},
    {"op": "literal", "value": "2021-04-29", "valueType": "date"}
  ]
}'''

df = delta_sharing.load_as_pandas(
    table_url,
    jsonPredicateHints=predicate
)

Supported Operators

Comparison Operators

equal: Equality comparison
lessThan: Less than
lessThanOrEqual: Less than or equal
greaterThan: Greater than
greaterThanOrEqual: Greater than or equal

Logical Operators

and: Logical AND
or: Logical OR
not: Logical NOT

Null Operators

isNull: Check if value is NULL
isNotNull: Check if value is NOT NULL

Supported Value Types

bool: Boolean values
int: Integer values
long: Long integer values
float: Floating-point values
double: Double-precision floating-point values
string: String values
date: Date values (format: YYYY-MM-DD)
timestamp: Timestamp values (format: YYYY-MM-DDThh:mm:ss[.SSS]Z)

Predicate hints are optimization hints only. The server may return more data than specified by the predicate, so additional client-side filtering may be necessary for correctness.

Memory Optimization with convert_in_batches

For large tables, the convert_in_batches parameter can significantly reduce memory consumption by converting Parquet files to pandas DataFrames one batch at a time instead of one file at a time.

When to Use

Use convert_in_batches=True when:

Working with large tables that don’t fit comfortably in memory
Individual Parquet files are large
You want to reduce peak memory usage
Query performance is less critical than memory usage

Trade-offs

Advantages:

Lower peak memory usage
Prevents out-of-memory errors on constrained systems
Better for parquet format queries

Disadvantages:

May take longer to complete
May download more data due to batch overhead
Slightly more CPU overhead for batch processing

Example Usage

import delta_sharing

# Load large table with batch conversion
df = delta_sharing.load_as_pandas(
    table_url,
    convert_in_batches=True
)

# Combine with other optimizations
df = delta_sharing.load_as_pandas(
    table_url,
    use_delta_format=True,
    convert_in_batches=True,
    jsonPredicateHints=my_predicate
)

With Change Data Feed

# Load large CDF query with batch conversion
changes = delta_sharing.load_table_changes_as_pandas(
    table_url,
    starting_version=0,
    ending_version=1000,
    convert_in_batches=True,
    use_delta_format=True
)

Delta Format vs Parquet Format

Delta Sharing supports two data formats for transferring table data: Delta format and Parquet format.

Format Comparison

Feature	Delta Format	Parquet Format
Performance	Faster for most queries	Slower for large tables
Predicate Pushdown	Excellent	Limited
Server Support	Requires Delta format support	Universal
Data Transfer	More efficient	Less efficient
Compatibility	Newer feature	Widely supported

Using Delta Format

import delta_sharing

# Explicitly use Delta format
df = delta_sharing.load_as_pandas(
    table_url,
    use_delta_format=True
)

# Delta format with predicates (best performance)
df = delta_sharing.load_as_pandas(
    table_url,
    use_delta_format=True,
    jsonPredicateHints=my_predicate
)

Automatic Format Selection

If use_delta_format is not specified, the connector automatically selects the best format based on table metadata:

# Connector chooses format automatically
df = delta_sharing.load_as_pandas(table_url)

Format for Change Data Feed

For Change Data Feed queries, Delta format is recommended:

# CDF with Delta format for best performance
changes = delta_sharing.load_table_changes_as_pandas(
    table_url,
    starting_version=0,
    ending_version=100,
    use_delta_format=True
)

When using Delta format for CDF, the connector properly handles metadata changes by replaying the Delta log, ensuring data consistency across versions.

Working with DeltaSharingProfile Objects

Instead of using profile file paths, you can create and use DeltaSharingProfile objects directly for more control over authentication.

Reading from File

from delta_sharing.protocol import DeltaSharingProfile

# Read profile from file
profile = DeltaSharingProfile.read_from_file("/path/to/profile.share")

# Use with SharingClient
client = delta_sharing.SharingClient(profile)

# Use with load_as_spark
df = delta_sharing.load_as_spark(
    "my_share.my_schema.my_table",
    delta_sharing_profile=profile
)

Profile Authentication Types

The connector supports multiple authentication types:

Bearer Token (Version 1)

{
  "shareCredentialsVersion": 1,
  "endpoint": "https://sharing.example.com/delta-sharing/",
  "bearerToken": "your-token-here",
  "expirationTime": "2024-12-31T23:59:59Z"
}

Bearer Token (Version 2)

{
  "shareCredentialsVersion": 2,
  "type": "bearer_token",
  "endpoint": "https://sharing.example.com/delta-sharing/",
  "bearerToken": "your-token-here",
  "expirationTime": "2024-12-31T23:59:59Z"
}

OAuth Client Credentials

{
  "shareCredentialsVersion": 2,
  "type": "oauth_client_credentials",
  "endpoint": "https://sharing.example.com/delta-sharing/",
  "tokenEndpoint": "https://auth.example.com/oauth/token",
  "clientId": "your-client-id",
  "clientSecret": "your-client-secret",
  "scope": "delta-sharing"
}

Basic Authentication

{
  "shareCredentialsVersion": 2,
  "type": "basic",
  "endpoint": "https://sharing.example.com/delta-sharing/",
  "username": "your-username",
  "password": "your-password"
}

Programmatic Profile Creation

from delta_sharing.protocol import DeltaSharingProfile

# Create profile programmatically
profile = DeltaSharingProfile(
    share_credentials_version=2,
    type="bearer_token",
    endpoint="https://sharing.example.com/delta-sharing",
    bearer_token="your-token-here",
    expiration_time="2024-12-31T23:59:59Z"
)

# Use with connector
client = delta_sharing.SharingClient(profile)

Performance Best Practices

1. Use Delta Format When Possible

# Best performance for most queries
df = delta_sharing.load_as_pandas(
    table_url,
    use_delta_format=True
)

2. Apply Predicate Hints

# Reduce data transfer with predicates
df = delta_sharing.load_as_pandas(
    table_url,
    use_delta_format=True,
    jsonPredicateHints=my_predicate
)

3. Use Limits for Exploration

# Sample data before loading full table
sample = delta_sharing.load_as_pandas(table_url, limit=1000)
print(sample.describe())

# Load full table only if needed
if looks_good(sample):
    df = delta_sharing.load_as_pandas(table_url)

4. Enable Batch Conversion for Large Tables

# Reduce memory usage for large tables
df = delta_sharing.load_as_pandas(
    table_url,
    convert_in_batches=True
)

5. Cache Profile Files Locally

# Avoid repeated S3 reads of profile files
import shutil
import delta_sharing

# Download profile once
with fsspec.open("s3://bucket/profile.share", "rb") as src:
    with open("/tmp/profile.share", "wb") as dst:
        shutil.copyfileobj(src, dst)

# Use local copy
profile = delta_sharing.DeltaSharingProfile.read_from_file("/tmp/profile.share")
client = delta_sharing.SharingClient(profile)

6. Reuse SharingClient Instances

# Create client once
client = delta_sharing.SharingClient(profile_file)

# Reuse for multiple operations
tables = client.list_all_tables()
for table in tables:
    # Client connection is reused
    print(table.name)

Error Handling

Common Errors and Solutions

import delta_sharing
from requests.exceptions import HTTPError

try:
    df = delta_sharing.load_as_pandas(table_url)
except HTTPError as e:
    if e.response.status_code == 401:
        print("Authentication failed. Check your bearer token.")
    elif e.response.status_code == 403:
        print("Access denied. You don't have permission for this table.")
    elif e.response.status_code == 404:
        print("Table not found. Check the table URL.")
    else:
        print(f"HTTP error: {e}")
except ValueError as e:
    print(f"Invalid URL or parameters: {e}")
except Exception as e:
    print(f"Unexpected error: {e}")

Validating Profile Files

from delta_sharing.protocol import DeltaSharingProfile

try:
    profile = DeltaSharingProfile.read_from_file(profile_path)
    print(f"Profile valid. Endpoint: {profile.endpoint}")
except FileNotFoundError:
    print(f"Profile file not found: {profile_path}")
except ValueError as e:
    print(f"Invalid profile format: {e}")

Complete Advanced Example

import delta_sharing
from delta_sharing.protocol import DeltaSharingProfile
import json

# Load profile from S3
profile_path = "s3://my-bucket/profiles/production.share"
profile = DeltaSharingProfile.read_from_file(profile_path)

# Create client
client = delta_sharing.SharingClient(profile)

# Explore available tables
print("Available tables:")
tables = client.list_all_tables()
for table in tables:
    print(f"  {table.share}.{table.schema}.{table.name}")

# Select a table
table_url = f"{profile_path}#{tables[0].share}.{tables[0].schema}.{tables[0].name}"

# Get table metadata
metadata = delta_sharing.get_table_metadata(table_url)
print(f"\nTable: {metadata.name}")
print(f"Format: {metadata.format}")
print(f"Partition columns: {metadata.partition_columns}")

# Build predicate
predicate = json.dumps({
    "op": "and",
    "children": [
        {
            "op": "greaterThan",
            "children": [
                {"op": "column", "name": "timestamp", "valueType": "timestamp"},
                {"op": "literal", "value": "2024-01-01T00:00:00Z", "valueType": "timestamp"}
            ]
        },
        {
            "op": "equal",
            "children": [
                {"op": "column", "name": "status", "valueType": "string"},
                {"op": "literal", "value": "active", "valueType": "string"}
            ]
        }
    ]
})

# Load with optimizations
df = delta_sharing.load_as_pandas(
    table_url,
    use_delta_format=True,
    jsonPredicateHints=predicate,
    convert_in_batches=True
)

print(f"\nLoaded {len(df)} rows matching predicate")
print(df.head())

# Query change data feed
changes = delta_sharing.load_table_changes_as_pandas(
    table_url,
    starting_version=0,
    use_delta_format=True,
    convert_in_batches=True
)

print(f"\nTotal changes: {len(changes)}")
print(changes['_change_type'].value_counts())

Get Started

Core Concepts

Python Connector

Apache Spark Connector

Reference Server

Community Connectors

​Profile File Paths

​FSSPEC Support

​Local File System

​Amazon S3

​Azure Blob Storage

​Google Cloud Storage

​DBFS Support

​Spark Profile Paths

​JSON Predicate Hints

​Predicate Format

​Basic Predicates

​Equality

​Less Than / Greater Than

​IS NULL

​Composite Predicates

​AND

​OR

​Date and Timestamp Predicates

​Supported Operators

​Supported Value Types

​Memory Optimization with convert_in_batches

​When to Use

​Trade-offs

​Example Usage

​With Change Data Feed

​Delta Format vs Parquet Format

​Format Comparison

​Using Delta Format

​Automatic Format Selection

​Format for Change Data Feed

​Working with DeltaSharingProfile Objects

​Reading from File

​Profile Authentication Types

​Bearer Token (Version 1)

​Bearer Token (Version 2)

​OAuth Client Credentials

​Basic Authentication

​Programmatic Profile Creation

​Performance Best Practices

​1. Use Delta Format When Possible

​2. Apply Predicate Hints

​3. Use Limits for Exploration

​4. Enable Batch Conversion for Large Tables

​5. Cache Profile Files Locally

​6. Reuse SharingClient Instances

​Error Handling

​Common Errors and Solutions

​Validating Profile Files

​Complete Advanced Example

Build docs developers (and LLMs) love

Profile File Paths

FSSPEC Support

Local File System

Amazon S3

Azure Blob Storage

Google Cloud Storage

DBFS Support

Spark Profile Paths

JSON Predicate Hints

Predicate Format

Basic Predicates

Equality

Less Than / Greater Than

IS NULL

Composite Predicates

AND

OR

Date and Timestamp Predicates

Supported Operators

Supported Value Types

Memory Optimization with convert_in_batches

When to Use

Trade-offs

Example Usage

With Change Data Feed

Delta Format vs Parquet Format

Format Comparison

Using Delta Format

Automatic Format Selection

Format for Change Data Feed

Working with DeltaSharingProfile Objects

Reading from File

Profile Authentication Types

Bearer Token (Version 1)

Bearer Token (Version 2)

OAuth Client Credentials

Basic Authentication

Programmatic Profile Creation

Performance Best Practices

1. Use Delta Format When Possible

2. Apply Predicate Hints

3. Use Limits for Exploration

4. Enable Batch Conversion for Large Tables

5. Cache Profile Files Locally

6. Reuse SharingClient Instances

Error Handling

Common Errors and Solutions

Validating Profile Files

Complete Advanced Example