REST API Source

The rest_api source provides a declarative way to load data from REST APIs. It handles common patterns like pagination, authentication, and incremental loading automatically.

Quick Start

Load data from a REST API with minimal configuration:

import dlt
from dlt.sources.rest_api import rest_api_source

# Create a source for the Pokemon API
pokemon_source = rest_api_source({
    "client": {
        "base_url": "https://pokeapi.co/api/v2/",
    },
    "resources": [
        {
            "name": "pokemon",
            "endpoint": {
                "path": "pokemon",
                "params": {
                    "limit": 100,
                },
            },
        },
    ],
})

# Load the data
pipeline = dlt.pipeline(
    pipeline_name="pokemon_pipeline",
    destination="duckdb",
    dataset_name="pokemon_data"
)

load_info = pipeline.run(pokemon_source)
print(load_info)

Configuration Structure

The REST API source uses a configuration dictionary with three main sections:

Client Configuration

Define base URL, authentication, and default pagination:

"client": {
    "base_url": "https://api.example.com/",
    "auth": {
        "type": "bearer",
        "token": dlt.secrets["api_token"],
    },
    "paginator": "json_link",
}

Resource Defaults

Set common configurations for all resources:

"resource_defaults": {
    "primary_key": "id",
    "write_disposition": "merge",
    "endpoint": {
        "params": {
            "per_page": 100,
        },
    },
}

Resources

Define individual API endpoints to load:

"resources": [
    {
        "name": "users",
        "endpoint": {
            "path": "users",
        },
    },
]

Authentication

The REST API source supports multiple authentication methods:

Bearer Token
API Key
HTTP Basic
OAuth 2.0 Client Credentials

"auth": {
    "type": "bearer",
    "token": dlt.secrets["github_token"],
}

"auth": {
    "type": "api_key",
    "api_key": dlt.secrets["api_key"],
    "location": "header",  # or "query"
    "name": "X-API-Key",
}

"auth": {
    "type": "http_basic",
    "username": dlt.secrets["username"],
    "password": dlt.secrets["password"],
}

"auth": {
    "type": "oauth2_client_credentials",
    "client_id": dlt.secrets["client_id"],
    "client_secret": dlt.secrets["client_secret"],
    "token_url": "https://api.example.com/oauth/token",
}

Pagination

Automatic pagination is configured at the client or endpoint level:

# Follows 'next' links in JSON responses
"paginator": "json_link"

# Custom JSON path
"paginator": {
    "type": "json_link",
    "next_url_path": "pagination.next",
}

Incremental Loading

Load only new or updated records using incremental cursors:

github_source = rest_api_source({
    "client": {
        "base_url": "https://api.github.com/repos/dlt-hub/dlt/",
        "auth": {
            "token": dlt.secrets["github_token"],
        },
    },
    "resources": [
        {
            "name": "issues",
            "endpoint": {
                "path": "issues",
                "params": {
                    "state": "all",
                    "sort": "updated",
                    "direction": "desc",
                    "since": {
                        "type": "incremental",
                        "cursor_path": "updated_at",
                        "initial_value": "2024-01-01T00:00:00Z",
                    },
                },
            },
        },
    ],
})

The cursor_path uses JSONPath notation to extract the cursor value from each record. On subsequent runs, only records with a cursor value greater than the last seen value are loaded.

Dependent Resources

Create resources that depend on data from other resources:

{
    "resources": [
        {
            "name": "users",
            "endpoint": {
                "path": "users",
            },
        },
        {
            "name": "user_posts",
            "endpoint": {
                # Use data from 'users' resource
                "path": "users/{users.id}/posts",
            },
        },
    ],
}

The {users.id} placeholder is resolved for each user, creating a request per user.

Complete Example: GitHub API

Here’s a comprehensive example loading GitHub repository data:

import dlt
from dlt.sources.rest_api import rest_api_source

# Configure the source
github_source = rest_api_source({
    "client": {
        "base_url": "https://api.github.com/repos/dlt-hub/dlt/",
        "auth": {
            "token": dlt.secrets["github_token"],
        },
    },
    "resource_defaults": {
        "primary_key": "id",
        "write_disposition": "merge",
        "endpoint": {
            "params": {
                "per_page": 100,
            },
        },
    },
    "resources": [
        {
            "name": "issues",
            "endpoint": {
                "path": "issues",
                "params": {
                    "state": "all",
                    "sort": "updated",
                    "direction": "desc",
                    "since": {
                        "type": "incremental",
                        "cursor_path": "updated_at",
                        "initial_value": "2024-01-01T00:00:00Z",
                    },
                },
            },
        },
        {
            "name": "issue_comments",
            "endpoint": {
                "path": "issues/{issues.number}/comments",
                "params": {
                    "per_page": 100,
                },
            },
            "include_from_parent": ["id"],  # Include issue ID in comments
        },
        {
            "name": "pulls",
            "endpoint": {
                "path": "pulls",
                "params": {
                    "state": "all",
                },
            },
        },
    ],
})

# Create and run pipeline
pipeline = dlt.pipeline(
    pipeline_name="github_pipeline",
    destination="duckdb",
    dataset_name="github_data"
)

load_info = pipeline.run(github_source)
print(load_info)

Advanced Features

Data Selector

Extract data from nested JSON responses:

"endpoint": {
    "path": "search/repositories",
    "params": {"q": "dlt"},
    "data_selector": "items",  # Extract from response["items"]
}

Custom Headers

Add custom headers to requests:

"endpoint": {
    "path": "data",
    "headers": {
        "X-Custom-Header": "value",
        "Accept": "application/vnd.github.v3+json",
    },
}

POST Requests

Make POST requests with JSON body:

"endpoint": {
    "path": "search",
    "method": "POST",
    "json": {
        "query": "dlt",
        "filters": {"language": "python"},
    },
}

Processing Steps

Transform data before loading:

{
    "name": "users",
    "endpoint": {"path": "users"},
    "processing_steps": [
        {
            "filter": lambda item: item["active"] is True,
        },
        {
            "map": lambda item: {**item, "processed": True},
        },
    ],
}

Using the REST Client Directly

For more control, use the RESTClient helper:

from dlt.sources.helpers.rest_client import RESTClient
from dlt.sources.helpers.rest_client.paginators import JSONResponsePaginator
from dlt.sources.helpers.rest_client.auth import BearerTokenAuth

@dlt.resource
def fetch_api_data():
    client = RESTClient(
        base_url="https://api.example.com",
        auth=BearerTokenAuth(token=dlt.secrets["api_token"]),
    )

    for page in client.paginate(
        "/data",
        paginator=JSONResponsePaginator(next_url_path="next_page"),
    ):
        yield page

Best Practices

Use Incremental Loading

Always configure incremental loading for large datasets to avoid reprocessing all data on each run.

Set Appropriate Limits

Configure per_page or limit parameters to balance between API rate limits and performance.

Handle Rate Limits

Implement retry logic and respect API rate limits to ensure reliable data extraction.

Secure Credentials

Always use dlt.secrets for API keys and tokens. Never hardcode credentials in your source code.

Common Issues

Authentication errors

Ensure your token or API key is correctly set in secrets.toml:

[sources.my_api]
api_token = "your_secret_token_here"

Reference it in your source:

"auth": {
    "token": dlt.secrets["sources.my_api.api_token"],
}

Pagination not working

Check that the paginator type matches your API’s pagination style. Inspect API responses to identify the correct pagination pattern:

# Debug: Print response to see pagination structure
response = requests.get("https://api.example.com/data")
print(response.json())

Missing data in responses

Use data_selector to extract data from nested JSON:

"endpoint": {
    "path": "data",
    "data_selector": "results.items",  # Navigate nested structure
}

Getting Started

Core Concepts

Building Pipelines

Sources

Destinations

Advanced Usage

REST API Source

REST API Source

Quick Start

Configuration Structure

Authentication

Incremental Loading

Dependent Resources

Complete Example: GitHub API

Advanced Features

Data Selector

Custom Headers

POST Requests

Processing Steps

Using the REST Client Directly

Best Practices

Use Incremental Loading

Set Appropriate Limits

Handle Rate Limits

Secure Credentials

Common Issues

Next Steps

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Building Pipelines

Sources

Destinations

Advanced Usage

​REST API Source

​Quick Start

​Configuration Structure

​Authentication

​Pagination

​Incremental Loading

​Dependent Resources

​Complete Example: GitHub API

​Advanced Features

​Data Selector

​Custom Headers

​POST Requests

​Processing Steps

​Using the REST Client Directly

​Best Practices

Use Incremental Loading

Set Appropriate Limits

Handle Rate Limits

Secure Credentials

​Common Issues

​Next Steps

Build docs developers (and LLMs) love

REST API Source

Quick Start

Configuration Structure

Authentication

Pagination

Incremental Loading

Dependent Resources

Complete Example: GitHub API

Advanced Features

Data Selector

Custom Headers

POST Requests

Processing Steps

Using the REST Client Directly

Best Practices

Common Issues

Next Steps