extract-file

The extract-file command (or simply struktur with no command) processes input files, text, or artifacts and extracts structured data according to a JSON schema using AI models.

Usage

struktur extract-file [options]
struktur [options]  # extract-file is the default command

Input options

You must specify exactly one input source:

--input

string

Path to an input file to parse and extract from

--text

string

Raw text input to extract from

--stdin

boolean

Read raw text from stdin (auto-detected when piped)

--artifact

string

Path to artifact JSON file, or - to read from stdin

--artifact-json

string

Artifact JSON as an inline string

Schema options

You must specify a JSON schema for extraction:

--schema

string

required

Path to JSON schema file or HTTP(S) URL. Remote schemas are fetched with application/schema+json accept header.

--schema-json

string

required

JSON schema as an inline string

Model and strategy options

--model

string

Model identifier in the format provider/model (e.g., openai/gpt-4o, anthropic/claude-4-sonnet). If not specified, uses the configured default model or the cheapest model from the first configured provider.

--strategy

string

default:"simple"

Extraction strategy to use. Available strategies:

simple - Single-pass extraction (default)
parallel - Parallel chunked extraction with merge
sequential - Sequential chunked extraction
parallelAutoMerge - Parallel chunks with automatic deduplication
sequentialAutoMerge - Sequential chunks with automatic deduplication
doublePass - Two-pass extraction with merge
doublePassAutoMerge - Two-pass extraction with automatic deduplication

--chunk-size

number

default:"10000"

Token budget per batch for chunked strategies. Only applies to strategies that support chunking.

Output options

--output

string

default:"-"

Output path for extracted JSON. Use - for stdout (default).

Examples

Extract from a text file

struktur --input document.txt \
  --schema schema.json \
  --model openai/gpt-4o \
  --output results.json

Extract from stdin with inline schema

cat document.txt | struktur \
  --schema-json '{"type":"object","properties":{"title":{"type":"string"}}}' \
  --model anthropic/claude-4-sonnet

Output:

{
  "title": "Example Document Title"
}

Use a chunking strategy for large documents

struktur --input large-document.pdf \
  --schema schema.json \
  --model openai/gpt-4o \
  --strategy parallelAutoMerge \
  --chunk-size 15000

Extract from raw text

struktur --text "The product costs $99 and ships in 3 days" \
  --schema-json '{"type":"object","properties":{"price":{"type":"number"},"shipping_days":{"type":"integer"}}}'

Output:

{
  "price": 99,
  "shipping_days": 3
}

Fetch schema from URL

struktur --input invoice.pdf \
  --schema https://example.com/schemas/invoice.json \
  --model google/gemini-2.0-flash-exp

Progress output

When stderr is a TTY, extraction progress is displayed:

◈ ▰▰▰▰▰▰▰▰▰▰▱▱▱▱▱▱▱▱▱▱ 50% | processing 5/10

Progress bars are automatically hidden when piping or redirecting output.

Error handling

Schema validation errors

If the extracted data fails schema validation, the error details are displayed:

struktur --input data.txt --schema schema.json

Output:

Schema validation failed:
[
  {
    "instancePath": "/price",
    "schemaPath": "#/properties/price/type",
    "keyword": "type",
    "message": "must be number"
  }
]

Environment variables

Provider API keys can be set via environment variables:

OPENAI_API_KEY - OpenAI API key
ANTHROPIC_API_KEY - Anthropic API key
GOOGLE_GENERATIVE_AI_API_KEY - Google Generative AI API key
OPENCODE_API_KEY - OpenCode API key
OPENROUTER_API_KEY - OpenRouter API key
STRUKTUR_CONFIG_DIR - Override config directory (default: ~/.config/struktur)
AI_SDK_LOG_WARNINGS - Enable AI SDK warnings (default: false)

Tokens stored via struktur auth set take precedence over environment variables.

Core API

Strategies

Artifacts

CLI

Usage

Input options

Schema options

Model and strategy options

Output options

Examples

Extract from a text file

Extract from stdin with inline schema

Use a chunking strategy for large documents

Extract from raw text

Fetch schema from URL

Progress output

Error handling

Schema validation errors

Environment variables

Build docs developers (and LLMs) love

Core API

Strategies

Artifacts

CLI

​Usage

​Input options

​Schema options

​Model and strategy options

​Output options

​Examples

​Extract from a text file

​Extract from stdin with inline schema

​Use a chunking strategy for large documents

​Extract from raw text

​Fetch schema from URL

​Progress output

​Error handling

​Schema validation errors

​Environment variables

Build docs developers (and LLMs) love

Usage

Input options

Schema options

Model and strategy options

Output options

Examples

Extract from a text file

Extract from stdin with inline schema

Use a chunking strategy for large documents

Extract from raw text

Fetch schema from URL

Progress output

Error handling

Schema validation errors

Environment variables