extract() function that transforms pre-parsed artifacts into validated JSON using AI models. This guide covers how to run extractions and when to use each strategy.
Basic extraction
Every extraction requires three components: artifacts, a JSON schema, and a strategy.Extraction strategies
Strukt provides seven strategies optimized for different scenarios:Simple strategy
Best for small inputs that fit within a single model context window.- Small documents (< 10,000 tokens)
- Single-page PDFs or short articles
- Fast responses are critical
Parallel strategy
Processes large inputs by splitting them into concurrent batches, then merges results using an LLM.- Large documents that exceed context limits
- Speed is important (processes batches concurrently)
- Willing to use a merge model to consolidate results
chunkSize: Token budget per batch (default: 10,000)concurrency: Max parallel batches (default: all)maxImages: Optional image limit per batch
Sequential strategy
Processes batches in order, passing previous results as context to each batch.- Documents where order matters (narratives, timelines)
- Extracting data that builds on previous sections
- Need to maintain context across chunks
chunkSize: Token budget per batchmaxImages: Optional image limit per batch
Parallel auto-merge strategy
Like parallel, but uses schema-aware merging and deduplication instead of an LLM merge.- Extracting lists or arrays with potential duplicates
- Need precise control over merge behavior
- Schema has clear merge/dedupe semantics
- Processes batches in parallel
- Merges results using schema rules (concatenate arrays, pick first for primitives)
- Deduplicates using CRC32 hashing
- LLM dedupe pass for semantic duplicates
Sequential auto-merge strategy
Sequential processing with schema-aware merge and dedupe.- Ordered documents with potential duplicates
- Need both context carryover and deduplication
Double-pass strategy
Runs parallel extraction, then refines with a sequential pass.- Need highest accuracy
- Large documents with complex structure
- Can afford two passes over the data
Double-pass auto-merge strategy
Combines parallel auto-merge with sequential refinement.- Maximum accuracy with deduplication
- Complex documents with duplicates
- Quality is more important than speed
Custom output instructions
All strategies support custom output instructions to guide the model:Strategy comparison
| Strategy | Speed | Accuracy | Context | Deduplication | Use case |
|---|---|---|---|---|---|
| Simple | Fastest | Good | Single pass | No | Small documents |
| Parallel | Fast | Good | Independent chunks | No | Large documents, speed priority |
| Sequential | Medium | Better | Cumulative | No | Ordered content |
| Parallel auto-merge | Fast | Good | Independent | Yes | Lists with duplicates |
| Sequential auto-merge | Medium | Better | Cumulative | Yes | Ordered lists |
| Double-pass | Slow | Best | Two passes | No | Complex documents |
| Double-pass auto-merge | Slowest | Best | Two passes | Yes | Complex lists |
Working with results
The extract function returns anExtractionResult with typed data and token usage:
data: Extracted data matching your schema typeusage: Token usage statisticsinputTokens: Tokens sent to the modeloutputTokens: Tokens generated by the modeltotalTokens: Sum of input and output
error: Error object if extraction failed