parallelAutoMerge() strategy processes document chunks in parallel, automatically merges results using schema-aware logic, removes exact duplicates, and uses an LLM to identify and remove semantic duplicates.
Usage
Configuration
The AI SDK language model to use for extracting from each chunk.
Maximum tokens per chunk. Documents are split into batches that fit within this limit.
Maximum number of concurrent extraction tasks. Defaults to processing all chunks in parallel.
Maximum number of images per chunk. Useful for controlling vision API costs.
Additional instructions to guide the model’s output format or behavior.
The AI SDK language model to use for semantic deduplication. Defaults to the extraction
model.Custom retry executor function for extraction. Defaults to
runWithRetries.Custom retry executor function for deduplication. Defaults to
runWithRetries.Enable strict mode for structured output validation. Defaults to
false.When to use
- You have large documents with potential duplicate data
- You want fast parallel processing
- You don’t want to write custom merge logic
- You want automatic deduplication
How it works
- Parallel extraction: Processes all chunks concurrently
- Smart merge: Uses
SmartDataMergerwith schema-aware logic to combine results - Hash-based deduplication: Removes exact duplicates using hash comparison
- LLM deduplication: Uses an LLM to identify semantic duplicates and returns paths to remove
Trade-offs
Advantages:- Fast parallel processing
- No custom merge logic needed
- Automatic duplicate removal
- Schema-aware merging
- Higher token usage (extractions + dedupe)
- Dedupe quality depends on model capability
- Less control over merge strategy
Performance characteristics
The strategy estimatesbatches.length + 3 steps:
- Prepare
- Extract from batch 1 through N (parallel)
- Dedupe
- Complete