sequentialAutoMerge()

The sequentialAutoMerge() strategy processes document chunks sequentially, automatically merges results using schema-aware logic as it goes, removes exact duplicates, and uses an LLM to identify and remove semantic duplicates.

Usage

import { extract, sequentialAutoMerge } from 'struktur';
import { openai } from '@ai-sdk/openai';

const result = await extract({
  artifacts,
  schema,
  strategy: sequentialAutoMerge({
    model: openai('gpt-4o'),
    chunkSize: 100000,
  }),
});

Configuration

model

LanguageModel

required

The AI SDK language model to use for extraction.

chunkSize

number

required

Maximum tokens per chunk. Documents are split into batches that fit within this limit.

maxImages

number

Maximum number of images per chunk. Useful for controlling vision API costs.

outputInstructions

string

Additional instructions to guide the model’s output format or behavior.

dedupeModel

LanguageModel

The AI SDK language model to use for semantic deduplication. Defaults to the extraction model.

execute

function

Custom retry executor function for extraction. Defaults to runWithRetries.

dedupeExecute

function

Custom retry executor function for deduplication. Defaults to runWithRetries.

strict

boolean

Enable strict mode for structured output validation. Defaults to false.

When to use

You have large documents with potential duplicate data
Sequential processing is important for your use case
You don’t want to write custom merge logic
You want automatic deduplication

How it works

Sequential extraction: Processes chunks one at a time
Incremental merge: Uses SmartDataMerger to combine each result as it arrives
Hash-based deduplication: Removes exact duplicates using hash comparison
LLM deduplication: Uses an LLM to identify semantic duplicates and returns paths to remove

Trade-offs

Advantages:

No custom merge logic needed
Automatic duplicate removal
Schema-aware merging
Lower peak memory usage than parallel strategies

Limitations:

Slower than parallel strategies (no concurrency)
Higher token usage than basic sequential (dedupe step)
Less control over merge strategy

Performance characteristics

The strategy estimates batches.length + 3 steps:

Prepare
Extract from batch 1 through N (sequential)
Dedupe
Complete

Processing is sequential, so total time = sum of all chunk processing times + dedupe time.

Example with custom dedupe model

import { extract, sequentialAutoMerge } from 'struktur';
import { anthropic } from '@ai-sdk/anthropic';
import { openai } from '@ai-sdk/openai';

const result = await extract({
  artifacts: transactionHistoryArtifacts,
  schema: transactionsSchema,
  strategy: sequentialAutoMerge({
    model: anthropic('claude-3-5-sonnet-20241022'),
    dedupeModel: openai('gpt-4o'), // Use different model for deduplication
    chunkSize: 150000,
    maxImages: 3,
    outputInstructions: 'Process chronologically, maintaining transaction order',
  }),
  events: {
    onStep: ({ step, total, label }) => {
      console.log(`Step ${step}/${total}: ${label}`);
    },
  },
});

console.log(`Extracted ${result.data.transactions.length} unique transactions`);

Core API

Strategies

Artifacts

CLI

sequentialAutoMerge()

Usage

Configuration

When to use

How it works

Trade-offs

Performance characteristics

Example with custom dedupe model

Build docs developers (and LLMs) love

Core API

Strategies

Artifacts

CLI

​Usage

​Configuration

​When to use

​How it works

​Trade-offs

​Performance characteristics

​Example with custom dedupe model

Build docs developers (and LLMs) love

Usage

Configuration

When to use

How it works

Trade-offs

Performance characteristics

Example with custom dedupe model