Skip to main content

Overview

LLM Magic’s extraction feature allows you to extract structured, validated data from various document types including PDFs, images, Word documents, and more. The system automatically processes documents and returns JSON data matching your schema.

Quick Start

Basic Extraction

Extract data from a document using a JSON schema:
use Mateffy\Magic;

$data = Magic::extract()
    ->schema([
        'type' => 'object',
        'properties' => [
            'name' => ['type' => 'string'],
            'age' => ['type' => 'integer'],
        ],
        'required' => ['name', 'age'],
    ])
    ->file('/path/to/document.pdf')
    ->send();

Working with Files

Single File

Process a single document:
Magic::extract()
    ->schema([/* ... */])
    ->file('/path/to/invoice.pdf')
    ->send();

Multiple Files

Process multiple documents at once:
Magic::extract()
    ->schema([/* ... */])
    ->files([
        '/path/to/document1.pdf',
        '/path/to/document2.pdf',
        '/path/to/image.jpg',
    ])
    ->send();

Using Artifacts

For more control, use the Artifact class:
use Mateffy\Magic\Extraction\Artifacts\DiskArtifact;

$artifact = DiskArtifact::from(
    path: '/path/to/document.pdf',
    disk: 's3' // Optional: Laravel disk
);

Magic::extract()
    ->schema([/* ... */])
    ->artifact($artifact)
    ->send();

Defining Schemas

JSON Schema Format

Define the structure of data you want to extract using JSON Schema:
$schema = [
    'type' => 'object',
    'properties' => [
        'invoice_number' => [
            'type' => 'string',
            'description' => 'The invoice reference number',
        ],
        'date' => [
            'type' => 'string',
            'format' => 'date',
        ],
        'total_amount' => [
            'type' => 'number',
            'description' => 'Total amount in USD',
        ],
        'items' => [
            'type' => 'array',
            'items' => [
                'type' => 'object',
                'properties' => [
                    'description' => ['type' => 'string'],
                    'quantity' => ['type' => 'integer'],
                    'price' => ['type' => 'number'],
                ],
                'required' => ['description', 'quantity', 'price'],
            ],
        ],
    ],
    'required' => ['invoice_number', 'date', 'total_amount', 'items'],
];

Magic::extract()
    ->schema($schema)
    ->file('/path/to/invoice.pdf')
    ->send();

Complex Schemas

Extract nested data structures:
$schema = [
    'type' => 'object',
    'properties' => [
        'patient' => [
            'type' => 'object',
            'properties' => [
                'name' => ['type' => 'string'],
                'date_of_birth' => ['type' => 'string', 'format' => 'date'],
                'address' => [
                    'type' => 'object',
                    'properties' => [
                        'street' => ['type' => 'string'],
                        'city' => ['type' => 'string'],
                        'zip' => ['type' => 'string'],
                    ],
                ],
            ],
        ],
        'medications' => [
            'type' => 'array',
            'items' => [
                'type' => 'object',
                'properties' => [
                    'name' => ['type' => 'string'],
                    'dosage' => ['type' => 'string'],
                    'frequency' => ['type' => 'string'],
                ],
            ],
        ],
    ],
];

Extraction Strategies

LLM Magic provides multiple strategies for processing documents, each optimized for different use cases.

Available Strategies

use Mateffy\Magic;

// Simple: Process entire document at once (default)
Magic::extract()
    ->strategy('simple')
    ->file('/path/to/document.pdf')
    ->send();

// Sequential: Process in chunks, one after another
Magic::extract()
    ->strategy('sequential')
    ->file('/path/to/large-document.pdf')
    ->send();

// Sequential with auto-merge: Combines results automatically
Magic::extract()
    ->strategy('sequential-auto-merge')
    ->file('/path/to/document.pdf')
    ->send();

// Parallel: Process chunks concurrently for speed
Magic::extract()
    ->strategy('parallel')
    ->file('/path/to/document.pdf')
    ->send();

// Parallel with auto-merge
Magic::extract()
    ->strategy('parallel-auto-merge')
    ->file('/path/to/document.pdf')
    ->send();

// Double-pass: Extract then verify for accuracy
Magic::extract()
    ->strategy('double-pass')
    ->file('/path/to/document.pdf')
    ->send();

// Double-pass with auto-merge
Magic::extract()
    ->strategy('double-pass-auto-merge')
    ->file('/path/to/document.pdf')
    ->send();

Strategy Comparison

Best for: Small documents (1-5 pages)Processes the entire document in a single LLM call. Fastest for small documents but may hit context limits on larger files.
Best for: Large documents where order mattersSplits the document into chunks and processes them one by one. Good for documents where context from earlier pages affects later pages.
Best for: Large documents needing speedProcesses multiple chunks simultaneously. Much faster than sequential but uses more resources.
Best for: Critical data requiring high accuracyMakes two passes: first to extract, second to verify. Highest accuracy but slower and more expensive.

Custom Chunk Size

Control how documents are split:
Magic::extract()
    ->strategy('sequential')
    ->chunkSize(2) // Process 2 pages at a time
    ->file('/path/to/document.pdf')
    ->send();

Context Options

Control what content from documents is included in the extraction:
use Mateffy\Magic\Extraction\ContextOptions;

// Include everything (default)
Magic::extract()
    ->contextOptions(ContextOptions::all())
    ->file('/path/to/document.pdf')
    ->send();

// Custom context options
Magic::extract()
    ->contextOptions(new ContextOptions(
        includeText: true,
        includeEmbeddedImages: true,
        markEmbeddedImages: true,
        includePageImages: false,
        markPageImages: false,
    ))
    ->file('/path/to/document.pdf')
    ->send();

// Default context (text + embedded images)
Magic::extract()
    ->contextOptions(ContextOptions::default())
    ->file('/path/to/document.pdf')
    ->send();

Context Options Explained

includeText
boolean
default:"true"
Include text content from the document
includeEmbeddedImages
boolean
default:"true"
Include images embedded within the document content
markEmbeddedImages
boolean
default:"true"
Add markers in text where embedded images appear
includePageImages
boolean
default:"false"
Include full-page screenshot images
markPageImages
boolean
default:"false"
Add markers for page images

Output Instructions

Provide additional guidance to improve extraction accuracy:
Magic::extract()
    ->schema([/* ... */])
    ->outputInstructions(
        'Extract all financial data in USD. ' .
        'Convert dates to YYYY-MM-DD format. ' .
        'If a field is not present, use null.'
    )
    ->file('/path/to/invoice.pdf')
    ->send();

Model Selection

Choose the AI model for extraction:
Magic::extract()
    ->model('google/gemini-2.0-flash-lite')
    ->schema([/* ... */])
    ->file('/path/to/document.pdf')
    ->send();

Working with Results

The extraction returns an Illuminate Collection of results:
$results = Magic::extract()
    ->schema([/* ... */])
    ->files([
        '/path/to/doc1.pdf',
        '/path/to/doc2.pdf',
    ])
    ->send();

// Iterate over results
foreach ($results as $result) {
    echo $result['invoice_number'];
}

// Use collection methods
$totalAmount = $results->sum('total_amount');

// Convert to array
$data = $results->toArray();

// Get first result
$firstResult = $results->first();

Callbacks

Token Statistics

Track token usage during extraction:
Magic::extract()
    ->schema([/* ... */])
    ->file('/path/to/document.pdf')
    ->onTokenStats(function ($stats) {
        logger()->info('Token usage', $stats);
    })
    ->send();

Extraction Progress

Monitor extraction progress:
Magic::extract()
    ->schema([/* ... */])
    ->file('/path/to/document.pdf')
    ->onChunkExtracted(function ($chunk, $result) {
        echo "Processed chunk {$chunk} of document\n";
    })
    ->send();

Complete Example

Here’s a complete example extracting invoice data:
use Mateffy\Magic;
use Mateffy\Magic\Extraction\ContextOptions;

$schema = [
    'type' => 'object',
    'properties' => [
        'invoice_number' => [
            'type' => 'string',
            'description' => 'Invoice reference number',
        ],
        'invoice_date' => [
            'type' => 'string',
            'format' => 'date',
        ],
        'due_date' => [
            'type' => 'string',
            'format' => 'date',
        ],
        'vendor' => [
            'type' => 'object',
            'properties' => [
                'name' => ['type' => 'string'],
                'address' => ['type' => 'string'],
                'tax_id' => ['type' => 'string'],
            ],
        ],
        'line_items' => [
            'type' => 'array',
            'items' => [
                'type' => 'object',
                'properties' => [
                    'description' => ['type' => 'string'],
                    'quantity' => ['type' => 'integer'],
                    'unit_price' => ['type' => 'number'],
                    'total' => ['type' => 'number'],
                ],
            ],
        ],
        'subtotal' => ['type' => 'number'],
        'tax' => ['type' => 'number'],
        'total' => ['type' => 'number'],
    ],
    'required' => [
        'invoice_number',
        'invoice_date',
        'vendor',
        'line_items',
        'total',
    ],
];

$results = Magic::extract()
    ->model('google/gemini-2.0-flash-lite')
    ->schema($schema)
    ->strategy('double-pass-auto-merge')
    ->contextOptions(ContextOptions::all())
    ->outputInstructions(
        'Extract all amounts as numbers (no currency symbols). ' .
        'Convert all dates to YYYY-MM-DD format. ' .
        'Ensure line item totals are calculated correctly.'
    )
    ->files([
        storage_path('invoices/invoice1.pdf'),
        storage_path('invoices/invoice2.pdf'),
    ])
    ->onTokenStats(function ($stats) {
        logger()->info('Extraction tokens', $stats);
    })
    ->send();

foreach ($results as $invoice) {
    echo "Invoice {$invoice['invoice_number']}: ${$invoice['total']}\n";
}

Supported File Types

PDF

Portable Document Format files

Images

JPG, PNG, GIF, WebP

Word

DOCX documents

Best Practices

1

Start with Simple Strategy

Begin with the simple strategy and upgrade to more complex strategies only if needed.
2

Provide Clear Schemas

Use descriptive property names and include description fields to guide extraction.
3

Test with Sample Documents

Always test your schema with sample documents before processing in production.
4

Use Output Instructions

Provide specific instructions about formats, units, and handling missing data.
5

Choose the Right Strategy

Use double-pass for critical data, parallel for speed, simple for small documents.

Next Steps

Chat API

Build conversational AI with multi-turn conversations

Embeddings

Generate embeddings for semantic search

Streaming

Learn about real-time streaming responses

API Reference

Explore the complete API documentation

Build docs developers (and LLMs) love