Skip to main content

Overview

Magic::extract() enables you to extract structured, validated JSON data from documents like PDFs, Word files, and images. It uses extraction strategies to efficiently process documents of any size.

Basic Extraction

1

Define your schema

Create a JSON Schema describing the data you want to extract:
$schema = [
    'type' => 'object',
    'properties' => [
        'name' => ['type' => 'string'],
        'age' => ['type' => 'integer'],
        'email' => ['type' => 'string', 'format' => 'email'],
    ],
    'required' => ['name', 'age'],
];
2

Prepare your artifacts

Artifacts are the documents to extract from (PDFs, images, text, etc.):
use Mateffy\Magic\Extraction\Artifacts\Artifact;

$artifact = Artifact::fromPath('/path/to/document.pdf');
3

Extract the data

use Mateffy\Magic;

$data = Magic::extract()
    ->schema($schema)
    ->artifacts([$artifact])
    ->send();

// Returns: ['name' => 'John Doe', 'age' => 25, 'email' => '[email protected]']

Complete Example

From the README.md:
use Mateffy\Magic;

$data = Magic::extract()
    ->schema([
        'type' => 'object',
        'properties' => [
            'name' => ['type' => 'string'],
            'age' => ['type' => 'integer'],
        ],
        'required' => ['name', 'age'],
    ])
    ->artifacts([$artifact])
    ->send();
The extracted data is automatically validated against your JSON Schema. Invalid data will trigger validation errors.

Extraction Strategies

Strategies determine how LLM Magic processes your documents. Each strategy implements the Strategy interface from src/Magic/Extraction/Strategies/Strategy.php.

Available Strategies

SimpleStrategy - Processes only the first chunk of artifacts.
$data = Magic::extract()
    ->strategy('simple')
    ->schema($schema)
    ->artifacts($artifacts)
    ->send();
Best for:
  • Single-page documents
  • Small files that fit in one LLM call
  • Quick prototyping
From src/Magic/Extraction/Strategies/SimpleStrategy.php:14

Strategy Selection

From src/Magic.php:206, all built-in strategies are registered:
'simple' => SimpleStrategy::class,
'sequential' => SequentialStrategy::class,
'sequential-auto-merge' => SequentialAutoMergeStrategy::class,
'parallel' => ParallelStrategy::class,
'parallel-auto-merge' => ParallelAutoMergeStrategy::class,
'double-pass' => DoublePassStrategy::class,
'double-pass-auto-merge' => DoublePassStrategyAutoMerging::class,

Customizing Extraction

Chunk Size

Control how many pages/sections are processed in each batch:
$data = Magic::extract()
    ->chunkSize(5)  // Process 5 pages at a time
    ->schema($schema)
    ->artifacts($artifacts)
    ->send();

Model Selection

$data = Magic::extract()
    ->model('google/gemini-2.0-flash-lite')
    ->schema($schema)
    ->artifacts($artifacts)
    ->send();

Output Instructions

Provide custom instructions for the extraction:
$data = Magic::extract()
    ->outputInstructions('Focus on extracting contact information. Ignore marketing content.')
    ->schema($schema)
    ->artifacts($artifacts)
    ->send();

Progress Tracking

Token Usage

use Mateffy\Magic\Chat\TokenStats;

$data = Magic::extract()
    ->schema($schema)
    ->artifacts($artifacts)
    ->onTokenStats(function (TokenStats $stats) {
        echo "Tokens used: {$stats->totalTokens}";
    })
    ->send();

Data Progress

$data = Magic::extract()
    ->schema($schema)
    ->artifacts($artifacts)
    ->onDataProgress(function (array $partialData) {
        // Called as data is extracted
        print_r($partialData);
    })
    ->send();

Message Progress

use Mateffy\Magic\Chat\Messages\Message;

$data = Magic::extract()
    ->schema($schema)
    ->artifacts($artifacts)
    ->onMessageProgress(function (Message $message) {
        // Called for each message
        echo $message->text();
    })
    ->send();

Complex Schemas

Nested Objects

$schema = [
    'type' => 'object',
    'properties' => [
        'company' => [
            'type' => 'object',
            'properties' => [
                'name' => ['type' => 'string'],
                'founded' => ['type' => 'integer'],
                'address' => [
                    'type' => 'object',
                    'properties' => [
                        'street' => ['type' => 'string'],
                        'city' => ['type' => 'string'],
                        'country' => ['type' => 'string'],
                    ],
                ],
            ],
        ],
    ],
];

Arrays of Items

$schema = [
    'type' => 'object',
    'properties' => [
        'employees' => [
            'type' => 'array',
            'items' => [
                'type' => 'object',
                'properties' => [
                    'name' => ['type' => 'string'],
                    'role' => ['type' => 'string'],
                    'salary' => ['type' => 'number'],
                ],
            ],
        ],
    ],
];

Best Practices

  • Make schemas as specific as possible
  • Use required for critical fields
  • Add description fields to guide extraction
  • Use appropriate JSON Schema types
  • Use ‘simple’ strategy for single-page documents
  • Use ‘parallel’ strategies for large documents
  • Adjust chunk size based on document structure
  • Monitor token usage with callbacks
  • Provide clear output instructions
  • Use double-pass strategies for complex data
  • Test with sample documents first
  • Validate extracted data
Extraction strategies process documents differently. The ‘simple’ strategy only processes the first chunk, while ‘sequential’ and ‘parallel’ process all content.

Next Steps

Custom Strategies

Learn to create custom extraction strategies

Multimodal Chat

Extract data from images in conversations

Build docs developers (and LLMs) love