Overview
Magic::extract() enables you to extract structured, validated JSON data from documents like PDFs, Word files, and images. It uses extraction strategies to efficiently process documents of any size.
Define your schema
Create a JSON Schema describing the data you want to extract: $schema = [
'type' => 'object' ,
'properties' => [
'name' => [ 'type' => 'string' ],
'age' => [ 'type' => 'integer' ],
'email' => [ 'type' => 'string' , 'format' => 'email' ],
],
'required' => [ 'name' , 'age' ],
];
Prepare your artifacts
Artifacts are the documents to extract from (PDFs, images, text, etc.): use Mateffy\Magic\Extraction\Artifacts\ Artifact ;
$artifact = Artifact :: fromPath ( '/path/to/document.pdf' );
Extract the data
use Mateffy\ Magic ;
$data = Magic :: extract ()
-> schema ( $schema )
-> artifacts ([ $artifact ])
-> send ();
// Returns: ['name' => 'John Doe', 'age' => 25, 'email' => '[email protected] ']
Complete Example
From the README.md:
use Mateffy\ Magic ;
$data = Magic :: extract ()
-> schema ([
'type' => 'object' ,
'properties' => [
'name' => [ 'type' => 'string' ],
'age' => [ 'type' => 'integer' ],
],
'required' => [ 'name' , 'age' ],
])
-> artifacts ([ $artifact ])
-> send ();
The extracted data is automatically validated against your JSON Schema. Invalid data will trigger validation errors.
Strategies determine how LLM Magic processes your documents. Each strategy implements the Strategy interface from src/Magic/Extraction/Strategies/Strategy.php.
Available Strategies
Simple
Sequential
Parallel
Auto-Merge Variants
SimpleStrategy - Processes only the first chunk of artifacts.$data = Magic :: extract ()
-> strategy ( 'simple' )
-> schema ( $schema )
-> artifacts ( $artifacts )
-> send ();
Best for:
Single-page documents
Small files that fit in one LLM call
Quick prototyping
From src/Magic/Extraction/Strategies/SimpleStrategy.php:14 SequentialStrategy - Processes documents in batches sequentially, building up the result.$data = Magic :: extract ()
-> strategy ( 'sequential' )
-> schema ( $schema )
-> artifacts ( $artifacts )
-> send ();
Best for:
Multi-page documents
Documents where order matters
Building cumulative results
From src/Magic/Extraction/Strategies/SequentialStrategy.php:16 ParallelStrategy - Processes batches concurrently, then merges results with an LLM.$data = Magic :: extract ()
-> strategy ( 'parallel' )
-> schema ( $schema )
-> artifacts ( $artifacts )
-> send ();
Best for:
Large documents
Independent sections
Maximum speed
From src/Magic/Extraction/Strategies/ParallelStrategy.php:20 Auto-merge strategies automatically combine partial results:
sequential-auto-merge - Sequential processing with automatic merging
parallel-auto-merge - Parallel processing with automatic merging
double-pass - Two-pass extraction for complex documents
double-pass-auto-merge - Two-pass with automatic merging
$data = Magic :: extract ()
-> strategy ( 'parallel-auto-merge' )
-> schema ( $schema )
-> artifacts ( $artifacts )
-> send ();
Strategy Selection
From src/Magic.php:206, all built-in strategies are registered:
'simple' => SimpleStrategy :: class ,
'sequential' => SequentialStrategy :: class ,
'sequential-auto-merge' => SequentialAutoMergeStrategy :: class ,
'parallel' => ParallelStrategy :: class ,
'parallel-auto-merge' => ParallelAutoMergeStrategy :: class ,
'double-pass' => DoublePassStrategy :: class ,
'double-pass-auto-merge' => DoublePassStrategyAutoMerging :: class ,
Chunk Size
Control how many pages/sections are processed in each batch:
$data = Magic :: extract ()
-> chunkSize ( 5 ) // Process 5 pages at a time
-> schema ( $schema )
-> artifacts ( $artifacts )
-> send ();
Model Selection
$data = Magic :: extract ()
-> model ( 'google/gemini-2.0-flash-lite' )
-> schema ( $schema )
-> artifacts ( $artifacts )
-> send ();
Output Instructions
Provide custom instructions for the extraction:
$data = Magic :: extract ()
-> outputInstructions ( 'Focus on extracting contact information. Ignore marketing content.' )
-> schema ( $schema )
-> artifacts ( $artifacts )
-> send ();
Progress Tracking
Token Usage
use Mateffy\Magic\Chat\ TokenStats ;
$data = Magic :: extract ()
-> schema ( $schema )
-> artifacts ( $artifacts )
-> onTokenStats ( function ( TokenStats $stats ) {
echo "Tokens used: { $stats -> totalTokens }" ;
})
-> send ();
Data Progress
$data = Magic :: extract ()
-> schema ( $schema )
-> artifacts ( $artifacts )
-> onDataProgress ( function ( array $partialData ) {
// Called as data is extracted
print_r ( $partialData );
})
-> send ();
Message Progress
use Mateffy\Magic\Chat\Messages\ Message ;
$data = Magic :: extract ()
-> schema ( $schema )
-> artifacts ( $artifacts )
-> onMessageProgress ( function ( Message $message ) {
// Called for each message
echo $message -> text ();
})
-> send ();
Complex Schemas
Nested Objects
$schema = [
'type' => 'object' ,
'properties' => [
'company' => [
'type' => 'object' ,
'properties' => [
'name' => [ 'type' => 'string' ],
'founded' => [ 'type' => 'integer' ],
'address' => [
'type' => 'object' ,
'properties' => [
'street' => [ 'type' => 'string' ],
'city' => [ 'type' => 'string' ],
'country' => [ 'type' => 'string' ],
],
],
],
],
],
];
Arrays of Items
$schema = [
'type' => 'object' ,
'properties' => [
'employees' => [
'type' => 'array' ,
'items' => [
'type' => 'object' ,
'properties' => [
'name' => [ 'type' => 'string' ],
'role' => [ 'type' => 'string' ],
'salary' => [ 'type' => 'number' ],
],
],
],
],
];
Best Practices
Make schemas as specific as possible
Use required for critical fields
Add description fields to guide extraction
Use appropriate JSON Schema types
Provide clear output instructions
Use double-pass strategies for complex data
Test with sample documents first
Validate extracted data
Extraction strategies process documents differently. The ‘simple’ strategy only processes the first chunk, while ‘sequential’ and ‘parallel’ process all content.
Next Steps
Custom Strategies Learn to create custom extraction strategies
Multimodal Chat Extract data from images in conversations