Overview
LLM Magic’s extraction feature allows you to extract structured, validated data from various document types including PDFs, images, Word documents, and more. The system automatically processes documents and returns JSON data matching your schema.
Quick Start
Extract data from a document using a JSON schema:
use Mateffy\ Magic ;
$data = Magic :: extract ()
-> schema ([
'type' => 'object' ,
'properties' => [
'name' => [ 'type' => 'string' ],
'age' => [ 'type' => 'integer' ],
],
'required' => [ 'name' , 'age' ],
])
-> file ( '/path/to/document.pdf' )
-> send ();
Working with Files
Single File
Process a single document:
Magic :: extract ()
-> schema ([ /* ... */ ])
-> file ( '/path/to/invoice.pdf' )
-> send ();
Multiple Files
Process multiple documents at once:
Magic :: extract ()
-> schema ([ /* ... */ ])
-> files ([
'/path/to/document1.pdf' ,
'/path/to/document2.pdf' ,
'/path/to/image.jpg' ,
])
-> send ();
Using Artifacts
For more control, use the Artifact class:
use Mateffy\Magic\Extraction\Artifacts\ DiskArtifact ;
$artifact = DiskArtifact :: from (
path : '/path/to/document.pdf' ,
disk : 's3' // Optional: Laravel disk
);
Magic :: extract ()
-> schema ([ /* ... */ ])
-> artifact ( $artifact )
-> send ();
Defining Schemas
Define the structure of data you want to extract using JSON Schema:
$schema = [
'type' => 'object' ,
'properties' => [
'invoice_number' => [
'type' => 'string' ,
'description' => 'The invoice reference number' ,
],
'date' => [
'type' => 'string' ,
'format' => 'date' ,
],
'total_amount' => [
'type' => 'number' ,
'description' => 'Total amount in USD' ,
],
'items' => [
'type' => 'array' ,
'items' => [
'type' => 'object' ,
'properties' => [
'description' => [ 'type' => 'string' ],
'quantity' => [ 'type' => 'integer' ],
'price' => [ 'type' => 'number' ],
],
'required' => [ 'description' , 'quantity' , 'price' ],
],
],
],
'required' => [ 'invoice_number' , 'date' , 'total_amount' , 'items' ],
];
Magic :: extract ()
-> schema ( $schema )
-> file ( '/path/to/invoice.pdf' )
-> send ();
Complex Schemas
Extract nested data structures:
$schema = [
'type' => 'object' ,
'properties' => [
'patient' => [
'type' => 'object' ,
'properties' => [
'name' => [ 'type' => 'string' ],
'date_of_birth' => [ 'type' => 'string' , 'format' => 'date' ],
'address' => [
'type' => 'object' ,
'properties' => [
'street' => [ 'type' => 'string' ],
'city' => [ 'type' => 'string' ],
'zip' => [ 'type' => 'string' ],
],
],
],
],
'medications' => [
'type' => 'array' ,
'items' => [
'type' => 'object' ,
'properties' => [
'name' => [ 'type' => 'string' ],
'dosage' => [ 'type' => 'string' ],
'frequency' => [ 'type' => 'string' ],
],
],
],
],
];
LLM Magic provides multiple strategies for processing documents, each optimized for different use cases.
Available Strategies
use Mateffy\ Magic ;
// Simple: Process entire document at once (default)
Magic :: extract ()
-> strategy ( 'simple' )
-> file ( '/path/to/document.pdf' )
-> send ();
// Sequential: Process in chunks, one after another
Magic :: extract ()
-> strategy ( 'sequential' )
-> file ( '/path/to/large-document.pdf' )
-> send ();
// Sequential with auto-merge: Combines results automatically
Magic :: extract ()
-> strategy ( 'sequential-auto-merge' )
-> file ( '/path/to/document.pdf' )
-> send ();
// Parallel: Process chunks concurrently for speed
Magic :: extract ()
-> strategy ( 'parallel' )
-> file ( '/path/to/document.pdf' )
-> send ();
// Parallel with auto-merge
Magic :: extract ()
-> strategy ( 'parallel-auto-merge' )
-> file ( '/path/to/document.pdf' )
-> send ();
// Double-pass: Extract then verify for accuracy
Magic :: extract ()
-> strategy ( 'double-pass' )
-> file ( '/path/to/document.pdf' )
-> send ();
// Double-pass with auto-merge
Magic :: extract ()
-> strategy ( 'double-pass-auto-merge' )
-> file ( '/path/to/document.pdf' )
-> send ();
Strategy Comparison
Best for: Small documents (1-5 pages) Processes the entire document in a single LLM call. Fastest for small documents but may hit context limits on larger files.
Best for: Large documents where order matters Splits the document into chunks and processes them one by one. Good for documents where context from earlier pages affects later pages.
Best for: Large documents needing speed Processes multiple chunks simultaneously. Much faster than sequential but uses more resources.
Best for: Critical data requiring high accuracy Makes two passes: first to extract, second to verify. Highest accuracy but slower and more expensive.
Custom Chunk Size
Control how documents are split:
Magic :: extract ()
-> strategy ( 'sequential' )
-> chunkSize ( 2 ) // Process 2 pages at a time
-> file ( '/path/to/document.pdf' )
-> send ();
Context Options
Control what content from documents is included in the extraction:
use Mateffy\Magic\Extraction\ ContextOptions ;
// Include everything (default)
Magic :: extract ()
-> contextOptions ( ContextOptions :: all ())
-> file ( '/path/to/document.pdf' )
-> send ();
// Custom context options
Magic :: extract ()
-> contextOptions ( new ContextOptions (
includeText : true ,
includeEmbeddedImages : true ,
markEmbeddedImages : true ,
includePageImages : false ,
markPageImages : false ,
))
-> file ( '/path/to/document.pdf' )
-> send ();
// Default context (text + embedded images)
Magic :: extract ()
-> contextOptions ( ContextOptions :: default ())
-> file ( '/path/to/document.pdf' )
-> send ();
Context Options Explained
Include text content from the document
Include images embedded within the document content
Add markers in text where embedded images appear
Include full-page screenshot images
Add markers for page images
Output Instructions
Provide additional guidance to improve extraction accuracy:
Magic :: extract ()
-> schema ([ /* ... */ ])
-> outputInstructions (
'Extract all financial data in USD. ' .
'Convert dates to YYYY-MM-DD format. ' .
'If a field is not present, use null.'
)
-> file ( '/path/to/invoice.pdf' )
-> send ();
Model Selection
Choose the AI model for extraction:
Magic :: extract ()
-> model ( 'google/gemini-2.0-flash-lite' )
-> schema ([ /* ... */ ])
-> file ( '/path/to/document.pdf' )
-> send ();
Working with Results
The extraction returns an Illuminate Collection of results:
$results = Magic :: extract ()
-> schema ([ /* ... */ ])
-> files ([
'/path/to/doc1.pdf' ,
'/path/to/doc2.pdf' ,
])
-> send ();
// Iterate over results
foreach ( $results as $result ) {
echo $result [ 'invoice_number' ];
}
// Use collection methods
$totalAmount = $results -> sum ( 'total_amount' );
// Convert to array
$data = $results -> toArray ();
// Get first result
$firstResult = $results -> first ();
Callbacks
Token Statistics
Track token usage during extraction:
Magic :: extract ()
-> schema ([ /* ... */ ])
-> file ( '/path/to/document.pdf' )
-> onTokenStats ( function ( $stats ) {
logger () -> info ( 'Token usage' , $stats );
})
-> send ();
Monitor extraction progress:
Magic :: extract ()
-> schema ([ /* ... */ ])
-> file ( '/path/to/document.pdf' )
-> onChunkExtracted ( function ( $chunk , $result ) {
echo "Processed chunk { $chunk } of document \n " ;
})
-> send ();
Complete Example
Here’s a complete example extracting invoice data:
use Mateffy\ Magic ;
use Mateffy\Magic\Extraction\ ContextOptions ;
$schema = [
'type' => 'object' ,
'properties' => [
'invoice_number' => [
'type' => 'string' ,
'description' => 'Invoice reference number' ,
],
'invoice_date' => [
'type' => 'string' ,
'format' => 'date' ,
],
'due_date' => [
'type' => 'string' ,
'format' => 'date' ,
],
'vendor' => [
'type' => 'object' ,
'properties' => [
'name' => [ 'type' => 'string' ],
'address' => [ 'type' => 'string' ],
'tax_id' => [ 'type' => 'string' ],
],
],
'line_items' => [
'type' => 'array' ,
'items' => [
'type' => 'object' ,
'properties' => [
'description' => [ 'type' => 'string' ],
'quantity' => [ 'type' => 'integer' ],
'unit_price' => [ 'type' => 'number' ],
'total' => [ 'type' => 'number' ],
],
],
],
'subtotal' => [ 'type' => 'number' ],
'tax' => [ 'type' => 'number' ],
'total' => [ 'type' => 'number' ],
],
'required' => [
'invoice_number' ,
'invoice_date' ,
'vendor' ,
'line_items' ,
'total' ,
],
];
$results = Magic :: extract ()
-> model ( 'google/gemini-2.0-flash-lite' )
-> schema ( $schema )
-> strategy ( 'double-pass-auto-merge' )
-> contextOptions ( ContextOptions :: all ())
-> outputInstructions (
'Extract all amounts as numbers (no currency symbols). ' .
'Convert all dates to YYYY-MM-DD format. ' .
'Ensure line item totals are calculated correctly.'
)
-> files ([
storage_path ( 'invoices/invoice1.pdf' ),
storage_path ( 'invoices/invoice2.pdf' ),
])
-> onTokenStats ( function ( $stats ) {
logger () -> info ( 'Extraction tokens' , $stats );
})
-> send ();
foreach ( $results as $invoice ) {
echo "Invoice { $invoice ['invoice_number']}: ${ $invoice ['total']} \n " ;
}
Supported File Types
PDF Portable Document Format files
Images JPG, PNG, GIF, WebP
Best Practices
Start with Simple Strategy
Begin with the simple strategy and upgrade to more complex strategies only if needed.
Provide Clear Schemas
Use descriptive property names and include description fields to guide extraction.
Test with Sample Documents
Always test your schema with sample documents before processing in production.
Use Output Instructions
Provide specific instructions about formats, units, and handling missing data.
Choose the Right Strategy
Use double-pass for critical data, parallel for speed, simple for small documents.
Next Steps
Chat API Build conversational AI with multi-turn conversations
Embeddings Generate embeddings for semantic search
Streaming Learn about real-time streaming responses
API Reference Explore the complete API documentation