Overview
Batch processing enables you to process hundreds of documents simultaneously with real-time progress tracking via WebSocket connections. Upload a CSV or Excel file with document metadata, validate paths, and watch as each document is processed with live status updates.Batch processing uses WebSocket for real-time updates, allowing you to monitor progress, pause processing, and export results without waiting for the entire batch to complete.
Workflow
Upload CSV or Excel File
Upload a file containing your document metadata:Supported FormatsAfter upload, the system automatically:
.csv(CSV files).xlsx(Excel 2007+).xls(Legacy Excel)
- Parses CSV with PapaParse
- Reads Excel with SheetJS (XLSX library)
- Detects column headers
- Maps system fields (title, file_path, file_source_type, description)
- Generates unique row IDs
Excel Tip: Paste URLs as plain text. Excel hyperlinks (blue underlined links) will not work as expected.
Review & Edit Spreadsheet
An interactive spreadsheet editor displays your data:Features
- Inline editing: Click any cell to edit
- Column management: Add, remove, rename, reorder columns
- Row actions: Add new rows, delete rows, bulk update
- Column mapping: Map your columns to required system fields
- Visual status: Color-coded row status (pending, processing, success, failed)
Run Pre-flight Path Validation
Before processing, validate all file paths:What It Checks
- URL paths: HTTP HEAD request (falls back to GET if HEAD not allowed)
- S3 paths: AWS SDK
head_object()check (requires S3 config) - Local paths: File system existence check
Configure Processing Settings
Set global processing options:
- OpenRouter API Key (required)
- Model Name (e.g.,
openai/gpt-4o-mini,google/gemini-flash-1.5) - Number of Pages: Pages to extract per document (default: 3)
- Number of Tags: Target tags per document (default: 8)
- Exclusion Words: Optional array of terms to filter
Start Real-Time Processing
Click Start Processing to begin:
- System generates a unique
job_id - WebSocket connection opens:
wss://api.example.com/api/batch/ws/{job_id} - Client sends batch start request with documents and config
- Server processes documents sequentially
- Each document sends progress update via WebSocket
- Client updates UI in real-time
- Completion message sent when all documents finish
- Click Stop Processing to gracefully halt
- Already-processed documents are preserved
- Partial results can be exported
Export Results
Export processed data in multiple formats:
- CSV: Comma-separated values
- Excel:
.xlsxwith formatted columns - JSON: Machine-readable format
- Select which columns to include
- Rename columns for export
- Save export presets for reuse
- Filter rows before export (e.g., only successful)
Real-Time WebSocket Updates
Connection Flow
Message Types
- Started Message
- Progress Update
- Completion Message
- Error Message
Confirms job initialization:
Client-Side State Management
The batch store (Zustand) manages WebSocket state:Interactive Spreadsheet Editor
Column Management
Add Column
Add custom columns for additional metadata:New columns appear in the editor and can be included in exports.
Remove Column
Delete columns you don’t need:System columns (file_path, title) cannot be removed.
Rename Column
Change display names:Preserves original CSV column names for reference.
Reorder Columns
Drag columns to reorder:Position updates are instant.
Row Operations
Update Single CellColumn Mapping Panel
Maps your CSV columns to system fields:Column Mapping Details
Column Mapping Details
System FieldsManual MappingUse dropdowns to map columns if auto-detection fails.
title→ Document titlefile_path→ URL or file location (required)file_source_type→ Source type:url,s3,localdescription→ Document description
Processing Controls
Pre-flight Validation
Validate Paths Button- Checks all
file_pathvalues before processing - Displays summary: X valid, Y invalid
- Highlights invalid paths in red in spreadsheet
- Disabled during processing
Start/Stop Processing
Start Processing Button- Validates API key before starting
- Confirms required columns are mapped
- Opens WebSocket connection
- Sends initial batch request
- Shows processing stats in real-time
- Gracefully closes WebSocket
- Preserves partial results
- Allows resuming later (manual CSV re-upload)
Processing Stats Display
CSV Template
Download Template
Endpoint:GET /api/batch/template
Returns a sample CSV with:
- Column headers with descriptions
- Example rows showing correct format
- Comments explaining each field
Column Descriptions
- file_path (Required)
- title (Optional)
- file_source_type (Optional)
- description (Optional)
- publishing_date (Optional)
- file_size (Optional)
Description: Path or URL to the PDF fileExamples:
https://example.com/document.pdfhttps://d1581jr3fp95xu.cloudfront.net/path/to/file.pdfs3://bucket-name/folder/file.pdf/local/path/to/file.pdf
Export Options
Export Formats
CSV Export
Comma-separated values:
- Standard CSV format
- UTF-8 encoding
- Quoted fields
- Compatible with Excel, Google Sheets
Excel Export
Excel workbook (
.xlsx):- Formatted columns
- Auto-sized widths
- Header row styling
- Opens directly in Excel
JSON Export
Machine-readable JSON:
- Array of objects
- Nested metadata
- Tags as arrays
- Perfect for APIs
Export Presets
Save common export configurations:- ElasticSearch Format: Export only title, tags, file_path with custom names
- Full Data: Export all columns including metadata
- Summary: Export only title, tags, status
Error Handling
Error Types
Rate Limit Errors
Rate Limit Errors
Detection:Handling:
- Document marked as
failed - Error message stored in
errorfield - Processing continues to next document
- User can retry failed documents later
- Reduce processing speed
- Use different API key with higher rate limits
- Upgrade OpenRouter plan
Model Errors
Model Errors
Detection:Common Causes:
- Invalid model name
- Model doesn’t support requested parameters
- Input too long for model context window
- Verify model name spelling
- Try different model (e.g., switch to Gemini from GPT)
- Reduce
num_pagesto shorten input
Network Errors
Network Errors
Detection:Handling:
- WebSocket reconnection attempted
- Document retry logic
- Timeout warnings
- Check network connectivity
- Verify firewall allows WebSocket connections
- Reduce document size
Path Validation Errors
Path Validation Errors
Detection: Pre-flight validation marks paths as invalidCommon Issues:
- URL returns 404 (file not found)
- URL requires authentication
- S3 object doesn’t exist
- Local file missing
- Fix URLs in spreadsheet before processing
- Ensure S3 credentials configured (for S3 paths)
- Verify file permissions (for local paths)
Performance Optimization
Best Practices
Prepare Your CSV
- Use template as starting point
- Ensure
file_pathcolumn has valid URLs - Add descriptive titles for better tagging
- Include descriptions for context
Validate Before Processing
- Always run pre-flight validation
- Fix invalid paths before starting
- Review column mapping
Monitor Progress
- Watch real-time updates for errors
- Stop processing if many documents fail
- Check error messages for patterns