Skip to main content

Overview

Batch processing enables you to process hundreds of documents simultaneously with real-time progress tracking via WebSocket connections. Upload a CSV or Excel file with document metadata, validate paths, and watch as each document is processed with live status updates.
Batch processing uses WebSocket for real-time updates, allowing you to monitor progress, pause processing, and export results without waiting for the entire batch to complete.

Workflow

1

Upload CSV or Excel File

Upload a file containing your document metadata:Supported Formats
  • .csv (CSV files)
  • .xlsx (Excel 2007+)
  • .xls (Legacy Excel)
Drag & Drop
// Drop zone accepts multiple formats
<input type="file" accept=".csv,.xlsx,.xls" />
After upload, the system automatically:
  • Parses CSV with PapaParse
  • Reads Excel with SheetJS (XLSX library)
  • Detects column headers
  • Maps system fields (title, file_path, file_source_type, description)
  • Generates unique row IDs
Excel Tip: Paste URLs as plain text. Excel hyperlinks (blue underlined links) will not work as expected.
2

Review & Edit Spreadsheet

An interactive spreadsheet editor displays your data:Features
  • Inline editing: Click any cell to edit
  • Column management: Add, remove, rename, reorder columns
  • Row actions: Add new rows, delete rows, bulk update
  • Column mapping: Map your columns to required system fields
  • Visual status: Color-coded row status (pending, processing, success, failed)
Required Fields
3

Run Pre-flight Path Validation

Before processing, validate all file paths:
POST /api/batch/validate-paths
What It Checks
  • URL paths: HTTP HEAD request (falls back to GET if HEAD not allowed)
  • S3 paths: AWS SDK head_object() check (requires S3 config)
  • Local paths: File system existence check
Validation Results
{
  "results": [
    {
      "path": "https://example.com/doc.pdf",
      "valid": true,
      "error": null,
      "content_type": "application/pdf",
      "size": 1024000
    },
    {
      "path": "https://example.com/missing.pdf",
      "valid": false,
      "error": "404 Not Found",
      "content_type": null,
      "size": null
    }
  ],
  "total": 2,
  "valid_count": 1,
  "invalid_count": 1
}
Invalid paths are highlighted in red in the spreadsheet. Fix them before starting processing to avoid failures.
4

Configure Processing Settings

Set global processing options:
  • OpenRouter API Key (required)
  • Model Name (e.g., openai/gpt-4o-mini, google/gemini-flash-1.5)
  • Number of Pages: Pages to extract per document (default: 3)
  • Number of Tags: Target tags per document (default: 8)
  • Exclusion Words: Optional array of terms to filter
5

Start Real-Time Processing

Click Start Processing to begin:
  1. System generates a unique job_id
  2. WebSocket connection opens: wss://api.example.com/api/batch/ws/{job_id}
  3. Client sends batch start request with documents and config
  4. Server processes documents sequentially
  5. Each document sends progress update via WebSocket
  6. Client updates UI in real-time
  7. Completion message sent when all documents finish
Stop Anytime
  • Click Stop Processing to gracefully halt
  • Already-processed documents are preserved
  • Partial results can be exported
6

Export Results

Export processed data in multiple formats:
  • CSV: Comma-separated values
  • Excel: .xlsx with formatted columns
  • JSON: Machine-readable format
Export Options
  • Select which columns to include
  • Rename columns for export
  • Save export presets for reuse
  • Filter rows before export (e.g., only successful)

Real-Time WebSocket Updates

Connection Flow

Message Types

Confirms job initialization:
{
  "type": "started",
  "job_id": "550e8400-e29b-41d4-a716-446655440000",
  "total_documents": 10,
  "message": "Batch processing started"
}

Client-Side State Management

The batch store (Zustand) manages WebSocket state:
// WebSocket connection
websocket: WebSocket | null

// Update handler receives messages
updateProgress: (update: any) => {
  if (update.type === 'progress') {
    // Find document by row_id and update
    const doc = documents.find(d => d.id === update.row_id)
    doc.status = update.status
    doc.tags = update.tags
    doc.metadata = update.metadata
  }
}

Interactive Spreadsheet Editor

Column Management

Add Column

Add custom columns for additional metadata:
addColumn('custom_field', afterColumnId)
New columns appear in the editor and can be included in exports.

Remove Column

Delete columns you don’t need:
removeColumn(columnId)
System columns (file_path, title) cannot be removed.

Rename Column

Change display names:
renameColumn(columnId, 'New Name')
Preserves original CSV column names for reference.

Reorder Columns

Drag columns to reorder:
reorderColumns([colId1, colId3, colId2])
Position updates are instant.

Row Operations

Update Single Cell
// Click to edit any cell
updateRowData(rowId, columnId, newValue)
Bulk Update
// Update multiple rows at once
bulkUpdateRows(rowIds, 'status', 'pending')
Add New Row
// Append blank row to bottom
addRow()
Delete Rows
// Remove selected rows
deleteRows([rowId1, rowId2, rowId3])

Column Mapping Panel

Maps your CSV columns to system fields:
System Fields
  • title → Document title
  • file_path → URL or file location (required)
  • file_source_type → Source type: url, s3, local
  • description → Document description
Auto-DetectionThe system attempts fuzzy matching:
// Example: "Document Name" maps to "title"
// Example: "PDF Link" maps to "file_path"
// Example: "Type" maps to "file_source_type"
Manual MappingUse dropdowns to map columns if auto-detection fails.

Processing Controls

Pre-flight Validation

Validate Paths Button
  • Checks all file_path values before processing
  • Displays summary: X valid, Y invalid
  • Highlights invalid paths in red in spreadsheet
  • Disabled during processing
Validation Stats
<div className="flex gap-4">
  <span className="text-green-600">{validCount} valid</span>
  <span className="text-red-600">⚠️ {invalidCount} invalid</span>
</div>

Start/Stop Processing

Start Processing Button
  • Validates API key before starting
  • Confirms required columns are mapped
  • Opens WebSocket connection
  • Sends initial batch request
  • Shows processing stats in real-time
Stop Processing Button
  • Gracefully closes WebSocket
  • Preserves partial results
  • Allows resuming later (manual CSV re-upload)

Processing Stats Display

<div className="grid grid-cols-3 gap-3">
  <div className="p-3 bg-gray-50 rounded-lg">
    <p className="text-2xl font-bold">{documents.length}</p>
    <p className="text-xs text-gray-500">Total</p>
  </div>
  <div className="p-3 bg-green-50 rounded-lg">
    <p className="text-2xl font-bold text-green-700">{successCount}</p>
    <p className="text-xs text-green-600">Success</p>
  </div>
  <div className="p-3 bg-red-50 rounded-lg">
    <p className="text-2xl font-bold text-red-700">{failedCount}</p>
    <p className="text-xs text-red-600">Failed</p>
  </div>
</div>

CSV Template

Download Template

Endpoint: GET /api/batch/template Returns a sample CSV with:
  • Column headers with descriptions
  • Example rows showing correct format
  • Comments explaining each field
Example Template
title,description,file_source_type,file_path,publishing_date,file_size
"Training Manual","PMSPECIAL training document",url,https://example.com/doc1.pdf,2025-01-15,1.2MB
"Policy Guidelines","Government policy document",url,https://example.com/doc2.pdf,2025-01-20,850KB
"Annual Report","2024 annual report",s3,s3://my-bucket/reports/2024.pdf,2024-12-31,3.5MB

Column Descriptions

Description: Path or URL to the PDF fileExamples:
  • https://example.com/document.pdf
  • https://d1581jr3fp95xu.cloudfront.net/path/to/file.pdf
  • s3://bucket-name/folder/file.pdf
  • /local/path/to/file.pdf
Validation: Must be non-empty, accessible URL or valid path

Export Options

Export Formats

CSV Export

Comma-separated values:
  • Standard CSV format
  • UTF-8 encoding
  • Quoted fields
  • Compatible with Excel, Google Sheets

Excel Export

Excel workbook (.xlsx):
  • Formatted columns
  • Auto-sized widths
  • Header row styling
  • Opens directly in Excel

JSON Export

Machine-readable JSON:
  • Array of objects
  • Nested metadata
  • Tags as arrays
  • Perfect for APIs

Export Presets

Save common export configurations:
interface ExportPreset {
  id: string
  name: string
  selectedColumns: string[]           // Column IDs to include
  columnRenames: Record<string, string>  // Column ID → export name
  format: 'csv' | 'excel' | 'json'
}
Use Cases
  • ElasticSearch Format: Export only title, tags, file_path with custom names
  • Full Data: Export all columns including metadata
  • Summary: Export only title, tags, status

Error Handling

Error Types

Detection:
if (msg.includes('429') || msg.includes('rate limit')) {
  errorType = 'rate-limit'
}
Handling:
  • Document marked as failed
  • Error message stored in error field
  • Processing continues to next document
  • User can retry failed documents later
Solutions:
  • Reduce processing speed
  • Use different API key with higher rate limits
  • Upgrade OpenRouter plan
Detection:
if (msg.includes('400') || msg.includes('bad request')) {
  errorType = 'model-error'
}
Common Causes:
  • Invalid model name
  • Model doesn’t support requested parameters
  • Input too long for model context window
Solutions:
  • Verify model name spelling
  • Try different model (e.g., switch to Gemini from GPT)
  • Reduce num_pages to shorten input
Detection:
if (msg.includes('connection') || msg.includes('timeout')) {
  errorType = 'network'
}
Handling:
  • WebSocket reconnection attempted
  • Document retry logic
  • Timeout warnings
Solutions:
  • Check network connectivity
  • Verify firewall allows WebSocket connections
  • Reduce document size
Detection: Pre-flight validation marks paths as invalidCommon Issues:
  • URL returns 404 (file not found)
  • URL requires authentication
  • S3 object doesn’t exist
  • Local file missing
Solutions:
  • Fix URLs in spreadsheet before processing
  • Ensure S3 credentials configured (for S3 paths)
  • Verify file permissions (for local paths)

Performance Optimization

Process faster:
  • Use google/gemini-flash-1.5 (fastest model)
  • Reduce num_pages to 1-2
  • Process text-based PDFs (avoid scanned documents)
  • Use URL paths (faster than local file reads)
Handle large batches:
  • Start with small test batch (10-20 documents)
  • Monitor for rate limits
  • Use pre-flight validation to catch errors early
  • Export partial results periodically
Avoid common pitfalls:
  • Don’t include Excel hyperlinks (paste as plain text)
  • Don’t mix file types in file_path column
  • Don’t use relative paths (use absolute URLs)
  • Don’t exceed API rate limits (add delays if needed)

Best Practices

1

Prepare Your CSV

  • Use template as starting point
  • Ensure file_path column has valid URLs
  • Add descriptive titles for better tagging
  • Include descriptions for context
2

Validate Before Processing

  • Always run pre-flight validation
  • Fix invalid paths before starting
  • Review column mapping
3

Monitor Progress

  • Watch real-time updates for errors
  • Stop processing if many documents fail
  • Check error messages for patterns
4

Export Results

  • Export partial results during long batches
  • Save export presets for reuse
  • Choose appropriate format for your use case

Build docs developers (and LLMs) love