Skip to main content
POST
/
api
/
batch
/
process
Process Batch (Legacy)
curl --request POST \
  --url https://api.example.com/api/batch/process \
  --header 'Content-Type: application/json' \
  --data '
{
  "config": "<string>"
}
'
{
  "success": true,
  "total_documents": 123,
  "processed_count": 123,
  "failed_count": 123,
  "output_csv_url": "<string>",
  "summary_report": {
    "documents": [
      {}
    ],
    "errors": [
      {}
    ]
  },
  "processing_time": 123
}
Legacy Endpoint: This is the legacy synchronous batch processing endpoint. For real-time progress updates and better user experience, use the WebSocket endpoint at /api/batch/ws/{job_id} instead.

Overview

Processes a CSV file containing multiple documents synchronously. The request blocks until all documents are processed, then returns the complete results. Limitations:
  • No real-time progress updates
  • Long request timeout for large batches
  • No ability to pause/resume processing
  • No partial results on connection loss
Recommended Alternative: Use POST /api/batch/start to create a job, then connect via WebSocket for real-time updates.

Authentication

Requires authentication via Bearer token:
Authorization: Bearer your_access_token

Request

Content-Type: multipart/form-data
csv_file
file
required
CSV file containing batch documents. Must have .csv extension.Required columns:
  • title - Document title
  • file_source_type - Source type: url, s3, or local
  • file_path - Path or URL to the file
Optional columns:
  • description - Document description
  • publishing_date - Publication date
  • file_size - File size
config
string
required
JSON string containing TaggingConfig object:
api_key
string
required
OpenRouter API key for AI tag generation
model_name
string
AI model to use (default: "openai/gpt-4o-mini")
num_pages
integer
Number of PDF pages to extract (1-10, default: 3)
num_tags
integer
Number of tags to generate (3-15, default: 8)
exclusion_words
array
List of words/phrases to exclude from tags
exclusion_file
file
Optional exclusion list file (.txt or .pdf)Contains words/phrases to exclude from generated tags:
  • One term per line or comma-separated
  • Comments: Lines starting with # are ignored
  • Overrides exclusion_words in config if provided

Response

success
boolean
Whether the batch processing completed successfully
total_documents
integer
Total number of documents in the CSV
processed_count
integer
Number of successfully processed documents
failed_count
integer
Number of documents that failed processing
output_csv_url
string
URL or path to the output CSV file with results
summary_report
object
Detailed summary of processing results
documents
array
Array of document results with titles, tags, and status
errors
array
Array of error details for failed documents
processing_time
number
Total processing time in seconds

Example Response

{
  "success": true,
  "total_documents": 10,
  "processed_count": 9,
  "failed_count": 1,
  "output_csv_url": "https://storage.example.com/results/batch_20250115_123456.csv",
  "summary_report": {
    "documents": [
      {
        "title": "Training Manual",
        "file_path": "https://example.com/doc1.pdf",
        "success": true,
        "tags": [
          "employee-training",
          "onboarding-procedures",
          "workplace-safety",
          "compliance-guidelines"
        ],
        "error": null
      },
      {
        "title": "Annual Report",
        "file_path": "https://example.com/doc2.pdf",
        "success": false,
        "tags": [],
        "error": "Failed to download file: HTTP 404"
      }
    ],
    "errors": [
      {
        "row": 2,
        "title": "Annual Report",
        "error": "Failed to download file: HTTP 404"
      }
    ]
  },
  "processing_time": 45.2
}

CSV Format

See GET /api/batch/template for the complete CSV structure and examples. Minimal Example:
title,file_source_type,file_path
"Document 1",url,https://example.com/doc1.pdf
"Document 2",url,https://example.com/doc2.pdf
Full Example:
title,description,file_source_type,file_path,publishing_date,file_size
"Training Manual","Employee onboarding guide",url,https://example.com/doc1.pdf,2025-01-15,1.2MB
"Q4 Report","Financial results",s3,company-docs/q4-2024.pdf,2024-12-31,2.5MB

Usage Example

cURL

curl -X POST "http://localhost:8000/api/batch/process" \
  -H "Authorization: Bearer your_access_token" \
  -F "csv_file=@batch_documents.csv" \
  -F 'config={
    "api_key": "your_openrouter_key",
    "model_name": "openai/gpt-4o-mini",
    "num_pages": 3,
    "num_tags": 8
  }' \
  -F "exclusion_file=@exclusion_list.txt"

JavaScript/TypeScript

interface TaggingConfig {
  api_key: string;
  model_name?: string;
  num_pages?: number;
  num_tags?: number;
  exclusion_words?: string[];
}

interface BatchProcessResponse {
  success: boolean;
  total_documents: number;
  processed_count: number;
  failed_count: number;
  output_csv_url: string;
  summary_report: {
    documents: Array<{
      title: string;
      file_path: string;
      success: boolean;
      tags: string[];
      error: string | null;
    }>;
    errors: Array<{
      row: number;
      title: string;
      error: string;
    }>;
  };
  processing_time: number;
}

async function processBatchCSV(
  csvFile: File,
  config: TaggingConfig,
  exclusionFile?: File,
  token: string
): Promise<BatchProcessResponse> {
  const formData = new FormData();
  formData.append('csv_file', csvFile);
  formData.append('config', JSON.stringify(config));
  
  if (exclusionFile) {
    formData.append('exclusion_file', exclusionFile);
  }
  
  const response = await fetch('http://localhost:8000/api/batch/process', {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${token}`,
    },
    body: formData,
  });
  
  if (!response.ok) {
    throw new Error(`Batch processing failed: ${response.statusText}`);
  }
  
  return response.json();
}

// Usage
const csvFile = document.getElementById('csv-input').files[0];
const result = await processBatchCSV(
  csvFile,
  {
    api_key: 'your_openrouter_key',
    model_name: 'openai/gpt-4o-mini',
    num_pages: 3,
    num_tags: 8,
  },
  undefined,
  'your_access_token'
);

console.log(`Processed ${result.processed_count}/${result.total_documents} documents`);
console.log(`Processing time: ${result.processing_time}s`);

Python

import requests
from typing import Dict, Optional

def process_batch_csv(
    csv_path: str,
    config: Dict,
    token: str,
    exclusion_file_path: Optional[str] = None,
    base_url: str = "http://localhost:8000"
) -> Dict:
    files = {
        'csv_file': open(csv_path, 'rb'),
    }
    
    if exclusion_file_path:
        files['exclusion_file'] = open(exclusion_file_path, 'rb')
    
    data = {
        'config': json.dumps(config)
    }
    
    try:
        response = requests.post(
            f"{base_url}/api/batch/process",
            headers={"Authorization": f"Bearer {token}"},
            files=files,
            data=data,
            timeout=600  # 10 minute timeout for large batches
        )
        response.raise_for_status()
        return response.json()
    finally:
        for f in files.values():
            f.close()

# Usage
config = {
    "api_key": "your_openrouter_key",
    "model_name": "openai/gpt-4o-mini",
    "num_pages": 3,
    "num_tags": 8,
}

result = process_batch_csv(
    csv_path="batch_documents.csv",
    config=config,
    token="your_access_token",
    exclusion_file_path="exclusion_list.txt"
)

print(f"Success: {result['success']}")
print(f"Processed: {result['processed_count']}/{result['total_documents']}")
print(f"Failed: {result['failed_count']}")
print(f"Time: {result['processing_time']}s")
print(f"Output: {result['output_csv_url']}")

Error Responses

Invalid File Type

{
  "detail": "Invalid file type"
}
Status Code: 400

Empty CSV File

{
  "detail": "Empty CSV file"
}
Status Code: 400

Invalid Config JSON

{
  "detail": "Invalid config JSON format"
}
Status Code: 400

Exclusion File Parse Error

{
  "detail": "Failed to parse exclusion file: [error details]"
}
Status Code: 400

Internal Server Error

{
  "detail": "Internal server error: [error details]"
}
Status Code: 500

Migration to WebSocket Endpoint

For new implementations, use the WebSocket-based workflow instead:
  1. POST /api/batch/start - Start the job
  2. WebSocket /api/batch/ws/ - Get real-time updates
  3. POST /api/batch/jobs//cancel - Cancel if needed
This provides better user experience with progress tracking and the ability to disconnect/reconnect without losing the job.

Performance Considerations

Timeout Risk: Large batches may exceed request timeout limits. The WebSocket endpoint is recommended for batches with more than 10 documents.
Blocking Request: This endpoint blocks until all documents are processed. The client must maintain the connection for the entire duration.
Rate Limiting: OpenRouter API rate limits apply. Large batches may encounter rate limit errors. The WebSocket endpoint handles rate limiting more gracefully with retry logic.

Build docs developers (and LLMs) love