Skip to main content
Learn how to structure CSV files for batch processing, including required columns, file source types, and formatting best practices.

CSV Structure

The batch processing system accepts CSV files with metadata for multiple documents. Each row represents one document to process.

Required Columns

title
string
required
Document titleThe name or title of the document. This is used for identification and appears in processing results.Example: "Training Manual 2024", "Annual Report FY2023-24"
file_source_type
enum
required
Source type of the fileSpecifies where the document is located. Must be one of:
  • url - Public HTTP/HTTPS URL
  • s3 - Amazon S3 bucket (requires AWS credentials)
  • local - Local file path on server
Example: url, s3, local
file_path
string
required
Path or URL to the PDF fileThe location of the PDF document. Format depends on file_source_type:
  • URL: https://example.com/document.pdf
  • S3: s3://bucket-name/path/to/file.pdf
  • Local: /absolute/path/to/file.pdf
Example: "https://socialjustice.gov.in/writereaddata/UploadFile/66991763713697.pdf"

Optional Columns

description
string
Document descriptionAdditional context about the document. This helps the AI generate more relevant tags.Example: "Comprehensive training document for new employees covering organizational policies and procedures"
publishing_date
string
Publication dateWhen the document was published or last updated. Format: YYYY-MM-DD or any standard date format.Example: "2025-01-15", "2024-12-31"
file_size
string
File sizeSize of the PDF file. This is informational only.Example: "1.2MB", "2.5MB", "450KB"

CSV Template

Download a template with sample data:
title,description,file_source_type,file_path,publishing_date,file_size
"Training Manual","PMSPECIAL training document",url,https://example.com/doc1.pdf,2025-01-15,1.2MB
"Annual Report 2024","Financial report",url,https://example.com/doc2.pdf,2024-12-31,2.5MB

API Response

Template API Response
{
  "template": "title,description,file_source_type,file_path,publishing_date,file_size\n\"Training Manual\",\"PMSPECIAL training document\",url,https://example.com/doc1.pdf,2025-01-15,1.2MB\n...",
  "columns": [
    {
      "name": "title",
      "required": true,
      "description": "Document title"
    },
    {
      "name": "description",
      "required": false,
      "description": "Document description"
    },
    {
      "name": "file_source_type",
      "required": true,
      "description": "Source type: url, s3, or local"
    },
    {
      "name": "file_path",
      "required": true,
      "description": "Path or URL to the file"
    },
    {
      "name": "publishing_date",
      "required": false,
      "description": "Publication date"
    },
    {
      "name": "file_size",
      "required": false,
      "description": "File size"
    }
  ]
}

File Source Types

URL Sources

Process documents from public HTTP/HTTPS URLs.
title,file_source_type,file_path
"CloudFront Document",url,https://d1581jr3fp95xu.cloudfront.net/path/to/file.pdf
"Government Portal",url,https://socialjustice.gov.in/writereaddata/UploadFile/66991763713697.pdf
"S3 Public URL",url,https://my-bucket.s3.ap-south-1.amazonaws.com/document.pdf
"Direct Link",url,https://example.com/reports/annual-2024.pdf
Requirements:
  • URL must start with http:// or https://
  • Document must be publicly accessible (no authentication)
  • File size limit: 50MB
  • Download timeout: 60 seconds
Supported URL Types:
  • Direct PDF URLs
  • CloudFront CDN URLs
  • S3 public bucket URLs
  • Government/institutional portals
  • Any publicly accessible HTTP/HTTPS endpoint

S3 Sources

Process documents from private S3 buckets (requires AWS credentials).
title,file_source_type,file_path
"Private Document",s3,s3://my-private-bucket/documents/report.pdf
"Archived Report",s3,s3://archive-bucket/2024/annual-report.pdf
"Training Material",s3,my-bucket/training/manual.pdf
Path Formats:
  • Full S3 URI: s3://bucket-name/path/to/file.pdf
  • Short format: bucket-name/path/to/file.pdf
AWS Configuration: The backend must be configured with AWS credentials:
Environment Variables
AWS_ACCESS_KEY_ID=your-access-key-id
AWS_SECRET_ACCESS_KEY=your-secret-access-key
AWS_DEFAULT_REGION=ap-south-1
S3 processing requires valid AWS credentials configured on the server. Public S3 URLs can be processed as url type instead.

Local Sources

Process documents from the server’s local file system.
title,file_source_type,file_path
"Server Document",local,/opt/documents/report.pdf
"Shared Storage",local,/mnt/shared/pdfs/manual.pdf
"Archive File",local,/data/archive/2024/report.pdf
Requirements:
  • Paths must be absolute (start with /)
  • File must exist and be readable by the application
  • No relative paths supported
  • Server must have read permissions
Local file processing is typically used for documents already uploaded to the server or available on network-mounted storage.

Complete CSV Examples

Minimal Example (Required Columns Only)

Minimal CSV
title,file_source_type,file_path
"Training Manual",url,https://example.com/training.pdf
"Annual Report 2024",url,https://example.com/annual-2024.pdf
"Policy Document",url,https://example.com/policy.pdf

Complete Example (All Columns)

Complete CSV
title,description,file_source_type,file_path,publishing_date,file_size
"NSFDC Training Manual","Comprehensive training document for NSFDC schemes and procedures",url,https://socialjustice.gov.in/writereaddata/UploadFile/66991763713697.pdf,2024-03-15,2.4MB
"Annual Report FY2023-24","Financial and operational report for fiscal year 2023-2024",url,https://example.com/reports/annual-2024.pdf,2024-04-01,3.1MB
"Employee Handbook","Standard operating procedures and organizational policies",s3,s3://company-docs/hr/handbook.pdf,2024-01-10,1.8MB
"Scheme Guidelines","Detailed guidelines for beneficiary eligibility and application process",url,https://portal.gov.in/schemes/guidelines.pdf,2024-02-20,950KB

Mixed Source Types

Mixed Sources
title,description,file_source_type,file_path,publishing_date
"Public Report","Annual report from website",url,https://example.com/report.pdf,2024-12-31
"Archive Document","Archived policy document",s3,s3://archive/policies/2024.pdf,2024-06-15
"Server File","Local training material",local,/opt/documents/training.pdf,2024-01-01

CSV Formatting Best Practices

Always wrap text fields containing commas, quotes, or newlines in double quotes:
"Training Manual, Version 2","Document for new employees",url,https://example.com/doc.pdf
Escape internal quotes by doubling them:
"Employee ""Best Practices"" Guide","Comprehensive guide",url,https://example.com/guide.pdf
Always include column headers as the first row:
title,description,file_source_type,file_path
"Document 1","Description",url,https://example.com/doc1.pdf
Do not skip the header row or the system won’t parse correctly.
Save your CSV file with UTF-8 encoding to support international characters:
title,description,file_source_type,file_path
"वार्षिक रिपोर्ट 2024","हिंदी में रिपोर्ट",url,https://example.com/report-hi.pdf
"தமிழ் கையேடு","Tamil training manual",url,https://example.com/tamil.pdf
Each row represents exactly one document to process:Correct:
title,file_source_type,file_path
"Document 1",url,https://example.com/doc1.pdf
"Document 2",url,https://example.com/doc2.pdf
Incorrect:
title,file_source_type,file_path
"Document 1; Document 2",url,https://example.com/doc1.pdf; https://example.com/doc2.pdf
Ensure all URLs are accessible before batch processing:
import requests

def validate_url(url):
    try:
        response = requests.head(url, timeout=10)
        return response.status_code == 200
    except:
        return False

# Check before adding to CSV
if validate_url("https://example.com/doc.pdf"):
    print("URL is valid")

Column Mapping

If your CSV uses different column names, you can provide a mapping:
Custom Column Mapping
import requests
import json

url = "http://localhost:8000/api/batch/start"

# Your CSV has columns: doc_name, source, location
# Map them to system fields
column_mapping = {
    "title": "doc_name",
    "file_source_type": "source",
    "file_path": "location"
}

payload = {
    "documents": documents,
    "config": config,
    "column_mapping": column_mapping
}

headers = {"Authorization": f"Bearer {access_token}"}
response = requests.post(url, json=payload, headers=headers)

Pre-flight Validation

Validate file paths before starting batch processing:
Validate CSV Paths
import csv
import requests

def validate_csv_paths(csv_file_path, access_token):
    """Validate all file paths in CSV before processing"""
    
    # Read CSV and extract paths
    paths = []
    with open(csv_file_path, 'r', encoding='utf-8') as f:
        reader = csv.DictReader(f)
        for row in reader:
            paths.append({
                "path": row['file_path'],
                "type": row['file_source_type']
            })
    
    # Validate all paths
    url = "http://localhost:8000/api/batch/validate-paths"
    headers = {"Authorization": f"Bearer {access_token}"}
    payload = {"paths": paths}
    
    response = requests.post(url, json=payload, headers=headers)
    results = response.json()
    
    # Report validation results
    print(f"Validation Results:")
    print(f"Valid: {results['valid_count']}")
    print(f"Invalid: {results['invalid_count']}")
    
    for result in results['results']:
        if not result['valid']:
            print(f"  ✗ {result['path']}: {result['error']}")
    
    return results['invalid_count'] == 0

# Usage
if validate_csv_paths('batch.csv', access_token):
    print("All paths valid, proceeding with batch processing")
else:
    print("Some paths invalid, please fix and retry")

Common Errors

Error: "Empty CSV file"Cause: CSV file has no rows or only header rowSolution: Ensure CSV has at least one data row after the header
Error: "Missing required column: title"Cause: CSV is missing one of the required columnsSolution: Ensure CSV includes title, file_source_type, and file_path columns
Error: "Invalid file_source_type: 'http'"Cause: file_source_type must be exactly url, s3, or localSolution: Use only the allowed values (case-sensitive)
Error: "HTTP 404" or "Request timed out"Cause: URL is inaccessible or returns errorSolution:
  • Verify URL is correct and publicly accessible
  • Check if URL requires authentication (not supported)
  • Ensure URL starts with http:// or https://
Error: "S3 client not configured"Cause: AWS credentials not configured on serverSolution:
  • Configure AWS credentials on the server
  • Or use public S3 URLs with file_source_type: url

Performance Tips

Batch Size

Process 50-100 documents per batch for optimal performance. Larger batches may timeout.

Use URL Sources

URL sources are fastest as documents are downloaded on-demand. Local/S3 require additional I/O.

Validate First

Always run path validation before batch processing to catch errors early and save API costs.

Monitor Progress

Connect to WebSocket for real-time progress. You can pause/resume or cancel jobs if needed.

Build docs developers (and LLMs) love