CSV Structure
The batch processing system accepts CSV files with metadata for multiple documents. Each row represents one document to process.Required Columns
Document titleThe name or title of the document. This is used for identification and appears in processing results.Example:
"Training Manual 2024", "Annual Report FY2023-24"Source type of the fileSpecifies where the document is located. Must be one of:
url- Public HTTP/HTTPS URLs3- Amazon S3 bucket (requires AWS credentials)local- Local file path on server
url, s3, localPath or URL to the PDF fileThe location of the PDF document. Format depends on
file_source_type:- URL:
https://example.com/document.pdf - S3:
s3://bucket-name/path/to/file.pdf - Local:
/absolute/path/to/file.pdf
"https://socialjustice.gov.in/writereaddata/UploadFile/66991763713697.pdf"Optional Columns
Document descriptionAdditional context about the document. This helps the AI generate more relevant tags.Example:
"Comprehensive training document for new employees covering organizational policies and procedures"Publication dateWhen the document was published or last updated. Format:
YYYY-MM-DD or any standard date format.Example: "2025-01-15", "2024-12-31"File sizeSize of the PDF file. This is informational only.Example:
"1.2MB", "2.5MB", "450KB"CSV Template
Download a template with sample data:API Response
Template API Response
File Source Types
URL Sources
Process documents from public HTTP/HTTPS URLs.- URL must start with
http://orhttps:// - Document must be publicly accessible (no authentication)
- File size limit: 50MB
- Download timeout: 60 seconds
- Direct PDF URLs
- CloudFront CDN URLs
- S3 public bucket URLs
- Government/institutional portals
- Any publicly accessible HTTP/HTTPS endpoint
S3 Sources
Process documents from private S3 buckets (requires AWS credentials).- Full S3 URI:
s3://bucket-name/path/to/file.pdf - Short format:
bucket-name/path/to/file.pdf
Environment Variables
Local Sources
Process documents from the server’s local file system.- Paths must be absolute (start with
/) - File must exist and be readable by the application
- No relative paths supported
- Server must have read permissions
Local file processing is typically used for documents already uploaded to the server or available on network-mounted storage.
Complete CSV Examples
Minimal Example (Required Columns Only)
Minimal CSV
Complete Example (All Columns)
Complete CSV
Mixed Source Types
Mixed Sources
CSV Formatting Best Practices
Use Quotes for Text Fields
Use Quotes for Text Fields
Always wrap text fields containing commas, quotes, or newlines in double quotes:Escape internal quotes by doubling them:
Include Header Row
Include Header Row
Always include column headers as the first row:Do not skip the header row or the system won’t parse correctly.
Use UTF-8 Encoding
Use UTF-8 Encoding
Save your CSV file with UTF-8 encoding to support international characters:
One Document Per Row
One Document Per Row
Each row represents exactly one document to process:✅ Correct:❌ Incorrect:
Validate URLs Before Upload
Validate URLs Before Upload
Ensure all URLs are accessible before batch processing:
Column Mapping
If your CSV uses different column names, you can provide a mapping:Custom Column Mapping
Pre-flight Validation
Validate file paths before starting batch processing:Validate CSV Paths
Common Errors
Empty CSV file
Empty CSV file
Error:
"Empty CSV file"Cause: CSV file has no rows or only header rowSolution: Ensure CSV has at least one data row after the headerMissing required columns
Missing required columns
Error:
"Missing required column: title"Cause: CSV is missing one of the required columnsSolution: Ensure CSV includes title, file_source_type, and file_path columnsInvalid file source type
Invalid file source type
Error:
"Invalid file_source_type: 'http'"Cause: file_source_type must be exactly url, s3, or localSolution: Use only the allowed values (case-sensitive)URL validation failed
URL validation failed
Error:
"HTTP 404" or "Request timed out"Cause: URL is inaccessible or returns errorSolution:- Verify URL is correct and publicly accessible
- Check if URL requires authentication (not supported)
- Ensure URL starts with
http://orhttps://
S3 client not configured
S3 client not configured
Error:
"S3 client not configured"Cause: AWS credentials not configured on serverSolution:- Configure AWS credentials on the server
- Or use public S3 URLs with
file_source_type: url
Performance Tips
Batch Size
Process 50-100 documents per batch for optimal performance. Larger batches may timeout.
Use URL Sources
URL sources are fastest as documents are downloaded on-demand. Local/S3 require additional I/O.
Validate First
Always run path validation before batch processing to catch errors early and save API costs.
Monitor Progress
Connect to WebSocket for real-time progress. You can pause/resume or cancel jobs if needed.