Endpoint
Authentication
This endpoint requires authentication. Include a valid JWT token in the Authorization header:Description
Processes a single PDF file and generates AI-powered metadata tags. Supports both file uploads and URL-based document retrieval with automatic OCR for scanned documents.Request
Content Type
Parameters
PDF file to process (optional if
pdf_url is provided)- Must have
.pdfextension - Cannot be empty
- Cannot be used together with
pdf_url
URL to download PDF from (optional if
pdf_file is provided)- Must start with
http://orhttps:// - Cannot be used together with
pdf_file - Examples: CloudFront URLs, S3 URLs, direct file links
- 60-second timeout, 50MB limit
JSON string containing tagging configurationSchema (TaggingConfig):
File containing words/phrases to exclude from tags
- Supported formats:
.txt,.pdf - One term per line or comma-separated
- Lines starting with
#are treated as comments - Merged with
config.exclusion_wordsif both provided
Example Request Body
Response
Success Response (200 OK)
Indicates if processing was successful
Extracted document title from PDF metadata or filename
Array of generated metadata tagsExample:
["digital-governance", "e-learning", "skill-development"]First 500 characters of extracted text from the document
Total processing time in seconds (rounded to 2 decimal places)
Indicates if the document was detected as a scanned PDF requiring OCR
Method used for text extractionPossible values:
pypdf2- Text-based extractiontesseract_ocr- Fast OCR for scanned documentseasyocr- Advanced OCR for complex scripts
OCR confidence score (0-100) if OCR was used
Raw response from AI model (for debugging)
Error message if processing failed (always null on success)
Example Response
Error Responses
400 Bad Request
Both file and URL provided:401 Unauthorized
500 Internal Server Error
Example Usage
Using cURL with File Upload
Using cURL with URL
Using cURL with Exclusion List
Using JavaScript (fetch)
Features
Automatic OCR Detection
The endpoint automatically detects scanned PDFs and applies OCR using a 3-tier strategy:- PyPDF2 (fastest) - Tries text-based extraction first
- Tesseract OCR (fast) - Fallback for scanned documents, good for Hindi/English
- EasyOCR (most accurate) - Automatic fallback if Tesseract confidence < 60%
Exclusion List Filtering
The system uses a two-layer approach to filter excluded terms:- Pre-generation: AI is instructed to avoid excluded terms
- Post-processing: Any excluded terms that slip through are filtered out
Language Support
Supports 80+ languages including:- Hindi, Tamil, Telugu, Bengali, Kannada, Malayalam, Marathi, Gujarati, Punjabi
- English, Spanish, French, German, Chinese, Japanese, Korean
- And many more via EasyOCR
Source Code
Implementation:backend/app/routers/single.py:25
Models: backend/app/models.py:8 (TaggingConfig), backend/app/models.py:24 (SinglePDFResponse)