POST /api/single/process

Endpoint

POST /api/single/process

Authentication

This endpoint requires authentication. Include a valid JWT token in the Authorization header:

Authorization: Bearer <your_token>

Description

Processes a single PDF file and generates AI-powered metadata tags. Supports both file uploads and URL-based document retrieval with automatic OCR for scanned documents.

Request

Content Type

multipart/form-data

Parameters

pdf_file

file

PDF file to process (optional if pdf_url is provided)

Must have .pdf extension
Cannot be empty
Cannot be used together with pdf_url

pdf_url

string

URL to download PDF from (optional if pdf_file is provided)

Must start with http:// or https://
Cannot be used together with pdf_file
Examples: CloudFront URLs, S3 URLs, direct file links
60-second timeout, 50MB limit

config

string

required

JSON string containing tagging configurationSchema (TaggingConfig):

Show TaggingConfig Schema

api_key

string

required

OpenRouter API key for AI model access

model_name

string

default:"openai/gpt-4o-mini"

AI model to use for tag generationExamples: openai/gpt-4o-mini, google/gemini-flash-1.5

num_pages

integer

default:3

Number of PDF pages to extract and analyze

Minimum: 1
Maximum: 10

num_tags

integer

default:8

Number of tags to generate

Minimum: 3
Maximum: 15

exclusion_words

array

List of words/phrases to exclude from generated tagsExample: ["government-india", "annual-report", "newsletter"]

exclusion_file

file

File containing words/phrases to exclude from tags

Supported formats: .txt, .pdf
One term per line or comma-separated
Lines starting with # are treated as comments
Merged with config.exclusion_words if both provided

Example Request Body

{
  "pdf_file": "<binary data>",
  "config": "{\"api_key\":\"sk-or-v1-...\",\"model_name\":\"openai/gpt-4o-mini\",\"num_pages\":3,\"num_tags\":8,\"exclusion_words\":[\"government\",\"report\"]}"
}

Response

Success Response (200 OK)

success

boolean

required

Indicates if processing was successful

document_title

string

required

Extracted document title from PDF metadata or filename

Example Response

{
  "success": true,
  "document_title": "Training Manual for Digital Governance",
  "tags": [
    "digital-governance",
    "e-learning",
    "skill-development",
    "capacity-building",
    "training-manual",
    "best-practices",
    "implementation-guide",
    "knowledge-sharing"
  ],
  "extracted_text_preview": "Chapter 1: Introduction to Digital Governance\n\nDigital governance encompasses the policies, procedures, and frameworks that guide the use of digital technologies in government operations. This manual provides comprehensive training on implementing digital governance initiatives at the state and local levels...",
  "processing_time": 4.23,
  "is_scanned": false,
  "extraction_method": "pypdf2",
  "ocr_confidence": null,
  "raw_ai_response": "digital-governance, e-learning, skill-development, capacity-building, training-manual, best-practices, implementation-guide, knowledge-sharing",
  "error": null
}

Error Responses

400 Bad Request

Both file and URL provided:

{
  "detail": "Please provide either a PDF file OR a URL, not both."
}

Neither file nor URL provided:

{
  "detail": "Please provide either a PDF file or a PDF URL."
}

Invalid file type:

{
  "detail": "Invalid file type. Please upload a PDF file."
}

Invalid URL format:

{
  "detail": "Invalid URL. Must start with http:// or https://"
}

Invalid config JSON:

{
  "detail": "Invalid config JSON format"
}

Insufficient text extracted:

{
  "detail": "Could not extract sufficient text from PDF. The document might be scanned or image-based without OCR support."
}

401 Unauthorized

{
  "detail": "Not authenticated"
}

500 Internal Server Error

{
  "detail": "Tag generation failed: <error message>"
}

Example Usage

Using cURL with File Upload

curl -X POST "http://localhost:8000/api/single/process" \
  -H "Authorization: Bearer <your_token>" \
  -F "pdf_file=@/path/to/document.pdf" \
  -F 'config={"api_key":"sk-or-v1-...","model_name":"openai/gpt-4o-mini","num_pages":3,"num_tags":8}'

Using cURL with URL

curl -X POST "http://localhost:8000/api/single/process" \
  -H "Authorization: Bearer <your_token>" \
  -F "pdf_url=https://example.com/document.pdf" \
  -F 'config={"api_key":"sk-or-v1-...","model_name":"openai/gpt-4o-mini","num_pages":3,"num_tags":8}'

Using cURL with Exclusion List

curl -X POST "http://localhost:8000/api/single/process" \
  -H "Authorization: Bearer <your_token>" \
  -F "pdf_file=@/path/to/document.pdf" \
  -F 'config={"api_key":"sk-or-v1-...","model_name":"openai/gpt-4o-mini","num_pages":3,"num_tags":8}' \
  -F "exclusion_file=@/path/to/exclusion-list.txt"

Using JavaScript (fetch)

const formData = new FormData();
formData.append('pdf_file', pdfFile); // File object

const config = {
  api_key: 'sk-or-v1-...',
  model_name: 'openai/gpt-4o-mini',
  num_pages: 3,
  num_tags: 8,
  exclusion_words: ['government', 'report']
};
formData.append('config', JSON.stringify(config));

const response = await fetch('http://localhost:8000/api/single/process', {
  method: 'POST',
  headers: {
    'Authorization': `Bearer ${token}`
  },
  body: formData
});

const result = await response.json();
console.log('Generated tags:', result.tags);

Features

Automatic OCR Detection

The endpoint automatically detects scanned PDFs and applies OCR using a 3-tier strategy:

PyPDF2 (fastest) - Tries text-based extraction first
Tesseract OCR (fast) - Fallback for scanned documents, good for Hindi/English
EasyOCR (most accurate) - Automatic fallback if Tesseract confidence < 60%

Exclusion List Filtering

The system uses a two-layer approach to filter excluded terms:

Pre-generation: AI is instructed to avoid excluded terms
Post-processing: Any excluded terms that slip through are filtered out

Guaranteed tag count: If you request 8 tags and 2 get filtered, you still get 8 tags (system requests extra tags from AI to compensate).

Language Support

Supports 80+ languages including:

Hindi, Tamil, Telugu, Bengali, Kannada, Malayalam, Marathi, Gujarati, Punjabi
English, Spanish, French, German, Chinese, Japanese, Korean
And many more via EasyOCR

Source Code

Implementation: backend/app/routers/single.py:25 Models: backend/app/models.py:8 (TaggingConfig), backend/app/models.py:24 (SinglePDFResponse)

Overview

Single Document

Batch Processing

User Management

History & Jobs

Status & Health

Endpoint

Authentication

Description

Request

Content Type

Parameters

Example Request Body

Response

Success Response (200 OK)

Example Response

Error Responses

400 Bad Request

401 Unauthorized

500 Internal Server Error

Example Usage

Using cURL with File Upload

Using cURL with URL

Using cURL with Exclusion List

Using JavaScript (fetch)

Features

Automatic OCR Detection

Exclusion List Filtering

Language Support

Source Code

Build docs developers (and LLMs) love

Overview

Single Document

Batch Processing

User Management

History & Jobs

Status & Health

​Endpoint

​Authentication

​Description

​Request

​Content Type

​Parameters

​Example Request Body

​Response

​Success Response (200 OK)

​Example Response

​Error Responses

​400 Bad Request

​401 Unauthorized

​500 Internal Server Error

​Example Usage

​Using cURL with File Upload

​Using cURL with URL

​Using cURL with Exclusion List

​Using JavaScript (fetch)

​Features

​Automatic OCR Detection

​Exclusion List Filtering

​Language Support

​Source Code

Build docs developers (and LLMs) love

Endpoint

Authentication

Description

Request

Content Type

Parameters

Example Request Body

Response

Success Response (200 OK)

Example Response

Error Responses

400 Bad Request

401 Unauthorized

500 Internal Server Error

Example Usage

Using cURL with File Upload

Using cURL with URL

Using cURL with Exclusion List

Using JavaScript (fetch)

Features

Automatic OCR Detection

Exclusion List Filtering

Language Support

Source Code