Overview
The single document processing feature allows you to quickly generate metadata tags for individual PDF files. Upload a document or provide a URL, configure your settings, and get AI-generated tags in seconds.Supports both file uploads (up to 50MB) and URL-based processing from any publicly accessible HTTP/HTTPS source.
How It Works
Upload Your Document
Choose between two input methods:File Upload
- Drag and drop a PDF file onto the upload zone
- Or click to browse and select from your computer
- Maximum file size: 50MB
- Paste a direct link to any publicly accessible PDF
- Supports CloudFront URLs, S3 URLs, government portals, and more
- Examples:
https://example.com/document.pdfhttps://d1581jr3fp95xu.cloudfront.net/path/to/file.pdfhttps://socialjustice.gov.in/writereaddata/UploadFile/66991763713697.pdf
Preview Your Document
After uploading or entering a URL, an embedded PDF viewer appears automatically:
- In-page preview: 800px height iframe for quick reference
- Fullscreen mode: Click the fullscreen button to view the entire document
- Keyboard shortcut: Press
ESCto exit fullscreen
For URL-based documents, the preview uses a CORS proxy endpoint (
/api/single/preview) to bypass browser restrictions.Configure Processing Settings
In the configuration panel (typically on the right), set:
- OpenRouter API Key (required)
- Model Name (e.g.,
google/gemini-flash-1.5,openai/gpt-4o-mini) - Number of Pages: How many pages to extract (default: 3)
- Number of Tags: Target tag count (default: 8)
- Exclusion List (optional): Upload a file to filter out common terms
Document Preview Features
Standard Preview
The embedded preview displays your PDF at 800px height with:- Native browser PDF controls (zoom, download, print)
- Scroll support for multi-page documents
- Automatic loading from file or URL
Fullscreen Mode
Fullscreen Preview Details
Fullscreen Preview Details
When you click the Fullscreen button:
- Immersive View: Dark overlay with centered PDF viewer
- Modal Header: Document icon, title, and close button
- Keyboard Shortcut: Press
ESCto close instantly - Full Browser Window: Maximizes PDF viewing area
- Clean UI: Gradient header with visual polish
Processing Results
Success Banner
When processing completes successfully, you’ll see:Processing Time
Displays how long the extraction and AI tagging took (e.g., “Processed in 4.2s”)
Extraction Method
Shows which engine was used:
pypdf2: Text-based PDF (fastest)tesseract_ocr: Scanned document with Tesseracteasyocr: Complex language document with EasyOCRpymupdf_tesseract: Fallback for complex embedded images
Document Metadata
Document Title- Extracted from PDF metadata or first substantive line
- Displayed prominently at top of results
- Text PDF: Blue badge for digitally-created documents
- Scanned PDF: Purple badge with confidence percentage (e.g., “Scanned PDF · 87% confidence”)
Generated Tags
Tags are displayed with color-coded categories:- Click to copy all tags as comma-separated text
- Uses
navigator.clipboard.writeText(result.tags.join(', ')) - Perfect for pasting into other systems
Extracted Text Preview
A scrollable preview (max 240px height) shows:- First few hundred characters of extracted text
- Monospace font for readability
- Helps verify extraction quality
- Useful for debugging OCR issues
Smart Document Detection
Automatic OCR Triggers
The system automatically uses OCR when:Low Text Density
Low Text Density
If PyPDF2 extracts fewer than 150 characters per page, the document is likely scanned.Example:
Garbled Encoding Detected
Garbled Encoding Detected
Legacy Indian PDFs (Krutidev, ISM fonts) map Devanagari glyphs to Latin Extended Unicode (U+00A0–U+024F).PyPDF2 reads these as garbage like
"ºÉÚSÉxÉÉ". The system detects when >25% of characters fall in this range and triggers OCR:Sparse Content
Sparse Content
If total extracted text is under 500 characters or fewer than 50 words per page:
No Text Lines Found
No Text Lines Found
If PyPDF2 returns empty or whitespace-only content:
Hybrid OCR Strategy
When OCR is needed, the system uses a 3-tier approach:Tesseract OCR (Fast Path)
- Speed: ~2-5 seconds per page at 300 DPI
- Languages: Hindi, English, Kannada, Tamil, Telugu, Bengali, etc.
- Confidence Threshold: If confidence ≥60%, Tesseract result is used
- Preprocessing: Grayscale conversion + contrast enhancement (2.0x)
EasyOCR (High Accuracy Fallback)
Triggered when:
- Tesseract confidence <60%
- Tesseract extracted <100 chars
- Complex Indian scripts detected
- 80+ languages including all major Indian languages
- CNN + LSTM neural networks for superior accuracy
- Subprocess isolation: Runs in separate process to prevent OOM crashes
- Image resizing: Max 1500px dimension to prevent memory issues
- Timeout: 120-second limit per document
Input Validation
Pre-Submission Checks
The Generate Tags button is disabled unless: Error messages appear above the button when validation fails:Reset and Refresh
Refresh Button
Click the Refresh button (top-right) to:- Clear the current PDF upload
- Remove URL input
- Reset results display
- Clear error messages
- Preserve configuration settings (API key, model, exclusion list)
URL Processing Details
Supported URL Types
Direct PDF URLs
https://example.com/document.pdfStandard web-hosted PDFsCloudFront URLs
https://d1581jr3fp95xu.cloudfront.net/path/to/file.pdfAWS CloudFront distributionsS3 Public URLs
https://bucket.s3.region.amazonaws.com/file.pdfPublicly accessible S3 objectsGovernment Sites
https://socialjustice.gov.in/writereaddata/UploadFile/66991763713697.pdfInstitutional document repositoriesBackend Download Process
API Endpoint
POST /api/single/process
Form data fields:
pdf_file: File object (optional)pdf_url: String (optional)config: JSON string withTaggingConfigexclusion_file: File object (optional)
Best Practices
Troubleshooting
Preview shows blank page
Preview shows blank page
Cause: Source URL blocks embedding or CORS restrictionsSolution:
- The proxy endpoint (
/api/single/preview) should handle most cases - If preview fails, you can still process the document
- Processing uses server-side download (not affected by CORS)
No tags generated
No tags generated
OCR confidence very low
OCR confidence very low
Cause: Poor scan quality, complex layouts, or unsupported scriptsSolution:
- System automatically tries EasyOCR as fallback
- Check extracted text preview to verify content quality
- Consider preprocessing the PDF (increase contrast, remove noise)
Processing takes too long
Processing takes too long
Cause: Large file, OCR processing, or slow AI modelSolution:
- Reduce
num_pagesin configuration - Use faster models (Gemini Flash instead of GPT-4)
- Ensure good network connection for URL-based documents