Overview
URL-based processing allows you to process PDF documents directly from web URLs without manually downloading them. Simply paste a link, and the system handles the rest.Supports any publicly accessible HTTP/HTTPS URL, including CloudFront distributions, S3 public URLs, government document portals, and institutional repositories.
Supported URL Types
Direct PDF URLs
Standard web-hosted PDF files:
CloudFront URLs
AWS CloudFront CDN distributions:
S3 Public URLs
Publicly accessible S3 bucket objects:
Government Portals
Institutional document repositories:
How It Works
User Provides URL
Enter a PDF URL in the input field:Validation:
- Must start with
http://orhttps:// - URL format must be valid
- File extension doesn’t need to be
.pdf(content-type checked server-side)
Backend Download
Server downloads the PDF:Security Headers:
- Custom
User-Agentto avoid bot blocking - Follow redirects automatically
- SSL verification enabled
PDF Preview
CORS Proxy Endpoint
Problem: Browser CORS restrictions prevent direct PDF embedding from external URLs. Solution: Backend proxy endpoint that:- Downloads the PDF server-side
- Serves it with appropriate headers
- Allows iframe embedding
GET /api/single/preview?url={pdf_url}
Implementation:
Frontend Preview
Validation
Pre-Flight Validation (Batch Processing)
Endpoint:POST /api/batch/validate-paths
Request:
Error Handling
404 Not Found
404 Not Found
Symptoms:Causes:
- Incorrect URL
- File moved or deleted
- Typo in URL
- Verify URL in browser
- Check for redirects
- Update URL if file moved
403 Forbidden
403 Forbidden
Symptoms:Causes:
- Authentication required
- IP-based blocking
- Referrer checking
- Bot detection
- Check if URL requires login
- Use publicly accessible URL
- Contact document owner for public link
Timeout
Timeout
Symptoms:Causes:
- Slow server response
- Large file size
- Network congestion
- Try again later
- Use smaller file
- Download and upload file directly instead
CORS Error (Preview Only)
CORS Error (Preview Only)
Symptoms: Preview shows blank page, but processing worksCause: Source server blocks cross-origin iframe embeddingImpact: Only affects preview, processing still works (server-side download)Solution:
- Preview may not work, but you can still process
- Backend proxy should handle most cases
- Download file and upload directly if preview needed
SSL Certificate Error
SSL Certificate Error
Symptoms:Causes:
- Expired certificate
- Self-signed certificate
- Certificate mismatch
- Contact site administrator
- Use HTTPS URL if available
- Download file manually if site trusted
Content-Type Mismatch
Content-Type Mismatch
Symptoms:Causes:
- URL points to HTML page, not PDF
- Server misconfigured (wrong Content-Type header)
- URL requires query parameters
- Verify URL ends in
.pdfor returns PDF content - Check URL in browser downloads file (not displays page)
- Copy download link, not page URL
Best Practices
Use Cases
Government Document Processing
Scenario: Process 100 government reports from official portal CSV:- No manual downloads
- Direct processing from source
- Links remain valid (official sources)
CloudFront CDN Processing
Scenario: Process documents from S3 via CloudFront CSV:- Fast downloads (CDN edge servers)
- High availability
- Reduced S3 egress costs
Institutional Repository Processing
Scenario: Process research papers from university repository CSV:- Direct academic source
- Permanent DOI/handle URLs
- No copyright issues (public repository)
API Reference
Single Document Processing
Endpoint:POST /api/single/process
Form Data:
Preview Endpoint
Endpoint:GET /api/single/preview?url={pdf_url}
Query Parameters:
url: URL-encoded PDF URL
- Content-Type:
application/pdf - Content-Disposition:
inline - Body: PDF file bytes
Batch Validation
Endpoint:POST /api/batch/validate-paths
Request:
Troubleshooting
Test URL in Browser
Open the URL in your browser:
- Should download a PDF file
- Should NOT display an HTML page
- Should NOT require login
Check Content-Type
Use browser developer tools (Network tab):
- Look for
Content-Type: application/pdfheader - Verify response status is
200 OK
Verify URL Format
- Must start with
http://orhttps:// - Should be properly URL-encoded
- No special characters without encoding