Overview
Silver training data is synthetic training data generated using ChatGPT 4o to convert PDF pages into clean, structured text. This process uses the OpenAI Batch API for cost-effective processing at scale.The Batch API is 50% cheaper than the standard API with the same performance. For large-scale data generation, this can result in significant cost savings.
Silver Data Generation Strategy
The silver data generation process combines:- Visual rendering - PDFs rendered to PNG images at 2048px resolution
- Anchor text - Raw text extracted with positional information
- ChatGPT 4o - Vision model to generate clean text output
- Structured outputs - JSON schema for consistent responses
Why “Silver” Data?
Silver data sits between gold (human-annotated) and bronze (raw) data:- More scalable than gold data (no manual annotation)
- Higher quality than bronze data (AI-cleaned)
- Cost-effective for training OCR models
Using buildsilver.py
Basic Usage
Arguments
Local or S3 path glob (e.g.,
*.pdf or s3://bucket/pdfs/*.pdf)Path to a file containing PDF paths, one per line
Number of PDF documents to sample
Always sample the first N pages of each PDF
Max number of pages to sample per PDF
Disable basic spam/language filtering
Output directory for JSONL batch files
Size of the reservoir for sampling paths. Defaults to 10x num_sample_docs
S3 Integration
Process PDFs directly from S3:- Downloads PDFs from S3 to
/tmp - Processes them locally
- Cleans up temporary files
Page Sampling Strategy
The script uses intelligent page sampling (seesample_pdf_pages in buildsilver.py:87-94):
ChatGPT 4o Prompting Approach
Prompt Structure
The prompt (from prompts.py:7-17) asks ChatGPT 4o to:- Read the PDF page image naturally
- Convert equations to LaTeX
- Convert tables to markdown
- Remove headers/footers but keep references and footnotes
- Read any handwriting
- Preserve sentences that span pages
- Return null if no readable text exists
- Not hallucinate
Structured Output Schema
The response follows a strict JSON schema (prompts.py:49-96):Field Descriptions
Field Descriptions
- primary_language: Two-letter language code or null if no text
- is_rotation_valid: Whether the page is oriented correctly for reading
- rotation_correction: Degrees of clockwise rotation needed (0, 90, 180, or 270)
- is_table: Whether majority of content is tabular
- is_diagram: Whether majority of content is a visual diagram
- natural_text: The cleaned, natural text content
OpenAI Batch API Usage
Output Format
The script generates JSONL files in the OpenAI Batch API format (buildsilver.py:206-266):Files are automatically split at 99MB to stay under OpenAI’s 100MB batch file limit.
Submitting Batches
After generating batch files, use therunopenaibatch.py script to automatically submit, monitor, and download results:
Processing Results
Once you have results, extract the natural text:Expert Tips on Structured Outputs
1. Use Batch API for Cost Savings
“Use the batch query system, it’s 1/2 the price and exactly the same performance”The Batch API provides 50% cost reduction with no performance penalty.
2. Always Use Structured Outputs
“If your application is not an actual chatbot, use structured outputs!”Even if the last 10 queries returned perfect results, unstructured outputs won’t scale to thousands of queries. You’ll get inconsistent formatting and LLM fluff text.
3. Field Order Matters
“The order in which fields are in your schema, is the order in which the model will answer them”Place preparatory or chain-of-thought fields first:
4. Check for Typos
“Check your prompt for typos, it makes a performance difference!”Typos in prompts degrade model performance. Review prompts carefully.
5. Request Logprobs
“Ask for logprobs, it’s not any more expensive and you can use them later to help identify problematic responses”Set
"logprobs": True and "top_logprobs": 5 to get confidence scores for filtering low-quality responses later.
Output Structure
The script creates:Filtering Integration
By default, the script usesPdfFilter to exclude:
- Non-English documents
- PDF forms
- SEO spam PDFs
Performance Considerations
Parallel Processing
The script usesProcessPoolExecutor for parallel PDF processing (buildsilver.py:222-264):
Reservoir Sampling
For large datasets, reservoir sampling (buildsilver.py:152-201) ensures uniform random sampling without loading all paths into memory:Next Steps
Build Test Sets
Create evaluation datasets
Filter PDFs
Configure PDF filtering