Overview
Thebuildsilver module provides functionality for generating silver training data from PDF documents using OpenAI’s GPT-4o model via the Batch API. It processes PDF pages, extracts anchor text, and constructs queries for vision-language model processing.
Core Function
build_page_query
Constructs an OpenAI Batch API request for processing a single PDF page.Path to the local PDF file to process
Display-friendly path used in the custom_id field for tracking
Page number to process (1-indexed)
Example Usage
OpenAI Batch API Format
The function returns a structured dictionary following OpenAI’s Batch API format:Request Structure
Format:
{pretty_pdf_path}-{page} - Used to track results back to sourceModel:
gpt-4o-2024-08-06 - Vision-capable model for OCR tasksSet to
0.1 for more deterministic outputsMaximum of
6000 tokens for comprehensive page extractionEnabled to help identify problematic responses later
Structured output schema ensuring consistent JSON responses (see Prompts API)
Prompt Construction
The query usesbuild_openai_silver_data_prompt() to create the text component:
Image Processing
Pages are rendered to base64-encoded PNG images at 2048px resolution:Best Practices
Use Batch API
The Batch API offers 50% cost savings compared to real-time API calls with the same performance.
Structured Outputs
Always use structured outputs for data processing tasks to eliminate LLM “fluff text” and ensure consistent parsing.
Schema Ordering
Fields in your schema are answered in order - place chain-of-thought fields first for better results.
Request Logprobs
Enable logprobs at no extra cost to identify low-confidence or problematic responses during analysis.
Check for Typos
Prompt quality directly impacts performance - review prompts carefully before processing thousands of documents.
Helper Functions
sample_pdf_pages
fetch_s3_file
process_pdf
s3://bucket/key).
Related APIs
- Prompts API - Prompt construction and response schemas
- Filter API - PDF filtering before processing