Overview
Thedolmaviewer module provides functionality to generate interactive HTML pages from JSONL files containing document data. It supports both local and S3 paths, renders PDF pages as images, and creates pre-signed URLs for S3 resources.
Main Function
main
List of paths to JSONL files (local or s3://). Supports glob patterns for local paths.
Directory where HTML files will be saved
Path to the Jinja2 template file for HTML generation
AWS profile name to use for accessing S3 resources
Core Functions
read_jsonl
List of file paths to read (local or S3)
generate_presigned_url
Boto3 S3 client instance
Name of the S3 bucket
S3 object key
process_document
Dictionary containing document data with keys:
id, text, attributes, metadataBoto3 S3 client for downloading PDFs and generating URLs
Jinja2 template object for rendering HTML
Directory where the HTML file will be saved
- Extracts PDF page information from the document data
- Downloads the source PDF from S3
- Renders each PDF page as a base64-encoded WebP image
- Converts markdown text to HTML
- Generates a pre-signed S3 URL for the source PDF
- Renders the final HTML using the template
HTML Generation Workflow
The viewer follows this workflow to generate HTML previews:- Read JSONL files - Loads document data from local or S3 paths
- Process each document - Extracts text, metadata, and PDF page information
- Render PDF pages - Converts PDF pages to base64 WebP images
- Convert markdown - Transforms markdown text to HTML with table support
- Generate S3 links - Creates pre-signed URLs for source PDFs
- Render HTML - Uses Jinja2 template to create the final HTML page
- Save output - Writes HTML files to the output directory
Usage Example
Command Line Usage
Document Data Format
The JSONL files should contain documents in this format:pdf_page_numbers is a list of [start_index, end_index, page_number] tuples mapping text ranges to PDF pages.