Overview
AI Math Notes uses OpenAI’s GPT-4o vision model to analyze hand-drawn mathematical equations and return calculated results. The integration involves base64 image encoding, structured prompts, and response parsing.OpenAI Client Setup
The OpenAI client is initialized in theDrawingApp constructor (main.py:43):
OPENAI_API_KEY environment variable is set (as per README setup instructions).
Image Encoding
Before sending the drawn equation to the API, the PIL Image must be converted to base64-encoded PNG.encode_image_to_base64()
This helper function (main.py:88-91) handles the conversion:
The PIL Image object containing the drawn equation
- Create an in-memory BytesIO buffer
- Save the PIL Image to the buffer in PNG format
- Encode the buffer contents to base64
- Decode bytes to UTF-8 string for API transmission
API Call Structure
Thecalculate() method (main.py:87-113) orchestrates the API request:
API Parameters
The GPT-4o vision model capable of analyzing images
Array containing a single user message with multimodal content (text + image)
Maximum tokens for the response (answers are typically short numbers)
Prompt Engineering
The prompt (main.py:101) is carefully designed to constrain the model’s output:
Prompt Strategy
- Output Format: “Only respond with numbers. NEVER Words” ensures numeric-only responses
- Task Clarity: “Give the answer to this math equation” defines the objective
- Conditional Logic: Only solve equations with incomplete equals signs (e.g.,
5 + 3 =not5 + 3 = 8) - Brevity: “Only respond with the answer” prevents explanations
5 + 3 =(needs solving) → Returns85 + 3 = 8(already solved) → Returns nothing
Multimodal Content Structure
The API accepts multimodal input via the content array:Image URL Format
The base64 image is embedded using a data URI:Response Parsing
The API response is parsed to extract the answer (main.py:112):
content contains the numeric result as a string.
Integration Flow
- User Action: User draws equation and presses Enter/Return (
main.py:26) - Event Trigger:
command_calculate()callscalculate()(main.py:116-117) - Image Encoding: PIL Image converted to base64 PNG
- API Request: Multimodal request sent to GPT-4o with prompt + image
- Response: Model returns numeric answer
- Display: Answer rendered on canvas via
draw_answer()(main.py:113)
Error Handling
The current implementation (main.py:87-113) does not include explicit error handling. Potential failure points:
- Network connectivity issues
- API authentication errors
- Rate limiting
- Invalid responses from the model
API Requirements
Environment Setup
From the README:Dependencies
Fromrequirements.txt:
Performance Considerations
- Image Size: 1200x800 canvas results in ~50-100KB base64 strings
- Latency: API calls typically complete in 1-3 seconds
- Token Usage: Responses use less than 10 tokens (just the numeric answer)
- Cost: GPT-4o vision pricing applies per API call
Example Request/Response
Request
Response
5 + 3 =, the model returns "8".