OCR Extraction

Sintesis can automatically extract structured data from uploaded PDF documents using an AI-backed OCR pipeline. Templates define which regions of a document contain which data, and the extraction engine uses those regions to prompt a vision model and map results back to tabla columns.

Architecture overview

User uploads PDF to obra folder
         │
         ▼
  PDF → rendered image
         │
         ▼
  Regions drawn on image
  (annotated as base64 data URL)
         │
         ▼
  POST /api/ocr-playground
  or  /api/obras/[id]/tablas/import/ocr-multi
         │
         ▼
  generateObject() → gpt-4o-mini
  (Vercel AI SDK, temperature 0.1)
         │
         ▼
  Structured JSON returned
  per region
         │
         ▼
  Rows inserted into obra_tabla

OCR templates

Templates are stored in the ocr_templates table. Each template belongs to a tenant, defines a reference document, and contains an array of annotated regions.

Schema

CREATE TABLE ocr_templates (
  id                UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  tenant_id         UUID NOT NULL REFERENCES tenants(id) ON DELETE CASCADE,
  name              TEXT NOT NULL,
  description       TEXT,

  -- Reference document stored in Supabase Storage
  template_bucket   TEXT,
  template_path     TEXT,
  template_file_name TEXT,

  -- Rendered image dimensions for coordinate scaling
  template_width    INTEGER,
  template_height   INTEGER,

  -- Extraction regions as a JSON array
  regions           JSONB NOT NULL DEFAULT '[]',

  -- Column definitions derived from regions
  columns           JSONB NOT NULL DEFAULT '[]',

  is_active         BOOLEAN NOT NULL DEFAULT true,
  created_at        TIMESTAMPTZ NOT NULL DEFAULT now(),
  updated_at        TIMESTAMPTZ NOT NULL DEFAULT now()
);

Template names must be unique per tenant among active templates. The uniqueness constraint uses a partial index so inactive (soft-deleted) templates can share names:

CREATE UNIQUE INDEX ocr_templates_name_unique
  ON ocr_templates (tenant_id, name)
  WHERE is_active = true;

Deleting a template is a soft delete: is_active is set to false. The record is preserved for historical reference in ocr_document_processing.

Region definition

Each entry in the regions array describes a rectangular bounding box on the reference document image:

type Region = {
  id: string;          // unique region identifier
  x: number;           // left edge, in rendered pixels
  y: number;           // top edge, in rendered pixels
  width: number;       // box width, in rendered pixels
  height: number;      // box height, in rendered pixels
  label: string;       // human-readable field name
  description?: string;
  color: string;       // display colour for the UI overlay
  type: "single" | "table";
  pageNumber?: number; // 1-indexed page (omit for page 1)
  tableColumns?: string[]; // column names for table regions
};

Region type	Extraction result
`single`	One text value extracted from the bounding box
`table`	An array of row objects, one per visible row inside the box

Column definitions

When a template is saved, the API derives a columns array from the regions automatically. Single-type regions produce one column with ocrScope: "parent". Table-type regions produce one column per tableColumns entry with ocrScope: "item":

type TemplateColumn = {
  fieldKey: string;           // snake_case key derived from label
  label: string;              // display label
  dataType: string;           // always "text" (extensible)
  ocrScope?: "parent" | "item";
  description?: string;
};

Document processing table

Every document that enters the OCR pipeline gets a record in ocr_document_processing:

CREATE TABLE ocr_document_processing (
  id                      UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  tabla_id                UUID NOT NULL REFERENCES obra_tablas(id) ON DELETE CASCADE,
  obra_id                 UUID NOT NULL REFERENCES obras(id) ON DELETE CASCADE,

  -- Source document
  source_bucket           TEXT NOT NULL,
  source_path             TEXT NOT NULL,
  source_file_name        TEXT NOT NULL,

  -- Lifecycle
  status                  TEXT NOT NULL DEFAULT 'pending'
                          CHECK (status IN ('pending', 'processing', 'completed', 'failed')),
  error_message           TEXT,
  rows_extracted          INTEGER DEFAULT 0,

  -- Template used
  template_id             UUID REFERENCES ocr_templates(id) ON DELETE SET NULL,

  -- Performance tracking
  processed_at            TIMESTAMPTZ,
  processing_duration_ms  INTEGER,
  retry_count             INTEGER NOT NULL DEFAULT 0,
  created_at              TIMESTAMPTZ NOT NULL DEFAULT now(),
  updated_at              TIMESTAMPTZ NOT NULL DEFAULT now()
);

Status	Meaning
`pending`	Document queued, not yet started
`processing`	Vision model call in progress
`completed`	Extraction succeeded; rows inserted into tabla
`failed`	Extraction failed; see `error_message`

OCR Playground

The playground lets you test extraction against any annotated image without persisting results. It is available at POST /api/ocr-playground.

Request format

{
  "annotatedImageDataUrl": "data:image/png;base64,...",
  "regions": [
    {
      "id": "r1",
      "x": 120,
      "y": 45,
      "width": 300,
      "height": 40,
      "label": "Número de contrato",
      "color": "#f97316",
      "type": "single"
    },
    {
      "id": "r2",
      "x": 80,
      "y": 200,
      "width": 500,
      "height": 300,
      "label": "Ítems de certificado",
      "color": "#3b82f6",
      "type": "table",
      "tableColumns": ["Código", "Descripción", "Cantidad", "Precio unitario", "Total"]
    }
  ]
}

annotatedImageDataUrl must be a base64 data URL that includes the bounding-box overlays already drawn on the image. The UI renders these overlays client-side before sending the request.

Response format

{
  "ok": true,
  "results": [
    {
      "id": "r1",
      "label": "Número de contrato",
      "type": "single",
      "text": "CONT-2024-00412",
      "color": "#f97316"
    },
    {
      "id": "r2",
      "label": "Ítems de certificado",
      "type": "table",
      "rows": [
        { "Código": "01.01", "Descripción": "Hormigón H30", "Cantidad": "15.5", "Precio unitario": "12500", "Total": "193750" },
        { "Código": "01.02", "Descripción": "Armadura", "Cantidad": "420", "Precio unitario": "850", "Total": "357000" }
      ],
      "color": "#3b82f6"
    }
  ]
}

AI model and prompting

Extraction uses GPT-4o mini via the Vercel AI SDK generateObject function:

const res = await generateObject({
  model: openai("gpt-4o-mini"),
  schema: extractionSchema,   // Zod schema derived from regions
  messages: [
    {
      role: "user",
      content: [
        { type: "text", text: instructions },
        { type: "image", image: annotatedImageDataUrl },
      ],
    },
  ],
  temperature: 0.1,
});

The model receives numbered boxes drawn on the image. Single-value fields are labelled [N] and table regions are labelled [N]📊. The prompt instructs the model to:

Extract text exactly as it appears, without interpretation.
Return null for empty or illegible cells.
For table regions, extract every visible row as a separate object.

Token cost estimation

// lib/ai-pricing.ts
export const AI_MODEL_COST_PER_1K_TOKENS: Record<string, number> = {
  "gpt-4o-mini": 0.00015, // USD per 1K tokens
};

export function estimateUsdForTokens(
  model: string | null,
  tokens: number
): number | null {
  const rate = AI_MODEL_COST_PER_1K_TOKENS[model ?? ""];
  if (!rate) return null;
  return (tokens / 1000) * rate;
}

GPT-4o mini is used intentionally over GPT-4o for cost efficiency. The structured generateObject schema dramatically reduces output token count by constraining the response shape.

Managing templates via the API

List active templates

GET /api/ocr-templates

Returns all active templates for the authenticated user’s tenant, ordered by name. Response

{
  "templates": [
    {
      "id": "<uuid>",
      "name": "Certificado de avance",
      "description": "Extrae número, período y tabla de ítems",
      "template_file_name": "cert-modelo.pdf",
      "regions": [...],
      "columns": [...],
      "is_active": true,
      "created_at": "2025-09-12T14:00:00Z"
    }
  ]
}

Create a template

POST /api/ocr-templates
Content-Type: application/json

{
  "name": "Certificado de avance",
  "description": "Extrae número, período y tabla de ítems",
  "templateBucket": "ocr-templates",
  "templatePath": "tenant-abc/cert-modelo.pdf",
  "templateFileName": "cert-modelo.pdf",
  "templateWidth": 1240,
  "templateHeight": 1754,
  "regions": [
    {
      "id": "r1",
      "x": 80, "y": 110, "width": 250, "height": 35,
      "label": "Número de certificado",
      "color": "#f97316",
      "type": "single",
      "pageNumber": 1
    }
  ]
}

Every region must include id, label, x, y, width, and height. Regions that fail validation are silently dropped. The request is rejected with 400 if no valid regions remain.

If a template with the same name already exists (and is active) for the tenant, the API returns 409 with code: "template_name_exists".

Delete (deactivate) a template

DELETE /api/ocr-templates
Content-Type: application/json

{ "id": "<template-uuid>" }

Sets is_active = false. The template is removed from all listing and assignment UI but its id is preserved in historical ocr_document_processing records.

Assigning templates to default tablas

Templates can be pre-assigned to obra default table configurations:

-- obra_default_tablas.ocr_template_id links to ocr_templates
ALTER TABLE obra_default_tablas
  ADD COLUMN ocr_template_id UUID REFERENCES ocr_templates(id) ON DELETE SET NULL;

When a new obra is created from defaults, the associated template is carried over so document uploads against that tabla are automatically processed with the right template.

Row-level security

-- Tenant members can view and manage their own templates
CREATE POLICY "Users can view OCR templates for their tenant"
  ON ocr_templates FOR SELECT
  USING (
    tenant_id IN (
      SELECT tenant_id FROM memberships WHERE user_id = auth.uid()
    )
  );

Document processing records inherit access control through their parent obra_id, which is scoped to the tenant.

Overview

Core Features

Automation

Administration

Setup & Deployment

Architecture overview

OCR templates

Schema

Region definition

Column definitions

Document processing table

OCR Playground

Request format

Response format

AI model and prompting

Token cost estimation

Managing templates via the API

List active templates

Create a template

Delete (deactivate) a template

Assigning templates to default tablas

Row-level security

Build docs developers (and LLMs) love

Overview

Core Features

Automation

Administration

Setup & Deployment

​Architecture overview

​OCR templates

​Schema

​Region definition

​Column definitions

​Document processing table

​OCR Playground

​Request format

​Response format

​AI model and prompting

​Token cost estimation

​Managing templates via the API

​List active templates

​Create a template

​Delete (deactivate) a template

​Assigning templates to default tablas

​Row-level security

Build docs developers (and LLMs) love

Architecture overview

OCR templates

Schema

Region definition

Column definitions

Document processing table

OCR Playground

Request format

Response format

AI model and prompting

Token cost estimation

Managing templates via the API

List active templates

Create a template

Delete (deactivate) a template

Assigning templates to default tablas

Row-level security