Skip to main content
Sintesis can automatically extract structured data from uploaded PDF documents using an AI-backed OCR pipeline. Templates define which regions of a document contain which data, and the extraction engine uses those regions to prompt a vision model and map results back to tabla columns.

Architecture overview

User uploads PDF to obra folder


  PDF → rendered image


  Regions drawn on image
  (annotated as base64 data URL)


  POST /api/ocr-playground
  or  /api/obras/[id]/tablas/import/ocr-multi


  generateObject() → gpt-4o-mini
  (Vercel AI SDK, temperature 0.1)


  Structured JSON returned
  per region


  Rows inserted into obra_tabla

OCR templates

Templates are stored in the ocr_templates table. Each template belongs to a tenant, defines a reference document, and contains an array of annotated regions.

Schema

CREATE TABLE ocr_templates (
  id                UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  tenant_id         UUID NOT NULL REFERENCES tenants(id) ON DELETE CASCADE,
  name              TEXT NOT NULL,
  description       TEXT,

  -- Reference document stored in Supabase Storage
  template_bucket   TEXT,
  template_path     TEXT,
  template_file_name TEXT,

  -- Rendered image dimensions for coordinate scaling
  template_width    INTEGER,
  template_height   INTEGER,

  -- Extraction regions as a JSON array
  regions           JSONB NOT NULL DEFAULT '[]',

  -- Column definitions derived from regions
  columns           JSONB NOT NULL DEFAULT '[]',

  is_active         BOOLEAN NOT NULL DEFAULT true,
  created_at        TIMESTAMPTZ NOT NULL DEFAULT now(),
  updated_at        TIMESTAMPTZ NOT NULL DEFAULT now()
);
Template names must be unique per tenant among active templates. The uniqueness constraint uses a partial index so inactive (soft-deleted) templates can share names:
CREATE UNIQUE INDEX ocr_templates_name_unique
  ON ocr_templates (tenant_id, name)
  WHERE is_active = true;
Deleting a template is a soft delete: is_active is set to false. The record is preserved for historical reference in ocr_document_processing.

Region definition

Each entry in the regions array describes a rectangular bounding box on the reference document image:
type Region = {
  id: string;          // unique region identifier
  x: number;           // left edge, in rendered pixels
  y: number;           // top edge, in rendered pixels
  width: number;       // box width, in rendered pixels
  height: number;      // box height, in rendered pixels
  label: string;       // human-readable field name
  description?: string;
  color: string;       // display colour for the UI overlay
  type: "single" | "table";
  pageNumber?: number; // 1-indexed page (omit for page 1)
  tableColumns?: string[]; // column names for table regions
};
Region typeExtraction result
singleOne text value extracted from the bounding box
tableAn array of row objects, one per visible row inside the box

Column definitions

When a template is saved, the API derives a columns array from the regions automatically. Single-type regions produce one column with ocrScope: "parent". Table-type regions produce one column per tableColumns entry with ocrScope: "item":
type TemplateColumn = {
  fieldKey: string;           // snake_case key derived from label
  label: string;              // display label
  dataType: string;           // always "text" (extensible)
  ocrScope?: "parent" | "item";
  description?: string;
};

Document processing table

Every document that enters the OCR pipeline gets a record in ocr_document_processing:
CREATE TABLE ocr_document_processing (
  id                      UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  tabla_id                UUID NOT NULL REFERENCES obra_tablas(id) ON DELETE CASCADE,
  obra_id                 UUID NOT NULL REFERENCES obras(id) ON DELETE CASCADE,

  -- Source document
  source_bucket           TEXT NOT NULL,
  source_path             TEXT NOT NULL,
  source_file_name        TEXT NOT NULL,

  -- Lifecycle
  status                  TEXT NOT NULL DEFAULT 'pending'
                          CHECK (status IN ('pending', 'processing', 'completed', 'failed')),
  error_message           TEXT,
  rows_extracted          INTEGER DEFAULT 0,

  -- Template used
  template_id             UUID REFERENCES ocr_templates(id) ON DELETE SET NULL,

  -- Performance tracking
  processed_at            TIMESTAMPTZ,
  processing_duration_ms  INTEGER,
  retry_count             INTEGER NOT NULL DEFAULT 0,
  created_at              TIMESTAMPTZ NOT NULL DEFAULT now(),
  updated_at              TIMESTAMPTZ NOT NULL DEFAULT now()
);
StatusMeaning
pendingDocument queued, not yet started
processingVision model call in progress
completedExtraction succeeded; rows inserted into tabla
failedExtraction failed; see error_message

OCR Playground

The playground lets you test extraction against any annotated image without persisting results. It is available at POST /api/ocr-playground.

Request format

{
  "annotatedImageDataUrl": "data:image/png;base64,...",
  "regions": [
    {
      "id": "r1",
      "x": 120,
      "y": 45,
      "width": 300,
      "height": 40,
      "label": "Número de contrato",
      "color": "#f97316",
      "type": "single"
    },
    {
      "id": "r2",
      "x": 80,
      "y": 200,
      "width": 500,
      "height": 300,
      "label": "Ítems de certificado",
      "color": "#3b82f6",
      "type": "table",
      "tableColumns": ["Código", "Descripción", "Cantidad", "Precio unitario", "Total"]
    }
  ]
}
annotatedImageDataUrl must be a base64 data URL that includes the bounding-box overlays already drawn on the image. The UI renders these overlays client-side before sending the request.

Response format

{
  "ok": true,
  "results": [
    {
      "id": "r1",
      "label": "Número de contrato",
      "type": "single",
      "text": "CONT-2024-00412",
      "color": "#f97316"
    },
    {
      "id": "r2",
      "label": "Ítems de certificado",
      "type": "table",
      "rows": [
        { "Código": "01.01", "Descripción": "Hormigón H30", "Cantidad": "15.5", "Precio unitario": "12500", "Total": "193750" },
        { "Código": "01.02", "Descripción": "Armadura", "Cantidad": "420", "Precio unitario": "850", "Total": "357000" }
      ],
      "color": "#3b82f6"
    }
  ]
}

AI model and prompting

Extraction uses GPT-4o mini via the Vercel AI SDK generateObject function:
const res = await generateObject({
  model: openai("gpt-4o-mini"),
  schema: extractionSchema,   // Zod schema derived from regions
  messages: [
    {
      role: "user",
      content: [
        { type: "text", text: instructions },
        { type: "image", image: annotatedImageDataUrl },
      ],
    },
  ],
  temperature: 0.1,
});
The model receives numbered boxes drawn on the image. Single-value fields are labelled [N] and table regions are labelled [N]📊. The prompt instructs the model to:
  • Extract text exactly as it appears, without interpretation.
  • Return null for empty or illegible cells.
  • For table regions, extract every visible row as a separate object.

Token cost estimation

// lib/ai-pricing.ts
export const AI_MODEL_COST_PER_1K_TOKENS: Record<string, number> = {
  "gpt-4o-mini": 0.00015, // USD per 1K tokens
};

export function estimateUsdForTokens(
  model: string | null,
  tokens: number
): number | null {
  const rate = AI_MODEL_COST_PER_1K_TOKENS[model ?? ""];
  if (!rate) return null;
  return (tokens / 1000) * rate;
}
GPT-4o mini is used intentionally over GPT-4o for cost efficiency. The structured generateObject schema dramatically reduces output token count by constraining the response shape.

Managing templates via the API

List active templates

GET /api/ocr-templates
Returns all active templates for the authenticated user’s tenant, ordered by name. Response
{
  "templates": [
    {
      "id": "<uuid>",
      "name": "Certificado de avance",
      "description": "Extrae número, período y tabla de ítems",
      "template_file_name": "cert-modelo.pdf",
      "regions": [...],
      "columns": [...],
      "is_active": true,
      "created_at": "2025-09-12T14:00:00Z"
    }
  ]
}

Create a template

POST /api/ocr-templates
Content-Type: application/json

{
  "name": "Certificado de avance",
  "description": "Extrae número, período y tabla de ítems",
  "templateBucket": "ocr-templates",
  "templatePath": "tenant-abc/cert-modelo.pdf",
  "templateFileName": "cert-modelo.pdf",
  "templateWidth": 1240,
  "templateHeight": 1754,
  "regions": [
    {
      "id": "r1",
      "x": 80, "y": 110, "width": 250, "height": 35,
      "label": "Número de certificado",
      "color": "#f97316",
      "type": "single",
      "pageNumber": 1
    }
  ]
}
Every region must include id, label, x, y, width, and height. Regions that fail validation are silently dropped. The request is rejected with 400 if no valid regions remain.
If a template with the same name already exists (and is active) for the tenant, the API returns 409 with code: "template_name_exists".

Delete (deactivate) a template

DELETE /api/ocr-templates
Content-Type: application/json

{ "id": "<template-uuid>" }
Sets is_active = false. The template is removed from all listing and assignment UI but its id is preserved in historical ocr_document_processing records.

Assigning templates to default tablas

Templates can be pre-assigned to obra default table configurations:
-- obra_default_tablas.ocr_template_id links to ocr_templates
ALTER TABLE obra_default_tablas
  ADD COLUMN ocr_template_id UUID REFERENCES ocr_templates(id) ON DELETE SET NULL;
When a new obra is created from defaults, the associated template is carried over so document uploads against that tabla are automatically processed with the right template.

Row-level security

-- Tenant members can view and manage their own templates
CREATE POLICY "Users can view OCR templates for their tenant"
  ON ocr_templates FOR SELECT
  USING (
    tenant_id IN (
      SELECT tenant_id FROM memberships WHERE user_id = auth.uid()
    )
  );
Document processing records inherit access control through their parent obra_id, which is scoped to the tenant.

Build docs developers (and LLMs) love