Ingest PDF Policy

Overview

This endpoint accepts a PDF policy document, extracts its text content, and uses Google’s Gemini AI to automatically identify and extract actionable compliance rules. The extracted rules are saved to a new policy with all associated rules.

Authentication

Requires a valid session token. Returns 401 UNAUTHORIZED if not authenticated.

Request

file

required

PDF file to upload. Must be a valid, non-encrypted PDF with extractable text. Scanned image PDFs without OCR will be rejected.

Content Type

multipart/form-data

Example Request

curl -X POST https://yourdomain.com/api/policies/ingest \
  -H "Cookie: session=your_session_token" \
  -F "[email protected]"

Response

policy

object

The created policy with extracted rules

string

UUID of the created policy

name

string

Policy name (extracted from PDF or filename)

rules

array

Array of extracted compliance rules

rule_id

string

Unique rule identifier in UPPER_SNAKE_CASE (e.g., DATA_RETENTION_VIOLATION)

name

string

Human-readable rule name

description

string

Detailed description of the rule

type

string

Rule category (e.g., retention, encryption, access_control, consent)

severity

enum

Rule severity level: CRITICAL, HIGH, or MEDIUM

threshold

number | null

Numeric threshold for threshold-based rules (e.g., 10000 for amount limits)

time_window

number | null

Time window in hours for temporal rules

conditions

object

Rule evaluation logic using recursive AND/OR conditions. Each leaf condition has field, operator, and value.Supported operators: equals, not_equals, greater_than, less_than, greater_than_or_equal, less_than_or_equal, contains, exists, not_exists, IN, BETWEEN, MATCH (regex)

policy_excerpt

string

Exact quote from the PDF justifying this rule

policy_section

string

Section reference from the policy document

requires_clarification

boolean

Whether the rule requires additional clarification

clarification_notes

string

Notes explaining what needs clarification

created_at

string

ISO 8601 timestamp of policy creation

Success Response

{
  "policy": {
    "id": "550e8400-e29b-41d4-a716-446655440000",
    "name": "GDPR Data Protection Policy",
    "rules": [
      {
        "rule_id": "DATA_RETENTION_VIOLATION",
        "name": "Data Retention Limit Exceeded",
        "description": "Personal data retained beyond maximum allowed period",
        "type": "retention",
        "severity": "HIGH",
        "threshold": null,
        "time_window": 8760,
        "conditions": {
          "AND": [
            {
              "field": "data_type",
              "operator": "equals",
              "value": "personal"
            },
            {
              "field": "retention_days",
              "operator": "greater_than",
              "value": 365
            }
          ]
        },
        "policy_excerpt": "Personal data must not be retained for longer than 12 months unless explicitly required by law.",
        "policy_section": "Article 5.1.e",
        "requires_clarification": false
      }
    ],
    "created_at": "2026-02-28T10:30:00Z"
  }
}

Error Responses

error

string

Error code identifier

message

string

Human-readable error message

400 Bad Request

{
  "error": "VALIDATION_ERROR",
  "message": "No PDF file provided"
}

{
  "error": "VALIDATION_ERROR",
  "message": "Failed to parse PDF. Ensure the file is a valid, non-encrypted PDF."
}

{
  "error": "VALIDATION_ERROR",
  "message": "PDF contains no extractable text. It may be a scanned image."
}

401 Unauthorized

{
  "error": "UNAUTHORIZED",
  "message": "Authentication required"
}

500 Internal Server Error

{
  "error": "INTERNAL_ERROR",
  "message": "An unexpected error occurred"
}

Rule Extraction Logic

The Gemini AI model uses a Signal Specificity Framework to ensure high-precision rules:

Weak Signals (0.5): Single thresholds, basic state checks
Medium Signals (1.0): Temporal windows, behavioral shifts
Strong Signals (2.0): Multiple conditions, cross-field comparisons

All extracted rules must achieve a combined specificity score of ≥2.0 to minimize false positives.

Processing Limits

Maximum PDF text size: 500,000 characters
Larger documents are truncated
Gemini 2.0 Flash supports up to ~1M tokens (~4M characters)

Notes

PDF parsing uses the unpdf library with fallback to UTF-8 text decoding
Rules are automatically assigned UUIDs and associated with the policy
All rules are set to is_active: true by default
The policy type is set to pdf to distinguish from prebuilt policies

Authentication

Audits

Policies

Data Management

Scanning

Violations

Compliance

Overview

Authentication

Request

Content Type

Example Request

Response

Success Response

Error Responses

400 Bad Request

401 Unauthorized

500 Internal Server Error

Rule Extraction Logic

Processing Limits

Notes

Build docs developers (and LLMs) love

Authentication

Audits

Policies

Data Management

Scanning

Violations

Compliance

​Overview

​Authentication

​Request

​Content Type

​Example Request

​Response

​Success Response

​Error Responses

​400 Bad Request

​401 Unauthorized

​500 Internal Server Error

​Rule Extraction Logic

​Processing Limits

​Notes

Build docs developers (and LLMs) love

Overview

Authentication

Request

Content Type

Example Request

Response

Success Response

Error Responses

400 Bad Request

401 Unauthorized

500 Internal Server Error

Rule Extraction Logic

Processing Limits

Notes