Skip to main content
POST
/
api
/
policies
/
ingest
Ingest PDF Policy
curl --request POST \
  --url https://api.example.com/api/policies/ingest \
  --header 'Content-Type: application/json' \
  --data '{}'
{
  "policy": {
    "id": "<string>",
    "name": "<string>",
    "rules": [
      {
        "rule_id": "<string>",
        "name": "<string>",
        "description": "<string>",
        "type": "<string>",
        "severity": {},
        "threshold": {},
        "time_window": {},
        "conditions": {},
        "policy_excerpt": "<string>",
        "policy_section": "<string>",
        "requires_clarification": true,
        "clarification_notes": "<string>"
      }
    ],
    "created_at": "<string>"
  },
  "error": "<string>",
  "message": "<string>"
}

Overview

This endpoint accepts a PDF policy document, extracts its text content, and uses Google’s Gemini AI to automatically identify and extract actionable compliance rules. The extracted rules are saved to a new policy with all associated rules.

Authentication

Requires a valid session token. Returns 401 UNAUTHORIZED if not authenticated.

Request

file
file
required
PDF file to upload. Must be a valid, non-encrypted PDF with extractable text. Scanned image PDFs without OCR will be rejected.

Content Type

multipart/form-data

Example Request

curl -X POST https://yourdomain.com/api/policies/ingest \
  -H "Cookie: session=your_session_token" \
  -F "[email protected]"

Response

policy
object
The created policy with extracted rules
id
string
UUID of the created policy
name
string
Policy name (extracted from PDF or filename)
rules
array
Array of extracted compliance rules
rule_id
string
Unique rule identifier in UPPER_SNAKE_CASE (e.g., DATA_RETENTION_VIOLATION)
name
string
Human-readable rule name
description
string
Detailed description of the rule
type
string
Rule category (e.g., retention, encryption, access_control, consent)
severity
enum
Rule severity level: CRITICAL, HIGH, or MEDIUM
threshold
number | null
Numeric threshold for threshold-based rules (e.g., 10000 for amount limits)
time_window
number | null
Time window in hours for temporal rules
conditions
object
Rule evaluation logic using recursive AND/OR conditions. Each leaf condition has field, operator, and value.Supported operators: equals, not_equals, greater_than, less_than, greater_than_or_equal, less_than_or_equal, contains, exists, not_exists, IN, BETWEEN, MATCH (regex)
policy_excerpt
string
Exact quote from the PDF justifying this rule
policy_section
string
Section reference from the policy document
requires_clarification
boolean
Whether the rule requires additional clarification
clarification_notes
string
Notes explaining what needs clarification
created_at
string
ISO 8601 timestamp of policy creation

Success Response

{
  "policy": {
    "id": "550e8400-e29b-41d4-a716-446655440000",
    "name": "GDPR Data Protection Policy",
    "rules": [
      {
        "rule_id": "DATA_RETENTION_VIOLATION",
        "name": "Data Retention Limit Exceeded",
        "description": "Personal data retained beyond maximum allowed period",
        "type": "retention",
        "severity": "HIGH",
        "threshold": null,
        "time_window": 8760,
        "conditions": {
          "AND": [
            {
              "field": "data_type",
              "operator": "equals",
              "value": "personal"
            },
            {
              "field": "retention_days",
              "operator": "greater_than",
              "value": 365
            }
          ]
        },
        "policy_excerpt": "Personal data must not be retained for longer than 12 months unless explicitly required by law.",
        "policy_section": "Article 5.1.e",
        "requires_clarification": false
      }
    ],
    "created_at": "2026-02-28T10:30:00Z"
  }
}

Error Responses

error
string
Error code identifier
message
string
Human-readable error message

400 Bad Request

{
  "error": "VALIDATION_ERROR",
  "message": "No PDF file provided"
}
{
  "error": "VALIDATION_ERROR",
  "message": "Failed to parse PDF. Ensure the file is a valid, non-encrypted PDF."
}
{
  "error": "VALIDATION_ERROR",
  "message": "PDF contains no extractable text. It may be a scanned image."
}

401 Unauthorized

{
  "error": "UNAUTHORIZED",
  "message": "Authentication required"
}

500 Internal Server Error

{
  "error": "INTERNAL_ERROR",
  "message": "An unexpected error occurred"
}

Rule Extraction Logic

The Gemini AI model uses a Signal Specificity Framework to ensure high-precision rules:
  • Weak Signals (0.5): Single thresholds, basic state checks
  • Medium Signals (1.0): Temporal windows, behavioral shifts
  • Strong Signals (2.0): Multiple conditions, cross-field comparisons
All extracted rules must achieve a combined specificity score of ≥2.0 to minimize false positives.

Processing Limits

  • Maximum PDF text size: 500,000 characters
  • Larger documents are truncated
  • Gemini 2.0 Flash supports up to ~1M tokens (~4M characters)

Notes

  • PDF parsing uses the unpdf library with fallback to UTF-8 text decoding
  • Rules are automatically assigned UUIDs and associated with the policy
  • All rules are set to is_active: true by default
  • The policy type is set to pdf to distinguish from prebuilt policies

Build docs developers (and LLMs) love