How It Works
The custom PDF extraction pipeline uses Gemini 2.5 Flash to parse regulatory text and generate rules with compound boolean logic.Upload PDF
Upload your regulatory document via the audit wizard. Supported formats:
- Standard PDFs (text-based)
- Scanned PDFs (OCR is applied automatically)
- Maximum file size: 10MB
- Maximum pages: 100
Text Extraction
Yggdrasil uses
unpdf (serverless-compatible PDF parser) to extract text from the document.Text is chunked into sections to fit within Gemini’s context window.AI Rule Extraction
Gemini 2.5 Flash analyzes the document and identifies:
- Enforceable clauses — Statements that can be validated against data
- Thresholds — Numeric limits (e.g., “transactions exceeding $10,000”)
- Conditions — Boolean logic (e.g., “if amount > $10K AND type = WIRE”)
- Severity — Risk level (CRITICAL, HIGH, MEDIUM)
- Policy excerpts — Exact quotes from the document
Signal Specificity Validation
Each extracted rule is scored using the Signal Specificity Framework:
- Single-signal rules (e.g., “amount > $10K”) are rejected unless they meet domain-specific thresholds
- Multi-signal rules (e.g., “amount > $10K AND type = WIRE AND dest_country = offshore”) are accepted
- Minimum combined specificity: 2.0
Signal Specificity Framework
The Signal Specificity Framework is a scoring system that evaluates rule quality based on how many independent signals are combined.Signal Types
| Signal Type | Examples | Specificity Score |
|---|---|---|
| Behavioral | Transaction type, action, event | 1.0 per signal |
| Temporal | Time window, frequency, velocity | 1.0 per signal |
| Relational | Account relationships, cross-entity patterns | 1.0 per signal |
| Threshold | Numeric limits (amount, count, age) | 0.5 per threshold |
Scoring Rules
- Single condition: Score = specificity of that signal
- AND conditions: Score = sum of all signal specificities
- OR conditions: Score = max specificity among branches
Examples
Example 1: Single-Signal Rule (Rejected)
Example 1: Single-Signal Rule (Rejected)
Rule:Specificity Score: 0.5 (threshold only)Result: Rejected (below minimum threshold of 2.0)Reason: This rule would fire on any transaction over $10K, producing too many false positives.
Example 2: Two-Signal Rule (Accepted)
Example 2: Two-Signal Rule (Accepted)
Rule:Specificity Score: 0.5 (threshold) + 1.0 (behavioral) = 1.5Result: Borderline (may be rejected depending on domain)Reason: Still too broad — any wire transfer over $10K would trigger.
Example 3: Multi-Signal Rule (Accepted)
Example 3: Multi-Signal Rule (Accepted)
Rule:Specificity Score: 0.5 (threshold) + 1.0 (behavioral) + 1.0 (relational) = 2.5Result: AcceptedReason: Combines multiple signals (amount + type + destination), reducing false positives.
Example 4: Temporal + Behavioral (Accepted)
Example 4: Temporal + Behavioral (Accepted)
Rule (Velocity-based):Specificity Score: 0.5 (threshold) + 1.0 (temporal: 24h window) + 1.0 (behavioral: velocity pattern) = 2.5Result: AcceptedReason: Velocity rules inherently combine temporal and behavioral signals.
Supported Rule Types
The extraction engine can generate the following rule types:| Rule Type | Description | Example |
|---|---|---|
| single_transaction | Evaluate conditions per record | ”Flag transactions > $10K” |
| aggregation | Sum values within time window | ”Flag accounts with total volume > $25K in 24h” |
| velocity | Count occurrences within time window | ”Flag 5+ transactions in 24h” |
| structuring | Detect sub-threshold patterns | ”Flag 3+ transactions between 10K in 24h” |
| dormant_reactivation | Detect dormant account activity | ”Flag dormant accounts (90d) with transaction > $5K” |
| round_amount | Detect round-dollar patterns | ”Flag 3+ round amounts ($X,000) in 30d” |
What Gets Extracted
For each rule, Gemini extracts:Rule Metadata
Rule Metadata
- rule_id — Unique identifier (e.g.,
CUSTOM-001) - name — Human-readable name (e.g., “High-Value Wire Transfer”)
- type — Rule execution type (single_transaction, aggregation, velocity, etc.)
- severity — Risk level (CRITICAL, HIGH, MEDIUM)
Thresholds & Windows
Thresholds & Windows
- threshold — Numeric limit (e.g., 10000 for $10K)
- time_window — Hours for temporal rules (e.g., 24 for “within 24 hours”)
Conditions (Boolean Logic)
Conditions (Boolean Logic)
- AND — All conditions must match
- OR — Any condition must match
- Leaf conditions — Field, operator, value triples
Policy References
Policy References
- policy_excerpt — Exact quote from the PDF
- policy_section — Section/article reference (e.g., “Section 3.2”, “Article 15”)
Quality Assurance
To ensure high-quality rule extraction:Validation Against Schema
All extracted rules are validated against Zod schemas to ensure:
- Valid JSON structure
- Required fields present (rule_id, name, type, severity, conditions)
- Supported operators (>=, ==, IN, BETWEEN, etc.)
Specificity Scoring
Rules are scored using the Signal Specificity Framework:
- Minimum combined specificity: 2.0
- Single-threshold rules rejected
- Compound conditions preferred
Example: HIPAA Privacy Rule Extraction
Limitations
When to Use Custom PDF vs Prebuilt
| Scenario | Recommended Approach |
|---|---|
| You need AML, GDPR, or SOC2 compliance | Use prebuilt frameworks (faster, includes historical fines) |
| You have HIPAA, PCI-DSS, GLBA, or other industry regulations | Use custom PDF upload |
| You’re enforcing internal company policies | Use custom PDF upload |
| You need to customize rule thresholds | Use prebuilt + manual editing or custom PDF |
| You want to test a new regulation before production | Use custom PDF with a sample dataset |
Next Steps
Start an Audit
Upload your first PDF and extract rules
Rule Engine
Learn how rules are evaluated
Confidence Scoring
Understand how violations are scored
Explainability
See how violation explanations are generated