Skip to main content
POST
/
api
/
data
/
pii-scan
PII Scan
curl --request POST \
  --url https://api.example.com/api/data/pii-scan \
  --header 'Content-Type: application/json' \
  --data '
{
  "upload_id": "<string>",
  "scan_id": "<string>"
}
'
{
  "findings": [
    {}
  ],
  "summary": "<string>",
  "pii_detected": true,
  "error": "<string>",
  "message": "<string>"
}
This endpoint analyzes the uploaded dataset for Personally Identifiable Information (PII) exposure. It uses AI-powered analysis combined with regex-based detection to identify sensitive data across all columns and rows.

Request

upload_id
string
required
UUID of the uploaded dataset from /api/data/upload.
scan_id
string
Optional UUID of an existing scan to associate findings with. If not provided, findings are stored with upload_id only.

Example Request

curl -X POST https://your-domain.com/api/data/pii-scan \
  -H "Content-Type: application/json" \
  -d '{
    "upload_id": "a3f12b45-8c7d-4e9f-b1a2-3c4d5e6f7g8h",
    "scan_id": "f7e6d5c4-b3a2-1098-7654-321fedcba098"
  }'

Response

findings
array
required
Array of PII findings detected in the dataset. Each finding contains:
  • column_name (string) - Name of the column containing PII
  • pii_type (string) - Type of PII detected
  • severity (string) - Risk severity level
  • confidence (number) - AI confidence score (0-100)
  • match_count (number) - Number of rows with PII matches
  • total_rows (number) - Total rows analyzed
  • masked_samples (array) - Sample values with PII masked
  • detection_regex (string) - Regex pattern used for detection
  • violation_text (string) - Description of the privacy risk
  • suggestion (string) - Recommended remediation action
summary
string
required
Human-readable summary of PII analysis results.
pii_detected
boolean
required
Indicates whether any PII was found in the dataset.

PII Types

  • email - Email addresses
  • phone - Phone numbers
  • ssn - Social Security Numbers
  • name - Personal names
  • address - Physical addresses
  • date_of_birth - Birth dates
  • credit_card - Credit card numbers
  • ip_address - IP addresses
  • passport - Passport numbers
  • national_id - National ID numbers
  • bank_account - Bank account numbers
  • other - Other PII types

Severity Levels

  • CRITICAL - SSN, credit card, passport (immediate risk)
  • HIGH - Email, phone, bank account (high risk)
  • MEDIUM - Name, address, date of birth, IP address (moderate risk)

Example Response

{
  "findings": [
    {
      "column_name": "customer_email",
      "pii_type": "email",
      "severity": "HIGH",
      "confidence": 95,
      "match_count": 1247,
      "total_rows": 1250,
      "masked_samples": [
        "j***@example.com",
        "s***@company.org",
        "a***@domain.net"
      ],
      "detection_regex": "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}",
      "violation_text": "Column contains email addresses which are considered PII under GDPR and CCPA. Exposure risk: HIGH. This data could be used to identify individuals and may require consent for processing.",
      "suggestion": "Hash email addresses using SHA-256 or remove column if not required for analysis. Consider implementing email masking for non-production environments."
    },
    {
      "column_name": "account_holder",
      "pii_type": "name",
      "severity": "MEDIUM",
      "confidence": 87,
      "match_count": 1250,
      "total_rows": 1250,
      "masked_samples": [
        "J*** D***",
        "S*** W***",
        "A*** M***"
      ],
      "detection_regex": "^[A-Z][a-z]+ [A-Z][a-z]+$",
      "violation_text": "Column contains personal names. Under GDPR Article 4, names are directly identifiable PII. This creates a moderate privacy risk.",
      "suggestion": "Replace names with pseudonymized identifiers (e.g., USER_001) or use tokenization to preserve referential integrity while protecting identity."
    },
    {
      "column_name": "ssn",
      "pii_type": "ssn",
      "severity": "CRITICAL",
      "confidence": 98,
      "match_count": 423,
      "total_rows": 1250,
      "masked_samples": [
        "***-**-1234",
        "***-**-5678",
        "***-**-9012"
      ],
      "detection_regex": "\\b\\d{3}-\\d{2}-\\d{4}\\b",
      "violation_text": "CRITICAL: Column contains Social Security Numbers. This is highly sensitive PII that poses severe identity theft risk if exposed. Violates multiple regulations including GLBA and state privacy laws.",
      "suggestion": "IMMEDIATE ACTION REQUIRED: Encrypt or remove SSN column. If retention is legally required, use field-level encryption with key management system. Never store SSNs in plain text."
    }
  ],
  "summary": "PII analysis complete. Found 3 columns with PII exposure across 1,250 rows. Severity breakdown: 1 CRITICAL, 1 HIGH, 1 MEDIUM. Immediate remediation required for SSN column.",
  "pii_detected": true
}

Error Responses

error
string
Error code:
  • Bad Request - Missing or invalid upload_id
  • Not Found - Upload not found or expired
  • Internal Server Error - Unexpected server error
message
string
Human-readable error description.

Example Error Response

{
  "error": "Not Found",
  "message": "Upload not found — may have expired"
}

Detection Process

1. Sample Analysis (AI-Powered)

  • Randomly samples 20 rows from the dataset using Fisher-Yates shuffle
  • Sends column names and sample data to Gemini AI
  • AI analyzes patterns and suggests PII types with confidence scores
  • Generates regex patterns for each detected PII type

2. Full Dataset Scan (Regex-Based)

  • For each column identified as containing PII:
    • Applies regex pattern to all rows
    • Counts matches and calculates prevalence
    • Collects masked sample values
  • Only columns with matches are included in final findings

3. Persistence

  • Findings are stored in Supabase pii_findings table
  • Associated with scan_id (if provided) and upload_id
  • Initial status set to open for review workflow

Graceful Degradation

If AI analysis fails:
  • Returns empty findings array
  • Sets pii_detected to false
  • Provides explanatory summary message
  • Does not block API response

Performance Characteristics

  • Sample size: 20 random rows (optimal for AI analysis)
  • Full scan: All rows processed with regex
  • Typical processing time: 3-8 seconds for 10k rows
  • Memory efficient: Streams data without full duplication

Privacy & Security

  • Sample values are automatically masked before storage
  • Masking preserves first character and length for verification
  • Full raw values never leave the upload store
  • Regex patterns are JavaScript-compatible and safe to execute

Notes

  • Only columns with confidence > 60% are analyzed
  • Findings include both column-level and value-level evidence
  • Severity levels align with GDPR, CCPA, and GLBA requirements
  • Suggestions are actionable and compliance-focused
  • Scan can be run multiple times on the same upload

Build docs developers (and LLMs) love