PII Scan

This endpoint analyzes the uploaded dataset for Personally Identifiable Information (PII) exposure. It uses AI-powered analysis combined with regex-based detection to identify sensitive data across all columns and rows.

Request

upload_id

string

required

UUID of the uploaded dataset from /api/data/upload.

scan_id

string

Optional UUID of an existing scan to associate findings with. If not provided, findings are stored with upload_id only.

Example Request

curl -X POST https://your-domain.com/api/data/pii-scan \
  -H "Content-Type: application/json" \
  -d '{
    "upload_id": "a3f12b45-8c7d-4e9f-b1a2-3c4d5e6f7g8h",
    "scan_id": "f7e6d5c4-b3a2-1098-7654-321fedcba098"
  }'

Response

findings

array

required

Array of PII findings detected in the dataset. Each finding contains:

column_name (string) - Name of the column containing PII
pii_type (string) - Type of PII detected
severity (string) - Risk severity level
confidence (number) - AI confidence score (0-100)
match_count (number) - Number of rows with PII matches
total_rows (number) - Total rows analyzed
masked_samples (array) - Sample values with PII masked
detection_regex (string) - Regex pattern used for detection
violation_text (string) - Description of the privacy risk
suggestion (string) - Recommended remediation action

summary

string

required

Human-readable summary of PII analysis results.

pii_detected

boolean

required

Indicates whether any PII was found in the dataset.

PII Types

email - Email addresses
phone - Phone numbers
ssn - Social Security Numbers
name - Personal names
address - Physical addresses
date_of_birth - Birth dates
credit_card - Credit card numbers
ip_address - IP addresses
passport - Passport numbers
national_id - National ID numbers
bank_account - Bank account numbers
other - Other PII types

Severity Levels

CRITICAL - SSN, credit card, passport (immediate risk)
HIGH - Email, phone, bank account (high risk)
MEDIUM - Name, address, date of birth, IP address (moderate risk)

Example Response

{
  "findings": [
    {
      "column_name": "customer_email",
      "pii_type": "email",
      "severity": "HIGH",
      "confidence": 95,
      "match_count": 1247,
      "total_rows": 1250,
      "masked_samples": [
        "j***@example.com",
        "s***@company.org",
        "a***@domain.net"
      ],
      "detection_regex": "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}",
      "violation_text": "Column contains email addresses which are considered PII under GDPR and CCPA. Exposure risk: HIGH. This data could be used to identify individuals and may require consent for processing.",
      "suggestion": "Hash email addresses using SHA-256 or remove column if not required for analysis. Consider implementing email masking for non-production environments."
    },
    {
      "column_name": "account_holder",
      "pii_type": "name",
      "severity": "MEDIUM",
      "confidence": 87,
      "match_count": 1250,
      "total_rows": 1250,
      "masked_samples": [
        "J*** D***",
        "S*** W***",
        "A*** M***"
      ],
      "detection_regex": "^[A-Z][a-z]+ [A-Z][a-z]+$",
      "violation_text": "Column contains personal names. Under GDPR Article 4, names are directly identifiable PII. This creates a moderate privacy risk.",
      "suggestion": "Replace names with pseudonymized identifiers (e.g., USER_001) or use tokenization to preserve referential integrity while protecting identity."
    },
    {
      "column_name": "ssn",
      "pii_type": "ssn",
      "severity": "CRITICAL",
      "confidence": 98,
      "match_count": 423,
      "total_rows": 1250,
      "masked_samples": [
        "***-**-1234",
        "***-**-5678",
        "***-**-9012"
      ],
      "detection_regex": "\\b\\d{3}-\\d{2}-\\d{4}\\b",
      "violation_text": "CRITICAL: Column contains Social Security Numbers. This is highly sensitive PII that poses severe identity theft risk if exposed. Violates multiple regulations including GLBA and state privacy laws.",
      "suggestion": "IMMEDIATE ACTION REQUIRED: Encrypt or remove SSN column. If retention is legally required, use field-level encryption with key management system. Never store SSNs in plain text."
    }
  ],
  "summary": "PII analysis complete. Found 3 columns with PII exposure across 1,250 rows. Severity breakdown: 1 CRITICAL, 1 HIGH, 1 MEDIUM. Immediate remediation required for SSN column.",
  "pii_detected": true
}

Error Responses

error

string

Error code:

Bad Request - Missing or invalid upload_id
Not Found - Upload not found or expired
Internal Server Error - Unexpected server error

message

string

Human-readable error description.

Example Error Response

{
  "error": "Not Found",
  "message": "Upload not found — may have expired"
}

Detection Process

1. Sample Analysis (AI-Powered)

Randomly samples 20 rows from the dataset using Fisher-Yates shuffle
Sends column names and sample data to Gemini AI
AI analyzes patterns and suggests PII types with confidence scores
Generates regex patterns for each detected PII type

2. Full Dataset Scan (Regex-Based)

For each column identified as containing PII:
- Applies regex pattern to all rows
- Counts matches and calculates prevalence
- Collects masked sample values
Only columns with matches are included in final findings

3. Persistence

Findings are stored in Supabase pii_findings table
Associated with scan_id (if provided) and upload_id
Initial status set to open for review workflow

Graceful Degradation

If AI analysis fails:

Returns empty findings array
Sets pii_detected to false
Provides explanatory summary message
Does not block API response

Performance Characteristics

Sample size: 20 random rows (optimal for AI analysis)
Full scan: All rows processed with regex
Typical processing time: 3-8 seconds for 10k rows
Memory efficient: Streams data without full duplication

Privacy & Security

Sample values are automatically masked before storage
Masking preserves first character and length for verification
Full raw values never leave the upload store
Regex patterns are JavaScript-compatible and safe to execute

Notes

Only columns with confidence > 60% are analyzed
Findings include both column-level and value-level evidence
Severity levels align with GDPR, CCPA, and GLBA requirements
Suggestions are actionable and compliance-focused
Scan can be run multiple times on the same upload

Authentication

Audits

Policies

Data Management

Scanning

Violations

Compliance

Request

Example Request

Response

PII Types

Severity Levels

Example Response

Error Responses

Example Error Response

Detection Process

1. Sample Analysis (AI-Powered)

2. Full Dataset Scan (Regex-Based)

3. Persistence

Graceful Degradation

Performance Characteristics

Privacy & Security

Notes

Build docs developers (and LLMs) love

Authentication

Audits

Policies

Data Management

Scanning

Violations

Compliance

​Request

​Example Request

​Response

​PII Types

​Severity Levels

​Example Response

​Error Responses

​Example Error Response

​Detection Process

​1. Sample Analysis (AI-Powered)

​2. Full Dataset Scan (Regex-Based)

​3. Persistence

​Graceful Degradation

​Performance Characteristics

​Privacy & Security

​Notes

Build docs developers (and LLMs) love

Request

Example Request

Response

PII Types

Severity Levels

Example Response

Error Responses

Example Error Response

Detection Process

1. Sample Analysis (AI-Powered)

2. Full Dataset Scan (Regex-Based)

3. Persistence

Graceful Degradation

Performance Characteristics

Privacy & Security

Notes