Skip to main content

Overview

The Static Analysis package provides automated security vulnerability scanning using Semgrep rules and optional CodeQL integration. It features parallel scanning, policy-based rule selection, and SARIF output for unified vulnerability processing.

Purpose

Scan source code repositories for security vulnerabilities using:
  • Semgrep: Pattern-based static analysis with custom and standard rule packs
  • CodeQL: Advanced semantic analysis (optional)
  • Parallel execution: Multiple scans run concurrently for speed
  • Policy groups: Organized rule categories (crypto, secrets, injection, auth)
  • SARIF output: Standardized format for downstream processing

Main Entry Point

main()

CLI entry point for automated code security scanning.
from packages.static_analysis import main

main()  # Uses command-line arguments

CLI Usage

Basic Scan

# Scan a local repository
python3 -m packages.static_analysis.scanner \
  --repo /path/to/code \
  --policy_groups crypto,secrets,injection

# Scan from Git URL
python3 -m packages.static_analysis.scanner \
  --repo https://github.com/org/repo \
  --policy_groups all

With CodeQL

# Include CodeQL analysis
python3 -m packages.static_analysis.scanner \
  --repo /path/to/code \
  --policy_groups all \
  --codeql

Advanced Options

# Sequential scanning (for debugging)
python3 -m packages.static_analysis.scanner \
  --repo /path/to/code \
  --policy_groups crypto,injection \
  --sequential

# Keep temporary directory
python3 -m packages.static_analysis.scanner \
  --repo /path/to/code \
  --policy_groups all \
  --keep

Python API

Parallel Scanning

from pathlib import Path
from packages.static_analysis.scanner import semgrep_scan_parallel
from core.config import RaptorConfig

repo_path = Path("/path/to/code")
rules_dirs = [
    str(RaptorConfig.SEMGREP_RULES_DIR / "crypto"),
    str(RaptorConfig.SEMGREP_RULES_DIR / "injection"),
]
out_dir = Path("out/scan_results")

# Run parallel scans
sarif_files = semgrep_scan_parallel(
    repo_path=repo_path,
    rules_dirs=rules_dirs,
    out_dir=out_dir,
    timeout=1800,
    progress_callback=lambda msg: print(f"[+] {msg}")
)

print(f"Generated {len(sarif_files)} SARIF files")

Sequential Scanning

from packages.static_analysis.scanner import semgrep_scan_sequential

# Use sequential mode for debugging
sarif_files = semgrep_scan_sequential(
    repo_path=repo_path,
    rules_dirs=rules_dirs,
    out_dir=out_dir,
    timeout=1800
)

Safe Repository Cloning

from packages.static_analysis.scanner import safe_clone
from pathlib import Path
import tempfile

tmp = Path(tempfile.mkdtemp(prefix="raptor_scan_"))

# Clones with URL validation
repo_path = safe_clone(
    url="https://github.com/org/repo",
    workdir=tmp
)

print(f"Cloned to: {repo_path}")

Core Functions

semgrep_scan_parallel()

Run Semgrep scans in parallel for improved performance.
repo_path
Path
required
Path to repository to scan
rules_dirs
List[str]
required
List of rule directory paths
out_dir
Path
required
Output directory for SARIF results
timeout
int
default:"1800"
Timeout per scan in seconds
progress_callback
Optional[Callable]
Optional callback for progress updates
sarif_paths
List[str]
List of generated SARIF file paths

run_single_semgrep()

Run a single Semgrep scan (used internally by parallel scanner).
name
str
required
Scan name (e.g., “category_crypto”)
config
str
required
Semgrep config (path or pack ID)
repo_path
Path
required
Repository path
out_dir
Path
required
Output directory
timeout
int
required
Timeout in seconds
result
Tuple[str, bool]
Tuple of (sarif_path, success)

validate_repo_url()

Validate repository URL against allowed patterns.
from packages.static_analysis.scanner import validate_repo_url

# Returns True for valid URLs
assert validate_repo_url("https://github.com/org/repo")
assert validate_repo_url("[email protected]:org/repo.git")

# Returns False for invalid URLs
assert not validate_repo_url("file:///etc/passwd")

sha256_tree()

Generate SHA256 hash of directory tree for caching.
from packages.static_analysis.scanner import sha256_tree
from pathlib import Path

repo_hash = sha256_tree(Path("/path/to/code"))
print(f"Repository hash: {repo_hash}")

Configuration

Policy Groups

Available policy groups in RaptorConfig:
  • crypto: Cryptographic vulnerabilities
  • secrets: Hardcoded secrets, API keys
  • injection: SQL injection, command injection, XSS
  • auth: Authentication and authorization flaws
  • all: All available rule categories

Environment Variables

# Semgrep rules directory
export SEMGREP_RULES_DIR=/path/to/custom/rules

# Output directory
export RAPTOR_OUT_DIR=/path/to/output

Timeouts

Configured in core/config.py:
SEMGREP_TIMEOUT = 1800          # Total timeout per scan
SEMGREP_RULE_TIMEOUT = 300      # Timeout per rule
GIT_CLONE_TIMEOUT = 600         # Git clone timeout
MAX_SEMGREP_WORKERS = 4         # Parallel workers

Output Structure

out/scan_{repo}_{timestamp}/
├── semgrep_category_crypto.sarif      # Per-category SARIF
├── semgrep_category_injection.sarif
├── semgrep_baseline_security.sarif    # Baseline packs
├── codeql_java.sarif                  # CodeQL (if --codeql)
├── combined.sarif                     # Merged & deduplicated
├── scan-manifest.json                 # Scan metadata
├── scan_metrics.json                  # Finding metrics
└── verification.json                  # Verification data

Manifest Example

{
  "agent": "auto_codesec",
  "version": "2.0.0",
  "repo_path": "/path/to/repo",
  "timestamp_utc": "2026-03-04T12:00:00Z",
  "input_hash": "abc123...",
  "policy_version": "1.0",
  "policy_groups": ["crypto", "injection"],
  "parallel_scanning": true
}

Metrics Example

{
  "total_findings": 47,
  "total_files_scanned": 235,
  "by_severity": {
    "error": 12,
    "warning": 28,
    "note": 7
  },
  "by_category": {
    "crypto": 8,
    "injection": 15,
    "secrets": 4
  }
}

Performance

Parallel Scanning

  • Small repos (<1K files): 2-5 minutes
  • Medium repos (1K-10K files): 5-15 minutes
  • Large repos (10K+ files): 15-30 minutes

Sequential vs Parallel

  • Sequential: 1 scan at a time (slower, easier to debug)
  • Parallel: Up to 4 scans simultaneously (4x faster)

Best Practices

  1. Use parallel scanning for production (default)
  2. Select specific policy groups for targeted analysis
  3. Enable CodeQL for comprehensive coverage
  4. Merge SARIFs for unified downstream processing
  5. Cache results using repository hash for repeat scans

Build docs developers (and LLMs) love