Overview
Theengine/ingestor.py module is Chronos-DFIR’s universal file parser. It extracts forensic data from 10+ file formats and normalizes everything into Polars DataFrames.
Supported Formats
| Format | Extensions | Use Case |
|---|---|---|
| CSV | .csv | EVTX exports, MFT, Prefetch, ShimCache |
| XLSX | .xlsx | Excel forensic reports, investigator notes |
| JSON | .json, .jsonl, .ndjson | API logs, cloud telemetry, SOAR exports |
| SQLite | .db, .sqlite, .sqlite3 | Browser history, Windows Timeline, mobile artifacts |
| Plist | .plist | macOS LaunchAgents, LaunchDaemons, preferences |
| PSList | .pslist | Volatility process dumps |
| TXT/LOG | .txt, .log | macOS Unified Logs, syslog, custom formats |
| Parquet | .parquet | Efficient columnar storage (internal exports) |
| TSV | .tsv | Tab-separated forensic data |
| ZIP | .zip | Bulk macOS plist archives |
Zero pandas dependency. All parsing uses Polars,
plistlib, sqlite3, and built-in Python libraries.Core Functions
ingest_file()
Main entry point for file parsing. Auto-detects format and returns a Polars DataFrame or LazyFrame.
Signature:
Absolute path to the file
File extension (with leading dot, e.g.,
".csv", ".json")3-tuple:
(lf, df_eager, file_cat)lf:pl.LazyFrame(streaming, for large files) ORNonedf_eager:pl.DataFrame(collected, for small/special formats) ORNonefile_cat:str— Category label (e.g.,"Memory/Volatility_PSList","macOS/Unified_Logs")
lf or df_eager is set. The other is None.normalize_and_save()
Normalizes column names, adds _id index, and writes to CSV.
Signature:
Lazy frame to process (if available)
Eager DataFrame to process (if lazy not available)
Output CSV path (e.g.,
"/tmp/normalized.csv")Number of rows written. Returns
-1 for lazy frames (unknown until sink completes).- Column name cleaning: Strip leading underscores, capitalize first letter
- Reserved columns:
_time→Time,_id→Original_Id - Numeric columns:
123→Field_123 - Add
_idindex: Row numbers starting from 1 - Write CSV: Streaming (
sink_csv) for lazy, direct write for eager
Format-Specific Parsers
CSV: _parse_csv_robust()
Handles tricky CSV files with encoding issues and headerless formats.
Features:
- Encoding fallback: UTF-8 →
utf8-lossyif parsing fails - Headerless detection: Recognizes CSV files starting with Unix permissions (e.g.,
drwxr-xr-x) - Auto-generates column names:
Field_0,Field_1, … - Sets category:
FileSystem/LS_Triageforls -laoutput
JSON: ingest_json_file()
Safely handles both NDJSON (newline-delimited) and standard JSON arrays.
Features:
- NDJSON-first: Tries
pl.scan_ndjson()for streaming - Array detection: Checks first byte for
[ - Size-aware: Small arrays (under 100MB) → direct read, large → convert to NDJSON
- OOM prevention: Uses
ijsonstreaming for gigabyte-scale JSON
- Attempt
pl.scan_ndjson()(optimistic) - If fails, read first byte
- If
[and file under 100MB →pl.read_json() - If
[and file over 100MB → stream withijsonto temp NDJSON - Fallback: Try
pl.read_json()as single object
SQLite: Database Table Extraction
Extracts the most relevant table from SQLite databases. Table Priority:- Preferred names:
events,logs,timeline,entries(case-insensitive) - Fallback: First non-system table (not starting with
sqlite_)
Plist: macOS Property Lists
Parses binary or XML plist files into tabular format. Features:- Nested data sanitization: Converts
bytes,dict,listto strings - Auto-unwrapping: Single-key dicts with list values are unwrapped to list
- Type safety: All values cast to UTF-8 strings
_sanitize_plist_val()
ZIP Archives: Bulk Plist Parsing
Extracts all.plist files from a ZIP archive (common in macOS triage).
Features:
- Recursive extraction: Walks entire ZIP tree
- Parallel parsing: Processes all plists in memory
- Metadata extraction:
Label,ProgramArguments,RunAtLoad,KeepAlive - Category tag:
macOS/Bulk_Plist
TXT/LOG Files: Multi-Format Parser
Handles three distinct log formats:1. macOS Unified Logs
Format:2. macOS Persistence Triage (ls -la)
Format:
3. Whitespace-Separated Logs
Fallback parser for.pslist, .log, unknown formats.
Function: _read_whitespace_csv()
Volatility PSList Detection
If a DataFrame has columns["Offset(V)", "PPID", "Threads"], it’s automatically categorized as Volatility memory dump.
Enrichment:
- Rename:
CreateTime→Time - Add EventID:
"Volatility_RAM_Process" - Add Destination_Entity:
"ProcessName [PID]" - Add Source_Entity:
"PPID: XXXX" - Set category:
"Memory/Volatility_PSList"
Data Flow Diagram
Column Normalization Rules
Before Normalization
After Normalization
Rule Table
| Original | Normalized | Reason |
|---|---|---|
_time | Time | Reserved field mapping |
_id | Original_Id | Avoid conflict with auto-generated _id |
Leading _ | Stripped | _eventid → Eventid |
| Pure numeric | Field_N | 123 → Field_123 |
| First char lowercase | Capitalized | host → Host |
Error Handling
Graceful Degradation
All parsers have fallback strategies:Example: JSON Parsing
Performance Considerations
Streaming vs. Eager
| Format | Mode | Reason |
|---|---|---|
| CSV (>50MB) | Lazy (scan_csv) | Streaming prevents OOM |
| JSON (>100MB) | Lazy (via NDJSON) | Converted to NDJSON for streaming |
| Parquet | Lazy (scan_parquet) | Columnar format is streaming-native |
| SQLite | Eager | Database cursor requires full read |
| Plist | Eager | Small files, in-memory parsing faster |
| XLSX | Eager | Excel format not stream-friendly |
Memory Usage
Example Workflows
Workflow 1: EVTX CSV Ingestion
Workflow 2: macOS Persistence Analysis
Workflow 3: Volatility Memory Analysis
API Reference Summary
| Function | Input | Output | Purpose |
|---|---|---|---|
ingest_file() | file_path, ext | (lf, df_eager, file_cat) | Main entry point |
normalize_and_save() | lf, df_eager, dest_path | int | Clean + write CSV |
_read_whitespace_csv() | file_path | pl.DataFrame | Parse space-separated logs |
_sanitize_plist_val() | value | str/None | Convert plist types |
_parse_text_file() | file_path, ext | (df_eager, file_cat) | TXT/LOG multi-parser |
_parse_ls_triage() | lines, pattern | pl.DataFrame | Parse ls -la output |
_parse_zip_plist() | file_path | (df_eager, file_cat) | Extract all plists from ZIP |
_parse_single_plist() | file_path | pl.DataFrame | Parse one plist file |
_parse_csv_robust() | file_path, file_cat | (lf, file_cat) | CSV with encoding fallback |
Best Practices
Ingestion Guidelines:
- Always use
ingest_file()— never call format-specific parsers directly - Check
lfvs.df_eager— only one is set, the other isNone - Call
normalize_and_save()before forensic analysis - Use
file_catfor conditional logic (e.g., special handling for"Memory/Volatility_PSList") - For large files (>500MB), prefer formats that support lazy evaluation (Parquet, CSV, NDJSON)
Forensic Integrity Tips
Related Documentation
- Forensic Engine — Post-ingestion analysis
- Sigma Engine — Detection rule evaluation
- Architecture — Data flow overview