run_full_pipeline.py script includes an automatic cleanup system (lines 69-71) that removes intermediate files after successful pipeline execution:
CLEANUP_INTERMEDIATE Flag
What It Controls
Default:
TrueControls whether intermediate files and directories are automatically deleted after the pipeline completes successfully. Only the final compressed output and OHLCV data are preserved.Behavior
- CLEANUP_INTERMEDIATE = True (Default)
- CLEANUP_INTERMEDIATE = False
After Phase 5 (Compression), the cleanup phase runs:Files Deleted: 15 intermediate JSON/CSV filesDirectories Deleted: 2 intermediate directoriesSpace Freed: ~38 MB
What Gets Deleted
The cleanup system targets 15 intermediate files and 2 directories (lines 76-98 ofrun_full_pipeline.py):
Intermediate Files (15 total)
Core Data Files (3)
Core Data Files (3)
| File | Size | Purpose |
|---|---|---|
master_isin_map.json | ~500 KB | ISIN → Symbol mapping |
dhan_data_response.json | ~2 MB | Raw market data from ScanX |
fundamental_data.json | ~35 MB | Quarterly results & ratios |
Enrichment Data Files (7)
Enrichment Data Files (7)
| File | Size | Purpose |
|---|---|---|
advanced_indicator_data.json | ~8 MB | Pivot points, EMA/SMA |
all_company_announcements.json | ~15 MB | Live corporate announcements |
upcoming_corporate_actions.json | ~200 KB | Future dividends/splits |
history_corporate_actions.json | ~1 MB | Past 2 years corporate actions |
nse_asm_list.json | ~50 KB | ASM surveillance list |
nse_gsm_list.json | ~30 KB | GSM surveillance list |
bulk_block_deals.json | ~500 KB | Last 30 days bulk/block deals |
Price Data Files (4)
Price Data Files (4)
| File | Size | Purpose |
|---|---|---|
upper_circuit_stocks.json | ~100 KB | Upper circuit hits (today) |
lower_circuit_stocks.json | ~100 KB | Lower circuit hits (today) |
incremental_price_bands.json | ~50 KB | Daily band changes |
complete_price_bands.json | ~300 KB | All securities price bands |
Auxiliary Files (1)
Auxiliary Files (1)
| File | Size | Purpose |
|---|---|---|
nse_equity_list.csv | ~200 KB | NSE listing dates CSV |
Uncompressed Output (1)
Uncompressed Output (1)
| File | Size | Purpose |
|---|---|---|
all_stocks_fundamental_analysis.json | ~45 MB | Raw JSON before compression |
Intermediate Directories (2 total)
- company_filings/
- market_news/
Contents: Per-stock regulatory filingsStructure:Size: ~12-15 MBWhy deleted: Top 5 filings per stock are injected into
Recent Announcements field in master JSONWhat Gets Preserved
Final Output
Files:
all_stocks_fundamental_analysis.json.gz(~6-8 MB)sector_analytics.json.gz(if generated)market_breadth.json.gz(if generated)
OHLCV Data
Directories:
ohlcv_data/(~300-400 MB, ~2,775 CSV files)indices_ohlcv_data/(~10-20 MB, ~194 CSV files)
Optional Data
Files (if FETCH_OPTIONAL = True):
all_indices_list.json(~1.2 MB)etf_data_response.json(~2.8 MB)
F&O Data
Files (if manually run):
fno_stocks_response.jsonfno_lot_sizes_cleaned.jsonfno_expiry_calendar.json
Cleanup Logic (Source Code)
Fromrun_full_pipeline.py lines 169-192:
- Only deletes files listed in
INTERMEDIATE_FILESarray - Only deletes directories listed in
INTERMEDIATE_DIRSarray - Uses
os.path.exists()checks to avoid errors - Reports freed space for transparency
When to Use Each Setting
- Production (Default)
- Development/Debug
- Running scheduled daily updates
- Storage space is limited
- Only need final compressed output
- Pipeline is running reliably
- Saves ~38 MB per run
- Clean workspace
- Faster file operations (fewer files)
- Cannot inspect intermediate data after run
- Must re-run entire pipeline to regenerate deleted files
Storage Comparison
With Cleanup (Default)
Without Cleanup
How to Change Setting
Step 1: Edit run_full_pipeline.py
Open the file and navigate to line 71:Step 2: Save and Run
Step 3: Verify Behavior
Check console output after Phase 5:Manual Cleanup
If you setCLEANUP_INTERMEDIATE = False and later want to manually clean up:
Common Questions
What if I need intermediate files later?
What if I need intermediate files later?
Solution: Re-run the pipeline with
CLEANUP_INTERMEDIATE = False.If you only need specific files (e.g., fundamental_data.json), you can run individual scripts:Does cleanup affect incremental updates?
Does cleanup affect incremental updates?
No. Cleanup only removes intermediate files from the current run.OHLCV data (
ohlcv_data/) is preserved because it’s required for:- Future incremental OHLCV updates
- Recalculating ADR, RVOL, ATH if needed
What if the pipeline fails mid-run?
What if the pipeline fails mid-run?
Cleanup only runs if the pipeline completes successfully.If Phase 3 fails, intermediate files are preserved so you can:
- Inspect the data that caused the failure
- Re-run from the failed phase without re-fetching all data
Can I customize what gets deleted?
Can I customize what gets deleted?
Yes. Edit the
INTERMEDIATE_FILES and INTERMEDIATE_DIRS arrays in run_full_pipeline.py (lines 76-98):Best Practices
For Daily Production
- Saves storage
- Keeps workspace clean
- Pipeline runs reliably
For Development
- Easier debugging
- Can inspect data quality
- Can test Phase 4 scripts independently
For CI/CD
- Reduces artifact size
- Faster deployments
- Only upload compressed .json.gz
For One-Time Runs
- Inspect all intermediate data
- Validate transformations
- Compare raw vs enriched data
Next Steps
Pipeline Flags
Configure FETCH_OHLCV and FETCH_OPTIONAL
Pipeline Architecture
Understand the full pipeline workflow
Output Schema
Explore the final JSON structure
Error Handling
Debug common pipeline issues