The EDL Pipeline can automatically clean up intermediate files after successful completion, keeping only the final compressed outputs and essential data.
Overview
Intermediate cleanup is controlled by the CLEANUP_INTERMEDIATE flag in run_full_pipeline.py (line 71):
# Auto-delete intermediate files after pipeline succeeds
# Keeps: all_stocks_fundamental_analysis.json.gz + ohlcv_data/
CLEANUP_INTERMEDIATE = True
Default : True in production environments to minimize storage usage
What Gets Deleted
The cleanup process removes 15 intermediate files and 2 directories that are only needed between pipeline stages.
INTERMEDIATE_FILES = [
"master_isin_map.json" , # Stock symbol → ISIN mapping
"dhan_data_response.json" , # Raw Dhan market data
"fundamental_data.json" , # Raw fundamental data (35 MB)
"advanced_indicator_data.json" , # Technical indicators
"all_company_announcements.json" , # Corporate announcements
"upcoming_corporate_actions.json" , # Upcoming corp actions
"history_corporate_actions.json" , # Historical corp actions
"nse_asm_list.json" , # ASM surveillance list
"nse_gsm_list.json" , # GSM surveillance list
"bulk_block_deals.json" , # Bulk/block deals
"upper_circuit_stocks.json" , # Upper circuit stocks
"lower_circuit_stocks.json" , # Lower circuit stocks
"incremental_price_bands.json" , # Daily price band changes
"complete_price_bands.json" , # All price bands
"nse_equity_list.csv" , # NSE listing dates
"all_stocks_fundamental_analysis.json" , # Raw JSON (before .gz)
]
INTERMEDIATE_DIRS = [
"company_filings" , # ~2,775 per-stock filing JSON files
"market_news" , # ~2,775 per-stock news JSON files
]
What Gets Preserved
The cleanup process preserves these essential outputs:
✅ all_stocks_fundamental_analysis.json.gz (~2 MB) - Final compressed output
✅ sector_analytics.json.gz - Sector performance data
✅ market_breadth.json.gz - Market breadth metrics
✅ ohlcv_data/ directory - Historical OHLCV CSV files (~200 MB)
✅ indices_ohlcv_data/ directory - Indices OHLCV data
The ohlcv_data/ directory is preserved because re-fetching it takes 25-35 minutes. The smart incremental updater needs existing data to calculate date ranges.
Cleanup Implementation
The cleanup logic is in run_full_pipeline.py (lines 169-192):
def cleanup_intermediate ():
"""Delete all intermediate files and directories, keeping only .json.gz + ohlcv_data/."""
removed_files = 0
removed_dirs = 0
freed_bytes = 0
# Remove intermediate files
for f in INTERMEDIATE_FILES :
fp = os.path.join( BASE_DIR , f)
if os.path.exists(fp):
freed_bytes += os.path.getsize(fp)
os.remove(fp)
removed_files += 1
# Remove intermediate directories
for d in INTERMEDIATE_DIRS :
dp = os.path.join( BASE_DIR , d)
if os.path.exists(dp):
for root, dirs, files in os.walk(dp):
for file in files:
freed_bytes += os.path.getsize(os.path.join(root, file ))
shutil.rmtree(dp)
removed_dirs += 1
freed_mb = freed_bytes / ( 1024 * 1024 )
print ( f "🗑️ Cleaned: { removed_files } files + { removed_dirs } dirs ( { freed_mb :.1f} MB freed)" )
Space Savings
Typical cleanup results:
Category Size Count Total JSON files ~38 MB 15 38 MB company_filings/~5 KB/file 2,775 ~13 MB market_news/~3 KB/file 2,775 ~8 MB Total Freed ~59 MB
fundamental_data.json: ~35 MB (largest file)
dhan_data_response.json: ~2 MB
advanced_indicator_data.json: ~8 MB
all_stocks_fundamental_analysis.json: ~50 MB → deleted after .gz created
Other JSONs: ~1 MB total
Execution Timing
Cleanup happens automatically in the pipeline:
PHASE 1-4: Data fetching & processing (3-34 min)
PHASE 5: Compression (2 sec)
🧹 CLEANUP: Removing intermediate files... (1 sec)
🗑️ Cleaned: 15 files + 2 dirs (59 MB freed)
Configuration Options
Production Mode (Default)
CLEANUP_INTERMEDIATE = True
✅ Minimizes disk usage
✅ Keeps only final outputs
❌ Cannot inspect intermediate files for debugging
Development Mode
CLEANUP_INTERMEDIATE = False
✅ Preserves all intermediate files for inspection
✅ Easier debugging of individual pipeline stages
❌ Uses ~59 MB extra disk space
Manual Cleanup
If you run the pipeline with CLEANUP_INTERMEDIATE = False, you can manually clean up later:
# Navigate to pipeline directory
cd "DO NOT DELETE EDL PIPELINE/"
# Remove intermediate JSON files
rm master_isin_map.json dhan_data_response.json fundamental_data.json \
advanced_indicator_data.json all_company_announcements.json \
upcoming_corporate_actions.json history_corporate_actions.json \
nse_asm_list.json nse_gsm_list.json bulk_block_deals.json \
upper_circuit_stocks.json lower_circuit_stocks.json \
incremental_price_bands.json complete_price_bands.json \
nse_equity_list.csv all_stocks_fundamental_analysis.json
# Remove intermediate directories
rm -rf company_filings/ market_news/
# Check space freed
du -sh .
Selective Preservation
To preserve specific intermediate files for debugging:
Edit run_full_pipeline.py (lines 76-93) and comment out files you want to keep:
INTERMEDIATE_FILES = [
"master_isin_map.json" ,
# "fundamental_data.json", # Keep for debugging
"advanced_indicator_data.json" ,
# ... rest of files
]
Recovery from Accidental Deletion
If you accidentally delete intermediate files:
Re-run the full pipeline :
python3 run_full_pipeline.py
This will regenerate all files from scratch.
Restore from backup (if available):
cp backup/dhan_data_response.json .
There is no recovery mechanism for deleted intermediate files. The pipeline must be re-run to regenerate them (~4-34 min depending on OHLCV setting).
Best Practices
Production: Enable cleanup
Set CLEANUP_INTERMEDIATE = True for daily automated runs to save disk space.
Development: Disable cleanup
Set CLEANUP_INTERMEDIATE = False when debugging or inspecting pipeline stages.
Archive final outputs
Backup .json.gz files before each run to maintain historical snapshots.
Monitor disk usage
Check ohlcv_data/ size periodically (~200 MB). This directory is never auto-deleted.
Next Steps
Compression Learn how final outputs are compressed to .json.gz
Working with Output Parse and analyze the compressed output files