Skip to main content
The EDL Pipeline can automatically clean up intermediate files after successful completion, keeping only the final compressed outputs and essential data.

Overview

Intermediate cleanup is controlled by the CLEANUP_INTERMEDIATE flag in run_full_pipeline.py (line 71):
# Auto-delete intermediate files after pipeline succeeds
# Keeps: all_stocks_fundamental_analysis.json.gz + ohlcv_data/
CLEANUP_INTERMEDIATE = True
Default: True in production environments to minimize storage usage

What Gets Deleted

The cleanup process removes 15 intermediate files and 2 directories that are only needed between pipeline stages.

Intermediate Files (15 files)

INTERMEDIATE_FILES = [
    "master_isin_map.json",              # Stock symbol → ISIN mapping
    "dhan_data_response.json",           # Raw Dhan market data
    "fundamental_data.json",              # Raw fundamental data (35 MB)
    "advanced_indicator_data.json",       # Technical indicators
    "all_company_announcements.json",     # Corporate announcements
    "upcoming_corporate_actions.json",    # Upcoming corp actions
    "history_corporate_actions.json",     # Historical corp actions
    "nse_asm_list.json",                  # ASM surveillance list
    "nse_gsm_list.json",                  # GSM surveillance list
    "bulk_block_deals.json",              # Bulk/block deals
    "upper_circuit_stocks.json",          # Upper circuit stocks
    "lower_circuit_stocks.json",          # Lower circuit stocks
    "incremental_price_bands.json",       # Daily price band changes
    "complete_price_bands.json",          # All price bands
    "nse_equity_list.csv",                # NSE listing dates
    "all_stocks_fundamental_analysis.json", # Raw JSON (before .gz)
]

Intermediate Directories (2 directories)

INTERMEDIATE_DIRS = [
    "company_filings",    # ~2,775 per-stock filing JSON files
    "market_news",        # ~2,775 per-stock news JSON files
]

What Gets Preserved

The cleanup process preserves these essential outputs: all_stocks_fundamental_analysis.json.gz (~2 MB) - Final compressed output
sector_analytics.json.gz - Sector performance data
market_breadth.json.gz - Market breadth metrics
ohlcv_data/ directory - Historical OHLCV CSV files (~200 MB)
indices_ohlcv_data/ directory - Indices OHLCV data
The ohlcv_data/ directory is preserved because re-fetching it takes 25-35 minutes. The smart incremental updater needs existing data to calculate date ranges.

Cleanup Implementation

The cleanup logic is in run_full_pipeline.py (lines 169-192):
def cleanup_intermediate():
    """Delete all intermediate files and directories, keeping only .json.gz + ohlcv_data/."""
    removed_files = 0
    removed_dirs = 0
    freed_bytes = 0
    
    # Remove intermediate files
    for f in INTERMEDIATE_FILES:
        fp = os.path.join(BASE_DIR, f)
        if os.path.exists(fp):
            freed_bytes += os.path.getsize(fp)
            os.remove(fp)
            removed_files += 1
    
    # Remove intermediate directories
    for d in INTERMEDIATE_DIRS:
        dp = os.path.join(BASE_DIR, d)
        if os.path.exists(dp):
            for root, dirs, files in os.walk(dp):
                for file in files:
                    freed_bytes += os.path.getsize(os.path.join(root, file))
            shutil.rmtree(dp)
            removed_dirs += 1
    
    freed_mb = freed_bytes / (1024 * 1024)
    print(f"🗑️  Cleaned: {removed_files} files + {removed_dirs} dirs ({freed_mb:.1f} MB freed)")

Space Savings

Typical cleanup results:
CategorySizeCountTotal
JSON files~38 MB1538 MB
company_filings/~5 KB/file2,775~13 MB
market_news/~3 KB/file2,775~8 MB
Total Freed~59 MB
  • fundamental_data.json: ~35 MB (largest file)
  • dhan_data_response.json: ~2 MB
  • advanced_indicator_data.json: ~8 MB
  • all_stocks_fundamental_analysis.json: ~50 MB → deleted after .gz created
  • Other JSONs: ~1 MB total

Execution Timing

Cleanup happens automatically in the pipeline:
PHASE 1-4: Data fetching & processing (3-34 min)
PHASE 5: Compression (2 sec)
🧹 CLEANUP: Removing intermediate files... (1 sec)
  🗑️  Cleaned: 15 files + 2 dirs (59 MB freed)

Configuration Options

Production Mode (Default)

CLEANUP_INTERMEDIATE = True
  • ✅ Minimizes disk usage
  • ✅ Keeps only final outputs
  • ❌ Cannot inspect intermediate files for debugging

Development Mode

CLEANUP_INTERMEDIATE = False
  • ✅ Preserves all intermediate files for inspection
  • ✅ Easier debugging of individual pipeline stages
  • ❌ Uses ~59 MB extra disk space

Manual Cleanup

If you run the pipeline with CLEANUP_INTERMEDIATE = False, you can manually clean up later:
# Navigate to pipeline directory
cd "DO NOT DELETE EDL PIPELINE/"

# Remove intermediate JSON files
rm master_isin_map.json dhan_data_response.json fundamental_data.json \
   advanced_indicator_data.json all_company_announcements.json \
   upcoming_corporate_actions.json history_corporate_actions.json \
   nse_asm_list.json nse_gsm_list.json bulk_block_deals.json \
   upper_circuit_stocks.json lower_circuit_stocks.json \
   incremental_price_bands.json complete_price_bands.json \
   nse_equity_list.csv all_stocks_fundamental_analysis.json

# Remove intermediate directories
rm -rf company_filings/ market_news/

# Check space freed
du -sh .

Selective Preservation

To preserve specific intermediate files for debugging: Edit run_full_pipeline.py (lines 76-93) and comment out files you want to keep:
INTERMEDIATE_FILES = [
    "master_isin_map.json",
    # "fundamental_data.json",  # Keep for debugging
    "advanced_indicator_data.json",
    # ... rest of files
]

Recovery from Accidental Deletion

If you accidentally delete intermediate files:
  1. Re-run the full pipeline:
    python3 run_full_pipeline.py
    
    This will regenerate all files from scratch.
  2. Restore from backup (if available):
    cp backup/dhan_data_response.json .
    
There is no recovery mechanism for deleted intermediate files. The pipeline must be re-run to regenerate them (~4-34 min depending on OHLCV setting).

Best Practices

1

Production: Enable cleanup

Set CLEANUP_INTERMEDIATE = True for daily automated runs to save disk space.
2

Development: Disable cleanup

Set CLEANUP_INTERMEDIATE = False when debugging or inspecting pipeline stages.
3

Archive final outputs

Backup .json.gz files before each run to maintain historical snapshots.
4

Monitor disk usage

Check ohlcv_data/ size periodically (~200 MB). This directory is never auto-deleted.

Next Steps

Compression

Learn how final outputs are compressed to .json.gz

Working with Output

Parse and analyze the compressed output files

Build docs developers (and LLMs) love