Cleanup intermediate files

The EDL Pipeline can automatically clean up intermediate files after successful completion, keeping only the final compressed outputs and essential data.

Overview

Intermediate cleanup is controlled by the CLEANUP_INTERMEDIATE flag in run_full_pipeline.py (line 71):

# Auto-delete intermediate files after pipeline succeeds
# Keeps: all_stocks_fundamental_analysis.json.gz + ohlcv_data/
CLEANUP_INTERMEDIATE = True

Default: True in production environments to minimize storage usage

What Gets Deleted

The cleanup process removes 15 intermediate files and 2 directories that are only needed between pipeline stages.

Intermediate Files (15 files)

INTERMEDIATE_FILES = [
    "master_isin_map.json",              # Stock symbol → ISIN mapping
    "dhan_data_response.json",           # Raw Dhan market data
    "fundamental_data.json",              # Raw fundamental data (35 MB)
    "advanced_indicator_data.json",       # Technical indicators
    "all_company_announcements.json",     # Corporate announcements
    "upcoming_corporate_actions.json",    # Upcoming corp actions
    "history_corporate_actions.json",     # Historical corp actions
    "nse_asm_list.json",                  # ASM surveillance list
    "nse_gsm_list.json",                  # GSM surveillance list
    "bulk_block_deals.json",              # Bulk/block deals
    "upper_circuit_stocks.json",          # Upper circuit stocks
    "lower_circuit_stocks.json",          # Lower circuit stocks
    "incremental_price_bands.json",       # Daily price band changes
    "complete_price_bands.json",          # All price bands
    "nse_equity_list.csv",                # NSE listing dates
    "all_stocks_fundamental_analysis.json", # Raw JSON (before .gz)
]

Intermediate Directories (2 directories)

INTERMEDIATE_DIRS = [
    "company_filings",    # ~2,775 per-stock filing JSON files
    "market_news",        # ~2,775 per-stock news JSON files
]

What Gets Preserved

The cleanup process preserves these essential outputs: ✅ all_stocks_fundamental_analysis.json.gz (~2 MB) - Final compressed output
✅ sector_analytics.json.gz - Sector performance data
✅ market_breadth.json.gz - Market breadth metrics
✅ ohlcv_data/ directory - Historical OHLCV CSV files (~200 MB)
✅ indices_ohlcv_data/ directory - Indices OHLCV data

The ohlcv_data/ directory is preserved because re-fetching it takes 25-35 minutes. The smart incremental updater needs existing data to calculate date ranges.

Cleanup Implementation

The cleanup logic is in run_full_pipeline.py (lines 169-192):

def cleanup_intermediate():
    """Delete all intermediate files and directories, keeping only .json.gz + ohlcv_data/."""
    removed_files = 0
    removed_dirs = 0
    freed_bytes = 0
    
    # Remove intermediate files
    for f in INTERMEDIATE_FILES:
        fp = os.path.join(BASE_DIR, f)
        if os.path.exists(fp):
            freed_bytes += os.path.getsize(fp)
            os.remove(fp)
            removed_files += 1
    
    # Remove intermediate directories
    for d in INTERMEDIATE_DIRS:
        dp = os.path.join(BASE_DIR, d)
        if os.path.exists(dp):
            for root, dirs, files in os.walk(dp):
                for file in files:
                    freed_bytes += os.path.getsize(os.path.join(root, file))
            shutil.rmtree(dp)
            removed_dirs += 1
    
    freed_mb = freed_bytes / (1024 * 1024)
    print(f"🗑️  Cleaned: {removed_files} files + {removed_dirs} dirs ({freed_mb:.1f} MB freed)")

Space Savings

Typical cleanup results:

Category	Size	Count	Total
JSON files	~38 MB	15	38 MB
`company_filings/`	~5 KB/file	2,775	~13 MB
`market_news/`	~3 KB/file	2,775	~8 MB
Total Freed			~59 MB

Breakdown by File Type

fundamental_data.json: ~35 MB (largest file)
dhan_data_response.json: ~2 MB
advanced_indicator_data.json: ~8 MB
all_stocks_fundamental_analysis.json: ~50 MB → deleted after .gz created
Other JSONs: ~1 MB total

Execution Timing

Cleanup happens automatically in the pipeline:

PHASE 1-4: Data fetching & processing (3-34 min)
PHASE 5: Compression (2 sec)
🧹 CLEANUP: Removing intermediate files... (1 sec)
  🗑️  Cleaned: 15 files + 2 dirs (59 MB freed)

Configuration Options

Production Mode (Default)

CLEANUP_INTERMEDIATE = True

✅ Minimizes disk usage
✅ Keeps only final outputs
❌ Cannot inspect intermediate files for debugging

Development Mode

CLEANUP_INTERMEDIATE = False

✅ Preserves all intermediate files for inspection
✅ Easier debugging of individual pipeline stages
❌ Uses ~59 MB extra disk space

Manual Cleanup

If you run the pipeline with CLEANUP_INTERMEDIATE = False, you can manually clean up later:

# Navigate to pipeline directory
cd "DO NOT DELETE EDL PIPELINE/"

# Remove intermediate JSON files
rm master_isin_map.json dhan_data_response.json fundamental_data.json \
   advanced_indicator_data.json all_company_announcements.json \
   upcoming_corporate_actions.json history_corporate_actions.json \
   nse_asm_list.json nse_gsm_list.json bulk_block_deals.json \
   upper_circuit_stocks.json lower_circuit_stocks.json \
   incremental_price_bands.json complete_price_bands.json \
   nse_equity_list.csv all_stocks_fundamental_analysis.json

# Remove intermediate directories
rm -rf company_filings/ market_news/

# Check space freed
du -sh .

Selective Preservation

To preserve specific intermediate files for debugging: Edit run_full_pipeline.py (lines 76-93) and comment out files you want to keep:

INTERMEDIATE_FILES = [
    "master_isin_map.json",
    # "fundamental_data.json",  # Keep for debugging
    "advanced_indicator_data.json",
    # ... rest of files
]

Recovery from Accidental Deletion

If you accidentally delete intermediate files:

Re-run the full pipeline:
```
python3 run_full_pipeline.py
```
This will regenerate all files from scratch.
Restore from backup (if available):
```
cp backup/dhan_data_response.json .
```

There is no recovery mechanism for deleted intermediate files. The pipeline must be re-run to regenerate them (~4-34 min depending on OHLCV setting).

Best Practices

Production: Enable cleanup

Set CLEANUP_INTERMEDIATE = True for daily automated runs to save disk space.

Development: Disable cleanup

Set CLEANUP_INTERMEDIATE = False when debugging or inspecting pipeline stages.

Archive final outputs

Backup .json.gz files before each run to maintain historical snapshots.

Monitor disk usage

Check ohlcv_data/ size periodically (~200 MB). This directory is never auto-deleted.

Usage

Data Management

Advanced

Overview

What Gets Deleted

Intermediate Files (15 files)

Intermediate Directories (2 directories)

What Gets Preserved

Cleanup Implementation

Space Savings

Execution Timing

Configuration Options

Production Mode (Default)

Development Mode

Manual Cleanup

Selective Preservation

Recovery from Accidental Deletion

Best Practices

Next Steps

Compression

Working with Output

Build docs developers (and LLMs) love

Usage

Data Management

Advanced

​Overview

​What Gets Deleted

​Intermediate Files (15 files)

​Intermediate Directories (2 directories)

​What Gets Preserved

​Cleanup Implementation

​Space Savings

​Execution Timing

​Configuration Options

​Production Mode (Default)

​Development Mode

​Manual Cleanup

​Selective Preservation

​Recovery from Accidental Deletion

​Best Practices

​Next Steps

Compression

Working with Output

Build docs developers (and LLMs) love

Overview

What Gets Deleted

Intermediate Files (15 files)

Intermediate Directories (2 directories)

What Gets Preserved

Cleanup Implementation

Space Savings

Execution Timing

Configuration Options

Production Mode (Default)

Development Mode

Manual Cleanup

Selective Preservation

Recovery from Accidental Deletion

Best Practices

Next Steps