Skip to main content
The run_full_pipeline.py script includes an automatic cleanup system (lines 69-71) that removes intermediate files after successful pipeline execution:
# Auto-delete intermediate files after pipeline succeeds
# Keeps: all_stocks_fundamental_analysis.json.gz + ohlcv_data/
CLEANUP_INTERMEDIATE = True

CLEANUP_INTERMEDIATE Flag

What It Controls

Default: TrueControls whether intermediate files and directories are automatically deleted after the pipeline completes successfully. Only the final compressed output and OHLCV data are preserved.

Behavior

After Phase 5 (Compression), the cleanup phase runs:
📦 PHASE 5: Compression (.json  .json.gz)
────────────────────────────────────────────
  📦 Compressed: 45.2 MB 6.8 MB (85% reduction)

🧹 CLEANUP: Removing intermediate files...
────────────────────────────────────────────
  🗑️  Cleaned: 15 files + 2 dirs (38.4 MB freed)

═══════════════════════════════════════════════════════════
  PIPELINE COMPLETE
═══════════════════════════════════════════════════════════
  📄 Output: all_stocks_fundamental_analysis.json.gz (6.8 MB)
  📦 Compression: 45.2 MB 6.8 MB (85% smaller)
  🧹 Only .json.gz + ohlcv_data/ remain. All intermediate data purged.
Files Deleted: 15 intermediate JSON/CSV filesDirectories Deleted: 2 intermediate directoriesSpace Freed: ~38 MB

What Gets Deleted

The cleanup system targets 15 intermediate files and 2 directories (lines 76-98 of run_full_pipeline.py):

Intermediate Files (15 total)

FileSizePurpose
master_isin_map.json~500 KBISIN → Symbol mapping
dhan_data_response.json~2 MBRaw market data from ScanX
fundamental_data.json~35 MBQuarterly results & ratios
Why deleted: Data is merged into final JSON
FileSizePurpose
advanced_indicator_data.json~8 MBPivot points, EMA/SMA
all_company_announcements.json~15 MBLive corporate announcements
upcoming_corporate_actions.json~200 KBFuture dividends/splits
history_corporate_actions.json~1 MBPast 2 years corporate actions
nse_asm_list.json~50 KBASM surveillance list
nse_gsm_list.json~30 KBGSM surveillance list
bulk_block_deals.json~500 KBLast 30 days bulk/block deals
Why deleted: Data is injected into master JSON
FileSizePurpose
upper_circuit_stocks.json~100 KBUpper circuit hits (today)
lower_circuit_stocks.json~100 KBLower circuit hits (today)
incremental_price_bands.json~50 KBDaily band changes
complete_price_bands.json~300 KBAll securities price bands
Why deleted: Data is merged into master JSON
FileSizePurpose
nse_equity_list.csv~200 KBNSE listing dates CSV
Why deleted: Data is merged into master JSON
FileSizePurpose
all_stocks_fundamental_analysis.json~45 MBRaw JSON before compression
Why deleted: Compressed version (.json.gz) is kept
Total Files Deleted: 15 (~38 MB freed)

Intermediate Directories (2 total)

Contents: Per-stock regulatory filingsStructure:
company_filings/
├── RELIANCE_filings.json
├── TCS_filings.json
├── INFY_filings.json
└── ... (~2,775 files)
Size: ~12-15 MBWhy deleted: Top 5 filings per stock are injected into Recent Announcements field in master JSON
Total Directories Deleted: 2 (~20-25 MB freed)

What Gets Preserved

Final Output

Files:
  • all_stocks_fundamental_analysis.json.gz (~6-8 MB)
  • sector_analytics.json.gz (if generated)
  • market_breadth.json.gz (if generated)
Why preserved: These are the final deliverables

OHLCV Data

Directories:
  • ohlcv_data/ (~300-400 MB, ~2,775 CSV files)
  • indices_ohlcv_data/ (~10-20 MB, ~194 CSV files)
Why preserved: Required for future incremental updates and recalculations

Optional Data

Files (if FETCH_OPTIONAL = True):
  • all_indices_list.json (~1.2 MB)
  • etf_data_response.json (~2.8 MB)
Why preserved: Standalone reference datasets

F&O Data

Files (if manually run):
  • fno_stocks_response.json
  • fno_lot_sizes_cleaned.json
  • fno_expiry_calendar.json
Why preserved: Standalone reference datasets

Cleanup Logic (Source Code)

From run_full_pipeline.py lines 169-192:
def cleanup_intermediate():
    """Delete all intermediate files and directories, keeping only .json.gz + ohlcv_data/."""
    removed_files = 0
    removed_dirs = 0
    freed_bytes = 0
    
    # Delete intermediate files
    for f in INTERMEDIATE_FILES:
        fp = os.path.join(BASE_DIR, f)
        if os.path.exists(fp):
            freed_bytes += os.path.getsize(fp)
            os.remove(fp)
            removed_files += 1
    
    # Delete intermediate directories
    for d in INTERMEDIATE_DIRS:
        dp = os.path.join(BASE_DIR, d)
        if os.path.exists(dp):
            for root, dirs, files in os.walk(dp):
                for file in files:
                    freed_bytes += os.path.getsize(os.path.join(root, file))
            shutil.rmtree(dp)
            removed_dirs += 1
    
    freed_mb = freed_bytes / (1024 * 1024)
    print(f"  🗑️  Cleaned: {removed_files} files + {removed_dirs} dirs ({freed_mb:.1f} MB freed)")
Cleanup is safe:
  • Only deletes files listed in INTERMEDIATE_FILES array
  • Only deletes directories listed in INTERMEDIATE_DIRS array
  • Uses os.path.exists() checks to avoid errors
  • Reports freed space for transparency

When to Use Each Setting

CLEANUP_INTERMEDIATE = True
Use When:
  • Running scheduled daily updates
  • Storage space is limited
  • Only need final compressed output
  • Pipeline is running reliably
Benefits:
  • Saves ~38 MB per run
  • Clean workspace
  • Faster file operations (fewer files)
Drawbacks:
  • Cannot inspect intermediate data after run
  • Must re-run entire pipeline to regenerate deleted files

Storage Comparison

With Cleanup (Default)

Directory Structure After Cleanup:

├── all_stocks_fundamental_analysis.json.gz   (6.8 MB)
├── ohlcv_data/                               (320 MB)
   ├── RELIANCE.csv
   ├── TCS.csv
   └── ... (2,775 files)
├── indices_ohlcv_data/                       (15 MB)
   ├── Nifty_50.csv
   └── ... (194 files)
└── run_full_pipeline.py

Total: ~342 MB

Without Cleanup

Directory Structure Without Cleanup:

├── all_stocks_fundamental_analysis.json      (45 MB)
├── all_stocks_fundamental_analysis.json.gz   (6.8 MB)
├── master_isin_map.json                      (0.5 MB)
├── dhan_data_response.json                   (2 MB)
├── fundamental_data.json                     (35 MB)
├── advanced_indicator_data.json              (8 MB)
├── all_company_announcements.json            (15 MB)
├── upcoming_corporate_actions.json           (0.2 MB)
├── history_corporate_actions.json            (1 MB)
├── nse_asm_list.json                         (0.05 MB)
├── nse_gsm_list.json                         (0.03 MB)
├── bulk_block_deals.json                     (0.5 MB)
├── upper_circuit_stocks.json                 (0.1 MB)
├── lower_circuit_stocks.json                 (0.1 MB)
├── incremental_price_bands.json              (0.05 MB)
├── complete_price_bands.json                 (0.3 MB)
├── nse_equity_list.csv                       (0.2 MB)
├── company_filings/                          (15 MB)
├── market_news/                              (10 MB)
├── ohlcv_data/                               (320 MB)
└── indices_ohlcv_data/                       (15 MB)

Total: ~380 MB
Space Saved with Cleanup: ~38 MB (10% reduction)

How to Change Setting

Step 1: Edit run_full_pipeline.py

Open the file and navigate to line 71:
# Auto-delete intermediate files after pipeline succeeds
# Keeps: all_stocks_fundamental_analysis.json.gz + ohlcv_data/
CLEANUP_INTERMEDIATE = True  # Change to False to preserve all files

Step 2: Save and Run

python3 run_full_pipeline.py

Step 3: Verify Behavior

Check console output after Phase 5:
# If CLEANUP_INTERMEDIATE = True:
🧹 CLEANUP: Removing intermediate files...
────────────────────────────────────────
  🗑️  Cleaned: 15 files + 2 dirs (38.4 MB freed)

# If CLEANUP_INTERMEDIATE = False:
# (No cleanup phase runs)

Manual Cleanup

If you set CLEANUP_INTERMEDIATE = False and later want to manually clean up:
# Delete all intermediate JSON files
rm master_isin_map.json \
   dhan_data_response.json \
   fundamental_data.json \
   advanced_indicator_data.json \
   all_company_announcements.json \
   upcoming_corporate_actions.json \
   history_corporate_actions.json \
   nse_asm_list.json \
   nse_gsm_list.json \
   bulk_block_deals.json \
   upper_circuit_stocks.json \
   lower_circuit_stocks.json \
   incremental_price_bands.json \
   complete_price_bands.json \
   nse_equity_list.csv \
   all_stocks_fundamental_analysis.json

# Delete intermediate directories
rm -rf company_filings/ market_news/
Do NOT delete:
  • ohlcv_data/ (needed for incremental updates)
  • indices_ohlcv_data/ (needed for incremental updates)
  • all_stocks_fundamental_analysis.json.gz (final output)

Common Questions

Solution: Re-run the pipeline with CLEANUP_INTERMEDIATE = False.If you only need specific files (e.g., fundamental_data.json), you can run individual scripts:
python3 fetch_dhan_data.py
python3 fetch_fundamental_data.py
No. Cleanup only removes intermediate files from the current run.OHLCV data (ohlcv_data/) is preserved because it’s required for:
  • Future incremental OHLCV updates
  • Recalculating ADR, RVOL, ATH if needed
Cleanup only runs if the pipeline completes successfully.If Phase 3 fails, intermediate files are preserved so you can:
  1. Inspect the data that caused the failure
  2. Re-run from the failed phase without re-fetching all data
Yes. Edit the INTERMEDIATE_FILES and INTERMEDIATE_DIRS arrays in run_full_pipeline.py (lines 76-98):
INTERMEDIATE_FILES = [
    "master_isin_map.json",
    "dhan_data_response.json",
    # Comment out files you want to preserve
    # "fundamental_data.json",
]

INTERMEDIATE_DIRS = [
    "company_filings",
    # "market_news",  # Preserve this directory
]

Best Practices

For Daily Production

CLEANUP_INTERMEDIATE = True
  • Saves storage
  • Keeps workspace clean
  • Pipeline runs reliably

For Development

CLEANUP_INTERMEDIATE = False
  • Easier debugging
  • Can inspect data quality
  • Can test Phase 4 scripts independently

For CI/CD

CLEANUP_INTERMEDIATE = True
  • Reduces artifact size
  • Faster deployments
  • Only upload compressed .json.gz

For One-Time Runs

CLEANUP_INTERMEDIATE = False
  • Inspect all intermediate data
  • Validate transformations
  • Compare raw vs enriched data

Next Steps

Pipeline Flags

Configure FETCH_OHLCV and FETCH_OPTIONAL

Pipeline Architecture

Understand the full pipeline workflow

Output Schema

Explore the final JSON structure

Error Handling

Debug common pipeline issues

Build docs developers (and LLMs) love