Cleanup Configuration

The run_full_pipeline.py script includes an automatic cleanup system (lines 69-71) that removes intermediate files after successful pipeline execution:

# Auto-delete intermediate files after pipeline succeeds
# Keeps: all_stocks_fundamental_analysis.json.gz + ohlcv_data/
CLEANUP_INTERMEDIATE = True

CLEANUP_INTERMEDIATE Flag

What It Controls

Default: TrueControls whether intermediate files and directories are automatically deleted after the pipeline completes successfully. Only the final compressed output and OHLCV data are preserved.

Behavior

CLEANUP_INTERMEDIATE = True (Default)
CLEANUP_INTERMEDIATE = False

After Phase 5 (Compression), the cleanup phase runs:

📦 PHASE 5: Compression (.json → .json.gz)
────────────────────────────────────────────
  📦 Compressed: 45.2 MB → 6.8 MB (85% reduction)

🧹 CLEANUP: Removing intermediate files...
────────────────────────────────────────────
  🗑️  Cleaned: 15 files + 2 dirs (38.4 MB freed)

═══════════════════════════════════════════════════════════
  PIPELINE COMPLETE
═══════════════════════════════════════════════════════════
  📄 Output: all_stocks_fundamental_analysis.json.gz (6.8 MB)
  📦 Compression: 45.2 MB → 6.8 MB (85% smaller)
  🧹 Only .json.gz + ohlcv_data/ remain. All intermediate data purged.

Files Deleted: 15 intermediate JSON/CSV filesDirectories Deleted: 2 intermediate directoriesSpace Freed: ~38 MB

Cleanup phase is skipped:

📦 PHASE 5: Compression (.json → .json.gz)
────────────────────────────────────────────
  📦 Compressed: 45.2 MB → 6.8 MB (85% reduction)

═══════════════════════════════════════════════════════════
  PIPELINE COMPLETE
═══════════════════════════════════════════════════════════
  📄 Output: all_stocks_fundamental_analysis.json.gz (6.8 MB)
  📦 Compression: 45.2 MB → 6.8 MB (85% smaller)

All intermediate files are preserved for debugging/inspection.

What Gets Deleted

The cleanup system targets 15 intermediate files and 2 directories (lines 76-98 of run_full_pipeline.py):

Intermediate Files (15 total)

Core Data Files (3)

File	Size	Purpose
`master_isin_map.json`	~500 KB	ISIN → Symbol mapping
`dhan_data_response.json`	~2 MB	Raw market data from ScanX
`fundamental_data.json`	~35 MB	Quarterly results & ratios

Why deleted: Data is merged into final JSON

Enrichment Data Files (7)

File	Size	Purpose
`advanced_indicator_data.json`	~8 MB	Pivot points, EMA/SMA
`all_company_announcements.json`	~15 MB	Live corporate announcements
`upcoming_corporate_actions.json`	~200 KB	Future dividends/splits
`history_corporate_actions.json`	~1 MB	Past 2 years corporate actions
`nse_asm_list.json`	~50 KB	ASM surveillance list
`nse_gsm_list.json`	~30 KB	GSM surveillance list
`bulk_block_deals.json`	~500 KB	Last 30 days bulk/block deals

Why deleted: Data is injected into master JSON

Price Data Files (4)

File	Size	Purpose
`upper_circuit_stocks.json`	~100 KB	Upper circuit hits (today)
`lower_circuit_stocks.json`	~100 KB	Lower circuit hits (today)
`incremental_price_bands.json`	~50 KB	Daily band changes
`complete_price_bands.json`	~300 KB	All securities price bands

Why deleted: Data is merged into master JSON

Auxiliary Files (1)

File	Size	Purpose
`nse_equity_list.csv`	~200 KB	NSE listing dates CSV

Why deleted: Data is merged into master JSON

Uncompressed Output (1)

File	Size	Purpose
`all_stocks_fundamental_analysis.json`	~45 MB	Raw JSON before compression

Why deleted: Compressed version (.json.gz) is kept

Total Files Deleted: 15 (~38 MB freed)

Intermediate Directories (2 total)

company_filings/
market_news/

Contents: Per-stock regulatory filingsStructure:

company_filings/
├── RELIANCE_filings.json
├── TCS_filings.json
├── INFY_filings.json
└── ... (~2,775 files)

Size: ~12-15 MBWhy deleted: Top 5 filings per stock are injected into Recent Announcements field in master JSON

Contents: Per-stock news articles with sentimentStructure:

market_news/
├── RELIANCE_news.json
├── TCS_news.json
├── INFY_news.json
└── ... (~2,775 files)

Size: ~8-10 MBWhy deleted: Top 5 news items per stock are injected into News Feed field in master JSON

Total Directories Deleted: 2 (~20-25 MB freed)

What Gets Preserved

Final Output

Files:

all_stocks_fundamental_analysis.json.gz (~6-8 MB)
sector_analytics.json.gz (if generated)
market_breadth.json.gz (if generated)

Why preserved: These are the final deliverables

OHLCV Data

Directories:

ohlcv_data/ (~300-400 MB, ~2,775 CSV files)
indices_ohlcv_data/ (~10-20 MB, ~194 CSV files)

Why preserved: Required for future incremental updates and recalculations

Optional Data

Files (if FETCH_OPTIONAL = True):

all_indices_list.json (~1.2 MB)
etf_data_response.json (~2.8 MB)

Why preserved: Standalone reference datasets

F&O Data

Files (if manually run):

fno_stocks_response.json
fno_lot_sizes_cleaned.json
fno_expiry_calendar.json

Why preserved: Standalone reference datasets

Cleanup Logic (Source Code)

From run_full_pipeline.py lines 169-192:

def cleanup_intermediate():
    """Delete all intermediate files and directories, keeping only .json.gz + ohlcv_data/."""
    removed_files = 0
    removed_dirs = 0
    freed_bytes = 0
    
    # Delete intermediate files
    for f in INTERMEDIATE_FILES:
        fp = os.path.join(BASE_DIR, f)
        if os.path.exists(fp):
            freed_bytes += os.path.getsize(fp)
            os.remove(fp)
            removed_files += 1
    
    # Delete intermediate directories
    for d in INTERMEDIATE_DIRS:
        dp = os.path.join(BASE_DIR, d)
        if os.path.exists(dp):
            for root, dirs, files in os.walk(dp):
                for file in files:
                    freed_bytes += os.path.getsize(os.path.join(root, file))
            shutil.rmtree(dp)
            removed_dirs += 1
    
    freed_mb = freed_bytes / (1024 * 1024)
    print(f"  🗑️  Cleaned: {removed_files} files + {removed_dirs} dirs ({freed_mb:.1f} MB freed)")

Cleanup is safe:

Only deletes files listed in INTERMEDIATE_FILES array
Only deletes directories listed in INTERMEDIATE_DIRS array
Uses os.path.exists() checks to avoid errors
Reports freed space for transparency

When to Use Each Setting

Production (Default)
Development/Debug

CLEANUP_INTERMEDIATE = True

Use When:

Running scheduled daily updates
Storage space is limited
Only need final compressed output
Pipeline is running reliably

Benefits:

Saves ~38 MB per run
Clean workspace
Faster file operations (fewer files)

Drawbacks:

Cannot inspect intermediate data after run
Must re-run entire pipeline to regenerate deleted files

CLEANUP_INTERMEDIATE = False

Use When:

Debugging pipeline failures
Inspecting intermediate data quality
Testing new enrichment scripts
Validating data transformations

Benefits:

Full data trail for debugging
Can inspect each phase’s output
Can manually run Phase 4 scripts on existing data

Drawbacks:

Uses ~38 MB extra storage
Cluttered workspace

Storage Comparison

With Cleanup (Default)

Directory Structure After Cleanup:

├── all_stocks_fundamental_analysis.json.gz   (6.8 MB)
├── ohlcv_data/                               (320 MB)
│   ├── RELIANCE.csv
│   ├── TCS.csv
│   └── ... (2,775 files)
├── indices_ohlcv_data/                       (15 MB)
│   ├── Nifty_50.csv
│   └── ... (194 files)
└── run_full_pipeline.py

Total: ~342 MB

Without Cleanup

Directory Structure Without Cleanup:

├── all_stocks_fundamental_analysis.json      (45 MB)
├── all_stocks_fundamental_analysis.json.gz   (6.8 MB)
├── master_isin_map.json                      (0.5 MB)
├── dhan_data_response.json                   (2 MB)
├── fundamental_data.json                     (35 MB)
├── advanced_indicator_data.json              (8 MB)
├── all_company_announcements.json            (15 MB)
├── upcoming_corporate_actions.json           (0.2 MB)
├── history_corporate_actions.json            (1 MB)
├── nse_asm_list.json                         (0.05 MB)
├── nse_gsm_list.json                         (0.03 MB)
├── bulk_block_deals.json                     (0.5 MB)
├── upper_circuit_stocks.json                 (0.1 MB)
├── lower_circuit_stocks.json                 (0.1 MB)
├── incremental_price_bands.json              (0.05 MB)
├── complete_price_bands.json                 (0.3 MB)
├── nse_equity_list.csv                       (0.2 MB)
├── company_filings/                          (15 MB)
├── market_news/                              (10 MB)
├── ohlcv_data/                               (320 MB)
└── indices_ohlcv_data/                       (15 MB)

Total: ~380 MB

Space Saved with Cleanup: ~38 MB (10% reduction)

How to Change Setting

Step 1: Edit run_full_pipeline.py

Open the file and navigate to line 71:

# Auto-delete intermediate files after pipeline succeeds
# Keeps: all_stocks_fundamental_analysis.json.gz + ohlcv_data/
CLEANUP_INTERMEDIATE = True  # Change to False to preserve all files

Step 2: Save and Run

python3 run_full_pipeline.py

Step 3: Verify Behavior

Check console output after Phase 5:

# If CLEANUP_INTERMEDIATE = True:
🧹 CLEANUP: Removing intermediate files...
────────────────────────────────────────
  🗑️  Cleaned: 15 files + 2 dirs (38.4 MB freed)

# If CLEANUP_INTERMEDIATE = False:
# (No cleanup phase runs)

Manual Cleanup

If you set CLEANUP_INTERMEDIATE = False and later want to manually clean up:

# Delete all intermediate JSON files
rm master_isin_map.json \
   dhan_data_response.json \
   fundamental_data.json \
   advanced_indicator_data.json \
   all_company_announcements.json \
   upcoming_corporate_actions.json \
   history_corporate_actions.json \
   nse_asm_list.json \
   nse_gsm_list.json \
   bulk_block_deals.json \
   upper_circuit_stocks.json \
   lower_circuit_stocks.json \
   incremental_price_bands.json \
   complete_price_bands.json \
   nse_equity_list.csv \
   all_stocks_fundamental_analysis.json

# Delete intermediate directories
rm -rf company_filings/ market_news/

Do NOT delete:

ohlcv_data/ (needed for incremental updates)
indices_ohlcv_data/ (needed for incremental updates)
all_stocks_fundamental_analysis.json.gz (final output)

Common Questions

What if I need intermediate files later?

Solution: Re-run the pipeline with CLEANUP_INTERMEDIATE = False.If you only need specific files (e.g., fundamental_data.json), you can run individual scripts:

python3 fetch_dhan_data.py
python3 fetch_fundamental_data.py

Does cleanup affect incremental updates?

No. Cleanup only removes intermediate files from the current run.OHLCV data (ohlcv_data/) is preserved because it’s required for:

Future incremental OHLCV updates
Recalculating ADR, RVOL, ATH if needed

What if the pipeline fails mid-run?

Cleanup only runs if the pipeline completes successfully.If Phase 3 fails, intermediate files are preserved so you can:

Inspect the data that caused the failure
Re-run from the failed phase without re-fetching all data

Can I customize what gets deleted?

Yes. Edit the INTERMEDIATE_FILES and INTERMEDIATE_DIRS arrays in run_full_pipeline.py (lines 76-98):

INTERMEDIATE_FILES = [
    "master_isin_map.json",
    "dhan_data_response.json",
    # Comment out files you want to preserve
    # "fundamental_data.json",
]

INTERMEDIATE_DIRS = [
    "company_filings",
    # "market_news",  # Preserve this directory
]

Best Practices

For Daily Production

CLEANUP_INTERMEDIATE = True

Saves storage
Keeps workspace clean
Pipeline runs reliably

For Development

CLEANUP_INTERMEDIATE = False

Easier debugging
Can inspect data quality
Can test Phase 4 scripts independently

For CI/CD

CLEANUP_INTERMEDIATE = True

Reduces artifact size
Faster deployments
Only upload compressed .json.gz

For One-Time Runs

CLEANUP_INTERMEDIATE = False

Inspect all intermediate data
Validate transformations
Compare raw vs enriched data

Next Steps

Pipeline Flags

Configure FETCH_OHLCV and FETCH_OPTIONAL

Pipeline Architecture

Understand the full pipeline workflow

Output Schema

Explore the final JSON structure

Error Handling

Debug common pipeline issues

Get Started

Core Concepts

Pipeline Scripts

Standalone Scripts

Configuration

CLEANUP_INTERMEDIATE Flag

What It Controls

Behavior

What Gets Deleted

Intermediate Files (15 total)

Intermediate Directories (2 total)

What Gets Preserved

Final Output

OHLCV Data

Optional Data

F&O Data

Cleanup Logic (Source Code)

When to Use Each Setting

Storage Comparison

With Cleanup (Default)

Without Cleanup

How to Change Setting

Step 1: Edit run_full_pipeline.py

Step 2: Save and Run

Step 3: Verify Behavior

Manual Cleanup

Common Questions

Best Practices

For Daily Production

For Development

For CI/CD

For One-Time Runs

Next Steps

Pipeline Flags

Pipeline Architecture

Output Schema

Error Handling

Build docs developers (and LLMs) love

Get Started

Core Concepts

Pipeline Scripts

Standalone Scripts

Configuration

​CLEANUP_INTERMEDIATE Flag

​What It Controls

​Behavior

​What Gets Deleted

​Intermediate Files (15 total)

​Intermediate Directories (2 total)

​What Gets Preserved

Final Output

OHLCV Data

Optional Data

F&O Data

​Cleanup Logic (Source Code)

​When to Use Each Setting

​Storage Comparison

​With Cleanup (Default)

​Without Cleanup

​How to Change Setting

​Step 1: Edit run_full_pipeline.py

​Step 2: Save and Run

​Step 3: Verify Behavior

​Manual Cleanup

​Common Questions

​Best Practices

For Daily Production

For Development

For CI/CD

For One-Time Runs

​Next Steps

Pipeline Flags

Pipeline Architecture

Output Schema

Error Handling

Build docs developers (and LLMs) love

CLEANUP_INTERMEDIATE Flag

What It Controls

Behavior

What Gets Deleted

Intermediate Files (15 total)

Intermediate Directories (2 total)

What Gets Preserved

Cleanup Logic (Source Code)

When to Use Each Setting

Storage Comparison

With Cleanup (Default)

Without Cleanup

How to Change Setting

Step 1: Edit run_full_pipeline.py

Step 2: Save and Run

Step 3: Verify Behavior

Manual Cleanup

Common Questions

Best Practices

Next Steps