Skip to main content

Overview

The full pipeline executes 16 scripts in strict dependency order to produce all_stocks_fundamental_analysis.json.gz - a comprehensive dataset of 2,775+ Indian stocks with 86 fields per stock covering fundamentals, technicals, events, and sentiment. Expected Runtime: ~4 minutes (without OHLCV) | ~34 minutes (with OHLCV first-time fetch)

Quick Start

1

Navigate to pipeline directory

cd ~/workspace/source/DO\ NOT\ DELETE\ EDL\ PIPELINE/
2

Run the pipeline

python3 run_full_pipeline.py
The script will automatically:
  • Fetch data from all sources (Dhan ScanX, NSE)
  • Build the master JSON structure
  • Enrich with technical indicators, events, and news
  • Compress output to .json.gz format
  • Clean up intermediate files
3

Verify output

Check for the final compressed file:
ls -lh all_stocks_fundamental_analysis.json.gz
Expected size: ~2 MB (compressed from ~50-60 MB raw JSON)

Configuration Options

Edit run_full_pipeline.py to customize behavior:

OHLCV Data Fetching

FETCH_OHLCV = True  # Default: True
When True:
  • First run: Downloads complete OHLCV history (~30 min for all stocks)
  • Subsequent runs: Incremental update only (~2-5 min)
  • Enables: ADR, RVOL, ATH, % from ATH, returns calculations
When False:
  • Skips OHLCV entirely
  • ADR, RVOL, ATH fields will be 0
  • Runtime: ~4 minutes

Optional Standalone Data

FETCH_OPTIONAL = False  # Default: False
When True: Also fetches (not included in master JSON):
  • all_indices_list.json - 194 market indices
  • etf_data_response.json - 361 ETFs

Auto-Cleanup

CLEANUP_INTERMEDIATE = True  # Default: True
When True: Removes intermediate files after successful completion, keeping only:
  • all_stocks_fundamental_analysis.json.gz
  • sector_analytics.json.gz
  • market_breadth.json.gz
  • ohlcv_data/ directory (if FETCH_OHLCV=True)
When False: Retains all intermediate JSON files for debugging

Pipeline Phases

The pipeline executes in strict order:

Phase 1: Core Data (Foundation)

1. fetch_dhan_data.py          → dhan_data_response.json + master_isin_map.json
2. fetch_fundamental_data.py   → fundamental_data.json
3. NSE CSV download            → nse_equity_list.csv (listing dates)
Critical: fetch_dhan_data.py must succeed - it creates master_isin_map.json which all other scripts need.

Phase 2: Data Enrichment (Fetching)

3.  fetch_company_filings.py       → company_filings/*.json
4.  fetch_new_announcements.py     → all_company_announcements.json
5.  fetch_advanced_indicators.py   → advanced_indicator_data.json
6.  fetch_market_news.py           → market_news/*.json
7.  fetch_corporate_actions.py     → upcoming/history_corporate_actions.json
8.  fetch_surveillance_lists.py    → nse_asm_list.json, nse_gsm_list.json
9.  fetch_circuit_stocks.py        → upper/lower_circuit_stocks.json
10. fetch_bulk_block_deals.py      → bulk_block_deals.json
11. fetch_incremental_price_bands.py → incremental_price_bands.json
12. fetch_complete_price_bands.py    → complete_price_bands.json
13. fetch_all_indices.py           → all_indices_list.json

Phase 2.5: OHLCV History (Smart Incremental)

14. fetch_all_ohlcv.py         → ohlcv_data/*.csv
15. fetch_indices_ohlcv.py     → (indices OHLCV)
Smart Incremental Logic:
  • Checks existing CSV files in ohlcv_data/
  • Only fetches missing dates since last update
  • First run: Fetches up to 2 years of history per stock
  • Daily updates: Only fetches 1-2 days of new data

Phase 3: Base Analysis

16. bulk_market_analyzer.py    → all_stocks_fundamental_analysis.json (BASE)
Creates the master JSON structure with fundamental data for all stocks.

Phase 4: Enrichment (Order Matters!)

17. advanced_metrics_processor.py   → Adds ADR, RVOL, ATH, Turnover
18. process_earnings_performance.py → Adds post-earnings returns
19. enrich_fno_data.py              → Adds F&O flag, Lot Size, Next Expiry
20. process_market_breadth.py       → Generates sector analytics
21. process_historical_market_breadth.py → Generates breadth charts
22. add_corporate_events.py         → Adds Events, Announcements, News (LAST!)
Critical: add_corporate_events.py MUST run last as it performs final JSON injection.

Phase 5: Compression

Compress all output files:
- all_stocks_fundamental_analysis.json → .json.gz
- sector_analytics.json → .json.gz
- market_breadth.csv → .json.gz

Compression ratio: ~90-95% size reduction

Output Files

Primary Output

Location: ~/workspace/source/DO NOT DELETE EDL PIPELINE/all_stocks_fundamental_analysis.json.gz Format: Gzip-compressed JSON array Structure:
[
  {
    "Symbol": "RELIANCE",
    "Name": "Reliance Industries Limited",
    "Market Cap(Cr.)": 1850000,
    "Stock Price(₹)": 2734.50,
    "P/E": 28.5,
    "ROE(%)": 15.2,
    "Latest Quarter": "Dec 2025",
    "Net Profit Latest": 18200,
    "QoQ % Net Profit Latest": 5.3,
    "YoY % Net Profit Latest": 12.7,
    "RSI (14)": 62.5,
    "Event Markers": "💸: Dividend (15-Mar)",
    "Recent Announcements": [...],
    "News Feed": [...]
    // ... 86 total fields
  },
  // ... 2,775+ stocks
]
Decompression:
import gzip
import json

with gzip.open('all_stocks_fundamental_analysis.json.gz', 'rb') as f:
    data = json.load(f)

print(f"Total stocks: {len(data)}")
print(f"Fields per stock: {len(data[0])}")

Secondary Outputs

FileSizeDescription
sector_analytics.json.gz~500 KBSector-wise aggregated metrics
market_breadth.json.gz~8 MBHistorical market breadth data
ohlcv_data/*.csv~200 MBIndividual stock OHLCV history
all_indices_list.json~85 KBMarket indices data (if FETCH_OPTIONAL=True)

Runtime Breakdown

First-Time Execution (with OHLCV)

Phase 1: Core Data                    ~30s
Phase 2: Data Enrichment              ~90s
Phase 2.5: OHLCV History (first)      ~30 min
Phase 3: Base Analysis                ~20s
Phase 4: Enrichment                   ~45s
Phase 5: Compression                  ~15s
─────────────────────────────────────────
Total:                                ~34 min

Daily Update (with incremental OHLCV)

Phase 1: Core Data                    ~30s
Phase 2: Data Enrichment              ~90s
Phase 2.5: OHLCV Incremental          ~2-5 min
Phase 3: Base Analysis                ~20s
Phase 4: Enrichment                   ~45s
Phase 5: Compression                  ~15s
─────────────────────────────────────────
Total:                                ~6-9 min

Without OHLCV

Phase 1: Core Data                    ~30s
Phase 2: Data Enrichment              ~90s
Phase 3: Base Analysis                ~20s
Phase 4: Enrichment                   ~30s
Phase 5: Compression                  ~15s
─────────────────────────────────────────
Total:                                ~4 min

Console Output Example

════════════════════════════════════════════════════════════
  EDL PIPELINE - FULL DATA REFRESH
════════════════════════════════════════════════════════════

📦 PHASE 1: Core Data (Foundation)
────────────────────────────────────────
  ▶ Running fetch_dhan_data.py...
  ✅ fetch_dhan_data.py (12.3s)
  ▶ Running fetch_fundamental_data.py...
  ✅ fetch_fundamental_data.py (18.7s)
  ▶ Downloading NSE Listing Dates...
  ✅ NSE Listing Dates downloaded.

📡 PHASE 2: Data Enrichment (Fetching)
────────────────────────────────────────
  ▶ Running fetch_company_filings.py...
  ✅ fetch_company_filings.py (45.2s)
  ...

📊 PHASE 2.5: OHLCV History (Smart Incremental)
────────────────────────────────────────
  ▶ Running fetch_all_ohlcv.py...
  ✅ fetch_all_ohlcv.py (142.5s)

🔬 PHASE 3: Base Analysis (Building Master JSON)
────────────────────────────────────────
  ▶ Running bulk_market_analyzer.py...
  ✅ bulk_market_analyzer.py (19.8s)

✨ PHASE 4: Enrichment (Injecting into Master JSON)
────────────────────────────────────────
  ▶ Running advanced_metrics_processor.py...
  ✅ advanced_metrics_processor.py (8.2s)
  ...

📦 PHASE 5: Compression (.json → .json.gz)
────────────────────────────────────────
  📦 Compressed: 58.3 MB → 2.1 MB (96% reduction)

🧹 CLEANUP: Removing intermediate files...
────────────────────────────────────────
  🗑️  Cleaned: 13 files + 2 dirs (56.2 MB freed)

════════════════════════════════════════════════════════════
  PIPELINE COMPLETE
════════════════════════════════════════════════════════════
  Total Time:  245.7s (4.1 min)
  Successful:  22/22
  Failed:      0/22

  📄 Output: all_stocks_fundamental_analysis.json.gz (2.1 MB)
  📦 Compression: 58.3 MB → 2.1 MB (96% smaller)
  🧹 Only .json.gz + ohlcv_data/ remain. All intermediate data purged.
════════════════════════════════════════════════════════════

Troubleshooting

Pipeline Fails at fetch_dhan_data.py

Error: CRITICAL: fetch_dhan_data.py failed. Cannot continue. Cause: This script fetches the master stock list and creates master_isin_map.json which all other scripts need. Solutions:
  • Check internet connectivity
  • Verify Dhan API endpoint is accessible
  • Check if rate-limited (wait 5 minutes and retry)
  • Inspect error message in console output

OHLCV Fetch Takes Too Long

Symptom: Phase 2.5 exceeds 30 minutes Solutions:
  • First run is expected to take ~30 min for full history
  • Reduce thread count: Edit fetch_all_ohlcv.py, set MAX_THREADS = 10 (line 14)
  • For faster daily updates, keep existing ohlcv_data/ directory - it will only fetch new dates
  • If not needed immediately, set FETCH_OHLCV = False and run later

Script Times Out

Error: ⏰ {script_name} TIMED OUT (>30 min) Cause: Individual script timeout is set to 30 minutes (1800 seconds) Solutions:
  • Check network stability
  • Increase timeout in run_full_pipeline.py line 117: timeout=3600 (1 hour)
  • Run the individual script manually to see detailed error

Compression Fails

Error: Files to compress not found Cause: Phase 3 or Phase 4 failed to produce expected output files Solutions:
  • Check console for which Phase 4 script failed
  • Run pipeline with CLEANUP_INTERMEDIATE = False to inspect intermediate files
  • Verify all_stocks_fundamental_analysis.json exists before compression

Memory Issues

Symptom: Process killed or out of memory errors Solutions:
  • Free up system RAM (close other applications)
  • Reduce parallelization: Lower thread counts in fetcher scripts
  • Process in batches: Set FETCH_OPTIONAL = False
  • Pipeline requires ~2-4 GB RAM for full execution

Partial Data in Output

Symptom: Some stocks missing fields or empty values Cause: Non-critical enrichment scripts failed but pipeline continued Solutions:
  • Check console output for failed scripts (marked with ❌)
  • Pipeline continues even if enrichment fails (line 126: return True)
  • Re-run pipeline to retry failed fetches
  • Some data sources may be temporarily unavailable (ASM/GSM lists, news feed)

Manual Script Execution

If you need to run individual scripts for debugging:
cd ~/workspace/source/DO\ NOT\ DELETE\ EDL\ PIPELINE/

# Core data (must run first)
python3 fetch_dhan_data.py
python3 fetch_fundamental_data.py

# Any enrichment script (requires master_isin_map.json)
python3 fetch_company_filings.py
python3 fetch_market_news.py

# OHLCV (requires dhan_data_response.json)
python3 fetch_all_ohlcv.py

# Base analysis (requires all fetched data)
python3 bulk_market_analyzer.py

# Enrichment (requires all_stocks_fundamental_analysis.json to exist)
python3 advanced_metrics_processor.py
python3 add_corporate_events.py  # Must be last!

Best Practices

Daily Updates

  • Run once per day after market close (after 3:30 PM IST)
  • Keep FETCH_OHLCV = True for incremental updates
  • OHLCV incremental fetch only takes 2-5 minutes
  • Set up a cron job for automated daily execution:
# Run at 4 PM IST daily
0 16 * * 1-5 cd ~/workspace/source/DO\ NOT\ DELETE\ EDL\ PIPELINE/ && python3 run_full_pipeline.py >> pipeline.log 2>&1

First-Time Setup

  • Allow 30-40 minutes for first run with OHLCV
  • Verify output file exists and is properly formatted
  • Test decompression with a JSON parser
  • Keep intermediate files for first run (CLEANUP_INTERMEDIATE = False)

Production Environment

  • Monitor disk space (OHLCV data grows to ~200 MB)
  • Archive old .json.gz files with timestamps
  • Set up error alerting for pipeline failures
  • Keep logs of each run for debugging

Next Steps

Build docs developers (and LLMs) love