Skip to main content

Overview

The EDL (Exchange Data Layer) Pipeline is a comprehensive data integration system that processes market data from Dhan/NSE endpoints into a unified 86-field schema. The pipeline executes 16 core scripts in strict dependency order and produces all_stocks_fundamental_analysis.json.gz in approximately 4 minutes. Single Command Execution:
python3 run_full_pipeline.py

Architecture Diagram


Phase 1: Core Data (Foundation)

Purpose

Establishes the foundation by fetching the complete market dataset and fundamental financial data for all stocks.

Scripts Executed

1. fetch_dhan_data.py

  • Output: dhan_data_response.json + master_isin_map.json
  • Records: ~2,775 stocks
  • Critical Dependency: All Phase 2 scripts depend on master_isin_map.json

2. fetch_fundamental_data.py

  • Output: fundamental_data.json (35 MB)
  • Contains: Quarterly results, ratios, P&L statements, balance sheets
  • Timeout: 30s per ISIN

3. NSE Listing Dates CSV Download

  • Source: https://nsearchives.nseindia.com/content/equities/EQUITY_L.csv
  • Output: nse_equity_list.csv
  • Used by: Phase 3 analysis to populate “Listing Date” field
Critical Rule: fetch_dhan_data.py MUST succeed for the pipeline to continue. It produces master_isin_map.json which is required by all subsequent scripts.

Phase 2: Data Enrichment (Fetching)

Purpose

Enriches the core dataset with regulatory filings, news, technical indicators, corporate actions, and market metadata.

Scripts Executed (10 scripts)

ScriptOutputDependencies
fetch_company_filings.pycompany_filings/{SYMBOL}_filings.jsonmaster_isin_map.json
fetch_new_announcements.pyall_company_announcements.jsonmaster_isin_map.json
fetch_advanced_indicators.pyadvanced_indicator_data.json (8.3 MB)master_isin_map.json (requires Sid)
fetch_market_news.pymarket_news/{SYMBOL}_news.jsonmaster_isin_map.json
fetch_corporate_actions.pyupcoming/history_corporate_actions.jsonNone (fetches via date filters)
fetch_surveillance_lists.pynse_asm_list.json, nse_gsm_list.jsonNone (Google Sheets Gviz)
fetch_circuit_stocks.pyupper/lower_circuit_stocks.jsonNone
fetch_bulk_block_deals.pybulk_block_deals.jsonNone
fetch_incremental_price_bands.pyincremental_price_bands.jsonNone (NSE CSV)
fetch_complete_price_bands.pycomplete_price_bands.jsonNone (NSE CSV)
fetch_all_indices.pyall_indices_list.jsonNone

Threading & Performance

  • Company Filings: 20 threads
  • Announcements: 40 threads
  • Advanced Indicators: 50 threads
  • Market News: 15 threads
All Phase 2 scripts can partially fail without stopping the pipeline. The pipeline continues and marks failed scripts for review in the final report.

Phase 2.5: OHLCV History (Smart Incremental)

Purpose

Fetches historical daily OHLCV (Open, High, Low, Close, Volume) data for all stocks and indices. This phase is optional and controlled by the FETCH_OHLCV flag.

Scripts Executed

1. fetch_all_ohlcv.py (Stocks)

  • Output: ohlcv_data/{SYMBOL}.csv
  • Threads: 15
  • Start Date: October 31, 1976 (timestamp: 215634600)
  • Duration: ~2-5 min (incremental), ~30 min (first run)
  • Interval: Daily candles

2. fetch_indices_ohlcv.py (Indices)

  • Output: ohlcv_data/indices/{INDEX}.csv
  • Optimized: High-speed specialized fetcher

Configuration

# In run_full_pipeline.py
FETCH_OHLCV = True   # Auto-incremental (default)
FETCH_OHLCV = False  # Skip (ADR, RVOL, ATH fields = 0)
Impact of Skipping OHLCV: If FETCH_OHLCV = False, the following fields will be zero/empty:
  • 5/14/20/30 Days MA ADR(%) (Average Daily Range)
  • RVOL (Relative Volume)
  • % from ATH (Distance from All-Time High)
  • Returns since Earnings(%)
  • Max Returns since Earnings(%)

Phase 3: Base Analysis (Building Master JSON)

Purpose

Builds the foundational all_stocks_fundamental_analysis.json structure by combining fundamental data, technical indicators, and market metadata.

Script Executed

bulk_market_analyzer.py

  • Input Files:
    • fundamental_data.json
    • dhan_data_response.json
    • advanced_indicator_data.json
    • nse_equity_list.csv
  • Output: all_stocks_fundamental_analysis.json (base structure)
  • Fields Created: ~60 fields (see Output Schema)

Processing Logic

  1. Load Data Sources
    • Fundamental data (quarterly financials)
    • Technical data (price, returns, RSI)
    • Advanced indicators (Pivots, EMA/SMA)
    • Listing dates from NSE CSV
  2. Calculate Metrics
    • QoQ/YoY % changes (Net Profit, EPS, Sales, OPM)
    • 5-year Sales CAGR
    • Valuation ratios (P/E, PEG, Forward P/E, D/E)
    • Shareholding changes (FII, DII)
    • Free float calculation
  3. Assemble JSON
    • 2,775 stock records
    • 60+ fields per stock
    • ~40 MB uncompressed JSON
Critical Rule: bulk_market_analyzer.py MUST succeed before Phase 4. All Phase 4 scripts modify this file in-place.

Phase 4: Field Injection (Order Matters!)

Purpose

Enriches the master JSON with advanced metrics, earnings tracking, F&O data, market breadth, and event markers. Execution order is critical as each script modifies the same JSON file.

Scripts Executed (7 scripts, sequential)

1. advanced_metrics_processor.py

Fields Injected:
  • 5/14/20/30 Days MA ADR(%) — Average Daily Range
  • RVOL — Relative Volume (vs 20-day avg)
  • % from ATH — Distance from All-Time High
  • Daily Rupee Turnover 20/50/100(Cr.)
  • 200 Days EMA Volume
  • % from 52W High 200 Days EMA Volume
Dependencies: ohlcv_data/*.csv

2. process_earnings_performance.py

Fields Injected:
  • Quarterly Results Date
  • Returns since Earnings(%)
  • Max Returns since Earnings(%)
Dependencies: company_filings/*.json, ohlcv_data/*.csv

3. enrich_fno_data.py

Fields Injected:
  • F&O Flag (Yes/No)
  • Lot Size
  • Next Expiry Date
Dependencies: fno_lot_sizes_cleaned.json, fno_expiry_calendar.json

4. process_market_breadth.py

Output: market_breadth.csv Generates:
  • Sector-wise advance/decline ratios
  • 52-week high/low distribution
  • SMA 200 Above/Below counts
Dependencies: Requires returns and SMA status fields from Phase 3

5. process_historical_market_breadth.py

Output: Historical market breadth line charts Purpose: Time-series analysis of market breadth metrics

6. add_corporate_events.py (MUST BE LAST)

Fields Injected:
  • Event Markers — Event icons with triggers
  • Recent Announcements — Top 5 regulatory filings
  • News Feed — Top 5 media news items
  • Circuit Limit — Current price band
Dependencies: All previous Phase 4 outputs + enrichment data
Critical Rule: add_corporate_events.py MUST be the very last script in Phase 4. It reads all existing fields to generate event markers and news aggregations.

Phase 5: Compression

Purpose

Compresses output files to reduce storage and transfer bandwidth.

Process

# Compression targets
files_to_compress = {
    "all_stocks_fundamental_analysis.json": "all_stocks_fundamental_analysis.json.gz",
    "sector_analytics.json": "sector_analytics.json.gz",
    "market_breadth.csv": "market_breadth.json.gz"
}

Compression Stats

  • Algorithm: gzip (level 9)
  • Typical Ratio: 70-80% size reduction
  • Example: 40 MB → 8-10 MB

Phase 6: Optional Standalone Data

Purpose

Fetches additional market data that is NOT included in the master JSON but available for separate analysis.

Scripts (when FETCH_OPTIONAL = True)

ScriptOutputRecords
fetch_all_indices.pyall_indices_list.json~194 indices
fetch_etf_data.pyetf_data_response.json~361 ETFs
These outputs are standalone and not consumed by the main pipeline. Use them for custom analysis or separate dashboards.

Cleanup & Intermediate Files

Auto-Cleanup (when CLEANUP_INTERMEDIATE = True)

After pipeline success, the following intermediate files are automatically deleted: Files Removed:
  • master_isin_map.json
  • dhan_data_response.json
  • fundamental_data.json
  • advanced_indicator_data.json
  • all_company_announcements.json
  • upcoming/history_corporate_actions.json
  • nse_asm_list.json, nse_gsm_list.json
  • bulk_block_deals.json
  • upper/lower_circuit_stocks.json
  • incremental/complete_price_bands.json
  • nse_equity_list.csv
  • all_stocks_fundamental_analysis.json (raw, before compression)
Directories Removed:
  • company_filings/
  • market_news/
Files Kept:
  • all_stocks_fundamental_analysis.json.gz
  • ohlcv_data/ directory ✅

Cleanup Stats

Typically frees 50-100 MB of disk space.

Configuration Flags

FETCH_OHLCV (Default: True)

FETCH_OHLCV = True   # Include lifetime OHLCV (~2-30 min extra)
FETCH_OHLCV = False  # Skip OHLCV (ADR, RVOL, ATH fields = 0)

FETCH_OPTIONAL (Default: False)

FETCH_OPTIONAL = True   # Include FNO, ETF, Indices standalone data
FETCH_OPTIONAL = False  # Skip optional data

CLEANUP_INTERMEDIATE (Default: True)

CLEANUP_INTERMEDIATE = True   # Auto-delete intermediate files
CLEANUP_INTERMEDIATE = False  # Keep all files for debugging

Execution Timeline

Typical Runtime (with FETCH_OHLCV = True, incremental)

PhaseDurationScripts
Phase 1~30s2 scripts
Phase 2~90s11 scripts (parallel threading)
Phase 2.5~120s2 scripts (incremental OHLCV)
Phase 3~20s1 script
Phase 4~60s7 scripts (sequential)
Phase 5~5sCompression
Total~5 min23 scripts

First-Time Run (with FETCH_OHLCV = True, full download)

  • Phase 2.5 Duration: ~30 minutes (lifetime OHLCV for 2,775 stocks)
  • Total Duration: ~35 minutes

Error Handling & Resilience

Critical Failures (Pipeline Stops)

  1. fetch_dhan_data.py fails → No master_isin_map.json
  2. bulk_market_analyzer.py fails → No base JSON for Phase 4

Non-Critical Failures (Pipeline Continues)

  • Any Phase 2 enrichment script can fail
  • Phase 4 scripts continue on error (fields may be missing)
  • Final report shows all failed scripts

Timeout Policy

  • Script Timeout: 30 minutes per script
  • Request Timeout: Varies by endpoint (10-30s)

Pipeline Output

Final Artifacts

✅ all_stocks_fundamental_analysis.json.gz (8-10 MB)
✅ ohlcv_data/ (CSV files, ~100-200 MB)
✅ sector_analytics.json.gz (if generated)
✅ market_breadth.json.gz (if generated)

Final Report Example

═══════════════════════════════════════════════════════════
  PIPELINE COMPLETE
═══════════════════════════════════════════════════════════
  Total Time:  285.3s (4.8 min)
  Successful:  21/23
  Failed:      2/23

  Failed Scripts:
    ❌ fetch_market_news.py
    ❌ fetch_indices_ohlcv.py

  📄 Output: all_stocks_fundamental_analysis.json.gz (9.2 MB)
  📦 Compression: 42.1 MB → 9.2 MB (78% smaller)
  🧹 Only .json.gz + ohlcv_data/ remain. All intermediate data purged.
═══════════════════════════════════════════════════════════

Next Steps

Data Sources

Explore all 12+ Dhan/NSE endpoints used in the pipeline

Output Schema

View the complete 86-field output schema structure

Build docs developers (and LLMs) love