Pipeline Architecture - ChartsMaze EDL Pipeline

Overview

The EDL (Exchange Data Layer) Pipeline is a comprehensive data integration system that processes market data from Dhan/NSE endpoints into a unified 86-field schema. The pipeline executes 16 core scripts in strict dependency order and produces all_stocks_fundamental_analysis.json.gz in approximately 4 minutes. Single Command Execution:

python3 run_full_pipeline.py

Architecture Diagram

Phase 1: Core Data (Foundation)

Purpose

Establishes the foundation by fetching the complete market dataset and fundamental financial data for all stocks.

Scripts Executed

1. `fetch_dhan_data.py`

Output: dhan_data_response.json + master_isin_map.json
Records: ~2,775 stocks
Critical Dependency: All Phase 2 scripts depend on master_isin_map.json

2. `fetch_fundamental_data.py`

Output: fundamental_data.json (35 MB)
Contains: Quarterly results, ratios, P&L statements, balance sheets
Timeout: 30s per ISIN

3. NSE Listing Dates CSV Download

Source: https://nsearchives.nseindia.com/content/equities/EQUITY_L.csv
Output: nse_equity_list.csv
Used by: Phase 3 analysis to populate “Listing Date” field

Critical Rule: fetch_dhan_data.py MUST succeed for the pipeline to continue. It produces master_isin_map.json which is required by all subsequent scripts.

Phase 2: Data Enrichment (Fetching)

Purpose

Enriches the core dataset with regulatory filings, news, technical indicators, corporate actions, and market metadata.

Scripts Executed (10 scripts)

Script	Output	Dependencies
`fetch_company_filings.py`	`company_filings/{SYMBOL}_filings.json`	master_isin_map.json
`fetch_new_announcements.py`	`all_company_announcements.json`	master_isin_map.json
`fetch_advanced_indicators.py`	`advanced_indicator_data.json` (8.3 MB)	master_isin_map.json (requires Sid)
`fetch_market_news.py`	`market_news/{SYMBOL}_news.json`	master_isin_map.json
`fetch_corporate_actions.py`	`upcoming/history_corporate_actions.json`	None (fetches via date filters)
`fetch_surveillance_lists.py`	`nse_asm_list.json`, `nse_gsm_list.json`	None (Google Sheets Gviz)
`fetch_circuit_stocks.py`	`upper/lower_circuit_stocks.json`	None
`fetch_bulk_block_deals.py`	`bulk_block_deals.json`	None
`fetch_incremental_price_bands.py`	`incremental_price_bands.json`	None (NSE CSV)
`fetch_complete_price_bands.py`	`complete_price_bands.json`	None (NSE CSV)
`fetch_all_indices.py`	`all_indices_list.json`	None

Threading & Performance

Company Filings: 20 threads
Announcements: 40 threads
Advanced Indicators: 50 threads
Market News: 15 threads

All Phase 2 scripts can partially fail without stopping the pipeline. The pipeline continues and marks failed scripts for review in the final report.

Phase 2.5: OHLCV History (Smart Incremental)

Purpose

Fetches historical daily OHLCV (Open, High, Low, Close, Volume) data for all stocks and indices. This phase is optional and controlled by the FETCH_OHLCV flag.

Scripts Executed

1. `fetch_all_ohlcv.py` (Stocks)

Output: ohlcv_data/{SYMBOL}.csv
Threads: 15
Start Date: October 31, 1976 (timestamp: 215634600)
Duration: ~2-5 min (incremental), ~30 min (first run)
Interval: Daily candles

2. `fetch_indices_ohlcv.py` (Indices)

Output: ohlcv_data/indices/{INDEX}.csv
Optimized: High-speed specialized fetcher

Configuration

# In run_full_pipeline.py
FETCH_OHLCV = True   # Auto-incremental (default)
FETCH_OHLCV = False  # Skip (ADR, RVOL, ATH fields = 0)

Impact of Skipping OHLCV: If FETCH_OHLCV = False, the following fields will be zero/empty:

5/14/20/30 Days MA ADR(%) (Average Daily Range)
RVOL (Relative Volume)
% from ATH (Distance from All-Time High)
Returns since Earnings(%)
Max Returns since Earnings(%)

Phase 3: Base Analysis (Building Master JSON)

Purpose

Builds the foundational all_stocks_fundamental_analysis.json structure by combining fundamental data, technical indicators, and market metadata.

Script Executed

`bulk_market_analyzer.py`

Input Files:
- fundamental_data.json
- dhan_data_response.json
- advanced_indicator_data.json
- nse_equity_list.csv
Output: all_stocks_fundamental_analysis.json (base structure)
Fields Created: ~60 fields (see Output Schema)

Processing Logic

Load Data Sources
- Fundamental data (quarterly financials)
- Technical data (price, returns, RSI)
- Advanced indicators (Pivots, EMA/SMA)
- Listing dates from NSE CSV
Calculate Metrics
- QoQ/YoY % changes (Net Profit, EPS, Sales, OPM)
- 5-year Sales CAGR
- Valuation ratios (P/E, PEG, Forward P/E, D/E)
- Shareholding changes (FII, DII)
- Free float calculation
Assemble JSON
- 2,775 stock records
- 60+ fields per stock
- ~40 MB uncompressed JSON

Critical Rule: bulk_market_analyzer.py MUST succeed before Phase 4. All Phase 4 scripts modify this file in-place.

Phase 4: Field Injection (Order Matters!)

Purpose

Enriches the master JSON with advanced metrics, earnings tracking, F&O data, market breadth, and event markers. Execution order is critical as each script modifies the same JSON file.

Scripts Executed (7 scripts, sequential)

1. `advanced_metrics_processor.py`

Fields Injected:

5/14/20/30 Days MA ADR(%) — Average Daily Range
RVOL — Relative Volume (vs 20-day avg)
% from ATH — Distance from All-Time High
Daily Rupee Turnover 20/50/100(Cr.)
200 Days EMA Volume
% from 52W High 200 Days EMA Volume

Dependencies: ohlcv_data/*.csv

2. `process_earnings_performance.py`

Fields Injected:

Quarterly Results Date
Returns since Earnings(%)
Max Returns since Earnings(%)

Dependencies: company_filings/*.json, ohlcv_data/*.csv

3. `enrich_fno_data.py`

Fields Injected:

F&O Flag (Yes/No)
Lot Size
Next Expiry Date

Dependencies: fno_lot_sizes_cleaned.json, fno_expiry_calendar.json

4. `process_market_breadth.py`

Output: market_breadth.csv Generates:

Sector-wise advance/decline ratios
52-week high/low distribution
SMA 200 Above/Below counts

Dependencies: Requires returns and SMA status fields from Phase 3

5. `process_historical_market_breadth.py`

Output: Historical market breadth line charts Purpose: Time-series analysis of market breadth metrics

6. `add_corporate_events.py` (MUST BE LAST)

Fields Injected:

Event Markers — Event icons with triggers
Recent Announcements — Top 5 regulatory filings
News Feed — Top 5 media news items
Circuit Limit — Current price band

Dependencies: All previous Phase 4 outputs + enrichment data

Critical Rule: add_corporate_events.py MUST be the very last script in Phase 4. It reads all existing fields to generate event markers and news aggregations.

Phase 5: Compression

Purpose

Compresses output files to reduce storage and transfer bandwidth.

Process

# Compression targets
files_to_compress = {
    "all_stocks_fundamental_analysis.json": "all_stocks_fundamental_analysis.json.gz",
    "sector_analytics.json": "sector_analytics.json.gz",
    "market_breadth.csv": "market_breadth.json.gz"
}

Compression Stats

Algorithm: gzip (level 9)
Typical Ratio: 70-80% size reduction
Example: 40 MB → 8-10 MB

Phase 6: Optional Standalone Data

Purpose

Fetches additional market data that is NOT included in the master JSON but available for separate analysis.

Scripts (when `FETCH_OPTIONAL = True`)

Script	Output	Records
`fetch_all_indices.py`	`all_indices_list.json`	~194 indices
`fetch_etf_data.py`	`etf_data_response.json`	~361 ETFs

These outputs are standalone and not consumed by the main pipeline. Use them for custom analysis or separate dashboards.

Cleanup & Intermediate Files

Auto-Cleanup (when `CLEANUP_INTERMEDIATE = True`)

After pipeline success, the following intermediate files are automatically deleted: Files Removed:

master_isin_map.json
dhan_data_response.json
fundamental_data.json
advanced_indicator_data.json
all_company_announcements.json
upcoming/history_corporate_actions.json
nse_asm_list.json, nse_gsm_list.json
bulk_block_deals.json
upper/lower_circuit_stocks.json
incremental/complete_price_bands.json
nse_equity_list.csv
all_stocks_fundamental_analysis.json (raw, before compression)

Directories Removed:

company_filings/
market_news/

Files Kept:

all_stocks_fundamental_analysis.json.gz ✅
ohlcv_data/ directory ✅

Cleanup Stats

Typically frees 50-100 MB of disk space.

Configuration Flags

`FETCH_OHLCV` (Default: `True`)

FETCH_OHLCV = True   # Include lifetime OHLCV (~2-30 min extra)
FETCH_OHLCV = False  # Skip OHLCV (ADR, RVOL, ATH fields = 0)

`FETCH_OPTIONAL` (Default: `False`)

FETCH_OPTIONAL = True   # Include FNO, ETF, Indices standalone data
FETCH_OPTIONAL = False  # Skip optional data

`CLEANUP_INTERMEDIATE` (Default: `True`)

CLEANUP_INTERMEDIATE = True   # Auto-delete intermediate files
CLEANUP_INTERMEDIATE = False  # Keep all files for debugging

Execution Timeline

Typical Runtime (with `FETCH_OHLCV = True`, incremental)

Phase	Duration	Scripts
Phase 1	~30s	2 scripts
Phase 2	~90s	11 scripts (parallel threading)
Phase 2.5	~120s	2 scripts (incremental OHLCV)
Phase 3	~20s	1 script
Phase 4	~60s	7 scripts (sequential)
Phase 5	~5s	Compression
Total	~5 min	23 scripts

First-Time Run (with `FETCH_OHLCV = True`, full download)

Phase 2.5 Duration: ~30 minutes (lifetime OHLCV for 2,775 stocks)
Total Duration: ~35 minutes

Error Handling & Resilience

Critical Failures (Pipeline Stops)

fetch_dhan_data.py fails → No master_isin_map.json
bulk_market_analyzer.py fails → No base JSON for Phase 4

Non-Critical Failures (Pipeline Continues)

Any Phase 2 enrichment script can fail
Phase 4 scripts continue on error (fields may be missing)
Final report shows all failed scripts

Timeout Policy

Script Timeout: 30 minutes per script
Request Timeout: Varies by endpoint (10-30s)

Pipeline Output

Final Artifacts

✅ all_stocks_fundamental_analysis.json.gz (8-10 MB)
✅ ohlcv_data/ (CSV files, ~100-200 MB)
✅ sector_analytics.json.gz (if generated)
✅ market_breadth.json.gz (if generated)

Final Report Example

═══════════════════════════════════════════════════════════
  PIPELINE COMPLETE
═══════════════════════════════════════════════════════════
  Total Time:  285.3s (4.8 min)
  Successful:  21/23
  Failed:      2/23

  Failed Scripts:
    ❌ fetch_market_news.py
    ❌ fetch_indices_ohlcv.py

  📄 Output: all_stocks_fundamental_analysis.json.gz (9.2 MB)
  📦 Compression: 42.1 MB → 9.2 MB (78% smaller)
  🧹 Only .json.gz + ohlcv_data/ remain. All intermediate data purged.
═══════════════════════════════════════════════════════════

Get Started

Core Concepts

Pipeline Scripts

Standalone Scripts

Configuration

​Overview

​Architecture Diagram

​Phase 1: Core Data (Foundation)

​Purpose

​Scripts Executed

​1. fetch_dhan_data.py

​2. fetch_fundamental_data.py

​3. NSE Listing Dates CSV Download

​Phase 2: Data Enrichment (Fetching)

​Purpose

​Scripts Executed (10 scripts)

​Threading & Performance

​Phase 2.5: OHLCV History (Smart Incremental)

​Purpose

​Scripts Executed

​1. fetch_all_ohlcv.py (Stocks)

​2. fetch_indices_ohlcv.py (Indices)

​Configuration

​Phase 3: Base Analysis (Building Master JSON)

​Purpose

​Script Executed

​bulk_market_analyzer.py

​Processing Logic

​Phase 4: Field Injection (Order Matters!)

​Purpose

​Scripts Executed (7 scripts, sequential)

​1. advanced_metrics_processor.py

​2. process_earnings_performance.py

​3. enrich_fno_data.py

​4. process_market_breadth.py

​5. process_historical_market_breadth.py

​6. add_corporate_events.py (MUST BE LAST)

​Phase 5: Compression

​Purpose

​Process

​Compression Stats

​Phase 6: Optional Standalone Data

​Purpose

​Scripts (when FETCH_OPTIONAL = True)

​Cleanup & Intermediate Files

​Auto-Cleanup (when CLEANUP_INTERMEDIATE = True)

​Cleanup Stats

​Configuration Flags

​FETCH_OHLCV (Default: True)

​FETCH_OPTIONAL (Default: False)

​CLEANUP_INTERMEDIATE (Default: True)

​Execution Timeline

​Typical Runtime (with FETCH_OHLCV = True, incremental)

​First-Time Run (with FETCH_OHLCV = True, full download)

​Error Handling & Resilience

​Critical Failures (Pipeline Stops)

​Non-Critical Failures (Pipeline Continues)

​Timeout Policy

​Pipeline Output

​Final Artifacts

​Final Report Example

​Next Steps

Data Sources

Output Schema

Build docs developers (and LLMs) love

Overview

Architecture Diagram

Phase 1: Core Data (Foundation)

Purpose

Scripts Executed

1. `fetch_dhan_data.py`

2. `fetch_fundamental_data.py`

3. NSE Listing Dates CSV Download

Phase 2: Data Enrichment (Fetching)

Purpose

Scripts Executed (10 scripts)

Threading & Performance

Phase 2.5: OHLCV History (Smart Incremental)

Purpose

Scripts Executed

1. `fetch_all_ohlcv.py` (Stocks)

2. `fetch_indices_ohlcv.py` (Indices)

Configuration

Phase 3: Base Analysis (Building Master JSON)

Purpose

Script Executed

`bulk_market_analyzer.py`

Processing Logic

Phase 4: Field Injection (Order Matters!)

Purpose

Scripts Executed (7 scripts, sequential)

1. `advanced_metrics_processor.py`

2. `process_earnings_performance.py`

3. `enrich_fno_data.py`

4. `process_market_breadth.py`

5. `process_historical_market_breadth.py`

6. `add_corporate_events.py` (MUST BE LAST)

Phase 5: Compression

Purpose

Process

Compression Stats

Phase 6: Optional Standalone Data

Purpose

Scripts (when `FETCH_OPTIONAL = True`)

Cleanup & Intermediate Files

Auto-Cleanup (when `CLEANUP_INTERMEDIATE = True`)

Cleanup Stats

Configuration Flags

`FETCH_OHLCV` (Default: `True`)

`FETCH_OPTIONAL` (Default: `False`)

`CLEANUP_INTERMEDIATE` (Default: `True`)

Execution Timeline

Typical Runtime (with `FETCH_OHLCV = True`, incremental)

First-Time Run (with `FETCH_OHLCV = True`, full download)

Error Handling & Resilience

Critical Failures (Pipeline Stops)

Non-Critical Failures (Pipeline Continues)

Timeout Policy

Pipeline Output

Final Artifacts

Final Report Example

Next Steps