Overview
The full pipeline executes 16 scripts in strict dependency order to produceall_stocks_fundamental_analysis.json.gz - a comprehensive dataset of 2,775+ Indian stocks with 86 fields per stock covering fundamentals, technicals, events, and sentiment.
Expected Runtime: ~4 minutes (without OHLCV) | ~34 minutes (with OHLCV first-time fetch)
Quick Start
Run the pipeline
- Fetch data from all sources (Dhan ScanX, NSE)
- Build the master JSON structure
- Enrich with technical indicators, events, and news
- Compress output to
.json.gzformat - Clean up intermediate files
Configuration Options
Editrun_full_pipeline.py to customize behavior:
OHLCV Data Fetching
- First run: Downloads complete OHLCV history (~30 min for all stocks)
- Subsequent runs: Incremental update only (~2-5 min)
- Enables: ADR, RVOL, ATH, % from ATH, returns calculations
- Skips OHLCV entirely
- ADR, RVOL, ATH fields will be 0
- Runtime: ~4 minutes
Optional Standalone Data
all_indices_list.json- 194 market indicesetf_data_response.json- 361 ETFs
Auto-Cleanup
all_stocks_fundamental_analysis.json.gzsector_analytics.json.gzmarket_breadth.json.gzohlcv_data/directory (if FETCH_OHLCV=True)
Pipeline Phases
The pipeline executes in strict order:Phase 1: Core Data (Foundation)
fetch_dhan_data.py must succeed - it creates master_isin_map.json which all other scripts need.
Phase 2: Data Enrichment (Fetching)
Phase 2.5: OHLCV History (Smart Incremental)
- Checks existing CSV files in
ohlcv_data/ - Only fetches missing dates since last update
- First run: Fetches up to 2 years of history per stock
- Daily updates: Only fetches 1-2 days of new data
Phase 3: Base Analysis
Phase 4: Enrichment (Order Matters!)
add_corporate_events.py MUST run last as it performs final JSON injection.
Phase 5: Compression
Output Files
Primary Output
Location:~/workspace/source/DO NOT DELETE EDL PIPELINE/all_stocks_fundamental_analysis.json.gz
Format: Gzip-compressed JSON array
Structure:
Secondary Outputs
| File | Size | Description |
|---|---|---|
sector_analytics.json.gz | ~500 KB | Sector-wise aggregated metrics |
market_breadth.json.gz | ~8 MB | Historical market breadth data |
ohlcv_data/*.csv | ~200 MB | Individual stock OHLCV history |
all_indices_list.json | ~85 KB | Market indices data (if FETCH_OPTIONAL=True) |
Runtime Breakdown
First-Time Execution (with OHLCV)
Daily Update (with incremental OHLCV)
Without OHLCV
Console Output Example
Troubleshooting
Pipeline Fails at fetch_dhan_data.py
Error:CRITICAL: fetch_dhan_data.py failed. Cannot continue.
Cause: This script fetches the master stock list and creates master_isin_map.json which all other scripts need.
Solutions:
- Check internet connectivity
- Verify Dhan API endpoint is accessible
- Check if rate-limited (wait 5 minutes and retry)
- Inspect error message in console output
OHLCV Fetch Takes Too Long
Symptom: Phase 2.5 exceeds 30 minutes Solutions:- First run is expected to take ~30 min for full history
- Reduce thread count: Edit
fetch_all_ohlcv.py, setMAX_THREADS = 10(line 14) - For faster daily updates, keep existing
ohlcv_data/directory - it will only fetch new dates - If not needed immediately, set
FETCH_OHLCV = Falseand run later
Script Times Out
Error:⏰ {script_name} TIMED OUT (>30 min)
Cause: Individual script timeout is set to 30 minutes (1800 seconds)
Solutions:
- Check network stability
- Increase timeout in
run_full_pipeline.pyline 117:timeout=3600(1 hour) - Run the individual script manually to see detailed error
Compression Fails
Error: Files to compress not found Cause: Phase 3 or Phase 4 failed to produce expected output files Solutions:- Check console for which Phase 4 script failed
- Run pipeline with
CLEANUP_INTERMEDIATE = Falseto inspect intermediate files - Verify
all_stocks_fundamental_analysis.jsonexists before compression
Memory Issues
Symptom: Process killed or out of memory errors Solutions:- Free up system RAM (close other applications)
- Reduce parallelization: Lower thread counts in fetcher scripts
- Process in batches: Set
FETCH_OPTIONAL = False - Pipeline requires ~2-4 GB RAM for full execution
Partial Data in Output
Symptom: Some stocks missing fields or empty values Cause: Non-critical enrichment scripts failed but pipeline continued Solutions:- Check console output for failed scripts (marked with ❌)
- Pipeline continues even if enrichment fails (line 126:
return True) - Re-run pipeline to retry failed fetches
- Some data sources may be temporarily unavailable (ASM/GSM lists, news feed)
Manual Script Execution
If you need to run individual scripts for debugging:Best Practices
Daily Updates
- Run once per day after market close (after 3:30 PM IST)
- Keep
FETCH_OHLCV = Truefor incremental updates - OHLCV incremental fetch only takes 2-5 minutes
- Set up a cron job for automated daily execution:
First-Time Setup
- Allow 30-40 minutes for first run with OHLCV
- Verify output file exists and is properly formatted
- Test decompression with a JSON parser
- Keep intermediate files for first run (
CLEANUP_INTERMEDIATE = False)
Production Environment
- Monitor disk space (OHLCV data grows to ~200 MB)
- Archive old
.json.gzfiles with timestamps - Set up error alerting for pipeline failures
- Keep logs of each run for debugging
Next Steps
- Incremental Updates - Run daily updates efficiently
- Single Stock Analysis - Analyze individual stocks
- API Reference - Detailed endpoint documentation