Skip to main content
The EDL Pipeline automatically compresses output files to .json.gz format, reducing storage requirements by 60-70% while maintaining full data integrity.

Compression Overview

Compression is performed in Phase 5 of the pipeline using gzip compression level 9 (maximum compression).

Files Compressed

The pipeline compresses three primary output files:

Stock Analysis

all_stocks_fundamental_analysis.json~50 MB → ~2 MB (96% reduction)

Sector Analytics

sector_analytics.jsonPerformance by sector/industry

Market Breadth

market_breadth.csvDaily breadth metrics

Compression Implementation

The compression logic is implemented in run_full_pipeline.py (lines 136-166):
import gzip
import os

def compress_output():
    """Compress final JSONs to .json.gz for ultra compression."""
    files_to_compress = {
        "all_stocks_fundamental_analysis.json": "all_stocks_fundamental_analysis.json.gz",
        "sector_analytics.json": "sector_analytics.json.gz",
        "market_breadth.csv": "market_breadth.json.gz"
    }
    
    total_raw = 0
    total_gz = 0
    
    for filename, output_name in files_to_compress.items():
        json_path = os.path.join(BASE_DIR, filename)
        gz_path = os.path.join(BASE_DIR, output_name)
        
        if os.path.exists(json_path):
            raw_size = os.path.getsize(json_path)
            total_raw += raw_size
            
            # Read raw data
            with open(json_path, "rb") as f_in:
                data = f_in.read()
            
            # Write compressed with level 9 (max compression)
            with gzip.open(gz_path, "wb", compresslevel=9) as f_out:
                f_out.write(data)
                
            total_gz += os.path.getsize(gz_path)
        else:
            print(f"⚠️  {filename} not found to compress.")
            
    ratio = (1 - total_gz / total_raw) * 100 if total_raw > 0 else 0
    print(f"📦 Compressed: {total_raw / (1024*1024):.1f} MB → {total_gz / (1024*1024):.1f} MB ({ratio:.0f}% reduction)")
    return total_raw, total_gz

Compression Statistics

Typical compression results:
FileRaw SizeCompressedReduction
all_stocks_fundamental_analysis.json~50 MB~2 MB96%
sector_analytics.json~500 KB~50 KB90%
market_breadth.csv~100 KB~10 KB90%
Total~51 MB~2 MB96%
The compression ratio varies based on data structure. JSON with repetitive field names compresses exceptionally well with gzip.

Reading Compressed Files

Python

import gzip
import json

# Read compressed JSON
with gzip.open('all_stocks_fundamental_analysis.json.gz', 'rt') as f:
    data = json.load(f)

# Access stock data
for stock in data:
    print(f"{stock['Symbol']}: ₹{stock['Stock Price(₹)']}")

Command Line

# View compressed file without extracting
zcat all_stocks_fundamental_analysis.json.gz | head -n 50

# Extract to disk
gunzip -k all_stocks_fundamental_analysis.json.gz

# Pipe to jq for filtering
zcat all_stocks_fundamental_analysis.json.gz | jq '.[] | select(.Sector == "IT")'

pandas

import pandas as pd
import gzip
import json

# Load into DataFrame
with gzip.open('all_stocks_fundamental_analysis.json.gz', 'rt') as f:
    data = json.load(f)
    
df = pd.DataFrame(data)
print(df[['Symbol', 'Stock Price(₹)', 'Market Cap(Cr.)', 'P/E']].head())

Compression Level Comparison

Gzip supports compression levels 1-9:
LevelSpeedRatioUse Case
1Fastest~85%Real-time processing
6Balanced~92%Default gzip
9Slowest~96%EDL Pipeline (max compression)
The pipeline uses level 9 because compression happens once per day, and the 96% space savings far outweigh the ~2 second compression time.

Disabling Compression

If you need uncompressed output for compatibility:
  1. Edit run_full_pipeline.py (line 297):
    # Comment out compression phase
    # print("\n📦 PHASE 5: Compression (.json → .json.gz)")
    # print("─" * 40)
    # raw_size, gz_size = compress_output()
    
  2. Keep raw JSON files in cleanup: Edit INTERMEDIATE_FILES (line 92) to remove:
    # "all_stocks_fundamental_analysis.json",  # Keep uncompressed
    

Storage Considerations

Disk Space Requirements

  • With compression: ~2 MB (final output)
  • Without compression: ~50 MB (raw JSON)
  • With OHLCV data: +200 MB (CSV files, not compressed)

Backup Strategy

# Daily backup (compressed files only)
cp all_stocks_fundamental_analysis.json.gz backups/stocks_$(date +%Y%m%d).json.gz

# Archive old backups (keep 30 days)
find backups/ -name "stocks_*.json.gz" -mtime +30 -delete

Performance Impact

  • Time to compress: ~2 seconds for 50 MB JSON
  • Time to decompress: ~1 second (reading into memory)
  • CPU overhead: ~5% of total pipeline runtime
  • I/O savings: 96% less disk writes, 96% faster network transfers

Next Steps

Cleanup Options

Learn about intermediate file cleanup after compression

Working with Output

Parse and analyze compressed output files

Build docs developers (and LLMs) love