Compression

The EDL Pipeline automatically compresses output files to .json.gz format, reducing storage requirements by 60-70% while maintaining full data integrity.

Compression Overview

Compression is performed in Phase 5 of the pipeline using gzip compression level 9 (maximum compression).

Files Compressed

The pipeline compresses three primary output files:

Stock Analysis

all_stocks_fundamental_analysis.json~50 MB → ~2 MB (96% reduction)

Sector Analytics

sector_analytics.jsonPerformance by sector/industry

Market Breadth

market_breadth.csvDaily breadth metrics

Compression Implementation

The compression logic is implemented in run_full_pipeline.py (lines 136-166):

import gzip
import os

def compress_output():
    """Compress final JSONs to .json.gz for ultra compression."""
    files_to_compress = {
        "all_stocks_fundamental_analysis.json": "all_stocks_fundamental_analysis.json.gz",
        "sector_analytics.json": "sector_analytics.json.gz",
        "market_breadth.csv": "market_breadth.json.gz"
    }
    
    total_raw = 0
    total_gz = 0
    
    for filename, output_name in files_to_compress.items():
        json_path = os.path.join(BASE_DIR, filename)
        gz_path = os.path.join(BASE_DIR, output_name)
        
        if os.path.exists(json_path):
            raw_size = os.path.getsize(json_path)
            total_raw += raw_size
            
            # Read raw data
            with open(json_path, "rb") as f_in:
                data = f_in.read()
            
            # Write compressed with level 9 (max compression)
            with gzip.open(gz_path, "wb", compresslevel=9) as f_out:
                f_out.write(data)
                
            total_gz += os.path.getsize(gz_path)
        else:
            print(f"⚠️  {filename} not found to compress.")
            
    ratio = (1 - total_gz / total_raw) * 100 if total_raw > 0 else 0
    print(f"📦 Compressed: {total_raw / (1024*1024):.1f} MB → {total_gz / (1024*1024):.1f} MB ({ratio:.0f}% reduction)")
    return total_raw, total_gz

Compression Statistics

Typical compression results:

File	Raw Size	Compressed	Reduction
`all_stocks_fundamental_analysis.json`	~50 MB	~2 MB	96%
`sector_analytics.json`	~500 KB	~50 KB	90%
`market_breadth.csv`	~100 KB	~10 KB	90%
Total	~51 MB	~2 MB	96%

The compression ratio varies based on data structure. JSON with repetitive field names compresses exceptionally well with gzip.

Reading Compressed Files

Python

import gzip
import json

# Read compressed JSON
with gzip.open('all_stocks_fundamental_analysis.json.gz', 'rt') as f:
    data = json.load(f)

# Access stock data
for stock in data:
    print(f"{stock['Symbol']}: ₹{stock['Stock Price(₹)']}")

Command Line

# View compressed file without extracting
zcat all_stocks_fundamental_analysis.json.gz | head -n 50

# Extract to disk
gunzip -k all_stocks_fundamental_analysis.json.gz

# Pipe to jq for filtering
zcat all_stocks_fundamental_analysis.json.gz | jq '.[] | select(.Sector == "IT")'

pandas

import pandas as pd
import gzip
import json

# Load into DataFrame
with gzip.open('all_stocks_fundamental_analysis.json.gz', 'rt') as f:
    data = json.load(f)
    
df = pd.DataFrame(data)
print(df[['Symbol', 'Stock Price(₹)', 'Market Cap(Cr.)', 'P/E']].head())

Compression Level Comparison

Gzip supports compression levels 1-9:

Level	Speed	Ratio	Use Case
1	Fastest	~85%	Real-time processing
6	Balanced	~92%	Default gzip
9	Slowest	~96%	EDL Pipeline (max compression)

The pipeline uses level 9 because compression happens once per day, and the 96% space savings far outweigh the ~2 second compression time.

Disabling Compression

If you need uncompressed output for compatibility:

Edit run_full_pipeline.py (line 297):

# Comment out compression phase
# print("\n📦 PHASE 5: Compression (.json → .json.gz)")
# print("─" * 40)
# raw_size, gz_size = compress_output()

Keep raw JSON files in cleanup: Edit INTERMEDIATE_FILES (line 92) to remove:
```
# "all_stocks_fundamental_analysis.json",  # Keep uncompressed
```

Storage Considerations

Disk Space Requirements

With compression: ~2 MB (final output)
Without compression: ~50 MB (raw JSON)
With OHLCV data: +200 MB (CSV files, not compressed)

Backup Strategy

# Daily backup (compressed files only)
cp all_stocks_fundamental_analysis.json.gz backups/stocks_$(date +%Y%m%d).json.gz

# Archive old backups (keep 30 days)
find backups/ -name "stocks_*.json.gz" -mtime +30 -delete

Performance Impact

Compression Performance Benchmarks

Time to compress: ~2 seconds for 50 MB JSON
Time to decompress: ~1 second (reading into memory)
CPU overhead: ~5% of total pipeline runtime
I/O savings: 96% less disk writes, 96% faster network transfers

Next Steps

Cleanup Options

Learn about intermediate file cleanup after compression

Working with Output

Parse and analyze compressed output files

Usage

Data Management

Advanced

Compression Overview

Files Compressed

Stock Analysis

Sector Analytics

Market Breadth

Compression Implementation

Compression Statistics

Reading Compressed Files

Python

Command Line

pandas

Compression Level Comparison

Disabling Compression

Storage Considerations

Disk Space Requirements

Backup Strategy

Performance Impact

Next Steps

Cleanup Options

Working with Output

Build docs developers (and LLMs) love

Usage

Data Management

Advanced

​Compression Overview

​Files Compressed

Stock Analysis

Sector Analytics

Market Breadth

​Compression Implementation

​Compression Statistics

​Reading Compressed Files

​Python

​Command Line

​pandas

​Compression Level Comparison

​Disabling Compression

​Storage Considerations

​Disk Space Requirements

​Backup Strategy

​Performance Impact

​Next Steps

Cleanup Options

Working with Output

Build docs developers (and LLMs) love

Compression Overview

Files Compressed

Compression Implementation

Compression Statistics

Reading Compressed Files

Python

Command Line

pandas

Compression Level Comparison

Disabling Compression

Storage Considerations

Disk Space Requirements

Backup Strategy

Performance Impact

Next Steps