The EDL Pipeline automatically compresses output files to .json.gz format, reducing storage requirements by 60-70% while maintaining full data integrity.
Compression Overview
Compression is performed in Phase 5 of the pipeline using gzip compression level 9 (maximum compression).
Files Compressed
The pipeline compresses three primary output files:
Stock Analysis all_stocks_fundamental_analysis.json~50 MB → ~2 MB (96% reduction)
Sector Analytics sector_analytics.jsonPerformance by sector/industry
Market Breadth market_breadth.csvDaily breadth metrics
Compression Implementation
The compression logic is implemented in run_full_pipeline.py (lines 136-166):
import gzip
import os
def compress_output ():
"""Compress final JSONs to .json.gz for ultra compression."""
files_to_compress = {
"all_stocks_fundamental_analysis.json" : "all_stocks_fundamental_analysis.json.gz" ,
"sector_analytics.json" : "sector_analytics.json.gz" ,
"market_breadth.csv" : "market_breadth.json.gz"
}
total_raw = 0
total_gz = 0
for filename, output_name in files_to_compress.items():
json_path = os.path.join( BASE_DIR , filename)
gz_path = os.path.join( BASE_DIR , output_name)
if os.path.exists(json_path):
raw_size = os.path.getsize(json_path)
total_raw += raw_size
# Read raw data
with open (json_path, "rb" ) as f_in:
data = f_in.read()
# Write compressed with level 9 (max compression)
with gzip.open(gz_path, "wb" , compresslevel = 9 ) as f_out:
f_out.write(data)
total_gz += os.path.getsize(gz_path)
else :
print ( f "⚠️ { filename } not found to compress." )
ratio = ( 1 - total_gz / total_raw) * 100 if total_raw > 0 else 0
print ( f "📦 Compressed: { total_raw / ( 1024 * 1024 ) :.1f} MB → { total_gz / ( 1024 * 1024 ) :.1f} MB ( { ratio :.0f} % reduction)" )
return total_raw, total_gz
Compression Statistics
Typical compression results:
File Raw Size Compressed Reduction all_stocks_fundamental_analysis.json~50 MB ~2 MB 96% sector_analytics.json~500 KB ~50 KB 90% market_breadth.csv~100 KB ~10 KB 90% Total ~51 MB ~2 MB 96%
The compression ratio varies based on data structure. JSON with repetitive field names compresses exceptionally well with gzip.
Reading Compressed Files
Python
import gzip
import json
# Read compressed JSON
with gzip.open( 'all_stocks_fundamental_analysis.json.gz' , 'rt' ) as f:
data = json.load(f)
# Access stock data
for stock in data:
print ( f " { stock[ 'Symbol' ] } : ₹ { stock[ 'Stock Price(₹)' ] } " )
Command Line
# View compressed file without extracting
zcat all_stocks_fundamental_analysis.json.gz | head -n 50
# Extract to disk
gunzip -k all_stocks_fundamental_analysis.json.gz
# Pipe to jq for filtering
zcat all_stocks_fundamental_analysis.json.gz | jq '.[] | select(.Sector == "IT")'
pandas
import pandas as pd
import gzip
import json
# Load into DataFrame
with gzip.open( 'all_stocks_fundamental_analysis.json.gz' , 'rt' ) as f:
data = json.load(f)
df = pd.DataFrame(data)
print (df[[ 'Symbol' , 'Stock Price(₹)' , 'Market Cap(Cr.)' , 'P/E' ]].head())
Compression Level Comparison
Gzip supports compression levels 1-9:
Level Speed Ratio Use Case 1 Fastest ~85% Real-time processing 6 Balanced ~92% Default gzip 9 Slowest ~96% EDL Pipeline (max compression)
The pipeline uses level 9 because compression happens once per day, and the 96% space savings far outweigh the ~2 second compression time.
Disabling Compression
If you need uncompressed output for compatibility:
Edit run_full_pipeline.py (line 297):
# Comment out compression phase
# print("\n📦 PHASE 5: Compression (.json → .json.gz)")
# print("─" * 40)
# raw_size, gz_size = compress_output()
Keep raw JSON files in cleanup:
Edit INTERMEDIATE_FILES (line 92) to remove:
# "all_stocks_fundamental_analysis.json", # Keep uncompressed
Storage Considerations
Disk Space Requirements
With compression : ~2 MB (final output)
Without compression : ~50 MB (raw JSON)
With OHLCV data : +200 MB (CSV files, not compressed)
Backup Strategy
# Daily backup (compressed files only)
cp all_stocks_fundamental_analysis.json.gz backups/stocks_ $( date +%Y%m%d ) .json.gz
# Archive old backups (keep 30 days)
find backups/ -name "stocks_*.json.gz" -mtime +30 -delete
Compression Performance Benchmarks
Next Steps
Cleanup Options Learn about intermediate file cleanup after compression
Working with Output Parse and analyze compressed output files