Overview
Traditional RAG systems require centralizing all documents in a single location, which is often impossible or illegal due to privacy regulations and competitive advantages. FedRAG solves this by:- Distributing Document Storage: Each organization maintains their own private document store
- Federated Retrieval: Queries are sent to all datasites, which retrieve relevant documents locally
- Privacy-Preserving Aggregation: Only the most relevant documents are shared (not entire databases)
- Consent-Based Access: Data owners review and approve all computational jobs
Real-World Applications
- Healthcare: Query medical knowledge across hospitals without sharing patient data
- Legal: Search case law distributed across law firms while maintaining client confidentiality
- Research: Access papers and datasets from multiple institutions without data centralization
- Enterprise: Build AI assistants that can access siloed departmental knowledge
Architecture
FedRAG Pipeline
The federated RAG workflow consists of five stages:
- Query Broadcasting: Server sends user query to all participating clients
- Local Retrieval: Each client searches their local document store using FAISS index
- Document Collection: Top-k relevant documents from each client are returned to server
- Re-ranking & Merging: Server aggregates and re-ranks all retrieved documents
- LLM Augmentation: Final top-k documents are used as context for the LLM prompt
Key Components
1. Local Document Retrieval (Client)
2. Document Merging (Server)
The server aggregates retrieved documents using either: Option A: Score-based Merging3. LLM Query Augmentation
Setup
Prerequisites
Install system dependencies based on your OS:Clone the Example
Install Dependencies
faiss-cpuorfaiss-gpu- Vector similarity searchtransformers- Hugging Face models for embeddingstorch- Deep learning frameworksyft_flwr- SyftBox integration
Download & Index Corpus
Before running FedRAG, download and index the document corpora:| Corpus | Domain | Size | Documents |
|---|---|---|---|
| PubMed | Medical research | ~60 GB | ~33M abstracts |
| StatPearls | Medical textbooks | ~1 GB | ~7K chapters |
| Textbooks | Medical textbooks | ~2 GB | ~18K sections |
| Wikipedia | Medical articles | ~57 GB | ~5M articles |
The default setup uses Textbooks and StatPearls (first 100 chunks) to quickly demonstrate the pipeline. The total disk space for all corpora is ~120 GB.
Running the Example
Local Simulation
Run FedRAG with the Flower simulation engine:- Simulate 2 clients (each with a different corpus)
- Evaluate questions from PubMedQA and BioASQ benchmark datasets
- Retrieve documents using FAISS from distributed corpora
- Aggregate results and query the LLM
- Report accuracy and execution time
Configuration
Editpyproject.toml to customize the pipeline:
Jupyter Notebooks
Local Setup
- Start with
local/do1.ipynb(Data Owner 1 with Textbooks corpus) - Then run
local/do2.ipynb(Data Owner 2 with StatPearls corpus) - Finally open
local/ds.ipynb(Data Scientist who queries the federated system)
Distributed Setup
Thedistributed/ directory contains notebooks for real distributed deployment using SyftBox client.
Example Results
After running the evaluation, you’ll see results like:| QA Dataset | Questions | Answered | Accuracy | Time (secs) |
|---|---|---|---|---|
| PubMedQA | 10 | 8 | 0.53 | 6.03 |
| BioASQ | 10 | 9 | 0.61 | 5.83 |
- Questions: Total questions evaluated from benchmark dataset
- Answered: Questions the LLM provided an answer for (some may be unanswerable)
- Accuracy: Fraction of correct answers compared to ground truth
- Time: Average wall-clock time per question (including retrieval + LLM inference)
Advanced Features
GPU Acceleration
Enable GPU for faster LLM inference:Custom Merging Strategies
Extend the default RRF merging with custom logic:Multi-Corpus Setup
Distribute different corpora across clients:Privacy Considerations
What is Shared
- Top-k retrieved document snippets (typically 8-16 documents)
- Retrieval scores (distances from query)
- Query text (question being asked)
What Stays Private
- Entire document corpus
- Non-retrieved documents
- FAISS index structure
- Embedding vectors
Advanced Research Extensions
This example provides building blocks for more sophisticated FedRAG systems:1. Domain-Specific Fine-Tuned LLMs
Combine FedRAG with federated learning to train domain-specific LLMs:Jung, Jincheol, et al. “Federated Learning and RAG Integration: A Scalable Approach for Medical Large Language Models.” arXiv:2412.13720 (2024).
2. Confidential Compute for Re-ranking
Use trusted execution environments (TEEs) for secure document re-ranking:Addison, Parker, et al. “C-FedRAG: A Confidential Federated Retrieval-Augmented Generation System.” arXiv:2412.13163 (2024).
3. Encrypted Vector Search
Apply homomorphic encryption for privacy-preserving similarity search:Zhao, Dongfang. “FRAG: Toward Federated Vector Database Management for Collaborative and Secure Retrieval-Augmented Generation.” arXiv:2410.13272 (2024).
Project Structure
Deployment Options
Local Simulation
Run on your local machine with simulated distributed corpora.
SyftBox Network
Deploy across real distributed nodes with separate document stores.
Next Steps
Diabetes Prediction
Learn federated learning for model training.
Federated Analytics
Compute statistics on distributed data.