Overview
TheCouncilAsAJudge implements a parallel evaluation system where multiple specialized judge agents evaluate different aspects of a task response simultaneously. Their findings are aggregated into a comprehensive technical report.
Installation
Evaluation Dimensions
The council evaluates responses across six key dimensions:- Accuracy: Factual correctness, source credibility, logical consistency
- Helpfulness: Practical value, solution feasibility, problem-solving efficacy
- Harmlessness: Safety assessment, ethical considerations, bias detection
- Coherence: Structural integrity, logical flow, organization
- Conciseness: Communication efficiency, precision, information density
- Instruction Adherence: Compliance with requirements, constraint adherence
Attributes
Unique identifier for the council
Display name of the council
Description of the council’s purpose
Model name for judge agents (if not using random models)
Type of output to return (“final”, “dict”, “list”, etc.)
Size of the LRU cache for prompts
Whether to use random model names for diversity
Maximum number of loops for agents
Model name for the aggregator agent
Specific model for judge agents (overrides model_name)
Methods
run()
Run the evaluation process using parallel execution.task(str): Task containing the response to evaluate
Usage Examples
Basic Evaluation
Custom Model Configuration
With Random Model Diversity
Full Conversation History
Specific Judge Model
Evaluation Report Structure
The final aggregated report includes:1. Executive Summary
- Key strengths and weaknesses
- Critical issues requiring immediate attention
- Overall assessment
2. Detailed Analysis
- Cross-dimensional patterns
- Specific examples and their implications
- Technical impact assessment
3. Recommendations
- Prioritized improvement areas
- Specific technical suggestions
- Implementation considerations
Dimension-Specific Evaluations
Each judge provides:- Specific Observations: References exact parts of the response
- Impact Analysis: Explains how issues affect quality
- Concrete Examples: Demonstrates strengths and weaknesses
- Improvement Suggestions: Actionable recommendations
Parallel Execution
The council usesThreadPoolExecutor to evaluate all dimensions simultaneously:
- Workers: Automatically configured based on CPU count (75% of cores)
- Concurrency: All 6 dimensions evaluated in parallel
- Error Handling: Individual dimension failures don’t stop other evaluations
- Performance: Significantly faster than sequential evaluation
Example Output
Features
- Multi-Dimensional Evaluation: Comprehensive assessment across 6 key dimensions
- Parallel Processing: All evaluations run concurrently for maximum speed
- Expert Judge Agents: Each dimension evaluated by specialized agent
- Intelligent Aggregation: Senior agent synthesizes all findings
- Technical Analysis: Detailed, actionable feedback for improvement
- Flexible Models: Support for any LLM model via LiteLLM
- Caching: LRU cache for frequently used prompts
- Error Handling: Robust exception handling for each dimension
- Multiple Output Formats: Choose from various output types
Use Cases
- Response Quality Assessment: Evaluate LLM outputs before deployment
- Model Comparison: Compare different models’ responses
- Training Data Evaluation: Assess quality of training examples
- Content Review: Evaluate generated content for publication
- Automated QA: Build quality assurance pipelines
- A/B Testing: Compare different prompt variations
Best Practices
-
Output Type Selection:
- Use
"final"for quick summary reports - Use
"dict"to analyze individual dimension evaluations - Use
"json"for integration with other systems
- Use
-
Model Selection:
- Enable
random_model_name=Truefor diverse perspectives - Use stronger models (GPT-4, Claude) for critical evaluations
- Use faster models (GPT-4o-mini) for development/testing
- Enable
-
Performance:
- Council auto-configures workers based on CPU count
- Consider cache_size for repeated similar evaluations
- Monitor costs when using multiple premium models
-
Integration:
- Parse the aggregated report for actionable insights
- Use dimension-specific feedback for targeted improvements
- Store evaluations for tracking quality over time