What is DeepEval?
DeepEval is an open-source evaluation framework for Large Language Models (LLMs) created by Confident AI. It provides tools for testing, evaluating, and improving LLM applications using both traditional metrics and LLM-based evaluation.DeepEval by Confident AIPurpose: Unit testing and evaluation framework for LLM outputsKey Features:
- G-Eval and other LLM-based metrics
- Support for multiple LLM providers (OpenAI, Anthropic, Google, xAI)
- Test case management and evaluation tracking
- Integration with popular frameworks (LangChain, LlamaIndex)
Why DeepEval in CheckThat AI?
CheckThat AI uses DeepEval as its evaluation infrastructure for several reasons:Core Benefits
DeepEval Advantages:
- Multi-Provider Support: Evaluate with GPT-4, Claude, Gemini, or Grok
- G-Eval Implementation: Built-in support for advanced LLM-based evaluation
- Test Case Management: Structured approach to evaluation scenarios
- Custom Metrics: Extensible framework for domain-specific evaluation
- Tracing and Observability: Track evaluation history and refinement loops
- Production-Ready: Robust error handling and async support
Architecture Overview
CheckThat AI’s DeepEval integration consists of three main components:Core Components
1. DeepEval Model Wrapper
TheDeepEvalModel class abstracts model provider selection:
2. G-Eval Metric Configuration
CheckThat AI defines evaluation criteria inSTATIC_EVAL_SPECS:
3. Refinement Service
TheRefinementService orchestrates evaluation and iterative improvement:
Test Case Structure
DeepEval usesLLMTestCase objects to structure evaluation:
Basic Test Case
Test Case Parameters
Example: Claim Evaluation
Custom Metrics
You can create domain-specific evaluation metrics:Example: Scientific Accuracy Metric
Example: Check-Worthiness Metric
Async Event Loop Handling
The uvloop Problem
FastAPI uses uvloop for async operations, but DeepEval’sevaluate() function creates its own event loop internally. This causes conflicts:
Solution: Thread Pool Execution
CheckThat AI runs DeepEval in a separate thread:/api/services/refinement/refine.py:34-44
Evaluation Service
CheckThat AI also provides a standalone evaluation service:Observability and Tracing
DeepEval provides tracing for monitoring evaluations:Production Considerations
Error Handling
Robust error handling for evaluation failures:Rate Limiting
Handle API rate limits gracefully:Cost Optimization
Cost-Saving Strategies:
- Use smaller models: Gemini Flash is 10x cheaper than GPT-4
- Cache evaluations: Store scores for identical inputs
- Selective refinement: Only refine low-scoring claims
- Batch processing: Evaluate multiple claims in parallel
- Threshold tuning: Higher threshold = fewer refinement iterations
API Integration
REST API Usage
CheckThat AI exposes refinement via API:References
Documentation
- DeepEval Docs: docs.confident-ai.com
- GitHub Repository: github.com/confident-ai/deepeval
- G-Eval Paper: “G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment” (Liu et al., 2023)
Implementation Files
- DeepEval Wrapper:
/api/_utils/deepeval_model.py - Refinement Service:
/api/services/refinement/refine.py - Evaluation Service:
/api/services/evaluation/evaluate.py - Evaluation Specs:
/api/types/evals.py