Introduction to DeepAgents Evals
@deepagents/evals is a comprehensive evaluation framework for LLM applications. It provides everything you need to measure, track, and improve the quality of AI systems over time.
Why Evaluate?
LLM evaluation is essential for:- Measuring Quality — Quantify accuracy, factuality, and other metrics
- Detecting Regressions — Catch breaking changes before deployment
- Comparing Models — Objectively compare different model versions or providers
- Iterating Confidently — Make informed decisions backed by data
- CI/CD Integration — Run evals in your pipeline to block bad releases
Key Features
Dataset Loading
Load evaluation data from multiple sources:Scoring Functions
Use built-in scorers or create custom ones:ScorerResult:
Run Persistence
Store evaluation results in SQLite for historical analysis:Model Comparison
Compare two runs case-by-case to detect improvements and regressions:Reporters
Multiple output formats for different use cases:Architecture
The framework is organized into subpath exports for granular imports:| Import | Description |
|---|---|
@deepagents/evals | Top-level evaluate() function |
@deepagents/evals/dataset | Dataset loading and transforms |
@deepagents/evals/scorers | Scorer functions and combinators |
@deepagents/evals/store | SQLite run persistence |
@deepagents/evals/engine | Eval engine with concurrency and events |
@deepagents/evals/comparison | Run diffing and regression detection |
@deepagents/evals/reporters | Console, JSON, CSV, HTML, Markdown reporters |
Next Steps
Installation
Install the package and get started
Quickstart
Run your first evaluation in 5 minutes
API Reference
Explore the full API documentation
Datasets
Learn about dataset loading and transforms