Testing Agents
Hive provides a goal-based testing framework that generates tests from your agent’s success criteria and constraints.Testing Framework Overview
Tests in Hive are:- Goal-Driven - Generated from success criteria and constraints
- LLM-Evaluated - Use LLM judges for complex assertions
- Approval-Required - All generated tests require human approval
- Pytest-Compatible - Standard pytest format for execution
Test Types
Three types of tests validate different aspects:Constraint Tests
Validate that constraints are respected:Success Criteria Tests
Validate achievement of success criteria:Edge Case Tests
Validate handling of unusual inputs:Test Generation
Generate tests from your goal definition:Generate Constraint Tests
Generate Success Criteria Tests
MCP Tool Usage
When using the MCP server, tests are generated via tools:Test Approval
All generated tests require approval:Approval Status
PENDING- Awaiting user reviewAPPROVED- Accepted as-isMODIFIED- User edited before acceptingREJECTED- Declined with reason
Test Structure
Tests follow pytest conventions:Running Tests
List Tests
Run Tests
Debug Failed Tests
LLM Judge
TheLLMJudge evaluates complex outputs:
Judge Response
Test Storage
Tests are stored alongside your agent:Test Results
Test runs are logged:test_results/ with timestamps.
Real-World Example
Complete test for a research agent:Test Best Practices
Test One Thing Per Test
Test One Thing Per Test
Each test should validate a single constraint or success criterion. This makes failures easier to diagnose.
Use LLM Judges for Complex Assertions
Use LLM Judges for Complex Assertions
For nuanced evaluations (citation quality, coherence, completeness), use
LLMJudge instead of brittle string matching.Provide Clear Criteria to Judges
Provide Clear Criteria to Judges
Write explicit, numbered criteria for LLM judges. Vague criteria lead to inconsistent evaluations.
Test with Realistic Inputs
Test with Realistic Inputs
Use representative inputs that match production use cases. Toy inputs may miss real-world failure modes.
Approve All Generated Tests
Approve All Generated Tests
Review and approve every generated test. LLMs can misinterpret constraints - human oversight is critical.
Continuous Testing
Integrate tests into your development workflow:Next Steps
Goal Definition
Define testable success criteria
Deployment
Deploy tested agents to production