Why Use Datasets
Fine-Tuning
Create training datasets from your best production requests to fine-tune custom models
Model Evaluation
Build consistent test sets to evaluate model performance and compare versions
Quality Control
Curate high-quality examples to improve prompt engineering and model outputs
Data Analysis
Export structured data for external analysis, research, and compliance
Quick Start
Filter production requests
Use custom properties, scores, or feedback ratings to find your best examples
Creating Datasets
From the Dashboard
The easiest way to build datasets is through the Helicone UI:- Navigate to helicone.ai/requests
- Apply filters to find high-quality examples:
- Custom properties: Tag production traffic (e.g.,
feature: "customer-support") - Scores: Filter by evaluation metrics (e.g.,
accuracy > 90) - Feedback: Select highly-rated responses (e.g.,
feedback: true) - User: Focus on specific users or use cases
- Custom properties: Tag production traffic (e.g.,
- Select requests using checkboxes
- Click “Add to Dataset” and choose or create a dataset
Via API
Create and manage datasets programmatically for automated workflows:- Create Dataset
- Add Requests
- Query Dataset
- Update Request
Rate Limits
Curating Quality Datasets
The Curation Process
Raw production logs contain noise—curation transforms them into valuable training data:Start broad, then narrow
Add many potential examples initially. It’s easier to remove poor examples than to find good ones later.
Review each example
- Accuracy: Is the response correct and helpful?
- Consistency: Does it match the style and format you want?
- Completeness: Does it fully address the user’s request?
- Relevance: Is this the behavior you want to reinforce?
Remove poor examples
Delete requests that contain:
- Incorrect or misleading responses
- Off-topic or irrelevant content
- Inconsistent formatting or style
- Edge cases that might confuse the model
- Sensitive or inappropriate content
Quality beats quantity: 50-100 carefully curated examples often outperform thousands of uncurated ones. Focus on consistency and correctness over volume.
Dataset Dashboard
Manage all your datasets at helicone.ai/datasets:- Track progress: Monitor dataset size and last updated time
- Access datasets: Click to view and curate contents
- Export data: Download datasets when ready for fine-tuning
- Delete datasets: Remove datasets you no longer need
Exporting Data
Export Formats
Download your datasets in formats optimized for different use cases:- Fine-Tuning (JSONL)
- Analysis (CSV)
Perfect for OpenAI fine-tuning format:Ready to use directly with:
- OpenAI’s fine-tuning API
- Anthropic Claude fine-tuning
- Custom training pipelines
Programmatic Export
Retrieve dataset contents via API:Use Cases
Replace Expensive Models with Fine-Tuned Alternatives
The most common use case—train cheaper models on expensive model outputs:Log premium model outputs
Start logging successful requests from GPT-4, Claude Sonnet, or other expensive models
Build task-specific datasets
Create separate datasets for different tasks:
- Customer support responses
- Code generation
- Data extraction
- Content summarization
Curate for consistency
Review examples to ensure responses follow the same format, style, and quality standards
Fine-tune smaller models
Export JSONL and fine-tune models that are 10-50x cheaper:
- GPT-4o-mini (10x cheaper than GPT-4o)
- Gemini 2.5 Flash (50x cheaper than Gemini Pro)
- Claude Haiku (30x cheaper than Claude Sonnet)
A fine-tuned GPT-4o-mini can often match or exceed GPT-4o performance on specific tasks while costing 90% less. Start with 50-100 examples and iterate.
Task-Specific Evaluation Sets
Build test datasets to evaluate model performance consistently:Continuous Improvement Pipeline
-
Tag production requests with custom properties for filtering
-
Score outputs based on automated metrics or user feedback
-
Filter high-quality examples using scores and feedback
- Auto-add to datasets when examples meet quality thresholds
- Regular retraining with newly curated examples every week/month
- A/B test new models against production traffic before full rollout
Research and Compliance
Export datasets for research, auditing, or compliance:Best Practices
Quality over Quantity
Choose fewer, high-quality examples rather than large datasets with mixed quality
Task-Specific Datasets
Create separate datasets for different use cases rather than one general dataset
Regular Updates
Continuously add new examples as your application evolves and improves
Clear Criteria
Document what makes a “good” example for each dataset’s specific purpose
Version Control
Create new dataset versions when making significant changes to examples
Diverse Examples
Include varied inputs, edge cases, and different user types in your datasets
API Reference
Key Endpoints
| Endpoint | Method | Description |
|---|---|---|
/v1/helicone-dataset | POST | Create new dataset with requests |
/v1/helicone-dataset/query | POST | List all datasets |
/v1/helicone-dataset/{id}/query | POST | Get dataset rows |
/v1/helicone-dataset/{id}/mutate | POST | Add/remove requests |
/v1/helicone-dataset/{id}/request/{requestId} | POST | Update request data |
/v1/helicone-dataset/{id}/delete | POST | Delete dataset |
Related Features
Scores
Track evaluation metrics to identify best examples for datasets
Feedback
Use user ratings to find high-quality examples automatically
Custom Properties
Tag requests to make dataset creation easier with filtering
Sessions
Include full conversation context in your datasets
Datasets turn your production LLM logs into valuable training and evaluation resources. Start small with a focused use case, then expand as you see the benefits of curated, high-quality data.
