Skip to main content

Introduction

Data labeling is critical for supervised learning. This guide covers deploying Argilla for human annotation and generating synthetic datasets with LLMs.

Why Data Labeling Matters

Quality

High-quality labels directly improve model performance

Consistency

Clear guidelines ensure inter-annotator agreement

Efficiency

Proper tools accelerate the labeling process

Cost

Plan labeling budget based on dataset size and complexity

Argilla

Argilla is an open-source platform for data labeling and feedback collection.

Key Features

  • Modern UI: Intuitive interface for annotators
  • Flexible: Text, token, ranking, and custom tasks
  • Python SDK: Programmatic dataset creation
  • Collaboration: Multi-user support with workspaces
  • Feedback: Collect model predictions for RLHF
  • Integration: Works with HuggingFace, OpenAI

Quick Start with Docker

docker run -it --rm --name argilla -p 6900:6900 \
  argilla/argilla-quickstart:v2.0.0rc1
Access:
Default credentials are in the Dockerfile.

Alternative Deployments

Deploy on K8s for production:
kubectl apply -f https://raw.githubusercontent.com/argilla-io/argilla/develop/examples/deployments/k8s/argilla.yaml
Full K8s examples

Creating Labeling Datasets

Simple Text-to-SQL Dataset

labeling/create_dataset.py
from datasets import load_dataset
import argilla as rg

client = rg.Argilla(
    api_url="http://0.0.0.0:6900", 
    api_key="argilla.apikey"
)
WORKSPACE_NAME = "admin"

def create_text2sql_dataset():
    # Define guidelines
    guidelines = """
    Please examine the given SQL question and context. 
    Write the correct SQL query that accurately answers 
    the question based on the context provided. 
    Ensure the query follows SQL syntax and logic correctly.
    """
    
    # Create dataset settings
    settings = rg.Settings(
        guidelines=guidelines,
        fields=[
            rg.TextField(
                name="query",
                title="Query",
                use_markdown=False,
            ),
            rg.TextField(
                name="schema",
                title="Schema",
                use_markdown=True,
            ),
        ],
        questions=[
            rg.TextQuestion(
                name="sql",
                title="Please write SQL for this query",
                description="Please write SQL for this query",
                required=True,
                use_markdown=True,
            )
        ],
    )
    
    # Create dataset
    dataset = rg.Dataset(
        name="text2sql-123",
        settings=settings,
        workspace=WORKSPACE_NAME,
        client=client,
    )
    dataset.create()
    
    # Load and add data
    data = load_dataset("b-mc2/sql-create-context")
    records = []
    for idx in range(len(data["train"])):
        x = rg.Record(
            fields={
                "query": data["train"][idx]["question"],
                "schema": data["train"][idx]["context"],
            },
        )
        records.append(x)
    
    dataset = client.datasets(name="text2sql-123")
    dataset.records.log(records, batch_size=1000)
Run:
uv run ./labeling/create_dataset.py

Synthetic Data Generation

Use LLMs to generate training data programmatically.

Extract Database Schema

labeling/create_dataset_synthetic.py
import sqlite3

def get_sqllite_schema(db_name: str) -> str:
    with sqlite3.connect(db_name) as conn:
        cursor = conn.cursor()
        
        cursor.execute(
            "SELECT 'CREATE TABLE ' || name || ' (' || sql || ');' "
            "FROM sqlite_master WHERE type='table';"
        )
        db_schema_records = cursor.fetchall()
        
        db_schema = [x[0] for x in db_schema_records]
        db_schema = "\n".join(db_schema)
    
    return db_schema

Generate Synthetic Examples

labeling/create_dataset_synthetic.py
import json
from openai import OpenAI
from retry import retry

@retry(tries=3, delay=1)
def generate_synthetic_example(db_schema: str) -> Dict[str, str]:
    client = OpenAI()
    
    prompt = f"""
    Corresponding database schema: {db_schema}
    
    Please generate an example of what user might ask 
    from this database: in plain text and in SQL.
    Return only JSON with format {{"user_text": '...', "sql": "...."}}  
    """
    
    chat_completion = client.chat.completions.create(
        messages=[
            {
                "role": "system",
                "content": "You are SQLite and SQL expert.",
            },
            {
                "role": "user",
                "content": prompt,
            },
        ],
        model="gpt-4o",
        response_format={"type": "json_object"},
        temperature=1,
    )
    
    sample = json.loads(chat_completion.choices[0].message.content)
    assert "user_text" in sample
    assert "sql" in sample
    return sample

Create Synthetic Dataset

labeling/create_dataset_synthetic.py
from tqdm import tqdm
import argilla as rg

def create_text2sql_dataset_synthetic(num_samples: int = 10):
    db_schema = get_sqllite_schema("examples/chinook.db")
    
    # Generate samples
    samples = []
    for _ in tqdm(range(num_samples)):
        sample = generate_synthetic_example(db_schema=db_schema)
        samples.append(sample)
    
    # Create guidelines with schema
    guidelines = f"""
    Please examine the given SQL question and context. 
    Write the correct SQL query that accurately answers 
    the question based on the context provided.
    
    DB schema:\n\n{db_schema}\n\n
    
    To verify the query:
    - Download: https://www.sqlitetutorial.net/wp-content/uploads/2018/03/chinook.zip
    - Install SQLite
    - Run: sqlite3 chinook.db
    """
    
    # Create dataset
    settings = rg.Settings(
        guidelines=guidelines,
        fields=[
            rg.TextField(name="schema", title="Schema", use_markdown=True),
            rg.TextField(name="sync_query", title="Query", use_markdown=False),
            rg.TextField(name="sync_sql", title="SQL", use_markdown=True),
        ],
        questions=[
            rg.BooleanQuestion(
                name="valid",
                title="Is this SQL query correct?",
                description="Validate the SQL query",
                required=True,
            )
        ],
    )
    
    dataset = rg.Dataset(
        name="text2sql-chinook-synthetic-123",
        workspace="admin",
        settings=settings,
        client=client,
    )
    dataset.create()
    
    # Add records
    records = [
        rg.Record(
            fields={
                "sync_sql": sample["sql"],
                "sync_query": sample["user_text"],
                "schema": db_schema,
            }
        )
        for sample in samples
    ]
    dataset.records.log(records, batch_size=1000)
Run:
uv run ./labeling/create_dataset_synthetic.py

Labeling Guidelines

Good guidelines are essential for consistent annotations.

Guidelines Template

# [Task Name] Labeling Guidelines

## Objective
[Clear description of what annotators should accomplish]

## Task Definition
[Detailed explanation of the task]

## Label Definitions
### Label 1
- **Description**: ...
- **Example**: ...
- **Non-example**: ...

### Label 2
- **Description**: ...
- **Example**: ...
- **Non-example**: ...

## Decision Tree
1. First, check if...
2. Then, determine if...
3. Finally, assign...

## Edge Cases
- **Case 1**: How to handle...
- **Case 2**: What to do when...

## Quality Checks
- [ ] Label makes sense given context
- [ ] Followed decision tree
- [ ] Checked edge cases

## Examples
### Example 1
**Input**: ...
**Correct Label**: ...
**Rationale**: ...

### Example 2
**Input**: ...
**Correct Label**: ...
**Rationale**: ...

Best Practices

  • Use simple, unambiguous language
  • Provide concrete examples
  • Include visual aids when helpful
  • Define domain-specific terms
  • Cover all edge cases
  • Provide decision flowcharts
  • Include non-examples
  • Address ambiguous cases
  • Start with pilot labeling (50 samples)
  • Measure inter-annotator agreement
  • Update guidelines based on confusion
  • Re-label if agreement < 80%
  • Use gold-standard test sets
  • Calculate Cohen’s kappa
  • Review disagreements
  • Provide ongoing feedback

Cost Estimation

Pilot Study Process

1

Label 50 samples

Time your labeling process:
import time

start = time.time()
# Label 50 samples
elapsed = time.time() - start

time_per_sample = elapsed / 50
print(f"Average: {time_per_sample:.2f}s per sample")
2

Calculate total time

total_samples = 10000
time_per_sample = 30  # seconds

total_hours = (total_samples * time_per_sample) / 3600
print(f"Total: {total_hours:.1f} hours")
3

Estimate cost

hourly_rate = 15  # USD
total_cost = total_hours * hourly_rate

# Add 20% for quality control
total_cost *= 1.2

print(f"Estimated cost: ${total_cost:,.2f}")

Typical Ranges

Task TypeTime/SampleCost/1000 Samples
Binary classification5-15s2020-100
Multi-class15-30s6060-200
Named entity recognition30-60s150150-400
Semantic segmentation2-5 min500500-2000
Question answering1-3 min250250-1000

Data Validation

Ensure label quality with automated checks.

Using Cleanlab

import cleanlab
from cleanlab.classification import CleanLearning

# Train with noisy labels
cl = CleanLearning(clf=YourClassifier())
cl.fit(X_train, noisy_labels)

# Find label issues
issues = cl.get_label_issues()
print(f"Found {len(issues)} potential label errors")

# Get cleaned labels
cleaned_labels = cl.predict(X_train)

Using Deepchecks

from deepchecks.tabular import Dataset
from deepchecks.tabular.suites import data_integrity

# Create dataset
ds = Dataset(df, label='target', cat_features=['cat1', 'cat2'])

# Run integrity checks
suite = data_integrity()
result = suite.run(ds)

# View results
result.show()

Production Labeling Workflow

Active Learning

Prioritize labeling of informative samples:
from modAL.uncertainty import uncertainty_sampling
from modAL.models import ActiveLearner

# Initialize learner
learner = ActiveLearner(
    estimator=classifier,
    query_strategy=uncertainty_sampling,
    X_training=X_initial,
    y_training=y_initial
)

# Query most uncertain samples
query_idx, query_inst = learner.query(X_pool, n_instances=100)

# Label and teach
y_new = get_labels(query_inst)
learner.teach(query_inst, y_new)

Alternative Tools

docker run -p 8080:8080 heartexlabs/label-studio
Features:
  • Rich media support
  • ML-assisted labeling
  • Export to many formats
Label Studio

Resources

Next Steps

Build docs developers (and LLMs) love