Model Cards

Model Details

Architecture, training data, and hyperparameters

Intended Use

Primary use cases and out-of-scope applications

Performance

Metrics across different datasets and demographics

Limitations

Known issues, biases, and failure modes

Why Use Model Cards?

Transparency

Provide stakeholders with clear information about model capabilities and limitations. Essential for building trust in ML systems.

Accountability

Document training decisions, data sources, and evaluation methods. Enables audit trails for regulated industries.

Reproducibility

Record exact configurations, data versions, and training procedures. Allows others to reproduce or build upon your work.

Risk Management

Identify potential harms, biases, and edge cases. Helps teams make informed decisions about model deployment.

Automatic Generation

from transformers import Trainer, TrainingArguments

# Configure training
training_args = TrainingArguments(
    output_dir="results",
    num_train_epochs=5,
    # ... other args
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)

# Train model
trainer.train()

# Generate model card
kwargs = {
    "finetuned_from": "google/mobilebert-uncased",
    "tasks": "text-classification",
    "language": "en",
    "dataset_tags": "sst2",
    "dataset": "SST-2 (Stanford Sentiment Treebank)",
}

trainer.create_model_card(**kwargs)
# Saves to: results/README.md

Model Card Structure

1. Model Details

## Model Details

- **Model type:** BERT-based sequence classification
- **Base model:** google/mobilebert-uncased
- **Training data:** SST-2 (Stanford Sentiment Treebank)
- **Language:** English
- **License:** Apache 2.0
- **Developed by:** Your Team/Organization
- **Date:** 2024-03-09

2. Intended Use

## Intended Use

### Primary Use Cases
- Binary sentiment classification of English text
- Movie review analysis
- Social media sentiment monitoring

### Out-of-Scope Use Cases
- Multi-label classification
- Non-English text
- Clinical or medical decision making
- High-stakes decisions without human oversight

3. Training Details

## Training Details

### Training Data
- Dataset: SST-2 from GLUE benchmark
- Size: 67,349 training examples
- Split: 80/20 train/validation
- Preprocessing: Standard BERT tokenization

### Hyperparameters
- Learning rate: 5e-5
- Batch size: 32
- Epochs: 5
- Optimizer: AdamW
- Max sequence length: 128

4. Evaluation Results

## Evaluation Results

| Metric | Value |
|--------|-------|
| F1 Score | 0.912 |
| Accuracy | 0.908 |
| Precision | 0.915 |
| Recall | 0.909 |

### Performance by Category
- Positive sentiment: F1 = 0.925
- Negative sentiment: F1 = 0.899

5. Limitations and Biases

## Limitations

- Trained only on movie reviews; may not generalize to other domains
- Limited to binary sentiment; cannot detect neutral or mixed sentiment
- Performance degrades on short text (<10 words)
- May exhibit bias toward formal language over slang

## Ethical Considerations

- Training data may contain societal biases
- Should not be used for decisions affecting individuals
- Requires human review for production use

6. Additional Information

## Additional Information

- Repository: https://github.com/your-org/project
- Paper: [Link to paper if applicable]
- Contact: [email protected]
- W&B Report: https://wandb.ai/project/runs/abc123

Model Card Toolkit

from model_card_toolkit import ModelCardToolkit

# Initialize toolkit
mct = ModelCardToolkit(output_dir='model_card')

# Create model card
model_card = mct.scaffold_assets()

# Populate fields
model_card.model_details.name = 'BERT Sentiment Classifier'
model_card.model_details.overview = 'Fine-tuned MobileBERT for sentiment analysis'
model_card.model_details.version.name = 'v1.0'

# Add training details
model_card.considerations.use_cases = [
    'Movie review sentiment analysis',
    'Social media monitoring',
]

model_card.considerations.limitations = [
    'Limited to English text',
    'Binary classification only',
]

# Add evaluation data
model_card.quantitative_analysis.performance_metrics = [
    {'type': 'f1', 'value': 0.912},
    {'type': 'accuracy', 'value': 0.908},
]

# Generate HTML and Markdown
mct.update_model_card(model_card)
mct.export_format()

Real-World Examples

Best Practices

Be Comprehensive

Include all relevant information:

Complete training configuration
All evaluation metrics
Known limitations and biases
Intended and out-of-scope uses

Be Honest

Clearly document:

Model failures and edge cases
Performance gaps across demographics
Uncertainty in predictions
Limitations of training data

Keep Updated

Maintain model cards over time:

Update with new evaluation results
Document discovered issues
Add user feedback and insights
Version model card with model updates

Make Accessible

Ensure cards are easy to find and understand:

Store with model artifacts
Use clear, non-technical language
Include visual aids and examples
Provide contact information

Integration with Training

def train(config_path: Path):
    # Load config and train model
    trainer = get_trainer(...)
    train_result = trainer.train()
    
    # Save model
    trainer.save_model()
    
    # Generate model card
    kwargs = {
        "finetuned_from": model_args.model_name_or_path,
        "tasks": "text-classification",
        "language": "en",
        "dataset_tags": "sst2",
        "dataset": "SST-2",
        "metrics": ["f1", "accuracy"],
    }
    trainer.create_model_card(**kwargs)
    
    logger.info(f"Model card saved to {training_args.output_dir}/README.md")
    
    # Upload to registry with model card
    upload_to_registry(
        model_name="my-model",
        model_path=training_args.output_dir
    )

Checklist

Resources

Model Cards Paper

Original paper: “Model Cards for Model Reporting”

Google Model Cards

Interactive playbook and examples

Model Card Toolkit

TensorFlow’s model card generation toolkit

Example Cards

Collection of model cards and datasheets

Module 1: Infrastructure

Module 2: Data Management

Module 3: Training Workflows

Module 4: Pipeline Orchestration

Module 5: Model Serving

Module 6: Optimization

Module 7: Monitoring

Module 8: Cloud Platforms

Model Cards

What Are Model Cards?

Model Details

Intended Use

Performance

Limitations

Why Use Model Cards?

Automatic Generation

Model Card Structure

1. Model Details

2. Intended Use

3. Training Details

4. Evaluation Results

5. Limitations and Biases

6. Additional Information

Model Card Toolkit

Real-World Examples

Best Practices

Integration with Training

Checklist

Resources

Model Cards Paper

Google Model Cards

Model Card Toolkit

Example Cards

Next Steps

Classic Training

Build docs developers (and LLMs) love

Module 1: Infrastructure

Module 2: Data Management

Module 3: Training Workflows

Module 4: Pipeline Orchestration

Module 5: Model Serving

Module 6: Optimization

Module 7: Monitoring

Module 8: Cloud Platforms

​Model Cards

​What Are Model Cards?

Model Details

Intended Use

Performance

Limitations

​Why Use Model Cards?

​Automatic Generation

​Model Card Structure

​1. Model Details

​2. Intended Use

​3. Training Details

​4. Evaluation Results

​5. Limitations and Biases

​6. Additional Information

​Model Card Toolkit

​Real-World Examples

​Best Practices

​Integration with Training

​Checklist

​Resources

Model Cards Paper

Google Model Cards

Model Card Toolkit

Example Cards

​Next Steps

Classic Training

Build docs developers (and LLMs) love

Model Cards

What Are Model Cards?

Why Use Model Cards?

Automatic Generation

Model Card Structure

1. Model Details

2. Intended Use

3. Training Details

4. Evaluation Results

5. Limitations and Biases

6. Additional Information

Model Card Toolkit

Real-World Examples

Best Practices

Integration with Training

Checklist

Resources

Next Steps