Skip to main content
The sample module provides utilities for creating representative subsets of annotation datasets, supporting both random and stratified sampling strategies.

Main Function

sample_dataset

Sample a dataset according to specified options.
pub fn sample_dataset(
    dataset: &Dataset,
    opts: &SampleOptions
) -> Result<Dataset, PanlabelError>
Creates a subset of the input dataset by selecting a specified number or fraction of images. Supports:
  • Random uniform sampling
  • Stratified sampling (category-aware)
  • Category filtering
  • Deterministic sampling with seed
Parameters:
  • dataset - The dataset to sample from
  • opts - Sampling options (strategy, size, categories, seed)
Returns: A new Dataset containing the sampled subset Errors:
  • InvalidSampleParams - Invalid sampling parameters (e.g., both n and fraction specified)
  • SampleFailed - Sampling failed (e.g., no images remain after filtering)

Types

SampleStrategy

Image sampling strategy.
pub enum SampleStrategy {
    /// Uniform random sampling
    Random,
    /// Category-aware weighted sampling
    Stratified,
}
  • Random: Selects images uniformly at random. Each image has equal probability of selection.
  • Stratified: Attempts to maintain category distribution from the original dataset. Useful for preserving class balance in the subset.

CategoryMode

Category filtering behavior.
pub enum CategoryMode {
    /// Keep whole images that contain at least one selected category
    Images,
    /// Keep only matching annotations; drop images with no remaining annotations
    Annotations,
}
  • Images: Filter at the image level. If an image contains at least one annotation from the selected categories, keep the entire image (with all its annotations).
  • Annotations: Filter at the annotation level. Only keep annotations matching the selected categories, and drop images that have no remaining annotations.

SampleOptions

Sampling configuration.
pub struct SampleOptions {
    pub n: Option<usize>,
    pub fraction: Option<f64>,
    pub seed: Option<u64>,
    pub strategy: SampleStrategy,
    pub categories: Vec<String>,
    pub category_mode: CategoryMode,
}
Fields:
  • n - Exact number of images to sample (mutually exclusive with fraction)
  • fraction - Fraction of images to sample, in range (0.0, 1.0] (mutually exclusive with n)
  • seed - Optional random seed for reproducible sampling
  • strategy - Sampling strategy (Random or Stratified)
  • categories - Optional category filter (empty = no filtering)
  • category_mode - How to apply category filtering
Constraints:
  • Exactly one of n or fraction must be set (not both, not neither)
  • If n is set, it must be > 0
  • If fraction is set, it must be in range (0.0, 1.0]

Validation

validate_sample_options

Validate sampling options before running.
pub fn validate_sample_options(
    opts: &SampleOptions
) -> Result<(), PanlabelError>
Checks that the sampling options are valid. Called automatically by sample_dataset.

Examples

Random Sampling (50 images)

use panlabel::sample::{sample_dataset, SampleOptions, SampleStrategy, CategoryMode};
use panlabel::ir;

let dataset = ir::io_coco_json::read_coco_json("dataset.json")?;

let opts = SampleOptions {
    n: Some(50),
    fraction: None,
    seed: Some(42),  // Reproducible
    strategy: SampleStrategy::Random,
    categories: vec![],
    category_mode: CategoryMode::Images,
};

let subset = sample_dataset(&dataset, &opts)?;
println!("Sampled {} images", subset.images.len());

Stratified Sampling (10% of dataset)

use panlabel::sample::{sample_dataset, SampleOptions, SampleStrategy, CategoryMode};

let opts = SampleOptions {
    n: None,
    fraction: Some(0.1),  // 10%
    seed: Some(42),
    strategy: SampleStrategy::Stratified,  // Preserve category distribution
    categories: vec![],
    category_mode: CategoryMode::Images,
};

let subset = sample_dataset(&dataset, &opts)?;

Category Filtering (Image-level)

Sample only images that contain “person” or “car” annotations:
use panlabel::sample::{sample_dataset, SampleOptions, SampleStrategy, CategoryMode};

let opts = SampleOptions {
    n: Some(100),
    fraction: None,
    seed: None,
    strategy: SampleStrategy::Random,
    categories: vec!["person".to_string(), "car".to_string()],
    category_mode: CategoryMode::Images,  // Keep entire image
};

let subset = sample_dataset(&dataset, &opts)?;
// Resulting subset contains 100 images, each with at least one "person" or "car"
// Images may also contain other categories

Category Filtering (Annotation-level)

Keep only “person” annotations and drop images with no persons:
use panlabel::sample::{sample_dataset, SampleOptions, SampleStrategy, CategoryMode};

let opts = SampleOptions {
    n: Some(100),
    fraction: None,
    seed: None,
    strategy: SampleStrategy::Random,
    categories: vec!["person".to_string()],
    category_mode: CategoryMode::Annotations,  // Filter annotations
};

let subset = sample_dataset(&dataset, &opts)?;
// Resulting subset contains 100 images
// Each image has ONLY "person" annotations (other categories removed)

Reproducible Sampling

Use a seed for deterministic results:
let opts = SampleOptions {
    n: Some(50),
    fraction: None,
    seed: Some(12345),  // Same seed = same sample
    strategy: SampleStrategy::Random,
    categories: vec![],
    category_mode: CategoryMode::Images,
};

let sample1 = sample_dataset(&dataset, &opts)?;
let sample2 = sample_dataset(&dataset, &opts)?;
// sample1 and sample2 contain the same images

Build docs developers (and LLMs) love