Skip to main content

Overview

The TikTok Auto Collection Sorter compares three model types during training and selects the best performer via cross-validation:
  1. k-Nearest Neighbors (k-NN): Non-parametric baseline
  2. Logistic Regression: Linear classifier with L2 regularization
  3. Multi-Layer Perceptron (MLP): Two-layer neural network
This guide covers when to use each model, how to modify the MLP architecture, and how to add custom models.

Model Comparison

k-Nearest Neighbors

How it works (train.py:146-152):
k = min(5, len(X_train) - 1)
knn = KNeighborsClassifier(n_neighbors=k, metric="cosine")
knn.fit(X_train, y_train)
knn_preds = knn.predict(X_val)
Characteristics:
  • No training required (stores all training data)
  • Uses cosine similarity between embeddings
  • k=5 neighbors by default
When to use:
  • Best for small datasets (<100 samples)
  • When classes have tight, well-separated clusters
  • When you want instant “training” (no optimization step)
Limitations:
  • Slow inference on large datasets (compares against all training data)
  • No learned decision boundaries
  • Sensitive to noisy features

Logistic Regression

How it works (train.py:155-160):
lr = LogisticRegression(max_iter=1000, C=1.0, class_weight="balanced")
lr.fit(X_train, y_train)
lr_preds = lr.predict(X_val)
Characteristics:
  • Linear decision boundaries
  • L2 regularization (C=1.0 controls strength)
  • Built-in class balancing
When to use:
  • When classes are linearly separable
  • For interpretability (can inspect feature weights)
  • When you need fast, reliable inference
Limitations:
  • Cannot learn non-linear patterns
  • May underfit complex relationships

Multi-Layer Perceptron (MLP)

Architecture (train.py:31-45):
class MLP(nn.Module):
    def __init__(self, input_dim, num_classes, hidden_dim=256):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),      # 1024 → 256
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(hidden_dim, hidden_dim // 2), # 256 → 128
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(hidden_dim // 2, num_classes), # 128 → N
        )

    def forward(self, x):
        return self.net(x)
Characteristics:
  • Two hidden layers (256 → 128 neurons)
  • ReLU activations
  • Dropout regularization (0.3 and 0.2)
  • Adam optimizer with weight decay
When to use:
  • When classes have non-linear decision boundaries
  • With sufficient training data (>50 samples per class)
  • When logistic regression underfits
Limitations:
  • Requires more data than linear models
  • Slower training than k-NN or logistic regression
  • Risk of overfitting on very small datasets

Modifying MLP Hyperparameters

Hidden Layer Size

Increase capacity for complex datasets:
class MLP(nn.Module):
    def __init__(self, input_dim, num_classes, hidden_dim=512):  # Was 256
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),       # 1024 → 512
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(hidden_dim, hidden_dim // 2),  # 512 → 256
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(hidden_dim // 2, num_classes), # 256 → N
        )
Larger networks require more training data. If you have <200 labeled samples, stick with hidden_dim=256 or smaller to avoid overfitting.

Dropout Rates

Reduce overfitting by increasing dropout:
nn.Dropout(0.5),  # Was 0.3 - more aggressive regularization
Or decrease for small datasets where model is underfitting:
nn.Dropout(0.1),  # Was 0.3 - less regularization

Learning Rate and Optimizer

Modify train_mlp function (train.py:48-51):
def train_mlp(X_train, y_train, X_val, y_val, num_classes, device, 
              epochs=100, lr=5e-4):  # Was 1e-3
    model = MLP(input_dim, num_classes).to(device)
    optimizer = optim.Adam(model.parameters(), lr=lr, weight_decay=1e-3)  # Was 1e-4
Guidelines:
  • Lower learning rate (5e-4) for more stable training
  • Higher weight decay (1e-3) for stronger L2 regularization
  • More epochs (200) if training stops improving early

Batch Size

Change in train.py:64:
loader = DataLoader(train_ds, batch_size=64, shuffle=True)  # Was 32
  • Larger batches (64) → more stable gradients, faster training
  • Smaller batches (16) → more noise, better generalization (useful for small datasets)

Adding a Third Hidden Layer

For very complex classification tasks:
class DeepMLP(nn.Module):
    def __init__(self, input_dim, num_classes, hidden_dim=256):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),           # 1024 → 256
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(hidden_dim, hidden_dim),          # 256 → 256
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(hidden_dim, hidden_dim // 2),     # 256 → 128
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(hidden_dim // 2, num_classes),    # 128 → N
        )

    def forward(self, x):
        return self.net(x)
Replace the MLP class in both train.py and predict.py with DeepMLP.
Deeper networks need significantly more data. Only use 3+ hidden layers if you have >500 labeled samples.

Custom Model: Attention-Based MLP

Add an attention mechanism to weight feature importance:
import torch
import torch.nn as nn
import torch.nn.functional as F

class AttentionMLP(nn.Module):
    def __init__(self, input_dim, num_classes, hidden_dim=256):
        super().__init__()
        # Attention layer
        self.attention = nn.Sequential(
            nn.Linear(input_dim, input_dim),
            nn.Tanh(),
            nn.Linear(input_dim, input_dim),
            nn.Softmax(dim=1)
        )
        
        # Main network
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(hidden_dim, hidden_dim // 2),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(hidden_dim // 2, num_classes),
        )

    def forward(self, x):
        # Compute attention weights
        attn_weights = self.attention(x)
        # Apply attention to input features
        x_attended = x * attn_weights
        # Pass through main network
        return self.net(x_attended)
This model learns which features (visual vs. audio) are most important for classification.

Integrating Custom Models

  1. Add model class to train.py
  2. Update training loop in main() function:
# After line 166 in train.py, add:
# 4. Custom Attention MLP
attn_model, attn_acc = train_custom_mlp(
    X_train, y_train, X_val, y_val, num_classes, device
)
attn_preds = attn_model(torch.FloatTensor(X_val).to(device)).argmax(dim=1).cpu().numpy()
results["attention_mlp"].append((attn_preds == y_val).mean())
all_preds["attention_mlp"][val_idx] = attn_preds
  1. Update prediction script (predict.py) to handle new model type
  2. Update model config to save model type metadata

Cross-Validation Strategy

The system uses Stratified K-Fold to ensure balanced folds (train.py:136):
skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)
This guarantees each fold has proportional class representation. For custom models, this happens automatically. Key parameters:
  • n_splits: Adjusted based on smallest class size (min 2, max 5)
  • shuffle=True: Randomizes data before splitting
  • random_state=42: Ensures reproducibility

Hyperparameter Tuning Example

Systematic grid search for best MLP configuration:
import itertools

# Define hyperparameter grid
hidden_dims = [128, 256, 512]
dropout_rates = [(0.2, 0.1), (0.3, 0.2), (0.4, 0.3)]
learning_rates = [1e-4, 5e-4, 1e-3]

best_acc = 0
best_config = None

for hidden_dim, (drop1, drop2), lr in itertools.product(
    hidden_dims, dropout_rates, learning_rates
):
    print(f"\nTesting: hidden={hidden_dim}, dropout=({drop1},{drop2}), lr={lr}")
    
    # Modify MLP class with current hyperparameters
    # (you'd need to pass these as arguments to MLP.__init__)
    
    # Run cross-validation
    cv_results = []
    for train_idx, val_idx in skf.split(X, y):
        X_train, X_val = X[train_idx], X[val_idx]
        y_train, y_val = y[train_idx], y[val_idx]
        model, acc = train_mlp(X_train, y_train, X_val, y_val, 
                               num_classes, device, lr=lr)
        cv_results.append(acc)
    
    mean_acc = np.mean(cv_results)
    if mean_acc > best_acc:
        best_acc = mean_acc
        best_config = (hidden_dim, (drop1, drop2), lr)
    
    print(f"Mean CV accuracy: {mean_acc:.1%}")

print(f"\nBest config: {best_config} with {best_acc:.1%} accuracy")
Hyperparameter tuning requires many training runs. Each configuration multiplied by K folds can take 10-20 minutes on CPU. Consider using a GPU or reducing the search space.

Model Selection Insights

From train.py:176-179, the system automatically picks the best model:
mean_accs = {name: np.mean(accs) for name, accs in results.items()}
best_name = max(mean_accs, key=mean_accs.get)
print(f"\nBest model: {best_name} ({mean_accs[best_name]:.1%})")
Typical outcomes:
  • k-NN wins: Very small dataset (<50 samples) or highly clustered embeddings
  • Logistic Regression wins: Linearly separable classes, medium dataset (50-200 samples)
  • MLP wins: Complex boundaries, sufficient data (>200 samples), multimodal signals

Debugging Poor Performance

If all models perform poorly (<70% accuracy):
  1. Check feature quality: Visualize embeddings with t-SNE/UMAP
  2. Verify labels: Ensure folder assignments are consistent
  3. Increase data: Collect more labeled samples per class
  4. Adjust class weights: See Class Imbalance
  5. Try different architectures: Add/remove layers, change activations

Class Imbalance

Handle skewed class distributions

Active Learning

Efficiently collect training data

Build docs developers (and LLMs) love