Overview
The TikTok Auto Collection Sorter compares three model types during training and selects the best performer via cross-validation:
k-Nearest Neighbors (k-NN) : Non-parametric baseline
Logistic Regression : Linear classifier with L2 regularization
Multi-Layer Perceptron (MLP) : Two-layer neural network
This guide covers when to use each model, how to modify the MLP architecture, and how to add custom models.
Model Comparison
k-Nearest Neighbors
How it works (train.py:146-152):
k = min ( 5 , len (X_train) - 1 )
knn = KNeighborsClassifier( n_neighbors = k, metric = "cosine" )
knn.fit(X_train, y_train)
knn_preds = knn.predict(X_val)
Characteristics :
No training required (stores all training data)
Uses cosine similarity between embeddings
k=5 neighbors by default
When to use :
Best for small datasets (<100 samples)
When classes have tight, well-separated clusters
When you want instant “training” (no optimization step)
Limitations :
Slow inference on large datasets (compares against all training data)
No learned decision boundaries
Sensitive to noisy features
Logistic Regression
How it works (train.py:155-160):
lr = LogisticRegression( max_iter = 1000 , C = 1.0 , class_weight = "balanced" )
lr.fit(X_train, y_train)
lr_preds = lr.predict(X_val)
Characteristics :
Linear decision boundaries
L2 regularization (C=1.0 controls strength)
Built-in class balancing
When to use :
When classes are linearly separable
For interpretability (can inspect feature weights)
When you need fast, reliable inference
Limitations :
Cannot learn non-linear patterns
May underfit complex relationships
Multi-Layer Perceptron (MLP)
Architecture (train.py:31-45):
class MLP ( nn . Module ):
def __init__ ( self , input_dim , num_classes , hidden_dim = 256 ):
super (). __init__ ()
self .net = nn.Sequential(
nn.Linear(input_dim, hidden_dim), # 1024 → 256
nn.ReLU(),
nn.Dropout( 0.3 ),
nn.Linear(hidden_dim, hidden_dim // 2 ), # 256 → 128
nn.ReLU(),
nn.Dropout( 0.2 ),
nn.Linear(hidden_dim // 2 , num_classes), # 128 → N
)
def forward ( self , x ):
return self .net(x)
Characteristics :
Two hidden layers (256 → 128 neurons)
ReLU activations
Dropout regularization (0.3 and 0.2)
Adam optimizer with weight decay
When to use :
When classes have non-linear decision boundaries
With sufficient training data (>50 samples per class)
When logistic regression underfits
Limitations :
Requires more data than linear models
Slower training than k-NN or logistic regression
Risk of overfitting on very small datasets
Modifying MLP Hyperparameters
Hidden Layer Size
Increase capacity for complex datasets:
class MLP ( nn . Module ):
def __init__ ( self , input_dim , num_classes , hidden_dim = 512 ): # Was 256
super (). __init__ ()
self .net = nn.Sequential(
nn.Linear(input_dim, hidden_dim), # 1024 → 512
nn.ReLU(),
nn.Dropout( 0.3 ),
nn.Linear(hidden_dim, hidden_dim // 2 ), # 512 → 256
nn.ReLU(),
nn.Dropout( 0.2 ),
nn.Linear(hidden_dim // 2 , num_classes), # 256 → N
)
Larger networks require more training data. If you have <200 labeled samples, stick with hidden_dim=256 or smaller to avoid overfitting.
Dropout Rates
Reduce overfitting by increasing dropout:
nn.Dropout( 0.5 ), # Was 0.3 - more aggressive regularization
Or decrease for small datasets where model is underfitting:
nn.Dropout( 0.1 ), # Was 0.3 - less regularization
Learning Rate and Optimizer
Modify train_mlp function (train.py:48-51):
def train_mlp ( X_train , y_train , X_val , y_val , num_classes , device ,
epochs = 100 , lr = 5e-4 ): # Was 1e-3
model = MLP(input_dim, num_classes).to(device)
optimizer = optim.Adam(model.parameters(), lr = lr, weight_decay = 1e-3 ) # Was 1e-4
Guidelines :
Lower learning rate (5e-4) for more stable training
Higher weight decay (1e-3) for stronger L2 regularization
More epochs (200) if training stops improving early
Batch Size
Change in train.py:64:
loader = DataLoader(train_ds, batch_size = 64 , shuffle = True ) # Was 32
Larger batches (64) → more stable gradients, faster training
Smaller batches (16) → more noise, better generalization (useful for small datasets)
Adding a Third Hidden Layer
For very complex classification tasks:
class DeepMLP ( nn . Module ):
def __init__ ( self , input_dim , num_classes , hidden_dim = 256 ):
super (). __init__ ()
self .net = nn.Sequential(
nn.Linear(input_dim, hidden_dim), # 1024 → 256
nn.ReLU(),
nn.Dropout( 0.3 ),
nn.Linear(hidden_dim, hidden_dim), # 256 → 256
nn.ReLU(),
nn.Dropout( 0.3 ),
nn.Linear(hidden_dim, hidden_dim // 2 ), # 256 → 128
nn.ReLU(),
nn.Dropout( 0.2 ),
nn.Linear(hidden_dim // 2 , num_classes), # 128 → N
)
def forward ( self , x ):
return self .net(x)
Replace the MLP class in both train.py and predict.py with DeepMLP.
Deeper networks need significantly more data. Only use 3+ hidden layers if you have >500 labeled samples.
Custom Model: Attention-Based MLP
Add an attention mechanism to weight feature importance:
import torch
import torch.nn as nn
import torch.nn.functional as F
class AttentionMLP ( nn . Module ):
def __init__ ( self , input_dim , num_classes , hidden_dim = 256 ):
super (). __init__ ()
# Attention layer
self .attention = nn.Sequential(
nn.Linear(input_dim, input_dim),
nn.Tanh(),
nn.Linear(input_dim, input_dim),
nn.Softmax( dim = 1 )
)
# Main network
self .net = nn.Sequential(
nn.Linear(input_dim, hidden_dim),
nn.ReLU(),
nn.Dropout( 0.3 ),
nn.Linear(hidden_dim, hidden_dim // 2 ),
nn.ReLU(),
nn.Dropout( 0.2 ),
nn.Linear(hidden_dim // 2 , num_classes),
)
def forward ( self , x ):
# Compute attention weights
attn_weights = self .attention(x)
# Apply attention to input features
x_attended = x * attn_weights
# Pass through main network
return self .net(x_attended)
This model learns which features (visual vs. audio) are most important for classification.
Integrating Custom Models
Add model class to train.py
Update training loop in main() function:
# After line 166 in train.py, add:
# 4. Custom Attention MLP
attn_model, attn_acc = train_custom_mlp(
X_train, y_train, X_val, y_val, num_classes, device
)
attn_preds = attn_model(torch.FloatTensor(X_val).to(device)).argmax( dim = 1 ).cpu().numpy()
results[ "attention_mlp" ].append((attn_preds == y_val).mean())
all_preds[ "attention_mlp" ][val_idx] = attn_preds
Update prediction script (predict.py) to handle new model type
Update model config to save model type metadata
Cross-Validation Strategy
The system uses Stratified K-Fold to ensure balanced folds (train.py:136):
skf = StratifiedKFold( n_splits = n_splits, shuffle = True , random_state = 42 )
This guarantees each fold has proportional class representation. For custom models, this happens automatically.
Key parameters :
n_splits: Adjusted based on smallest class size (min 2, max 5)
shuffle=True: Randomizes data before splitting
random_state=42: Ensures reproducibility
Hyperparameter Tuning Example
Systematic grid search for best MLP configuration:
import itertools
# Define hyperparameter grid
hidden_dims = [ 128 , 256 , 512 ]
dropout_rates = [( 0.2 , 0.1 ), ( 0.3 , 0.2 ), ( 0.4 , 0.3 )]
learning_rates = [ 1e-4 , 5e-4 , 1e-3 ]
best_acc = 0
best_config = None
for hidden_dim, (drop1, drop2), lr in itertools.product(
hidden_dims, dropout_rates, learning_rates
):
print ( f " \n Testing: hidden= { hidden_dim } , dropout=( { drop1 } , { drop2 } ), lr= { lr } " )
# Modify MLP class with current hyperparameters
# (you'd need to pass these as arguments to MLP.__init__)
# Run cross-validation
cv_results = []
for train_idx, val_idx in skf.split(X, y):
X_train, X_val = X[train_idx], X[val_idx]
y_train, y_val = y[train_idx], y[val_idx]
model, acc = train_mlp(X_train, y_train, X_val, y_val,
num_classes, device, lr = lr)
cv_results.append(acc)
mean_acc = np.mean(cv_results)
if mean_acc > best_acc:
best_acc = mean_acc
best_config = (hidden_dim, (drop1, drop2), lr)
print ( f "Mean CV accuracy: { mean_acc :.1%} " )
print ( f " \n Best config: { best_config } with { best_acc :.1%} accuracy" )
Hyperparameter tuning requires many training runs. Each configuration multiplied by K folds can take 10-20 minutes on CPU. Consider using a GPU or reducing the search space.
Model Selection Insights
From train.py:176-179, the system automatically picks the best model:
mean_accs = {name: np.mean(accs) for name, accs in results.items()}
best_name = max (mean_accs, key = mean_accs.get)
print ( f " \n Best model: { best_name } ( { mean_accs[best_name] :.1%} )" )
Typical outcomes :
k-NN wins : Very small dataset (<50 samples) or highly clustered embeddings
Logistic Regression wins : Linearly separable classes, medium dataset (50-200 samples)
MLP wins : Complex boundaries, sufficient data (>200 samples), multimodal signals
If all models perform poorly (<70% accuracy):
Check feature quality : Visualize embeddings with t-SNE/UMAP
Verify labels : Ensure folder assignments are consistent
Increase data : Collect more labeled samples per class
Adjust class weights : See Class Imbalance
Try different architectures : Add/remove layers, change activations
Class Imbalance Handle skewed class distributions
Active Learning Efficiently collect training data