Overview
t-SNE (t-Distributed Stochastic Neighbor Embedding) is a dimensionality reduction technique that projects high-dimensional feature representations into 2D space for visualization. It reveals how well your model has learned to separate different classes and identifies patterns in the feature space.Well-separated clusters in t-SNE space indicate that your model has learned discriminative features for classification.
How It Works
The feature embedding pipeline consists of:- Feature Extraction: Extract embeddings from the model’s penultimate layer
- Dimensionality Reduction: Apply t-SNE, UMAP, or PCA to reduce to 2D
- Visualization: Plot samples colored by class, prediction, or correctness
- Quality Metrics: Compute clustering metrics (Silhouette, Davies-Bouldin)
Feature Extraction
The platform extracts features from the model’s final representation layer before the classifier:The extraction process automatically adapts to different model architectures (CustomCNN, TransferModel).
Architecture-Specific Handling
The implementation intelligently identifies where to extract features: CustomCNN Models:- Hook into the last feature layer before classification
- Flattens spatial dimensions if needed
- Hook into the pre-trained base model output
- Captures rich representations from ImageNet-trained networks
- Uses final output logits if no suitable layer is found
Dimensionality Reduction Methods
The platform supports three reduction techniques:t-SNE (Recommended)
- Best for: Visualization and pattern discovery
- Preserves: Local neighborhood structure
- Speed: Slower (1-5 minutes for 1000 samples)
- Deterministic: Fixed random seed ensures reproducibility
Perplexity Parameter
Perplexity Parameter
Perplexity controls the effective number of neighbors considered:
- Low (5-15): Focuses on local structure, many small clusters
- Medium (20-50): Balanced view, typical choice
- High (50+): Emphasizes global structure, fewer larger clusters
- Start with 30 (default)
- Increase for larger datasets (>1000 samples)
- Decrease if you see fragmented clusters
- Must be less than number of samples
UMAP (Fast Alternative)
- Best for: Large datasets, faster visualization
- Preserves: Both local and global structure
- Speed: Faster than t-SNE (2-10x speedup)
- Installation: Requires
umap-learnpackage
PCA (Linear Baseline)
- Best for: Quick baseline, linear separability check
- Preserves: Maximum variance directions
- Speed: Very fast (seconds)
- Limitation: May miss non-linear structure
t-SNE
Best visualization quality, preserves local structure
UMAP
Balanced speed and quality, scales better
PCA
Fast baseline, checks linear separability
Clustering Quality Metrics
The platform automatically computes metrics to quantify cluster quality:Silhouette Score
Range: -1 to 1 Interpretation:- > 0.7: Excellent separation, strong clusters
- 0.5 - 0.7: Good separation, reasonable clusters
- 0.25 - 0.5: Weak separation, overlapping clusters
- < 0.25: Poor separation, no meaningful clusters
Silhouette measures how similar each point is to its own cluster compared to other clusters.
Davies-Bouldin Index
Range: 0 to ∞ (lower is better) Interpretation:- < 1.0: Excellent separation
- 1.0 - 2.0: Good separation
- 2.0 - 3.0: Moderate separation
- > 3.0: Poor separation
Davies-Bouldin measures the ratio of within-cluster scatter to between-cluster separation.
Visualization Interface
The interpretability dashboard provides three complementary views:1. By True Class
Colors points according to ground truth labels:- Each color represents one malware family or class
- Shows how classes naturally separate in feature space
- Reveals which classes are inherently similar
2. Correct vs Incorrect
Colors points by prediction correctness:- Green: Correctly classified samples
- Red: Misclassified samples
- Red points at cluster boundaries → confusion between similar classes
- Red points within clusters → difficult or mislabeled samples
- Red points isolated → outliers or distribution shift
3. By Predicted Class
Colors points according to model predictions:- Compare with “By True Class” to identify systematic errors
- Spots where predicted color doesn’t match true color indicate confusion patterns
Interactive Usage
The embedding interface provides intuitive controls:Parameters and Controls
Parameters and Controls
Method Selection:
- Choose between t-SNE, UMAP, or PCA
- Each method shows different aspects of the data
- More samples: Better representation, slower computation
- Fewer samples: Faster preview, less reliable patterns
- Default: 500 samples (good balance)
- t-SNE Perplexity (5-50): Controls neighborhood size
- UMAP N-Neighbors (5-50): Controls local structure preservation
- Select method and parameters
- Click “Compute Embeddings”
- Wait for processing (30s - 3min depending on method)
- View three interactive plots
- Check clustering metrics
Interpreting Results
Ideal Patterns
Well-Trained Model:- Tight, well-separated clusters
- Each class forms distinct regions
- High Silhouette score (> 0.5)
- Low Davies-Bouldin index (< 2.0)
- Few red points (misclassifications)
- Overlapping clusters
- No clear class separation
- Low Silhouette score (< 0.25)
- High Davies-Bouldin index (> 3.0)
- Mixed class colors in the same region
Common Patterns
Pattern Analysis Guide
Pattern Analysis Guide
Horseshoe Shape:
- Common artifact in high-dimensional data
- Not necessarily a problem
- Indicates data lies on a manifold
- Suggests class diversity
- May indicate sub-families or variants
- Check if sub-clusters correspond to known variants
- Points far from any cluster
- May indicate:
- Mislabeled data
- Dataset contamination
- Novel variants
- Adversarial examples
- Similar families in same region
- Expected for related malware families
- Check confusion matrix to confirm
Performance Considerations
Computation Time
PCA:- ~1 second for 1000 samples
- Linear scaling with samples and features
- ~10-30 seconds for 1000 samples
- Relatively good scaling to large datasets
- ~30-180 seconds for 1000 samples
- Quadratic complexity, slow for large datasets
Memory Requirements
Feature matrices can be large:- 1000 samples × 512 features × 4 bytes = ~2 MB
- 5000 samples × 2048 features × 4 bytes = ~40 MB
Integration with Other Tools
Grad-CAM
Investigate outliers or misclassifications with visual explanations
Activation Maps
Understand what features create the embedding space
Advanced Use Cases
Detecting Dataset Issues
Mislabeled Samples:- Points with wrong color in tight clusters
- Consistently misclassified samples
- Outliers far from their class cluster
- Some clusters much larger than others
- May need data augmentation or rebalancing
- Test samples form separate cluster from training distribution
- Indicates domain adaptation needed
Model Comparison
Compare embeddings from different models:- Better model → tighter, more separated clusters
- Overfitted model → may show artificial separation
- Undertrained model → merged or overlapping clusters
Feature Space Evolution
Track embeddings during training:- Early training: Random, mixed clusters
- Mid training: Clusters begin to separate
- Late training: Clean, well-separated clusters
Technical Details
Feature Flattening
Sample Limiting
References
- t-SNE Paper: Visualizing Data using t-SNE
- UMAP Paper: UMAP: Uniform Manifold Approximation and Projection
- Source Code:
app/content/interpret/engine/embeddings.py - UI Implementation:
app/content/interpret/sections/embeddings.py
Combine t-SNE analysis with confusion matrix analysis and Grad-CAM for comprehensive model understanding.