Overview
The UC Intel Final platform provides three families of neural network architectures, each designed for different use cases and computational constraints:
Custom CNN Build convolutional neural networks from scratch with configurable layer stacks
Transfer Learning Fine-tune pre-trained models (VGG, ResNet, EfficientNet) for faster convergence
Vision Transformer State-of-the-art transformer architecture with self-attention mechanisms
Base Model Interface
All models inherit from the BaseModel abstract class, ensuring consistent interfaces:
Location : app/models/base.py:11-71
from abc import ABC , abstractmethod
from typing import Any, Dict, Tuple
import torch.nn as nn
class BaseModel ( ABC ):
"""Abstract base class for model implementations"""
def __init__ ( self , config : Dict[ str , Any]):
"""
Initialize model with configuration
Args:
config: Model configuration dictionary
"""
self .config = config
self .model = None
@abstractmethod
def build ( self ) -> nn.Module:
"""
Build and return the model
Returns:
PyTorch model (nn.Module)
"""
pass
@abstractmethod
def get_parameters_count ( self ) -> Tuple[ int , int ]:
"""
Get total and trainable parameter counts
Returns:
Tuple of (total_params, trainable_params)
"""
pass
def get_model_summary ( self ) -> Dict[ str , Any]:
"""Get model summary statistics"""
if self .model is None :
self .model = self .build()
total_params, trainable_params = self .get_parameters_count()
return {
"total_parameters" : total_params,
"trainable_parameters" : trainable_params,
"model_type" : self .config.get( "model_type" , "Unknown" ),
"architecture" : self .config.get( "architecture" , "Unknown" ),
"num_classes" : self .config.get( "num_classes" , 0 )
}
All models implement the same interface, making it easy to swap architectures during experimentation.
Custom CNN
Overview
The Custom CNN builder allows you to construct convolutional neural networks from a layer stack configuration. This provides maximum flexibility for architecture experimentation.
Location : app/models/pytorch/cnn_builder.py
Architecture
Supported Layer Types
Convolutional
Pooling
Normalization
Regularization
Transition
Dense
Conv2D - 2D Convolutional layerParameters :
filters (int): Number of output channels (default: 32)
kernel_size (int): Kernel size (default: 3)
activation (str): Activation function - “relu”, “leaky_relu”, “gelu”, “swish” (default: “relu”)
padding (str): “same” or “valid” (default: “same”)
Implementation (app/models/pytorch/cnn_builder.py:178-194):def _build_conv2d ( self , in_channels : int , params : dict ) -> tuple :
filters = params.get( "filters" , 32 )
kernel_size = params.get( "kernel_size" , 3 )
activation = params.get( "activation" , "relu" )
padding_mode = params.get( "padding" , "same" )
padding = kernel_size // 2 if padding_mode == "same" else 0
layers = [
nn.Conv2d(in_channels, filters,
kernel_size = kernel_size,
padding = padding),
self ._get_activation(activation)
]
return nn.Sequential( * layers), filters
Output shape : (batch, filters, height, width)MaxPooling2D - Max pooling layerParameters :
pool_size (int): Pooling window size (default: 2)
Implementation :layer = nn.MaxPool2d( kernel_size = pool_size, stride = pool_size)
current_spatial = current_spatial // pool_size
Output shape : (batch, channels, height//pool_size, width//pool_size)AveragePooling2D - Average pooling layerParameters :
pool_size (int): Pooling window size (default: 2)
Implementation :layer = nn.AvgPool2d( kernel_size = pool_size, stride = pool_size)
current_spatial = current_spatial // pool_size
BatchNorm - Batch normalization layerNormalizes activations across the batch dimension: Implementation (app/models/pytorch/cnn_builder.py:133-135):layer = nn.BatchNorm2d(current_channels)
self .feature_layers.append(layer)
Benefits :
Stabilizes training
Allows higher learning rates
Reduces internal covariate shift
Acts as regularization
Dropout - Dropout layerParameters :
rate (float): Dropout probability (default: 0.25)
Implementation (app/models/pytorch/cnn_builder.py:137-144):rate = params.get( "rate" , 0.25 )
if in_classifier:
layer = nn.Dropout(rate) # 1D dropout for FC layers
self .classifier_layers.append(layer)
else :
layer = nn.Dropout2d(rate) # 2D dropout for conv layers
self .feature_layers.append(layer)
Usage :
Use Dropout2d (spatial dropout) after convolutional layers
Use regular Dropout after dense layers
Flatten - Flatten spatial dimensionsConverts (batch, channels, height, width) → (batch, channels * height * width) Implementation (app/models/pytorch/cnn_builder.py:146-148):flatten_features = current_channels * current_spatial * current_spatial
in_classifier = True
Used in forward pass: x = torch.flatten(x, 1 ) # Flatten all dims except batch
GlobalAvgPool - Global average poolingConverts (batch, channels, height, width) → (batch, channels) Implementation (app/models/pytorch/cnn_builder.py:150-152):flatten_features = current_channels
in_classifier = True
Used in forward pass: x = torch.mean(x, dim = [ 2 , 3 ]) # Average over spatial dims
Advantage : Reduces parameters compared to FlattenDense - Fully connected layerParameters :
units (int): Number of output units (default: 256)
activation (str): Activation function (default: “relu”)
Implementation (app/models/pytorch/cnn_builder.py:154-167):units = params.get( "units" , 256 )
activation = params.get( "activation" , "relu" )
layer = nn.Linear(flatten_features, units)
self .classifier_layers.append(layer)
self .classifier_layers.append( self ._get_activation(activation))
flatten_features = units # Update for next layer
Note : Must come after Flatten or GlobalAvgPool
Activation Functions
Location : app/models/pytorch/cnn_builder.py:75-81
ACTIVATION_MAP = {
"relu" : nn.ReLU( inplace = True ),
"leaky_relu" : nn.LeakyReLU( 0.1 , inplace = True ),
"gelu" : nn.GELU(),
"swish" : nn.SiLU( inplace = True ),
"none" : nn.Identity(),
}
ReLU Formula : f(x) = max(0, x)Pros :
Fast computation
Sparse activation
Widely used
Cons :
Leaky ReLU Formula : f(x) = x if x > 0 else 0.1xPros :
Fixes dying ReLU
Allows negative gradients
Use when : Training deep networks
GELU Formula : f(x) = x * Φ(x) (Gaussian Error Linear Unit)Pros :
Smooth activation
Better for transformers
State-of-the-art results
Use when : Using transformer-style architectures
Swish (SiLU) Formula : f(x) = x * sigmoid(x)Pros :
Self-gated activation
Smooth and non-monotonic
Often outperforms ReLU
Use when : Need smooth gradients
Example Configuration
Simple CNN for MNIST-style data :
config = {
"model_type" : "Custom CNN" ,
"num_classes" : 9 ,
"cnn_config" : {
"layers" : [
# Block 1
{ "type" : "Conv2D" , "params" : { "filters" : 32 , "kernel_size" : 3 , "activation" : "relu" }},
{ "type" : "Conv2D" , "params" : { "filters" : 32 , "kernel_size" : 3 , "activation" : "relu" }},
{ "type" : "MaxPooling2D" , "params" : { "pool_size" : 2 }},
{ "type" : "BatchNorm" },
{ "type" : "Dropout" , "params" : { "rate" : 0.25 }},
# Block 2
{ "type" : "Conv2D" , "params" : { "filters" : 64 , "kernel_size" : 3 , "activation" : "relu" }},
{ "type" : "Conv2D" , "params" : { "filters" : 64 , "kernel_size" : 3 , "activation" : "relu" }},
{ "type" : "MaxPooling2D" , "params" : { "pool_size" : 2 }},
{ "type" : "BatchNorm" },
{ "type" : "Dropout" , "params" : { "rate" : 0.25 }},
# Block 3
{ "type" : "Conv2D" , "params" : { "filters" : 128 , "kernel_size" : 3 , "activation" : "relu" }},
{ "type" : "GlobalAvgPool" },
# Classifier
{ "type" : "Dense" , "params" : { "units" : 256 , "activation" : "relu" }},
{ "type" : "Dropout" , "params" : { "rate" : 0.5 }},
]
}
}
Parameter Count : ~200K parameters
Forward Pass
Location : app/models/pytorch/cnn_builder.py:205-232
def forward ( self , x : torch.Tensor) -> torch.Tensor:
"""
Forward pass
Args:
x: Input tensor of shape (batch, channels, height, width)
Returns:
Output logits of shape (batch, num_classes)
"""
# Apply feature extraction layers
for layer in self .feature_layers:
x = layer(x)
# Apply transition (flatten or global pool)
if self .use_global_pool:
x = torch.mean(x, dim = [ 2 , 3 ])
else :
x = torch.flatten(x, 1 )
# Apply classifier layers
for layer in self .classifier_layers:
x = layer(x)
# Output layer
x = self .output_layer(x)
return x
Transfer Learning
Overview
Transfer learning leverages pre-trained models trained on ImageNet (1.2M images, 1000 classes) to accelerate training and improve performance on smaller datasets.
Location : app/models/pytorch/transfer.py
Supported Base Models
VGG
ResNet
InceptionV3
EfficientNet
VGG16 / VGG19 Architecture : Deep CNNs with small 3x3 filtersCharacteristics :
16 or 19 layers
Simple, uniform architecture
Large number of parameters (~138M for VGG16)
Input size : 224x224Feature dimensions : 512 (after global pooling)Use when : Need simple, well-understood architectureImplementation (app/models/pytorch/transfer.py:152-154):"VGG16" : lambda : models.vgg16( pretrained = use_pretrained),
"VGG19" : lambda : models.vgg19( pretrained = use_pretrained),
ResNet50 / ResNet101 Architecture : Residual connections to enable very deep networksCharacteristics :
50 or 101 layers
Skip connections prevent vanishing gradients
Moderate parameter count (~25M for ResNet50)
Input size : 224x224Feature dimensions : 2048Use when : Need deeper network with good performance/cost ratioImplementation (app/models/pytorch/transfer.py:155-156):"ResNet50" : lambda : models.resnet50( pretrained = use_pretrained),
"ResNet101" : lambda : models.resnet101( pretrained = use_pretrained),
Residual Block :x --> Conv --> BN --> ReLU --> Conv --> BN --> (+) --> ReLU
| |
+----------------------------------------------+
InceptionV3 Architecture : Multi-scale feature extraction with inception modulesCharacteristics :
Parallel convolutions at multiple scales
Factorized convolutions
Efficient parameter usage (~24M params)
Input size : 299x299 (different from others!)Feature dimensions : 2048Use when : Need multi-scale featuresImplementation (app/models/pytorch/transfer.py:157-159):"InceptionV3" : lambda : models.inception_v3(
pretrained = use_pretrained,
aux_logits = False # Disable auxiliary classifier
),
EfficientNetB0 Architecture : Compound scaling of depth, width, and resolutionCharacteristics :
State-of-the-art efficiency
Mobile-friendly architecture
Few parameters (~5M for B0)
Input size : 224x224Feature dimensions : 1280Use when : Need efficient inference or limited computeImplementation (app/models/pytorch/transfer.py:160):"EfficientNetB0" : lambda : models.efficientnet_b0( pretrained = use_pretrained),
Scaling strategy : Jointly scale depth, width, and resolution
Fine-Tuning Strategies
Location : app/models/pytorch/transfer.py:194-217
Custom Classifier Head
Location : app/models/pytorch/transfer.py:125-146
# Build custom classifier head
classifier_layers = []
if global_pooling:
self .global_pool = nn.AdaptiveAvgPool2d(( 1 , 1 ))
else :
self .global_pool = None
if add_dense:
# Two-layer classifier
classifier_layers.extend([
nn.Linear(in_features, dense_units),
nn.ReLU( inplace = True ),
nn.Dropout(dropout),
nn.Linear(dense_units, num_classes)
])
else :
# Single-layer classifier
classifier_layers.extend([
nn.Dropout(dropout),
nn.Linear(in_features, num_classes)
])
self .classifier = nn.Sequential( * classifier_layers)
Options :
Global Pooling : Reduces spatial dimensions to 1x1
Extra Dense Layer : Adds capacity (useful for complex domains)
Dropout : Regularization (default: 0.5)
Forward Pass
Location : app/models/pytorch/transfer.py:219-243
def forward ( self , x : torch.Tensor) -> torch.Tensor:
# Extract features with frozen/unfrozen base model
features = self .base_model(x)
# Apply global pooling if needed
if self .global_pool is not None and len (features.shape) == 4 :
features = self .global_pool(features)
features = torch.flatten(features, 1 )
elif len (features.shape) == 4 :
features = torch.flatten(features, 1 )
# Apply custom classifier
output = self .classifier(features)
return output
Overview
Vision Transformer (ViT) applies the transformer architecture (originally designed for NLP) to image classification by treating images as sequences of patches.
Location : app/models/pytorch/transformer.py
Paper : “An Image is Worth 16x16 Words” (Dosovitskiy et al., 2020)
Architecture
Patch Embedding
Location : app/models/pytorch/transformer.py:72-114
Converts 2D image into sequence of patch embeddings:
class PatchEmbedding ( nn . Module ):
def __init__ (
self ,
image_size : int = 224 ,
patch_size : int = 16 , # 16x16 patches
in_channels : int = 3 ,
embed_dim : int = 768 ,
):
super (). __init__ ()
self .image_size = image_size
self .patch_size = patch_size
self .num_patches = (image_size // patch_size) ** 2 # 196 for 224x224
# Use convolution to extract and embed patches
self .proj = nn.Conv2d(
in_channels,
embed_dim,
kernel_size = patch_size,
stride = patch_size # Non-overlapping patches
)
def forward ( self , x : torch.Tensor) -> torch.Tensor:
# (B, C, H, W) -> (B, embed_dim, H/P, W/P)
x = self .proj(x)
# (B, embed_dim, H/P, W/P) -> (B, num_patches, embed_dim)
x = x.flatten( 2 ).transpose( 1 , 2 )
return x
Example :
Input: (1, 3, 224, 224)
After projection: (1, 768, 14, 14)
After flatten: (1, 196, 768)
Multi-Head Self-Attention
Location : app/models/pytorch/transformer.py:117-164
class MultiHeadAttention ( nn . Module ):
def __init__ ( self , embed_dim : int , num_heads : int , dropout : float = 0.0 ):
super (). __init__ ()
assert embed_dim % num_heads == 0
self .embed_dim = embed_dim
self .num_heads = num_heads
self .head_dim = embed_dim // num_heads
self .scale = self .head_dim ** - 0.5
# Single linear layer to compute Q, K, V
self .qkv = nn.Linear(embed_dim, embed_dim * 3 )
self .attn_drop = nn.Dropout(dropout)
self .proj = nn.Linear(embed_dim, embed_dim)
self .proj_drop = nn.Dropout(dropout)
def forward ( self , x : torch.Tensor) -> torch.Tensor:
B, N, C = x.shape
# Generate Q, K, V
qkv = self .qkv(x).reshape(B, N, 3 , self .num_heads, self .head_dim)
qkv = qkv.permute( 2 , 0 , 3 , 1 , 4 )
q, k, v = qkv[ 0 ], qkv[ 1 ], qkv[ 2 ] # Each: (B, num_heads, N, head_dim)
# Scaled dot-product attention
attn = (q @ k.transpose( - 2 , - 1 )) * self .scale
attn = attn.softmax( dim =- 1 )
attn = self .attn_drop(attn)
# Apply attention to values
x = (attn @ v).transpose( 1 , 2 ).reshape(B, N, C)
# Output projection
x = self .proj(x)
x = self .proj_drop(x)
return x
Attention Mechanism :
Linear projection to Q, K, V
Split into multiple heads
Compute attention scores: Attention(Q, K, V) = softmax(QK^T / √d_k)V
Concatenate heads
Output projection
Location : app/models/pytorch/transformer.py:195-220
class TransformerBlock ( nn . Module ):
def __init__ (
self ,
embed_dim : int ,
num_heads : int ,
mlp_ratio : float = 4.0 ,
dropout : float = 0.0 ,
):
super (). __init__ ()
self .norm1 = nn.LayerNorm(embed_dim)
self .attn = MultiHeadAttention(embed_dim, num_heads, dropout)
self .norm2 = nn.LayerNorm(embed_dim)
self .mlp = MLP(
in_features = embed_dim,
hidden_features = int (embed_dim * mlp_ratio), # 3072 for 768-dim
dropout = dropout
)
def forward ( self , x : torch.Tensor) -> torch.Tensor:
# Attention with residual (pre-norm)
x = x + self .attn( self .norm1(x))
# MLP with residual (pre-norm)
x = x + self .mlp( self .norm2(x))
return x
Structure : LayerNorm → Attention → Residual → LayerNorm → MLP → Residual
Configuration Options
ViT-Base Configuration :
Patch size: 16
Embed dim: 768
Depth: 12 blocks
Heads: 12
MLP ratio: 4.0
Parameters : ~86MUse when : Standard accuracy/speed tradeoff
ViT-Large Configuration :
Patch size: 16
Embed dim: 1024
Depth: 24 blocks
Heads: 16
MLP ratio: 4.0
Parameters : ~307MUse when : Maximum accuracy, large dataset
ViT-Small Configuration :
Patch size: 16
Embed dim: 384
Depth: 12 blocks
Heads: 6
MLP ratio: 4.0
Parameters : ~22MUse when : Limited compute, faster inference
Custom Configurable parameters :
Patch size (8, 16, 32)
Embed dimension
Number of blocks
Number of heads
MLP ratio
Dropout rate
Use when : Specific requirements
Forward Pass
Location : app/models/pytorch/transformer.py:304-338
def forward ( self , x : torch.Tensor) -> torch.Tensor:
B = x.shape[ 0 ]
# 1. Patch embedding
x = self .patch_embed(x) # (B, num_patches, embed_dim)
# 2. Add CLS token
cls_tokens = self .cls_token.expand(B, - 1 , - 1 ) # (B, 1, embed_dim)
x = torch.cat((cls_tokens, x), dim = 1 ) # (B, num_patches + 1, embed_dim)
# 3. Add position embeddings
x = x + self .pos_embed
x = self .pos_drop(x)
# 4. Apply transformer blocks
for block in self .blocks:
x = block(x)
# 5. Normalize
x = self .norm(x)
# 6. Extract CLS token and classify
cls_output = x[:, 0 ] # (B, embed_dim)
x = self .head(cls_output) # (B, num_classes)
return x
Model Selection Guide
Dataset Size
Computational Budget
Domain Similarity
Inference Speed
Small (<1000 images/class) :
✅ Transfer Learning (Feature Extraction)
✅ Transfer Learning (Partial Fine-tuning)
⚠️ Custom CNN (risk of overfitting)
❌ Vision Transformer (requires large dataset)
Medium (1000-5000 images/class) :
✅ Transfer Learning (Partial/Full Fine-tuning)
✅ Custom CNN (with regularization)
⚠️ Vision Transformer (may underperform)
Large (>5000 images/class) :
✅ All architectures
✅ Vision Transformer (best performance)
✅ Transfer Learning (Full Fine-tuning)
✅ Custom CNN (deep architectures)
Low (CPU, <8GB RAM) :
✅ Transfer Learning (Feature Extraction, small models)
✅ Custom CNN (shallow, <1M params)
⚠️ EfficientNetB0
❌ Vision Transformer
❌ Large ResNets
Medium (GPU, 8-16GB VRAM) :
✅ Transfer Learning (all strategies)
✅ Custom CNN (deep)
✅ ViT-Small
⚠️ ViT-Base (small batch size)
High (GPU, >16GB VRAM) :
✅ All architectures
✅ Large batch sizes
✅ ViT-Large
Similar to ImageNet (natural images) :
✅ Transfer Learning (Feature Extraction)
Early layers capture generic features
Somewhat different (medical, satellite) :
✅ Transfer Learning (Partial Fine-tuning)
✅ Custom CNN
Adapt mid-to-high level features
Very different (grayscale, textures) :
✅ Transfer Learning (Full Fine-tuning)
✅ Custom CNN
✅ Vision Transformer (if enough data)
Need to learn domain-specific features
Real-time required (<50ms) :
✅ EfficientNetB0
✅ Custom CNN (shallow)
⚠️ ResNet50 (optimized)
❌ Vision Transformer
❌ Large models
Batch processing OK (>100ms) :
✅ All architectures
Optimize for accuracy over speed
Typical Results on Malware Dataset
Architecture Parameters Training Time Accuracy GPU Memory Custom CNN (Small) ~200K 1-2 hours 85-88% 2 GB Custom CNN (Deep) ~2M 3-4 hours 88-91% 4 GB ResNet50 (Feature Ext.) ~25M 1-2 hours 90-93% 4 GB ResNet50 (Partial FT) ~25M 3-5 hours 92-95% 6 GB ResNet50 (Full FT) ~25M 6-10 hours 93-96% 8 GB EfficientNetB0 ~5M 2-4 hours 91-94% 3 GB ViT-Small ~22M 8-12 hours 90-93% 8 GB ViT-Base ~86M 12-24 hours 94-97% 16 GB
Results vary based on dataset size, quality, and training configuration. These are representative ranges.
References
Custom CNN implementation: app/models/pytorch/cnn_builder.py
Transfer learning implementation: app/models/pytorch/transfer.py
Vision Transformer implementation: app/models/pytorch/transformer.py
Base model interface: app/models/base.py
Model building in training worker: app/training/worker.py:29-42