Model Architectures - UC Intel Final

Overview

The UC Intel Final platform provides three families of neural network architectures, each designed for different use cases and computational constraints:

Custom CNN

Build convolutional neural networks from scratch with configurable layer stacks

Transfer Learning

Fine-tune pre-trained models (VGG, ResNet, EfficientNet) for faster convergence

Vision Transformer

State-of-the-art transformer architecture with self-attention mechanisms

Base Model Interface

All models inherit from the BaseModel abstract class, ensuring consistent interfaces: Location: app/models/base.py:11-71

from abc import ABC, abstractmethod
from typing import Any, Dict, Tuple
import torch.nn as nn

class BaseModel(ABC):
    """Abstract base class for model implementations"""
    
    def __init__(self, config: Dict[str, Any]):
        """
        Initialize model with configuration
        
        Args:
            config: Model configuration dictionary
        """
        self.config = config
        self.model = None
    
    @abstractmethod
    def build(self) -> nn.Module:
        """
        Build and return the model
        
        Returns:
            PyTorch model (nn.Module)
        """
        pass
    
    @abstractmethod
    def get_parameters_count(self) -> Tuple[int, int]:
        """
        Get total and trainable parameter counts
        
        Returns:
            Tuple of (total_params, trainable_params)
        """
        pass
    
    def get_model_summary(self) -> Dict[str, Any]:
        """Get model summary statistics"""
        if self.model is None:
            self.model = self.build()
        
        total_params, trainable_params = self.get_parameters_count()
        
        return {
            "total_parameters": total_params,
            "trainable_parameters": trainable_params,
            "model_type": self.config.get("model_type", "Unknown"),
            "architecture": self.config.get("architecture", "Unknown"),
            "num_classes": self.config.get("num_classes", 0)
        }

All models implement the same interface, making it easy to swap architectures during experimentation.

Custom CNN

Overview

The Custom CNN builder allows you to construct convolutional neural networks from a layer stack configuration. This provides maximum flexibility for architecture experimentation. Location: app/models/pytorch/cnn_builder.py

Architecture

Supported Layer Types

Conv2D - 2D Convolutional layerParameters:

filters (int): Number of output channels (default: 32)
kernel_size (int): Kernel size (default: 3)
activation (str): Activation function - “relu”, “leaky_relu”, “gelu”, “swish” (default: “relu”)
padding (str): “same” or “valid” (default: “same”)

Implementation (app/models/pytorch/cnn_builder.py:178-194):

def _build_conv2d(self, in_channels: int, params: dict) -> tuple:
    filters = params.get("filters", 32)
    kernel_size = params.get("kernel_size", 3)
    activation = params.get("activation", "relu")
    padding_mode = params.get("padding", "same")
    
    padding = kernel_size // 2 if padding_mode == "same" else 0
    
    layers = [
        nn.Conv2d(in_channels, filters, 
                 kernel_size=kernel_size, 
                 padding=padding),
        self._get_activation(activation)
    ]
    
    return nn.Sequential(*layers), filters

Output shape: (batch, filters, height, width)

MaxPooling2D - Max pooling layerParameters:

pool_size (int): Pooling window size (default: 2)

Implementation:

layer = nn.MaxPool2d(kernel_size=pool_size, stride=pool_size)
current_spatial = current_spatial // pool_size

Output shape: (batch, channels, height//pool_size, width//pool_size)

AveragePooling2D - Average pooling layerParameters:

pool_size (int): Pooling window size (default: 2)

Implementation:

layer = nn.AvgPool2d(kernel_size=pool_size, stride=pool_size)
current_spatial = current_spatial // pool_size

BatchNorm - Batch normalization layerNormalizes activations across the batch dimension:Implementation (app/models/pytorch/cnn_builder.py:133-135):

layer = nn.BatchNorm2d(current_channels)
self.feature_layers.append(layer)

Benefits:

Stabilizes training
Allows higher learning rates
Reduces internal covariate shift
Acts as regularization

Dropout - Dropout layerParameters:

rate (float): Dropout probability (default: 0.25)

Implementation (app/models/pytorch/cnn_builder.py:137-144):

rate = params.get("rate", 0.25)
if in_classifier:
    layer = nn.Dropout(rate)           # 1D dropout for FC layers
    self.classifier_layers.append(layer)
else:
    layer = nn.Dropout2d(rate)         # 2D dropout for conv layers
    self.feature_layers.append(layer)

Usage:

Use Dropout2d (spatial dropout) after convolutional layers
Use regular Dropout after dense layers

Flatten - Flatten spatial dimensionsConverts (batch, channels, height, width) → (batch, channels * height * width)Implementation (app/models/pytorch/cnn_builder.py:146-148):

flatten_features = current_channels * current_spatial * current_spatial
in_classifier = True

Used in forward pass:

x = torch.flatten(x, 1)  # Flatten all dims except batch

GlobalAvgPool - Global average poolingConverts (batch, channels, height, width) → (batch, channels)Implementation (app/models/pytorch/cnn_builder.py:150-152):

flatten_features = current_channels
in_classifier = True

Used in forward pass:

x = torch.mean(x, dim=[2, 3])  # Average over spatial dims

Advantage: Reduces parameters compared to Flatten

Dense - Fully connected layerParameters:

units (int): Number of output units (default: 256)
activation (str): Activation function (default: “relu”)

Implementation (app/models/pytorch/cnn_builder.py:154-167):

units = params.get("units", 256)
activation = params.get("activation", "relu")

layer = nn.Linear(flatten_features, units)
self.classifier_layers.append(layer)
self.classifier_layers.append(self._get_activation(activation))

flatten_features = units  # Update for next layer

Note: Must come after Flatten or GlobalAvgPool

Activation Functions

Location: app/models/pytorch/cnn_builder.py:75-81

ACTIVATION_MAP = {
    "relu": nn.ReLU(inplace=True),
    "leaky_relu": nn.LeakyReLU(0.1, inplace=True),
    "gelu": nn.GELU(),
    "swish": nn.SiLU(inplace=True),
    "none": nn.Identity(),
}

ReLU

Formula: f(x) = max(0, x)Pros:

Fast computation
Sparse activation
Widely used

Cons:

Dying ReLU problem

Leaky ReLU

Formula: f(x) = x if x > 0 else 0.1xPros:

Fixes dying ReLU
Allows negative gradients

Use when: Training deep networks

GELU

Formula: f(x) = x * Φ(x) (Gaussian Error Linear Unit)Pros:

Smooth activation
Better for transformers
State-of-the-art results

Use when: Using transformer-style architectures

Swish (SiLU)

Formula: f(x) = x * sigmoid(x)Pros:

Self-gated activation
Smooth and non-monotonic
Often outperforms ReLU

Use when: Need smooth gradients

Example Configuration

Simple CNN for MNIST-style data:

config = {
    "model_type": "Custom CNN",
    "num_classes": 9,
    "cnn_config": {
        "layers": [
            # Block 1
            {"type": "Conv2D", "params": {"filters": 32, "kernel_size": 3, "activation": "relu"}},
            {"type": "Conv2D", "params": {"filters": 32, "kernel_size": 3, "activation": "relu"}},
            {"type": "MaxPooling2D", "params": {"pool_size": 2}},
            {"type": "BatchNorm"},
            {"type": "Dropout", "params": {"rate": 0.25}},
            
            # Block 2
            {"type": "Conv2D", "params": {"filters": 64, "kernel_size": 3, "activation": "relu"}},
            {"type": "Conv2D", "params": {"filters": 64, "kernel_size": 3, "activation": "relu"}},
            {"type": "MaxPooling2D", "params": {"pool_size": 2}},
            {"type": "BatchNorm"},
            {"type": "Dropout", "params": {"rate": 0.25}},
            
            # Block 3
            {"type": "Conv2D", "params": {"filters": 128, "kernel_size": 3, "activation": "relu"}},
            {"type": "GlobalAvgPool"},
            
            # Classifier
            {"type": "Dense", "params": {"units": 256, "activation": "relu"}},
            {"type": "Dropout", "params": {"rate": 0.5}},
        ]
    }
}

Parameter Count: ~200K parameters

Forward Pass

Location: app/models/pytorch/cnn_builder.py:205-232

def forward(self, x: torch.Tensor) -> torch.Tensor:
    """
    Forward pass
    
    Args:
        x: Input tensor of shape (batch, channels, height, width)
    
    Returns:
        Output logits of shape (batch, num_classes)
    """
    # Apply feature extraction layers
    for layer in self.feature_layers:
        x = layer(x)
    
    # Apply transition (flatten or global pool)
    if self.use_global_pool:
        x = torch.mean(x, dim=[2, 3])
    else:
        x = torch.flatten(x, 1)
    
    # Apply classifier layers
    for layer in self.classifier_layers:
        x = layer(x)
    
    # Output layer
    x = self.output_layer(x)
    
    return x

Transfer Learning

Overview

Transfer learning leverages pre-trained models trained on ImageNet (1.2M images, 1000 classes) to accelerate training and improve performance on smaller datasets. Location: app/models/pytorch/transfer.py

Supported Base Models

VGG
ResNet
InceptionV3
EfficientNet

VGG16 / VGG19Architecture: Deep CNNs with small 3x3 filtersCharacteristics:

16 or 19 layers
Simple, uniform architecture
Large number of parameters (~138M for VGG16)

Input size: 224x224Feature dimensions: 512 (after global pooling)Use when: Need simple, well-understood architectureImplementation (app/models/pytorch/transfer.py:152-154):

"VGG16": lambda: models.vgg16(pretrained=use_pretrained),
"VGG19": lambda: models.vgg19(pretrained=use_pretrained),

ResNet50 / ResNet101Architecture: Residual connections to enable very deep networksCharacteristics:

50 or 101 layers
Skip connections prevent vanishing gradients
Moderate parameter count (~25M for ResNet50)

Input size: 224x224Feature dimensions: 2048Use when: Need deeper network with good performance/cost ratioImplementation (app/models/pytorch/transfer.py:155-156):

"ResNet50": lambda: models.resnet50(pretrained=use_pretrained),
"ResNet101": lambda: models.resnet101(pretrained=use_pretrained),

Residual Block:

x --> Conv --> BN --> ReLU --> Conv --> BN --> (+) --> ReLU
|                                              |
+----------------------------------------------+

InceptionV3Architecture: Multi-scale feature extraction with inception modulesCharacteristics:

Parallel convolutions at multiple scales
Factorized convolutions
Efficient parameter usage (~24M params)

Input size: 299x299 (different from others!)Feature dimensions: 2048Use when: Need multi-scale featuresImplementation (app/models/pytorch/transfer.py:157-159):

"InceptionV3": lambda: models.inception_v3(
    pretrained=use_pretrained, 
    aux_logits=False  # Disable auxiliary classifier
),

EfficientNetB0Architecture: Compound scaling of depth, width, and resolutionCharacteristics:

State-of-the-art efficiency
Mobile-friendly architecture
Few parameters (~5M for B0)

Input size: 224x224Feature dimensions: 1280Use when: Need efficient inference or limited computeImplementation (app/models/pytorch/transfer.py:160):

"EfficientNetB0": lambda: models.efficientnet_b0(pretrained=use_pretrained),

Scaling strategy: Jointly scale depth, width, and resolution

Fine-Tuning Strategies

Location: app/models/pytorch/transfer.py:194-217

Feature Extraction
Partial Fine-tuning
Full Fine-tuning

Strategy: Freeze all base model layers, train only classifierImplementation:

# Freeze all base model parameters
for param in self.base_model.parameters():
    param.requires_grad = False

Trainable parameters: ~10K (classifier only)Use when:

Small dataset (<1000 images/class)
Limited compute resources
Domain similar to ImageNet

Training time: Fastest (1-2 hours)Expected performance: Good baseline

Strategy: Freeze early layers, unfreeze last N layersImplementation:

# Freeze all first
for param in self.base_model.parameters():
    param.requires_grad = False

# Unfreeze last N layers
all_layers = list(self.base_model.children())
layers_to_unfreeze = all_layers[-unfreeze_layers:]

for layer in layers_to_unfreeze:
    for param in layer.parameters():
        param.requires_grad = True

Trainable parameters: ~1-5M (depends on N)Use when:

Medium dataset (1000-5000 images/class)
Moderate compute resources
Domain somewhat different from ImageNet

Training time: Medium (3-6 hours)Expected performance: Better than feature extractionRecommended N: 2-4 layers for ResNet, 1-2 blocks for VGG

Strategy: Train all layers with differential learning ratesImplementation:

# All parameters are trainable (default)
# Use lower LR for base model
optimizer = torch.optim.Adam([
    {'params': base_model.parameters(), 'lr': 1e-5},
    {'params': classifier.parameters(), 'lr': 1e-3}
])

Trainable parameters: Full model (~25M for ResNet50)Use when:

Large dataset (>5000 images/class)
Sufficient compute resources (GPU)
Domain very different from ImageNet

Training time: Longest (8-24 hours)Expected performance: Best possibleTip: Use learning rate warmup and gradual unfreezing

Custom Classifier Head

Location: app/models/pytorch/transfer.py:125-146

# Build custom classifier head
classifier_layers = []

if global_pooling:
    self.global_pool = nn.AdaptiveAvgPool2d((1, 1))
else:
    self.global_pool = None

if add_dense:
    # Two-layer classifier
    classifier_layers.extend([
        nn.Linear(in_features, dense_units),
        nn.ReLU(inplace=True),
        nn.Dropout(dropout),
        nn.Linear(dense_units, num_classes)
    ])
else:
    # Single-layer classifier
    classifier_layers.extend([
        nn.Dropout(dropout),
        nn.Linear(in_features, num_classes)
    ])

self.classifier = nn.Sequential(*classifier_layers)

Options:

Global Pooling: Reduces spatial dimensions to 1x1
Extra Dense Layer: Adds capacity (useful for complex domains)
Dropout: Regularization (default: 0.5)

Forward Pass

Location: app/models/pytorch/transfer.py:219-243

def forward(self, x: torch.Tensor) -> torch.Tensor:
    # Extract features with frozen/unfrozen base model
    features = self.base_model(x)
    
    # Apply global pooling if needed
    if self.global_pool is not None and len(features.shape) == 4:
        features = self.global_pool(features)
        features = torch.flatten(features, 1)
    elif len(features.shape) == 4:
        features = torch.flatten(features, 1)
    
    # Apply custom classifier
    output = self.classifier(features)
    
    return output

Vision Transformer

Overview

Vision Transformer (ViT) applies the transformer architecture (originally designed for NLP) to image classification by treating images as sequences of patches. Location: app/models/pytorch/transformer.py Paper: “An Image is Worth 16x16 Words” (Dosovitskiy et al., 2020)

Architecture

Patch Embedding

Location: app/models/pytorch/transformer.py:72-114 Converts 2D image into sequence of patch embeddings:

class PatchEmbedding(nn.Module):
    def __init__(
        self,
        image_size: int = 224,
        patch_size: int = 16,      # 16x16 patches
        in_channels: int = 3,
        embed_dim: int = 768,
    ):
        super().__init__()
        self.image_size = image_size
        self.patch_size = patch_size
        self.num_patches = (image_size // patch_size) ** 2  # 196 for 224x224
        
        # Use convolution to extract and embed patches
        self.proj = nn.Conv2d(
            in_channels, 
            embed_dim, 
            kernel_size=patch_size, 
            stride=patch_size  # Non-overlapping patches
        )
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # (B, C, H, W) -> (B, embed_dim, H/P, W/P)
        x = self.proj(x)
        
        # (B, embed_dim, H/P, W/P) -> (B, num_patches, embed_dim)
        x = x.flatten(2).transpose(1, 2)
        
        return x

Example:

Input: (1, 3, 224, 224)
After projection: (1, 768, 14, 14)
After flatten: (1, 196, 768)

Multi-Head Self-Attention

Location: app/models/pytorch/transformer.py:117-164

class MultiHeadAttention(nn.Module):
    def __init__(self, embed_dim: int, num_heads: int, dropout: float = 0.0):
        super().__init__()
        assert embed_dim % num_heads == 0
        
        self.embed_dim = embed_dim
        self.num_heads = num_heads
        self.head_dim = embed_dim // num_heads
        self.scale = self.head_dim ** -0.5
        
        # Single linear layer to compute Q, K, V
        self.qkv = nn.Linear(embed_dim, embed_dim * 3)
        self.attn_drop = nn.Dropout(dropout)
        self.proj = nn.Linear(embed_dim, embed_dim)
        self.proj_drop = nn.Dropout(dropout)
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        B, N, C = x.shape
        
        # Generate Q, K, V
        qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, self.head_dim)
        qkv = qkv.permute(2, 0, 3, 1, 4)
        q, k, v = qkv[0], qkv[1], qkv[2]  # Each: (B, num_heads, N, head_dim)
        
        # Scaled dot-product attention
        attn = (q @ k.transpose(-2, -1)) * self.scale
        attn = attn.softmax(dim=-1)
        attn = self.attn_drop(attn)
        
        # Apply attention to values
        x = (attn @ v).transpose(1, 2).reshape(B, N, C)
        
        # Output projection
        x = self.proj(x)
        x = self.proj_drop(x)
        
        return x

Attention Mechanism:

Linear projection to Q, K, V
Split into multiple heads
Compute attention scores: Attention(Q, K, V) = softmax(QK^T / √d_k)V
Concatenate heads
Output projection

Transformer Block

Location: app/models/pytorch/transformer.py:195-220

class TransformerBlock(nn.Module):
    def __init__(
        self,
        embed_dim: int,
        num_heads: int,
        mlp_ratio: float = 4.0,
        dropout: float = 0.0,
    ):
        super().__init__()
        self.norm1 = nn.LayerNorm(embed_dim)
        self.attn = MultiHeadAttention(embed_dim, num_heads, dropout)
        self.norm2 = nn.LayerNorm(embed_dim)
        self.mlp = MLP(
            in_features=embed_dim,
            hidden_features=int(embed_dim * mlp_ratio),  # 3072 for 768-dim
            dropout=dropout
        )
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # Attention with residual (pre-norm)
        x = x + self.attn(self.norm1(x))
        
        # MLP with residual (pre-norm)
        x = x + self.mlp(self.norm2(x))
        
        return x

Structure: LayerNorm → Attention → Residual → LayerNorm → MLP → Residual

Configuration Options

ViT-Base

Configuration:

Patch size: 16
Embed dim: 768
Depth: 12 blocks
Heads: 12
MLP ratio: 4.0

Parameters: ~86MUse when: Standard accuracy/speed tradeoff

ViT-Large

Configuration:

Patch size: 16
Embed dim: 1024
Depth: 24 blocks
Heads: 16
MLP ratio: 4.0

Parameters: ~307MUse when: Maximum accuracy, large dataset

ViT-Small

Configuration:

Patch size: 16
Embed dim: 384
Depth: 12 blocks
Heads: 6
MLP ratio: 4.0

Parameters: ~22MUse when: Limited compute, faster inference

Custom

Configurable parameters:

Patch size (8, 16, 32)
Embed dimension
Number of blocks
Number of heads
MLP ratio
Dropout rate

Use when: Specific requirements

Forward Pass

Location: app/models/pytorch/transformer.py:304-338

def forward(self, x: torch.Tensor) -> torch.Tensor:
    B = x.shape[0]
    
    # 1. Patch embedding
    x = self.patch_embed(x)  # (B, num_patches, embed_dim)
    
    # 2. Add CLS token
    cls_tokens = self.cls_token.expand(B, -1, -1)  # (B, 1, embed_dim)
    x = torch.cat((cls_tokens, x), dim=1)  # (B, num_patches + 1, embed_dim)
    
    # 3. Add position embeddings
    x = x + self.pos_embed
    x = self.pos_drop(x)
    
    # 4. Apply transformer blocks
    for block in self.blocks:
        x = block(x)
    
    # 5. Normalize
    x = self.norm(x)
    
    # 6. Extract CLS token and classify
    cls_output = x[:, 0]  # (B, embed_dim)
    x = self.head(cls_output)  # (B, num_classes)
    
    return x

Model Selection Guide

Dataset Size
Computational Budget
Domain Similarity
Inference Speed

Small (<1000 images/class):

✅ Transfer Learning (Feature Extraction)
✅ Transfer Learning (Partial Fine-tuning)
⚠️ Custom CNN (risk of overfitting)
❌ Vision Transformer (requires large dataset)

Medium (1000-5000 images/class):

✅ Transfer Learning (Partial/Full Fine-tuning)
✅ Custom CNN (with regularization)
⚠️ Vision Transformer (may underperform)

Large (>5000 images/class):

✅ All architectures
✅ Vision Transformer (best performance)
✅ Transfer Learning (Full Fine-tuning)
✅ Custom CNN (deep architectures)

Performance Comparison

Typical Results on Malware Dataset

Architecture	Parameters	Training Time	Accuracy	GPU Memory
Custom CNN (Small)	~200K	1-2 hours	85-88%	2 GB
Custom CNN (Deep)	~2M	3-4 hours	88-91%	4 GB
ResNet50 (Feature Ext.)	~25M	1-2 hours	90-93%	4 GB
ResNet50 (Partial FT)	~25M	3-5 hours	92-95%	6 GB
ResNet50 (Full FT)	~25M	6-10 hours	93-96%	8 GB
EfficientNetB0	~5M	2-4 hours	91-94%	3 GB
ViT-Small	~22M	8-12 hours	90-93%	8 GB
ViT-Base	~86M	12-24 hours	94-97%	16 GB

Results vary based on dataset size, quality, and training configuration. These are representative ranges.

References

Custom CNN implementation: app/models/pytorch/cnn_builder.py
Transfer learning implementation: app/models/pytorch/transfer.py
Vision Transformer implementation: app/models/pytorch/transformer.py
Base model interface: app/models/base.py
Model building in training worker: app/training/worker.py:29-42

Get Started

Core Concepts

Dashboard Guide

Training

Model Interpretability

​Overview

Custom CNN

Transfer Learning

Vision Transformer

​Base Model Interface

​Custom CNN

​Overview

​Architecture

​Supported Layer Types

​Activation Functions

ReLU

Leaky ReLU

GELU

Swish (SiLU)

​Example Configuration

​Forward Pass

​Transfer Learning

​Overview

​Supported Base Models

​Fine-Tuning Strategies

​Custom Classifier Head

​Forward Pass

​Vision Transformer

​Overview

​Architecture

​Patch Embedding

​Multi-Head Self-Attention

​Transformer Block

​Configuration Options

ViT-Base

ViT-Large

ViT-Small

Custom

​Forward Pass

​Model Selection Guide

​Performance Comparison

​Typical Results on Malware Dataset

​References

Build docs developers (and LLMs) love

Overview

Base Model Interface

Custom CNN

Overview

Architecture

Supported Layer Types

Activation Functions

Example Configuration

Forward Pass

Transfer Learning

Overview

Supported Base Models

Fine-Tuning Strategies

Custom Classifier Head

Forward Pass

Vision Transformer

Overview

Architecture

Patch Embedding

Multi-Head Self-Attention

Transformer Block

Configuration Options

Forward Pass

Model Selection Guide

Performance Comparison

Typical Results on Malware Dataset

References