Overview
Thekeras_yolo.py module implements the YOLO v2 (You Only Look Once version 2) object detection architecture in Keras. It provides functions for building the model, processing outputs, computing loss, and evaluating predictions.
Helper Functions
space_to_depth_x2()
TensorFlow space-to-depth transformation with block size 2.Parameters
Input tensor to be transformed.
Returns
Transformed tensor with spatial dimensions reduced by 2x and channels increased by 4x.
Description
This is a thin wrapper for TensorFlow’sspace_to_depth operation with a fixed block size of 2. It reorganizes spatial data into the depth (channel) dimension, which is used in YOLO v2 to concatenate features from different scales.
Transformation:
- Input shape:
(batch, height, width, channels) - Output shape:
(batch, height/2, width/2, channels*4)
yolo_body() to create the passthrough layer that combines high-resolution features with low-resolution features.
space_to_depth_x2_output_shape()
Calculate output shape for space_to_depth operation with block size 2.Parameters
Input shape as
(batch, height, width, channels).Returns
Output shape as
(batch, height//2, width//2, channels*4). If height is None, returns (batch, None, None, channels*4) for dynamic shapes.Description
This helper function computes the output shape after applyingspace_to_depth_x2(). It’s used by Keras Lambda layers to determine the output shape at graph construction time.
For TensorFlow backend, this function may not be strictly required as shape inference can be automatic. However, it’s provided for compatibility and explicit shape specification.
Model Architecture Functions
yolo_body()
Creates the YOLO v2 CNN body architecture.Parameters
Input tensor for the model.
Number of anchor boxes per grid cell.
Number of object classes to detect.
Returns
Keras Model with YOLO v2 architecture. Output shape is
(batch, grid_h, grid_w, num_anchors * (num_classes + 5)).Architecture Details
- Darknet-19 Base: Uses
darknet_body()as feature extractor - Conv20 Layers: Two additional 1024-filter 3x3 convolutions
- Passthrough Layer: Concatenates layer 43 output with conv20
- Space-to-depth: Reorganizes spatial data to depth dimension
- Final Convolution: Outputs predictions for anchors and classes
yolo()
Generates a complete YOLO v2 localization model by combining the model body and head.Parameters
Input tensor for the model.
Anchor box definitions. Shape:
(num_anchors, 2) with width/height pairs.Number of object classes to detect.
Returns
Tuple of tensors
(box_xy, box_wh, box_confidence, box_class_probs) representing processed predictions ready for evaluation.Description
This is a convenience function that combinesyolo_body() and yolo_head() to create a complete YOLO model in one step. It internally:
- Calls
yolo_body()to create the CNN architecture - Passes the model output through
yolo_head()to get prediction tensors - Returns the processed outputs
Usage Example
Output Processing Functions
yolo_head()
Converts final layer features to bounding box parameters.Parameters
Final convolutional layer features from the YOLO model.
Anchor box widths and heights. Shape:
(num_anchors, 2).Number of target classes.
Returns
Box center coordinates (x, y) adjusted by spatial location in conv layer. Values are normalized to [0, 1].
Box dimensions (width, height) adjusted by anchors and conv spatial resolution. Values are normalized to [0, 1].
Probability estimate for whether each box contains any object. Values in [0, 1].
Probability distribution over class labels for each box. Softmax normalized.
Processing Steps
- Reshape Features: Converts to
(batch, conv_h, conv_w, num_anchors, num_classes + 5) - Extract Components:
box_xy: Sigmoid activation on first 2 valuesbox_wh: Exponential on next 2 valuesbox_confidence: Sigmoid on 5th valuebox_class_probs: Softmax on remaining values
- Adjust Predictions:
- Add grid cell offset to xy coordinates
- Multiply wh by anchor dimensions
- Normalize by grid dimensions
yolo_boxes_to_corners()
Converts YOLO box predictions to bounding box corners.Parameters
Box center coordinates from
yolo_head().Box width and height from
yolo_head().Returns
Bounding box corners in format
[y_min, x_min, y_max, x_max].Filtering and Evaluation Functions
yolo_filter_boxes()
Filters YOLO boxes based on object and class confidence.Parameters
Bounding box coordinates in corner format.
Object confidence scores.
Class probability distributions.
Minimum score threshold for keeping boxes.
Returns
Filtered bounding boxes that exceed the threshold.
Confidence scores for filtered boxes.
Predicted class indices for filtered boxes.
yolo_eval()
Evaluates YOLO model on input and returns filtered boxes with non-maximum suppression.Parameters
Tuple of
(box_xy, box_wh, box_confidence, box_class_probs) from yolo_head().Original image shape as
[height, width].Maximum number of boxes to return after NMS.
Minimum score for box filtering.
IoU threshold for non-maximum suppression.
Returns
Final bounding boxes scaled to original image dimensions. Shape:
(num_boxes, 4).Confidence scores for final boxes. Shape:
(num_boxes,).Class indices for final boxes. Shape:
(num_boxes,).Processing Pipeline
- Convert boxes to corner format
- Filter by score threshold
- Scale boxes to original image size
- Apply non-maximum suppression
- Return top-k boxes
Training Functions
yolo_loss()
YOLO localization loss function for training.Parameters
Tuple of
(yolo_output, true_boxes, detectors_mask, matching_true_boxes).Final convolutional layer features from the model.
Ground truth boxes with shape
[batch, num_true_boxes, 5]. Contains box x_center, y_center, width, height, and class.Binary mask (0/1) for detector positions where there is a matching ground truth.
Corresponding ground truth boxes for positive detector positions, adjusted for conv height and width.
Anchor boxes for the model.
Number of object classes.
If
True, set confidence target to IoU of best predicted box with closest matching ground truth.If
True, use tf.Print() to print loss components during training.Returns
Mean localization loss across the minibatch.
Loss Components
The total loss combines four components:- Confidence Loss (objects): Penalizes incorrect confidence for boxes with objects
- Confidence Loss (no objects): Penalizes false positives
- Classification Loss: Penalizes incorrect class predictions
- Coordinate Loss: Penalizes incorrect box coordinates
preprocess_true_boxes()
Finds the detector position in YOLO grid where each ground truth box should appear.Parameters
Ground truth boxes in form of relative
[x, y, w, h, class]. Coordinates are in range [0, 1] as percentage of original image dimensions.Anchor boxes in form of
[w, h]. Assumed to be in range [0, conv_size] where conv_size is the spatial dimension of final conv features.Image dimensions as
[height, width] in pixels.Returns
Binary mask with shape
[conv_height, conv_width, num_anchors, 1] indicating detector positions to compare with ground truth.Ground truth boxes adjusted for comparison with predicted parameters. Same shape as
detectors_mask with box parameters.Algorithm
- Downsamples ground truth to conv grid (32x downsampling)
- For each ground truth box:
- Finds grid cell containing box center
- Computes IoU with each anchor
- Assigns to anchor with highest IoU
- Adjusts box parameters for training:
- Offsets relative to grid cell
- Log-space width/height relative to anchor

