Skip to main content

Overview

The keras_yolo.py module implements the YOLO v2 (You Only Look Once version 2) object detection architecture in Keras. It provides functions for building the model, processing outputs, computing loss, and evaluating predictions.

Helper Functions

space_to_depth_x2()

TensorFlow space-to-depth transformation with block size 2.
space_to_depth_x2(x)

Parameters

x
tensor
required
Input tensor to be transformed.

Returns

output
tensor
Transformed tensor with spatial dimensions reduced by 2x and channels increased by 4x.

Description

This is a thin wrapper for TensorFlow’s space_to_depth operation with a fixed block size of 2. It reorganizes spatial data into the depth (channel) dimension, which is used in YOLO v2 to concatenate features from different scales. Transformation:
  • Input shape: (batch, height, width, channels)
  • Output shape: (batch, height/2, width/2, channels*4)
This function is used internally by yolo_body() to create the passthrough layer that combines high-resolution features with low-resolution features.
# Used in a Keras Lambda layer
from keras.layers import Lambda

conv21_reshaped = Lambda(
    space_to_depth_x2,
    output_shape=space_to_depth_x2_output_shape,
    name='space_to_depth')(conv21)

space_to_depth_x2_output_shape()

Calculate output shape for space_to_depth operation with block size 2.
space_to_depth_x2_output_shape(input_shape)

Parameters

input_shape
tuple
required
Input shape as (batch, height, width, channels).

Returns

output_shape
tuple
Output shape as (batch, height//2, width//2, channels*4). If height is None, returns (batch, None, None, channels*4) for dynamic shapes.

Description

This helper function computes the output shape after applying space_to_depth_x2(). It’s used by Keras Lambda layers to determine the output shape at graph construction time.
For TensorFlow backend, this function may not be strictly required as shape inference can be automatic. However, it’s provided for compatibility and explicit shape specification.

Model Architecture Functions

yolo_body()

Creates the YOLO v2 CNN body architecture.
yolo_body(inputs, num_anchors, num_classes)

Parameters

inputs
keras.Input
required
Input tensor for the model.
num_anchors
int
required
Number of anchor boxes per grid cell.
num_classes
int
required
Number of object classes to detect.

Returns

model
keras.Model
Keras Model with YOLO v2 architecture. Output shape is (batch, grid_h, grid_w, num_anchors * (num_classes + 5)).

Architecture Details

  1. Darknet-19 Base: Uses darknet_body() as feature extractor
  2. Conv20 Layers: Two additional 1024-filter 3x3 convolutions
  3. Passthrough Layer: Concatenates layer 43 output with conv20
  4. Space-to-depth: Reorganizes spatial data to depth dimension
  5. Final Convolution: Outputs predictions for anchors and classes
darknet = Model(inputs, darknet_body()(inputs))
conv20 = compose(
    DarknetConv2D_BN_Leaky(1024, (3, 3)),
    DarknetConv2D_BN_Leaky(1024, (3, 3)))(darknet.output)

conv13 = darknet.layers[43].output
conv21 = DarknetConv2D_BN_Leaky(64, (1, 1))(conv13)
conv21_reshaped = Lambda(
    space_to_depth_x2,
    output_shape=space_to_depth_x2_output_shape,
    name='space_to_depth')(conv21)

x = concatenate([conv21_reshaped, conv20])
x = DarknetConv2D_BN_Leaky(1024, (3, 3))(x)
x = DarknetConv2D(num_anchors * (num_classes + 5), (1, 1))(x)

yolo()

Generates a complete YOLO v2 localization model by combining the model body and head.
yolo(inputs, anchors, num_classes)

Parameters

inputs
keras.Input
required
Input tensor for the model.
anchors
array-like
required
Anchor box definitions. Shape: (num_anchors, 2) with width/height pairs.
num_classes
int
required
Number of object classes to detect.

Returns

outputs
tuple
Tuple of tensors (box_xy, box_wh, box_confidence, box_class_probs) representing processed predictions ready for evaluation.

Description

This is a convenience function that combines yolo_body() and yolo_head() to create a complete YOLO model in one step. It internally:
  1. Calls yolo_body() to create the CNN architecture
  2. Passes the model output through yolo_head() to get prediction tensors
  3. Returns the processed outputs
# Equivalent to:
num_anchors = len(anchors)
body = yolo_body(inputs, num_anchors, num_classes)
outputs = yolo_head(body.output, anchors, num_classes)
return outputs

Usage Example

from keras.layers import Input
from yad2k.models.keras_yolo import yolo
import numpy as np

# Define inputs and anchors
inputs = Input(shape=(416, 416, 3))
anchors = np.array([[1.08, 1.19], [3.42, 4.41], [6.63, 11.38], 
                    [9.42, 5.11], [16.62, 10.52]])

# Create complete YOLO model
box_xy, box_wh, box_confidence, box_class_probs = yolo(inputs, anchors, num_classes=20)

Output Processing Functions

yolo_head()

Converts final layer features to bounding box parameters.
yolo_head(feats, anchors, num_classes)

Parameters

feats
tensor
required
Final convolutional layer features from the YOLO model.
anchors
array-like
required
Anchor box widths and heights. Shape: (num_anchors, 2).
num_classes
int
required
Number of target classes.

Returns

box_xy
tensor
Box center coordinates (x, y) adjusted by spatial location in conv layer. Values are normalized to [0, 1].
box_wh
tensor
Box dimensions (width, height) adjusted by anchors and conv spatial resolution. Values are normalized to [0, 1].
box_confidence
tensor
Probability estimate for whether each box contains any object. Values in [0, 1].
box_class_probs
tensor
Probability distribution over class labels for each box. Softmax normalized.

Processing Steps

  1. Reshape Features: Converts to (batch, conv_h, conv_w, num_anchors, num_classes + 5)
  2. Extract Components:
    • box_xy: Sigmoid activation on first 2 values
    • box_wh: Exponential on next 2 values
    • box_confidence: Sigmoid on 5th value
    • box_class_probs: Softmax on remaining values
  3. Adjust Predictions:
    • Add grid cell offset to xy coordinates
    • Multiply wh by anchor dimensions
    • Normalize by grid dimensions
box_xy = K.sigmoid(feats[..., :2])
box_wh = K.exp(feats[..., 2:4])
box_confidence = K.sigmoid(feats[..., 4:5])
box_class_probs = K.softmax(feats[..., 5:])

box_xy = (box_xy + conv_index) / conv_dims
box_wh = box_wh * anchors_tensor / conv_dims

yolo_boxes_to_corners()

Converts YOLO box predictions to bounding box corners.
yolo_boxes_to_corners(box_xy, box_wh)

Parameters

box_xy
tensor
required
Box center coordinates from yolo_head().
box_wh
tensor
required
Box width and height from yolo_head().

Returns

corners
tensor
Bounding box corners in format [y_min, x_min, y_max, x_max].
box_mins = box_xy - (box_wh / 2.)
box_maxes = box_xy + (box_wh / 2.)

return K.concatenate([
    box_mins[..., 1:2],  # y_min
    box_mins[..., 0:1],  # x_min
    box_maxes[..., 1:2],  # y_max
    box_maxes[..., 0:1]  # x_max
])

Filtering and Evaluation Functions

yolo_filter_boxes()

Filters YOLO boxes based on object and class confidence.
yolo_filter_boxes(boxes, box_confidence, box_class_probs, threshold=.6)

Parameters

boxes
tensor
required
Bounding box coordinates in corner format.
box_confidence
tensor
required
Object confidence scores.
box_class_probs
tensor
required
Class probability distributions.
threshold
float
default:"0.6"
Minimum score threshold for keeping boxes.

Returns

boxes
tensor
Filtered bounding boxes that exceed the threshold.
scores
tensor
Confidence scores for filtered boxes.
classes
tensor
Predicted class indices for filtered boxes.
box_scores = box_confidence * box_class_probs
box_classes = K.argmax(box_scores, axis=-1)
box_class_scores = K.max(box_scores, axis=-1)
prediction_mask = box_class_scores >= threshold

boxes = tf.boolean_mask(boxes, prediction_mask)
scores = tf.boolean_mask(box_class_scores, prediction_mask)
classes = tf.boolean_mask(box_classes, prediction_mask)

yolo_eval()

Evaluates YOLO model on input and returns filtered boxes with non-maximum suppression.
yolo_eval(yolo_outputs, image_shape, max_boxes=10, score_threshold=.6, iou_threshold=.5)

Parameters

yolo_outputs
tuple
required
Tuple of (box_xy, box_wh, box_confidence, box_class_probs) from yolo_head().
image_shape
tensor
required
Original image shape as [height, width].
max_boxes
int
default:"10"
Maximum number of boxes to return after NMS.
score_threshold
float
default:"0.6"
Minimum score for box filtering.
iou_threshold
float
default:"0.5"
IoU threshold for non-maximum suppression.

Returns

boxes
tensor
Final bounding boxes scaled to original image dimensions. Shape: (num_boxes, 4).
scores
tensor
Confidence scores for final boxes. Shape: (num_boxes,).
classes
tensor
Class indices for final boxes. Shape: (num_boxes,).

Processing Pipeline

  1. Convert boxes to corner format
  2. Filter by score threshold
  3. Scale boxes to original image size
  4. Apply non-maximum suppression
  5. Return top-k boxes
box_xy, box_wh, box_confidence, box_class_probs = yolo_outputs
boxes = yolo_boxes_to_corners(box_xy, box_wh)
boxes, scores, classes = yolo_filter_boxes(
    boxes, box_confidence, box_class_probs, threshold=score_threshold)

# Scale boxes back to original image shape
height = image_shape[0]
width = image_shape[1]
image_dims = K.stack([height, width, height, width])
image_dims = K.reshape(image_dims, [1, 4])
boxes = boxes * image_dims

# Non-maximum suppression
max_boxes_tensor = K.variable(max_boxes, dtype='int32')
K.get_session().run(tf.variables_initializer([max_boxes_tensor]))
nms_index = tf.image.non_max_suppression(
    boxes, scores, max_boxes_tensor, iou_threshold=iou_threshold)
boxes = K.gather(boxes, nms_index)
scores = K.gather(scores, nms_index)
classes = K.gather(classes, nms_index)

Training Functions

yolo_loss()

YOLO localization loss function for training.
yolo_loss(args, anchors, num_classes, rescore_confidence=False, print_loss=False)

Parameters

args
tuple
required
Tuple of (yolo_output, true_boxes, detectors_mask, matching_true_boxes).
yolo_output
tensor
required
Final convolutional layer features from the model.
true_boxes
tensor
required
Ground truth boxes with shape [batch, num_true_boxes, 5]. Contains box x_center, y_center, width, height, and class.
detectors_mask
array
required
Binary mask (0/1) for detector positions where there is a matching ground truth.
matching_true_boxes
array
required
Corresponding ground truth boxes for positive detector positions, adjusted for conv height and width.
anchors
tensor
required
Anchor boxes for the model.
num_classes
int
required
Number of object classes.
rescore_confidence
bool
default:"False"
If True, set confidence target to IoU of best predicted box with closest matching ground truth.
print_loss
bool
default:"False"
If True, use tf.Print() to print loss components during training.

Returns

total_loss
float
Mean localization loss across the minibatch.

Loss Components

The total loss combines four components:
  1. Confidence Loss (objects): Penalizes incorrect confidence for boxes with objects
  2. Confidence Loss (no objects): Penalizes false positives
  3. Classification Loss: Penalizes incorrect class predictions
  4. Coordinate Loss: Penalizes incorrect box coordinates
object_scale = 5
no_object_scale = 1
class_scale = 1
coordinates_scale = 1

total_loss = 0.5 * (
    confidence_loss_sum + classification_loss_sum + coordinates_loss_sum)

preprocess_true_boxes()

Finds the detector position in YOLO grid where each ground truth box should appear.
preprocess_true_boxes(true_boxes, anchors, image_size)

Parameters

true_boxes
array
required
Ground truth boxes in form of relative [x, y, w, h, class]. Coordinates are in range [0, 1] as percentage of original image dimensions.
anchors
array
required
Anchor boxes in form of [w, h]. Assumed to be in range [0, conv_size] where conv_size is the spatial dimension of final conv features.
image_size
array-like
required
Image dimensions as [height, width] in pixels.

Returns

detectors_mask
array
Binary mask with shape [conv_height, conv_width, num_anchors, 1] indicating detector positions to compare with ground truth.
matching_true_boxes
array
Ground truth boxes adjusted for comparison with predicted parameters. Same shape as detectors_mask with box parameters.

Algorithm

  1. Downsamples ground truth to conv grid (32x downsampling)
  2. For each ground truth box:
    • Finds grid cell containing box center
    • Computes IoU with each anchor
    • Assigns to anchor with highest IoU
  3. Adjusts box parameters for training:
    • Offsets relative to grid cell
    • Log-space width/height relative to anchor
conv_height = height // 32
conv_width = width // 32

for box in true_boxes:
    box_class = box[4:5]
    box = box[0:4] * np.array([conv_width, conv_height, conv_width, conv_height])
    i = np.floor(box[1]).astype('int')  # grid row
    j = np.floor(box[0]).astype('int')  # grid col
    
    # Find best anchor
    best_iou = 0
    best_anchor = 0
    for k, anchor in enumerate(anchors):
        iou = compute_iou(box[2:4], anchor)
        if iou > best_iou:
            best_iou = iou
            best_anchor = k
    
    if best_iou > 0:
        detectors_mask[i, j, best_anchor] = 1
        adjusted_box = np.array([
            box[0] - j,  # x offset from grid cell
            box[1] - i,  # y offset from grid cell
            np.log(box[2] / anchors[best_anchor][0]),  # log w
            np.log(box[3] / anchors[best_anchor][1]),  # log h
            box_class
        ])
        matching_true_boxes[i, j, best_anchor] = adjusted_box

Constants

VOC Anchors

voc_anchors = np.array([
    [1.08, 1.19], 
    [3.42, 4.41], 
    [6.63, 11.38], 
    [9.42, 5.11], 
    [16.62, 10.52]
])
Predefined anchor boxes for Pascal VOC dataset.

VOC Classes

voc_classes = [
    "aeroplane", "bicycle", "bird", "boat", "bottle", "bus", "car", "cat",
    "chair", "cow", "diningtable", "dog", "horse", "motorbike", "person",
    "pottedplant", "sheep", "sofa", "train", "tvmonitor"
]
20 object classes from Pascal VOC dataset.

Usage Example

Building a Model

from keras.layers import Input
from keras_yolo import yolo_body, yolo_head, yolo_eval

# Create model
inputs = Input(shape=(416, 416, 3))
num_anchors = 5
num_classes = 20

model = yolo_body(inputs, num_anchors, num_classes)

# Process outputs
anchors = voc_anchors
yolo_outputs = yolo_head(model.output, anchors, num_classes)

# Evaluation
image_shape = K.placeholder(shape=(2,))
boxes, scores, classes = yolo_eval(
    yolo_outputs,
    image_shape,
    max_boxes=10,
    score_threshold=0.3,
    iou_threshold=0.5
)

Training

from keras_yolo import yolo_loss, preprocess_true_boxes

# Prepare training data
detectors_mask, matching_true_boxes = preprocess_true_boxes(
    true_boxes, anchors, image_size=(416, 416)
)

# Compile model with custom loss
model.compile(
    optimizer='adam',
    loss=lambda y_true, y_pred: yolo_loss(
        (y_pred, true_boxes, detectors_mask, matching_true_boxes),
        anchors,
        num_classes
    )
)

Build docs developers (and LLMs) love