Skip to main content
Welcome to neural network training! Last week you learned how to carry out inference in a neural network. This week, we’ll cover how to train a neural network on your own data.
Being able to take your own data and train your own neural network is a powerful and exciting capability. Let’s dive in!

Training Overview

Let’s continue with the handwritten digit recognition example - recognizing an image as 0 or 1.

Network Architecture

  • Input: Image pixels (X)
  • Layer 1: 25 units with sigmoid activation
  • Layer 2: 15 units with sigmoid activation
  • Output: 1 unit with sigmoid activation
Given a training set of images X with ground truth labels Y, how do you train the parameters of this neural network?

TensorFlow Training Code

Here’s the complete code to train a neural network in TensorFlow:
1
Step 1: Build the Model
2
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense

# Create the model architecture
model = Sequential([
    Dense(units=25, activation='sigmoid'),  # First hidden layer
    Dense(units=15, activation='sigmoid'),  # Second hidden layer
    Dense(units=1, activation='sigmoid')    # Output layer
])
3
This step is familiar from the inference section - you’re specifying the layers and their configurations.
4
Step 2: Compile the Model
5
# Specify the loss function
model.compile(
    loss='binary_crossentropy',
    optimizer='adam'
)
6
The key part of compilation is specifying the loss function. Binary crossentropy is the standard loss for binary classification problems.
7
Step 3: Train the Model
8
# Fit the model to the data
model.fit(X, Y, epochs=100)
9
This tells TensorFlow to fit the model using the specified loss function to your dataset X and Y.

Understanding Training: Comparison with Logistic Regression

To understand neural network training, let’s first recall logistic regression training from the previous course.

Logistic Regression Training Steps

1
Specify the Model
2
Define how to compute output given input:
3
# Logistic regression prediction
f(x) = g(w · x + b)

# Where g is the sigmoid function
g(z) = 1 / (1 + e^(-z))
4
Define Loss and Cost Functions
5
Loss function (single training example):
6
L(f(x), y) = -y * log(f(x)) - (1 - y) * log(1 - f(x))
7
Cost function (average over all examples):
8
J(w, b) = (1/m) * Σ L(f(x^(i)), y^(i))
9
  • Loss function: Measures error on a single training example
  • Cost function: Average loss over the entire training set
10
Train Using Gradient Descent
11
Update parameters to minimize cost:
12
# Gradient descent update
w = w - α * ∂J/∂w
b = b - α * ∂J/∂b

Neural Network Training Steps

Training a neural network follows the same three steps:

Step 1: Specify the Model

Define the neural network architecture:
model = Sequential([
    Dense(25, activation='sigmoid'),
    Dense(15, activation='sigmoid'),
    Dense(1, activation='sigmoid')
])
This specifies:
  • How many layers
  • How many neurons per layer
  • What activation functions to use
  • How to compute output given input and parameters

Step 2: Specify Loss and Cost

For binary classification, use binary crossentropy:
model.compile(loss='binary_crossentropy')
Binary Crossentropy Loss:
L(f(x), y) = -y * log(f(x)) - (1 - y) * log(1 - f(x))
This is the same loss function as logistic regression!
Binary crossentropy (also called logistic loss) is ideal for binary classification because:
  • It heavily penalizes confident wrong predictions
  • It provides smooth gradients for optimization
  • It’s theoretically derived from maximum likelihood estimation
Cost Function:
J(W, B) = (1/m) * Σ L(f(x^(i)), y^(i))
Average the loss over all m training examples.

Step 3: Train Using Gradient Descent

model.fit(X, Y, epochs=100)
This executes gradient descent:
  1. Compute gradients of cost with respect to all parameters
  2. Update parameters: W = W - α * ∂J/∂W
  3. Repeat for specified number of epochs
An epoch is one complete pass through the entire training dataset. Training for 100 epochs means the algorithm sees each example 100 times.

Backpropagation Algorithm

The key to neural network training is computing gradients efficiently using backpropagation.
Backpropagation computes the gradient of the loss function with respect to each parameter by applying the chain rule from calculus. TensorFlow handles this automatically!

What Backpropagation Does

1
Forward Pass
2
Compute outputs layer by layer from input to output
3
Compute Loss
4
Calculate the error between prediction and actual label
5
Backward Pass
6
Propagate the error backward through the network
7
Compute Gradients
8
Calculate ∂J/∂W and ∂J/∂B for each layer
9
Update Parameters
10
Adjust weights and biases using gradient descent

Different Loss Functions

Depending on your problem, you might use different loss functions:
# Binary crossentropy for 2 classes
model.compile(loss='binary_crossentropy')
Use when: Output is 0 or 1 (binary classification)

Complete Training Example

Here’s a complete example for the digit classification problem:
import numpy as np
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense

# Prepare training data
X = np.array([...])  # Image pixel values
Y = np.array([...])  # Labels (0 or 1)

# Build model
model = Sequential([
    Dense(25, activation='sigmoid', input_shape=(784,)),
    Dense(15, activation='sigmoid'),
    Dense(1, activation='sigmoid')
])

# Compile model
model.compile(
    loss='binary_crossentropy',
    optimizer='adam',
    metrics=['accuracy']
)

# Train model
history = model.fit(
    X, Y,
    epochs=100,
    batch_size=32,
    validation_split=0.2,
    verbose=1
)

# Evaluate model
test_loss, test_accuracy = model.evaluate(X_test, Y_test)
print(f"Test accuracy: {test_accuracy}")

# Make predictions
predictions = model.predict(X_new)

Training Parameters Explained

Epochs

Number of complete passes through the training data

Batch Size

Number of examples processed before updating parameters

Learning Rate

Step size for parameter updates (controlled by optimizer)

Validation Split

Fraction of data reserved for validation during training

Monitoring Training Progress

TensorFlow provides tools to monitor training:
# Training with validation
history = model.fit(
    X_train, Y_train,
    epochs=100,
    validation_data=(X_val, Y_val),
    verbose=1
)

# Plot training history
import matplotlib.pyplot as plt

plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.show()
Monitoring validation loss helps detect overfitting. If training loss decreases but validation loss increases, your model is overfitting.

Common Training Issues

Symptoms: Loss stays constant or decreases very slowlySolutions:
  • Increase learning rate
  • Check data preprocessing
  • Verify labels are correct
  • Try different initialization
Symptoms: Training accuracy high, validation accuracy lowSolutions:
  • Add more training data
  • Use dropout or regularization
  • Reduce model complexity
  • Use early stopping
Symptoms: Both training and validation accuracy are lowSolutions:
  • Increase model complexity
  • Train for more epochs
  • Reduce regularization
  • Check for bugs in data pipeline

Optimizers in TensorFlow

TensorFlow offers several optimization algorithms:
# Adam optimizer (most common)
model.compile(
    loss='binary_crossentropy',
    optimizer='adam'
)

# SGD with momentum
model.compile(
    loss='binary_crossentropy',
    optimizer=tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9)
)

# RMSprop
model.compile(
    loss='binary_crossentropy',
    optimizer=tf.keras.optimizers.RMSprop(learning_rate=0.001)
)
Adam is usually the best starting point. It combines the benefits of momentum and adaptive learning rates.

Best Practices

1
Normalize Your Data
2
Scale input features to similar ranges (e.g., 0-1 or standardize)
3
Use Appropriate Batch Sizes
4
Typical values: 32, 64, 128, or 256
5
Monitor Validation Metrics
6
Always evaluate on a held-out validation set
7
Save Checkpoints
8
Save model weights periodically during training
9
Use Early Stopping
10
Stop training when validation performance stops improving

Early Stopping Example

from tensorflow.keras.callbacks import EarlyStopping

# Define early stopping
early_stop = EarlyStopping(
    monitor='val_loss',
    patience=10,
    restore_best_weights=True
)

# Train with early stopping
model.fit(
    X_train, Y_train,
    epochs=1000,
    validation_data=(X_val, Y_val),
    callbacks=[early_stop]
)
Early stopping automatically stops training when the model stops improving, preventing overfitting and saving time.

Summary

Neural network training involves three key steps:
  1. Specify the model: Define architecture with layers and activations
  2. Choose loss function: Binary crossentropy for classification, MSE for regression
  3. Train with gradient descent: Use backpropagation to compute gradients
TensorFlow automates the complex mathematics of backpropagation, letting you focus on model architecture and hyperparameters.

Next Steps

Now that you understand neural network training:
  • Experiment with different architectures
  • Try various loss functions and optimizers
  • Practice with real datasets
  • Learn about regularization techniques
  • Explore advanced architectures (CNNs, RNNs)
The ability to debug and improve your models comes from understanding what’s happening under the hood. Even when using high-level APIs, this knowledge is invaluable.
Congratulations! You now have the complete foundation for building, training, and deploying neural networks!

Build docs developers (and LLMs) love