Gradient Descent Optimization

Gradient descent is a fundamental optimization algorithm used throughout machine learning. This guide implements gradient descent for functions with varying complexity, exploring parameter selection and convergence behavior.

Prerequisites

Load the required packages:

import numpy as np
import matplotlib.pyplot as plt
from w2_tools import plot_f, gradient_descent_one_variable

%matplotlib widget

Function with One Global Minimum

Consider the function

f(x) = e^x - \log(x)

for

x > 0

. This function has one global minimum that cannot be found analytically.

The Gradient Descent Algorithm

Starting from initial point

x_0

, gradient descent iteratively moves “down the hill” using:

x_1 = x_0 - \alpha \frac{df}{dx}(x_0)

where:

$\alpha > 0$ is the learning rate (step size)
$\frac{df}{dx}(x_0)$ is the gradient (derivative at $x_0$ )
Subtraction moves against the gradient toward the minimum

The gradient defines the direction of steepest increase. Moving in the opposite direction (subtracting it) leads toward the minimum.

Defining the Function

Implement

f(x) = e^x - \log(x)

and its derivative:

def f_example_1(x):
    return np.exp(x) - np.log(x)

def dfdx_example_1(x):
    return np.exp(x) - 1/x

Visualize the function:

plot_f([0.001, 2.5], [-0.3, 13], f_example_1, 0.0)

Implementing Gradient Descent

Define the Algorithm

def gradient_descent(dfdx, x, learning_rate=0.1, num_iterations=100):
    for iteration in range(num_iterations):
        x = x - learning_rate * dfdx(x)
    return x

Set Parameters

num_iterations = 25
learning_rate = 0.1
x_initial = 1.6

These three parameters are crucial:

num_iterations: How many steps to take
learning_rate: Size of each step
x_initial: Starting point

Run Optimization

x_min = gradient_descent(dfdx_example_1, x_initial, 
                         learning_rate, num_iterations)
print("Gradient descent result: x_min =", x_min)

Interactive Visualization

Visualize the gradient descent process:

num_iterations = 25
learning_rate = 0.1
x_initial = 1.6

gd_example_1 = gradient_descent_one_variable(
    [0.001, 2.5], [-0.3, 13], 
    f_example_1, dfdx_example_1, 
    gradient_descent, num_iterations, 
    learning_rate, x_initial, 0.0, [0.35, 9.5]
)

After the animation ends, click on the plot to choose a new initial point and observe how gradient descent behaves from different starting locations.

Parameter Sensitivity Analysis

Experiment with different parameter combinations to understand their effects:

Learning Rate Effects

Learning Rate = 0.1 (Baseline)

num_iterations = 25; learning_rate = 0.1; x_initial = 1.6

Result: Converges successfully to the minimum around iteration 21. The learning rate is well-tuned for this problem.

Learning Rate = 0.3 (Increased)

num_iterations = 25; learning_rate = 0.3; x_initial = 1.6

Result: Converges faster! Larger steps mean fewer iterations needed. However, larger steps are riskier…

Learning Rate = 0.5 (Too Large)

num_iterations = 25; learning_rate = 0.5; x_initial = 1.6

Result: ❌ Diverges! Steps are too large, causing oscillation and missing the minimum entirely.

Increasing the learning rate can speed up convergence but may cause divergence. Finding the right balance is crucial.

Learning Rate = 0.04 (Too Small)

num_iterations = 25; learning_rate = 0.04; x_initial = 1.6

Result: ❌ Doesn’t converge within 25 iterations. Steps are too small, making progress very slow.

# Increase iterations to compensate
num_iterations = 75; learning_rate = 0.04; x_initial = 1.6

Result: ✓ Converges, but uses 3x more iterations—computationally expensive.

Initial Point Effects

x_initial = 1.6 (Good)

num_iterations = 25; learning_rate = 0.1; x_initial = 1.6

Result: Converges smoothly. The function gradient is moderate at this point.

x_initial = 0.05 (Steep Region)

num_iterations = 25; learning_rate = 0.1; x_initial = 0.05

Result: Still converges, but the first step is much larger due to the steeper gradient. The algorithm successfully navigates this.

x_initial = 0.03 (Very Steep)

num_iterations = 25; learning_rate = 0.1; x_initial = 0.03

Result: First step is extremely large. Risky—close to missing the minimum.

x_initial = 0.02 (Critical)

num_iterations = 25; learning_rate = 0.1; x_initial = 0.02

Result: ❌ Diverges! The gradient is so steep that even the first step overshoots dramatically.

Starting points in steep regions can cause divergence even with reasonable learning rates. Initial point selection matters!

Function with Multiple Minima

Now consider a more challenging problem: a function with multiple local minima.

from w2_tools import f_example_2, dfdx_example_2

plot_f([0.001, 2], [-6.3, 5], f_example_2, -6)

Finding Different Minima

Run gradient descent from different starting points:

print("Gradient descent results")

# Starting from x=1.3 → Global minimum
print("Global minimum: x_min =", 
      gradient_descent(dfdx_example_2, x=1.3, 
                      learning_rate=0.005, num_iterations=35))

# Starting from x=0.25 → Local minimum
print("Local minimum: x_min =", 
      gradient_descent(dfdx_example_2, x=0.25, 
                      learning_rate=0.005, num_iterations=35))

The Local Minimum Problem: Gradient descent can get “stuck” in local minima. The algorithm has no way to know if it has found the global minimum or just a local one.

Visualizing Multiple Minima

# Try different starting points
num_iterations = 35; learning_rate = 0.005; x_initial = 1.3
# num_iterations = 35; learning_rate = 0.005; x_initial = 0.25
# num_iterations = 35; learning_rate = 0.01; x_initial = 1.3

gd_example_2 = gradient_descent_one_variable(
    [0.001, 2], [-6.3, 5], 
    f_example_2, dfdx_example_2, 
    gradient_descent, num_iterations, 
    learning_rate, x_initial, -6, [0.1, -0.5]
)

After the animation, click different points on the plot to see which starting positions lead to the global minimum versus local minima.

Gradient Descent in Two Variables

Extend the algorithm to optimize functions of two variables

f(x, y)

Algorithm for Two Variables

Starting from

(x_0, y_0)

, update both coordinates:

x_1 = x_0 - \alpha \frac{\partial f}{\partial x}(x_0, y_0)

y_1 = y_0 - \alpha \frac{\partial f}{\partial y}(x_0, y_0)

The gradient is now a vector:

\nabla f = \begin{bmatrix} \frac{\partial f}{\partial x} \\ \frac{\partial f}{\partial y} \end{bmatrix}

. The algorithm moves in the direction opposite to this vector.

Implementation

from w2_tools import (gradient_descent_two_variables,
                      f_example_3, dfdx_example_3, dfdy_example_3)

def gradient_descent(dfdx, dfdy, x, y, learning_rate=0.1, num_iterations=100):
    for iteration in range(num_iterations):
        x = x - learning_rate * dfdx(x, y)
        y = y - learning_rate * dfdy(x, y)
    return x, y

Function with One Global Minimum

Visualize a two-variable function:

plot_f_cont_and_surf([0, 5], [0, 5], [74, 85], 
                     f_example_3, cmap='coolwarm', 
                     view={'azim': -60, 'elev': 28})

Optimize the function:

num_iterations = 30
learning_rate = 0.25
x_initial = 0.5
y_initial = 0.6

print("Gradient descent result: x_min, y_min =", 
      gradient_descent(dfdx_example_3, dfdy_example_3, 
                      x_initial, y_initial, 
                      learning_rate, num_iterations))

Interactive Two-Variable Visualization

num_iterations = 20; learning_rate = 0.25
x_initial = 0.5; y_initial = 0.6
# Try different parameters:
# num_iterations = 20; learning_rate = 0.5; x_initial = 0.5; y_initial = 0.6
# num_iterations = 20; learning_rate = 0.15; x_initial = 0.5; y_initial = 0.6
# num_iterations = 20; learning_rate = 0.15; x_initial = 3.5; y_initial = 3.6

gd_example_3 = gradient_descent_two_variables(
    [0, 5], [0, 5], [74, 85], 
    f_example_3, dfdx_example_3, dfdy_example_3, 
    gradient_descent, num_iterations, learning_rate, 
    x_initial, y_initial, 
    [0.1, 0.1, 81.5], 2, [4, 1, 171], 
    cmap='coolwarm', view={'azim': -60, 'elev': 28}
)

The gradient vector points perpendicular to contour lines. Gradient descent takes steps perpendicular to these contours, moving toward lower values.

Complex Two-Variable Function

Optimize a function with multiple minima:

from w2_tools import f_example_4, dfdx_example_4, dfdy_example_4

plot_f_cont_and_surf([0, 5], [0, 5], [6, 9.5], 
                     f_example_4, cmap='terrain', 
                     view={'azim': -63, 'elev': 21})

Test different starting points:

# Converges to global minimum
num_iterations = 30; learning_rate = 0.2
x_initial = 0.5; y_initial = 3

# Converges to local minimum
# num_iterations = 20; learning_rate = 0.2
# x_initial = 2; y_initial = 3

# Converges to another local minimum
# num_iterations = 20; learning_rate = 0.2
# x_initial = 4; y_initial = 0.5

print("Result:", 
      gradient_descent(dfdx_example_4, dfdy_example_4, 
                      x_initial, y_initial, 
                      learning_rate, num_iterations))

Key Insights

Advantages

Simple to implement
Computationally efficient
Works for high dimensions
Requires only first derivatives

Limitations

Sensitive to learning rate
Sensitive to initial point
Can get stuck in local minima
Requires parameter tuning

Parameter Selection Guidelines

Parameter	Too Small	Optimal	Too Large
Learning Rate	Slow convergence	Fast convergence	Divergence/oscillation
Iterations	Doesn’t converge	Just enough	Wasted computation
Initial Point	May find local minimum	Finds global minimum	May diverge

Finding Good Parameters: Start with small learning rates (0.001-0.01) and increase until convergence becomes unstable. Use multiple random initial points to avoid local minima.

Advanced Considerations

Why does learning rate matter so much?

The learning rate controls step size. Too small means slow progress; too large causes overshooting. The optimal rate depends on the function’s curvature—steeper functions need smaller rates.

How to escape local minima?

Strategies include:

Random restarts: Run gradient descent from multiple initial points
Momentum: Add velocity to help escape shallow minima
Stochastic methods: Add randomness to the updates
Adaptive learning rates: Adjust α during optimization

How many iterations do I need?

Monitor the cost function. When it stops decreasing significantly (or changes less than a threshold like 0.001), the algorithm has converged. Typical values: 100-10,000 iterations depending on problem complexity.

Summary

Gradient descent is powerful but requires careful parameter tuning:

Learning rate: Balance between speed and stability
Initial point: Critical for functions with multiple minima
Iterations: Enough to converge, not so many as to waste computation
Monitoring: Track the cost function to verify convergence

In Practice: Machine learning frameworks automatically tune these parameters using techniques like learning rate schedules, adaptive methods (Adam, RMSprop), and early stopping.

Next Steps

Apply gradient descent to real machine learning problems:

Perceptron Regression - Train neural networks for regression
Perceptron Classification - Implement classification models

Linear Algebra

Calculus

Probability & Statistics

Gradient Descent Optimization

Prerequisites

Function with One Global Minimum

The Gradient Descent Algorithm

Defining the Function

Implementing Gradient Descent

Interactive Visualization

Parameter Sensitivity Analysis

Learning Rate Effects

Initial Point Effects

Function with Multiple Minima

Finding Different Minima

Visualizing Multiple Minima

Gradient Descent in Two Variables

Algorithm for Two Variables

Implementation

Function with One Global Minimum

Interactive Two-Variable Visualization

Complex Two-Variable Function

Key Insights

Advantages

Limitations

Parameter Selection Guidelines

Advanced Considerations

Summary

Next Steps

Build docs developers (and LLMs) love

Linear Algebra

Calculus

Probability & Statistics

​Prerequisites

​Function with One Global Minimum

​The Gradient Descent Algorithm

​Defining the Function

​Implementing Gradient Descent

​Interactive Visualization

​Parameter Sensitivity Analysis

​Learning Rate Effects

​Initial Point Effects

​Function with Multiple Minima

​Finding Different Minima

​Visualizing Multiple Minima

​Gradient Descent in Two Variables

​Algorithm for Two Variables

​Implementation

​Function with One Global Minimum

​Interactive Two-Variable Visualization

​Complex Two-Variable Function

​Key Insights

Advantages

Limitations

​Parameter Selection Guidelines

​Advanced Considerations

​Summary

​Next Steps

Build docs developers (and LLMs) love

Prerequisites

Function with One Global Minimum

The Gradient Descent Algorithm

Defining the Function

Implementing Gradient Descent

Interactive Visualization

Parameter Sensitivity Analysis

Learning Rate Effects

Initial Point Effects

Function with Multiple Minima

Finding Different Minima

Visualizing Multiple Minima

Gradient Descent in Two Variables

Algorithm for Two Variables

Implementation

Function with One Global Minimum

Interactive Two-Variable Visualization

Complex Two-Variable Function

Key Insights

Parameter Selection Guidelines

Advanced Considerations

Summary

Next Steps