Skip to main content
Gradient descent is a fundamental optimization algorithm used throughout machine learning. This guide implements gradient descent for functions with varying complexity, exploring parameter selection and convergence behavior.

Prerequisites

Load the required packages:
import numpy as np
import matplotlib.pyplot as plt
from w2_tools import plot_f, gradient_descent_one_variable

%matplotlib widget

Function with One Global Minimum

Consider the function f(x)=exlog(x)f(x) = e^x - \log(x) for x>0x > 0. This function has one global minimum that cannot be found analytically.

The Gradient Descent Algorithm

Starting from initial point x0x_0, gradient descent iteratively moves “down the hill” using: x1=x0αdfdx(x0)x_1 = x_0 - \alpha \frac{df}{dx}(x_0) where:
  • α>0\alpha > 0 is the learning rate (step size)
  • dfdx(x0)\frac{df}{dx}(x_0) is the gradient (derivative at x0x_0)
  • Subtraction moves against the gradient toward the minimum
The gradient defines the direction of steepest increase. Moving in the opposite direction (subtracting it) leads toward the minimum.

Defining the Function

Implement f(x)=exlog(x)f(x) = e^x - \log(x) and its derivative:
def f_example_1(x):
    return np.exp(x) - np.log(x)

def dfdx_example_1(x):
    return np.exp(x) - 1/x
Visualize the function:
plot_f([0.001, 2.5], [-0.3, 13], f_example_1, 0.0)

Implementing Gradient Descent

1

Define the Algorithm

def gradient_descent(dfdx, x, learning_rate=0.1, num_iterations=100):
    for iteration in range(num_iterations):
        x = x - learning_rate * dfdx(x)
    return x
2

Set Parameters

num_iterations = 25
learning_rate = 0.1
x_initial = 1.6
These three parameters are crucial:
  • num_iterations: How many steps to take
  • learning_rate: Size of each step
  • x_initial: Starting point
3

Run Optimization

x_min = gradient_descent(dfdx_example_1, x_initial, 
                         learning_rate, num_iterations)
print("Gradient descent result: x_min =", x_min)

Interactive Visualization

Visualize the gradient descent process:
num_iterations = 25
learning_rate = 0.1
x_initial = 1.6

gd_example_1 = gradient_descent_one_variable(
    [0.001, 2.5], [-0.3, 13], 
    f_example_1, dfdx_example_1, 
    gradient_descent, num_iterations, 
    learning_rate, x_initial, 0.0, [0.35, 9.5]
)
After the animation ends, click on the plot to choose a new initial point and observe how gradient descent behaves from different starting locations.

Parameter Sensitivity Analysis

Experiment with different parameter combinations to understand their effects:

Learning Rate Effects

num_iterations = 25; learning_rate = 0.1; x_initial = 1.6
Result: Converges successfully to the minimum around iteration 21. The learning rate is well-tuned for this problem.
num_iterations = 25; learning_rate = 0.3; x_initial = 1.6
Result: Converges faster! Larger steps mean fewer iterations needed. However, larger steps are riskier…
num_iterations = 25; learning_rate = 0.5; x_initial = 1.6
Result: ❌ Diverges! Steps are too large, causing oscillation and missing the minimum entirely.
Increasing the learning rate can speed up convergence but may cause divergence. Finding the right balance is crucial.
num_iterations = 25; learning_rate = 0.04; x_initial = 1.6
Result: ❌ Doesn’t converge within 25 iterations. Steps are too small, making progress very slow.
# Increase iterations to compensate
num_iterations = 75; learning_rate = 0.04; x_initial = 1.6
Result: ✓ Converges, but uses 3x more iterations—computationally expensive.

Initial Point Effects

num_iterations = 25; learning_rate = 0.1; x_initial = 1.6
Result: Converges smoothly. The function gradient is moderate at this point.
num_iterations = 25; learning_rate = 0.1; x_initial = 0.05
Result: Still converges, but the first step is much larger due to the steeper gradient. The algorithm successfully navigates this.
num_iterations = 25; learning_rate = 0.1; x_initial = 0.03
Result: First step is extremely large. Risky—close to missing the minimum.
num_iterations = 25; learning_rate = 0.1; x_initial = 0.02
Result: ❌ Diverges! The gradient is so steep that even the first step overshoots dramatically.
Starting points in steep regions can cause divergence even with reasonable learning rates. Initial point selection matters!

Function with Multiple Minima

Now consider a more challenging problem: a function with multiple local minima.
from w2_tools import f_example_2, dfdx_example_2

plot_f([0.001, 2], [-6.3, 5], f_example_2, -6)

Finding Different Minima

Run gradient descent from different starting points:
print("Gradient descent results")

# Starting from x=1.3 → Global minimum
print("Global minimum: x_min =", 
      gradient_descent(dfdx_example_2, x=1.3, 
                      learning_rate=0.005, num_iterations=35))

# Starting from x=0.25 → Local minimum
print("Local minimum: x_min =", 
      gradient_descent(dfdx_example_2, x=0.25, 
                      learning_rate=0.005, num_iterations=35))
The Local Minimum Problem: Gradient descent can get “stuck” in local minima. The algorithm has no way to know if it has found the global minimum or just a local one.

Visualizing Multiple Minima

# Try different starting points
num_iterations = 35; learning_rate = 0.005; x_initial = 1.3
# num_iterations = 35; learning_rate = 0.005; x_initial = 0.25
# num_iterations = 35; learning_rate = 0.01; x_initial = 1.3

gd_example_2 = gradient_descent_one_variable(
    [0.001, 2], [-6.3, 5], 
    f_example_2, dfdx_example_2, 
    gradient_descent, num_iterations, 
    learning_rate, x_initial, -6, [0.1, -0.5]
)
After the animation, click different points on the plot to see which starting positions lead to the global minimum versus local minima.

Gradient Descent in Two Variables

Extend the algorithm to optimize functions of two variables f(x,y)f(x, y).

Algorithm for Two Variables

Starting from (x0,y0)(x_0, y_0), update both coordinates: x1=x0αfx(x0,y0)x_1 = x_0 - \alpha \frac{\partial f}{\partial x}(x_0, y_0) y1=y0αfy(x0,y0)y_1 = y_0 - \alpha \frac{\partial f}{\partial y}(x_0, y_0)
The gradient is now a vector: f=[fxfy]\nabla f = \begin{bmatrix} \frac{\partial f}{\partial x} \\ \frac{\partial f}{\partial y} \end{bmatrix}. The algorithm moves in the direction opposite to this vector.

Implementation

from w2_tools import (gradient_descent_two_variables,
                      f_example_3, dfdx_example_3, dfdy_example_3)

def gradient_descent(dfdx, dfdy, x, y, learning_rate=0.1, num_iterations=100):
    for iteration in range(num_iterations):
        x = x - learning_rate * dfdx(x, y)
        y = y - learning_rate * dfdy(x, y)
    return x, y

Function with One Global Minimum

Visualize a two-variable function:
plot_f_cont_and_surf([0, 5], [0, 5], [74, 85], 
                     f_example_3, cmap='coolwarm', 
                     view={'azim': -60, 'elev': 28})
Optimize the function:
num_iterations = 30
learning_rate = 0.25
x_initial = 0.5
y_initial = 0.6

print("Gradient descent result: x_min, y_min =", 
      gradient_descent(dfdx_example_3, dfdy_example_3, 
                      x_initial, y_initial, 
                      learning_rate, num_iterations))

Interactive Two-Variable Visualization

num_iterations = 20; learning_rate = 0.25
x_initial = 0.5; y_initial = 0.6
# Try different parameters:
# num_iterations = 20; learning_rate = 0.5; x_initial = 0.5; y_initial = 0.6
# num_iterations = 20; learning_rate = 0.15; x_initial = 0.5; y_initial = 0.6
# num_iterations = 20; learning_rate = 0.15; x_initial = 3.5; y_initial = 3.6

gd_example_3 = gradient_descent_two_variables(
    [0, 5], [0, 5], [74, 85], 
    f_example_3, dfdx_example_3, dfdy_example_3, 
    gradient_descent, num_iterations, learning_rate, 
    x_initial, y_initial, 
    [0.1, 0.1, 81.5], 2, [4, 1, 171], 
    cmap='coolwarm', view={'azim': -60, 'elev': 28}
)
The gradient vector points perpendicular to contour lines. Gradient descent takes steps perpendicular to these contours, moving toward lower values.

Complex Two-Variable Function

Optimize a function with multiple minima:
from w2_tools import f_example_4, dfdx_example_4, dfdy_example_4

plot_f_cont_and_surf([0, 5], [0, 5], [6, 9.5], 
                     f_example_4, cmap='terrain', 
                     view={'azim': -63, 'elev': 21})
Test different starting points:
# Converges to global minimum
num_iterations = 30; learning_rate = 0.2
x_initial = 0.5; y_initial = 3

# Converges to local minimum
# num_iterations = 20; learning_rate = 0.2
# x_initial = 2; y_initial = 3

# Converges to another local minimum
# num_iterations = 20; learning_rate = 0.2
# x_initial = 4; y_initial = 0.5

print("Result:", 
      gradient_descent(dfdx_example_4, dfdy_example_4, 
                      x_initial, y_initial, 
                      learning_rate, num_iterations))

Key Insights

Advantages

  • Simple to implement
  • Computationally efficient
  • Works for high dimensions
  • Requires only first derivatives

Limitations

  • Sensitive to learning rate
  • Sensitive to initial point
  • Can get stuck in local minima
  • Requires parameter tuning

Parameter Selection Guidelines

ParameterToo SmallOptimalToo Large
Learning RateSlow convergenceFast convergenceDivergence/oscillation
IterationsDoesn’t convergeJust enoughWasted computation
Initial PointMay find local minimumFinds global minimumMay diverge
Finding Good Parameters: Start with small learning rates (0.001-0.01) and increase until convergence becomes unstable. Use multiple random initial points to avoid local minima.

Advanced Considerations

The learning rate controls step size. Too small means slow progress; too large causes overshooting. The optimal rate depends on the function’s curvature—steeper functions need smaller rates.
Strategies include:
  • Random restarts: Run gradient descent from multiple initial points
  • Momentum: Add velocity to help escape shallow minima
  • Stochastic methods: Add randomness to the updates
  • Adaptive learning rates: Adjust α during optimization
Monitor the cost function. When it stops decreasing significantly (or changes less than a threshold like 0.001), the algorithm has converged. Typical values: 100-10,000 iterations depending on problem complexity.

Summary

Gradient descent is powerful but requires careful parameter tuning:
  1. Learning rate: Balance between speed and stability
  2. Initial point: Critical for functions with multiple minima
  3. Iterations: Enough to converge, not so many as to waste computation
  4. Monitoring: Track the cost function to verify convergence
In Practice: Machine learning frameworks automatically tune these parameters using techniques like learning rate schedules, adaptive methods (Adam, RMSprop), and early stopping.

Next Steps

Apply gradient descent to real machine learning problems:

Build docs developers (and LLMs) love