Gradient descent is a fundamental optimization algorithm used throughout machine learning. This guide implements gradient descent for functions with varying complexity, exploring parameter selection and convergence behavior.
Prerequisites
Load the required packages:
import numpy as np
import matplotlib.pyplot as plt
from w2_tools import plot_f, gradient_descent_one_variable
% matplotlib widget
Function with One Global Minimum
Consider the function f ( x ) = e x − log ( x ) f(x) = e^x - \log(x) f ( x ) = e x − log ( x ) for x > 0 x > 0 x > 0 . This function has one global minimum that cannot be found analytically.
The Gradient Descent Algorithm
Starting from initial point x 0 x_0 x 0 , gradient descent iteratively moves “down the hill” using:
x 1 = x 0 − α d f d x ( x 0 ) x_1 = x_0 - \alpha \frac{df}{dx}(x_0) x 1 = x 0 − α d x df ( x 0 )
where:
α > 0 \alpha > 0 α > 0 is the learning rate (step size)
d f d x ( x 0 ) \frac{df}{dx}(x_0) d x df ( x 0 ) is the gradient (derivative at x 0 x_0 x 0 )
Subtraction moves against the gradient toward the minimum
The gradient defines the direction of steepest increase. Moving in the opposite direction (subtracting it) leads toward the minimum.
Defining the Function
Implement f ( x ) = e x − log ( x ) f(x) = e^x - \log(x) f ( x ) = e x − log ( x ) and its derivative:
def f_example_1 ( x ):
return np.exp(x) - np.log(x)
def dfdx_example_1 ( x ):
return np.exp(x) - 1 / x
Visualize the function:
plot_f([ 0.001 , 2.5 ], [ - 0.3 , 13 ], f_example_1, 0.0 )
Implementing Gradient Descent
Define the Algorithm
def gradient_descent ( dfdx , x , learning_rate = 0.1 , num_iterations = 100 ):
for iteration in range (num_iterations):
x = x - learning_rate * dfdx(x)
return x
Set Parameters
num_iterations = 25
learning_rate = 0.1
x_initial = 1.6
These three parameters are crucial:
num_iterations: How many steps to take
learning_rate: Size of each step
x_initial: Starting point
Run Optimization
x_min = gradient_descent(dfdx_example_1, x_initial,
learning_rate, num_iterations)
print ( "Gradient descent result: x_min =" , x_min)
Interactive Visualization
Visualize the gradient descent process:
num_iterations = 25
learning_rate = 0.1
x_initial = 1.6
gd_example_1 = gradient_descent_one_variable(
[ 0.001 , 2.5 ], [ - 0.3 , 13 ],
f_example_1, dfdx_example_1,
gradient_descent, num_iterations,
learning_rate, x_initial, 0.0 , [ 0.35 , 9.5 ]
)
After the animation ends, click on the plot to choose a new initial point and observe how gradient descent behaves from different starting locations.
Parameter Sensitivity Analysis
Experiment with different parameter combinations to understand their effects:
Learning Rate Effects
Learning Rate = 0.1 (Baseline)
num_iterations = 25 ; learning_rate = 0.1 ; x_initial = 1.6
Result : Converges successfully to the minimum around iteration 21. The learning rate is well-tuned for this problem.
Learning Rate = 0.3 (Increased)
num_iterations = 25 ; learning_rate = 0.3 ; x_initial = 1.6
Result : Converges faster! Larger steps mean fewer iterations needed. However, larger steps are riskier…
Learning Rate = 0.5 (Too Large)
num_iterations = 25 ; learning_rate = 0.5 ; x_initial = 1.6
Result : ❌ Diverges! Steps are too large, causing oscillation and missing the minimum entirely.Increasing the learning rate can speed up convergence but may cause divergence. Finding the right balance is crucial.
Learning Rate = 0.04 (Too Small)
num_iterations = 25 ; learning_rate = 0.04 ; x_initial = 1.6
Result : ❌ Doesn’t converge within 25 iterations. Steps are too small, making progress very slow.# Increase iterations to compensate
num_iterations = 75 ; learning_rate = 0.04 ; x_initial = 1.6
Result : ✓ Converges, but uses 3x more iterations—computationally expensive.
Initial Point Effects
num_iterations = 25 ; learning_rate = 0.1 ; x_initial = 1.6
Result : Converges smoothly. The function gradient is moderate at this point.
x_initial = 0.05 (Steep Region)
num_iterations = 25 ; learning_rate = 0.1 ; x_initial = 0.05
Result : Still converges, but the first step is much larger due to the steeper gradient. The algorithm successfully navigates this.
x_initial = 0.03 (Very Steep)
num_iterations = 25 ; learning_rate = 0.1 ; x_initial = 0.03
Result : First step is extremely large. Risky—close to missing the minimum.
x_initial = 0.02 (Critical)
num_iterations = 25 ; learning_rate = 0.1 ; x_initial = 0.02
Result : ❌ Diverges! The gradient is so steep that even the first step overshoots dramatically.Starting points in steep regions can cause divergence even with reasonable learning rates. Initial point selection matters!
Function with Multiple Minima
Now consider a more challenging problem: a function with multiple local minima.
from w2_tools import f_example_2, dfdx_example_2
plot_f([ 0.001 , 2 ], [ - 6.3 , 5 ], f_example_2, - 6 )
Finding Different Minima
Run gradient descent from different starting points:
print ( "Gradient descent results" )
# Starting from x=1.3 → Global minimum
print ( "Global minimum: x_min =" ,
gradient_descent(dfdx_example_2, x = 1.3 ,
learning_rate = 0.005 , num_iterations = 35 ))
# Starting from x=0.25 → Local minimum
print ( "Local minimum: x_min =" ,
gradient_descent(dfdx_example_2, x = 0.25 ,
learning_rate = 0.005 , num_iterations = 35 ))
The Local Minimum Problem : Gradient descent can get “stuck” in local minima. The algorithm has no way to know if it has found the global minimum or just a local one.
Visualizing Multiple Minima
# Try different starting points
num_iterations = 35 ; learning_rate = 0.005 ; x_initial = 1.3
# num_iterations = 35; learning_rate = 0.005; x_initial = 0.25
# num_iterations = 35; learning_rate = 0.01; x_initial = 1.3
gd_example_2 = gradient_descent_one_variable(
[ 0.001 , 2 ], [ - 6.3 , 5 ],
f_example_2, dfdx_example_2,
gradient_descent, num_iterations,
learning_rate, x_initial, - 6 , [ 0.1 , - 0.5 ]
)
After the animation, click different points on the plot to see which starting positions lead to the global minimum versus local minima.
Gradient Descent in Two Variables
Extend the algorithm to optimize functions of two variables f ( x , y ) f(x, y) f ( x , y ) .
Algorithm for Two Variables
Starting from ( x 0 , y 0 ) (x_0, y_0) ( x 0 , y 0 ) , update both coordinates:
x 1 = x 0 − α ∂ f ∂ x ( x 0 , y 0 ) x_1 = x_0 - \alpha \frac{\partial f}{\partial x}(x_0, y_0) x 1 = x 0 − α ∂ x ∂ f ( x 0 , y 0 )
y 1 = y 0 − α ∂ f ∂ y ( x 0 , y 0 ) y_1 = y_0 - \alpha \frac{\partial f}{\partial y}(x_0, y_0) y 1 = y 0 − α ∂ y ∂ f ( x 0 , y 0 )
The gradient is now a vector: ∇ f = [ ∂ f ∂ x ∂ f ∂ y ] \nabla f = \begin{bmatrix} \frac{\partial f}{\partial x} \\ \frac{\partial f}{\partial y} \end{bmatrix} ∇ f = [ ∂ x ∂ f ∂ y ∂ f ] . The algorithm moves in the direction opposite to this vector.
Implementation
from w2_tools import (gradient_descent_two_variables,
f_example_3, dfdx_example_3, dfdy_example_3)
def gradient_descent ( dfdx , dfdy , x , y , learning_rate = 0.1 , num_iterations = 100 ):
for iteration in range (num_iterations):
x = x - learning_rate * dfdx(x, y)
y = y - learning_rate * dfdy(x, y)
return x, y
Function with One Global Minimum
Visualize a two-variable function:
plot_f_cont_and_surf([ 0 , 5 ], [ 0 , 5 ], [ 74 , 85 ],
f_example_3, cmap = 'coolwarm' ,
view = { 'azim' : - 60 , 'elev' : 28 })
Optimize the function:
num_iterations = 30
learning_rate = 0.25
x_initial = 0.5
y_initial = 0.6
print ( "Gradient descent result: x_min, y_min =" ,
gradient_descent(dfdx_example_3, dfdy_example_3,
x_initial, y_initial,
learning_rate, num_iterations))
Interactive Two-Variable Visualization
num_iterations = 20 ; learning_rate = 0.25
x_initial = 0.5 ; y_initial = 0.6
# Try different parameters:
# num_iterations = 20; learning_rate = 0.5; x_initial = 0.5; y_initial = 0.6
# num_iterations = 20; learning_rate = 0.15; x_initial = 0.5; y_initial = 0.6
# num_iterations = 20; learning_rate = 0.15; x_initial = 3.5; y_initial = 3.6
gd_example_3 = gradient_descent_two_variables(
[ 0 , 5 ], [ 0 , 5 ], [ 74 , 85 ],
f_example_3, dfdx_example_3, dfdy_example_3,
gradient_descent, num_iterations, learning_rate,
x_initial, y_initial,
[ 0.1 , 0.1 , 81.5 ], 2 , [ 4 , 1 , 171 ],
cmap = 'coolwarm' , view = { 'azim' : - 60 , 'elev' : 28 }
)
The gradient vector points perpendicular to contour lines. Gradient descent takes steps perpendicular to these contours, moving toward lower values.
Complex Two-Variable Function
Optimize a function with multiple minima:
from w2_tools import f_example_4, dfdx_example_4, dfdy_example_4
plot_f_cont_and_surf([ 0 , 5 ], [ 0 , 5 ], [ 6 , 9.5 ],
f_example_4, cmap = 'terrain' ,
view = { 'azim' : - 63 , 'elev' : 21 })
Test different starting points:
# Converges to global minimum
num_iterations = 30 ; learning_rate = 0.2
x_initial = 0.5 ; y_initial = 3
# Converges to local minimum
# num_iterations = 20; learning_rate = 0.2
# x_initial = 2; y_initial = 3
# Converges to another local minimum
# num_iterations = 20; learning_rate = 0.2
# x_initial = 4; y_initial = 0.5
print ( "Result:" ,
gradient_descent(dfdx_example_4, dfdy_example_4,
x_initial, y_initial,
learning_rate, num_iterations))
Key Insights
Advantages
Simple to implement
Computationally efficient
Works for high dimensions
Requires only first derivatives
Limitations
Sensitive to learning rate
Sensitive to initial point
Can get stuck in local minima
Requires parameter tuning
Parameter Selection Guidelines
Parameter Too Small Optimal Too Large Learning Rate Slow convergence Fast convergence Divergence/oscillation Iterations Doesn’t converge Just enough Wasted computation Initial Point May find local minimum Finds global minimum May diverge
Finding Good Parameters : Start with small learning rates (0.001-0.01) and increase until convergence becomes unstable. Use multiple random initial points to avoid local minima.
Advanced Considerations
Why does learning rate matter so much?
The learning rate controls step size. Too small means slow progress; too large causes overshooting. The optimal rate depends on the function’s curvature—steeper functions need smaller rates.
How to escape local minima?
Strategies include:
Random restarts : Run gradient descent from multiple initial points
Momentum : Add velocity to help escape shallow minima
Stochastic methods : Add randomness to the updates
Adaptive learning rates : Adjust α during optimization
How many iterations do I need?
Monitor the cost function. When it stops decreasing significantly (or changes less than a threshold like 0.001), the algorithm has converged. Typical values: 100-10,000 iterations depending on problem complexity.
Summary
Gradient descent is powerful but requires careful parameter tuning:
Learning rate : Balance between speed and stability
Initial point : Critical for functions with multiple minima
Iterations : Enough to converge, not so many as to waste computation
Monitoring : Track the cost function to verify convergence
In Practice : Machine learning frameworks automatically tune these parameters using techniques like learning rate schedules, adaptive methods (Adam, RMSprop), and early stopping.
Next Steps
Apply gradient descent to real machine learning problems: