Skip to main content

Introduction

Logistic regression is probably the single most widely used classification algorithm in the world. Despite its name containing “regression,” it’s actually used for classification problems where the output is a category (0 or 1) rather than a continuous number.
Logistic regression is used when the output variable y can take on only one of a small number of discrete values. For binary classification, y is either 0 or 1.

Why Not Linear Regression for Classification?

Linear regression is not suitable for classification problems. Here’s why:

The Problem with Linear Regression

Suppose you’re classifying tumors as malignant (1) or benign (0) based on tumor size:
With a small dataset, linear regression might fit a line that, with a threshold of 0.5, classifies correctly:
  • Predictions < 0.5 → Class 0 (benign)
  • Predictions ≥ 0.5 → Class 1 (malignant)
But add one large tumor example on the right, and the line shifts. Now the classification boundary moves, causing previously correct predictions to become wrong. Linear regression’s predictions can be any number, not just 0 or 1.
Linear regression can output values less than 0 or greater than 1, which doesn’t make sense for classification where we want probabilities between 0 and 1.

The Sigmoid Function

Logistic regression uses the sigmoid function (also called the logistic function) to squash predictions between 0 and 1.

Mathematical Definition

The sigmoid function is:
g(z) = 1 / (1 + e^(-z))
Where:
  • e ≈ 2.718 (mathematical constant)
  • z can be any real number (-∞ to +∞)
  • g(z) is always between 0 and 1

Properties of the Sigmoid Function

When z is very large (e.g., z = 100):
e^(-100) ≈ 0 (tiny number)
g(100) = 1 / (1 + 0) ≈ 1
The sigmoid approaches 1.

Visualizing the Sigmoid

The sigmoid creates an S-shaped curve:
  • Starts near 0 for large negative values
  • Smoothly transitions through 0.5 at z = 0
  • Approaches 1 for large positive values

The Logistic Regression Model

Logistic regression combines linear regression with the sigmoid function in two steps:
1

Compute linear combination

z = w · x + b
Same as linear regression: weighted sum of features plus bias
2

Apply sigmoid function

f(x) = g(z) = 1 / (1 + e^(-z))
Pass z through sigmoid to get output between 0 and 1

Complete Model

Combining these steps:
f(x) = g(w · x + b) = 1 / (1 + e^(-(w·x + b)))
This is the logistic regression model.
You can think of the output as the probability that y = 1 given input x. If f(x) = 0.7, the model estimates a 70% chance that y = 1.

Interpreting the Output

Probability Interpretation

The output f(x) represents:
f(x) = P(y = 1 | x)
Translation: The probability that y equals 1, given input features x.

Example: Tumor Classification

Suppose a patient has a tumor of certain size x, and the model outputs:
f(x) = 0.7
Interpretation:
  • 70% chance the tumor is malignant (y = 1)
  • 30% chance the tumor is benign (y = 0)
Probabilities must sum to 1. If P(y=1) = 0.7, then P(y=0) = 1 - 0.7 = 0.3.

Implementation

Python Implementation

import numpy as np

def sigmoid(z):
    """
    Compute the sigmoid function
    
    Args:
        z: Input value(s), can be scalar or array
    
    Returns:
        g: Sigmoid of z, between 0 and 1
    """
    g = 1 / (1 + np.exp(-z))
    return g

def predict_logistic(x, w, b):
    """
    Make prediction using logistic regression
    
    Args:
        x: Feature vector
        w: Weight vector
        b: Bias parameter
    
    Returns:
        probability: P(y=1|x)
    """
    z = np.dot(w, x) + b
    return sigmoid(z)

# Example: Tumor classification
w = np.array([0.5])  # Weight for tumor size
b = -10.0            # Bias

# Test different tumor sizes
tumor_sizes = np.array([5, 10, 15, 20, 25])

print("Tumor Size | Probability Malignant")
print("-" * 40)
for size in tumor_sizes:
    prob = predict_logistic(np.array([size]), w, b)
    print(f"{size:10.1f} | {prob:0.4f} ({prob*100:.1f}%)")
Output:
Tumor Size | Probability Malignant
----------------------------------------
       5.0 | 0.0067 (0.7%)
      10.0 | 0.0474 (4.7%)
      15.0 | 0.2689 (26.9%)
      20.0 | 0.7311 (73.1%)
      25.0 | 0.9526 (95.3%)

With Multiple Features

# Multiple features: size, age, etc.
def predict_multi_logistic(x, w, b):
    """
    Logistic regression with multiple features
    
    Args:
        x: Feature vector [x1, x2, ..., xn]
        w: Weight vector [w1, w2, ..., wn]
        b: Bias
    
    Returns:
        probability: P(y=1|x)
    """
    z = np.dot(w, x) + b
    return sigmoid(z)

# Example with 2 features
w = np.array([0.5, 0.1])  # Weights for [size, age]
b = -15.0

# Patient: tumor size=20, age=50
patient = np.array([20, 50])
prob = predict_multi_logistic(patient, w, b)

print(f"Probability malignant: {prob:.4f} ({prob*100:.1f}%)")
print(f"Probability benign: {1-prob:.4f} ({(1-prob)*100:.1f}%)")

Decision Boundary

The decision boundary is where the model switches between predicting class 0 and class 1.

Threshold at 0.5

Common decision rule:
  • If f(x) ≥ 0.5, predict y = 1
  • If f(x) < 0.5, predict y = 0

When is f(x) = 0.5?

Since sigmoid(0) = 0.5, we have f(x) = 0.5 when:
z = w · x + b = 0
This equation defines the decision boundary.
With two features x₁ and x₂:
z = w₁*x₁ + w₂*x₂ + b = 0
This is a straight line separating the two classes.
With polynomial features like x₁², x₂²:
z = w₁*x₁² + w₂*x₂² + b = 0
This creates a circular or elliptical boundary.

Cost Function for Logistic Regression

The squared error cost function doesn’t work well for logistic regression (creates non-convex function with many local minima). Instead, we use the logistic loss or binary cross-entropy:
J(w, b) = -(1/m) * Σ[y⁽ⁱ⁾ log(f(x⁽ⁱ⁾)) + (1-y⁽ⁱ⁾) log(1-f(x⁽ⁱ⁾))]
This cost function:
  • Is convex (one global minimum)
  • Heavily penalizes confident wrong predictions
  • Works well with gradient descent

Training with Gradient Descent

Gradient descent for logistic regression:
def compute_gradient_logistic(X, y, w, b):
    """
    Compute gradient for logistic regression
    
    Args:
        X: Training examples (m x n)
        y: Labels (m, )
        w: Weights (n, )
        b: Bias
    
    Returns:
        dj_dw: Gradient of cost w.r.t. w
        dj_db: Gradient of cost w.r.t. b
    """
    m = len(y)
    n = len(w)
    
    dj_dw = np.zeros(n)
    dj_db = 0.0
    
    for i in range(m):
        z = np.dot(w, X[i]) + b
        f_wb = sigmoid(z)
        err = f_wb - y[i]
        
        for j in range(n):
            dj_dw[j] += err * X[i][j]
        dj_db += err
    
    dj_dw = dj_dw / m
    dj_db = dj_db / m
    
    return dj_dw, dj_db

def gradient_descent_logistic(X, y, w_init, b_init, alpha, num_iters):
    """
    Performs gradient descent for logistic regression
    
    Args:
        X: Training examples
        y: Labels  
        w_init: Initial weights
        b_init: Initial bias
        alpha: Learning rate
        num_iters: Number of iterations
    
    Returns:
        w, b: Optimized parameters
    """
    w = w_init
    b = b_init
    
    for i in range(num_iters):
        dj_dw, dj_db = compute_gradient_logistic(X, y, w, b)
        
        w = w - alpha * dj_dw
        b = b - alpha * dj_db
        
        if i % 1000 == 0:
            print(f"Iteration {i}: w={w}, b={b:.2f}")
    
    return w, b

Key Takeaways

Classification, Not Regression

Despite its name, logistic regression is used for classification problems, predicting discrete categories rather than continuous values.

Outputs Probabilities

The sigmoid function ensures outputs are between 0 and 1, interpretable as probabilities that y = 1.

Decision Boundary

The decision boundary (where z = 0) separates regions where the model predicts different classes.

Widely Used

Logistic regression is one of the most commonly used algorithms in practice, powering applications from medical diagnosis to ad targeting.

Real-World Applications

Classifying whether a patient has a disease based on symptoms, test results, and medical history.
Determining if an email is spam based on content, sender, subject line, and other features.
Predicting whether a loan applicant will default based on income, credit history, and other factors.
Identifying which customers are likely to stop using a service based on usage patterns and demographics.

What’s Next

Now that you understand logistic regression, explore:
  • Regularization to prevent overfitting in classification
  • Multi-class classification for problems with more than 2 categories
  • Advanced optimization algorithms beyond gradient descent
  • Performance metrics like precision, recall, and F1-score

Build docs developers (and LLMs) love