Skip to main content
This guide builds a neural network classifier using a single perceptron with sigmoid activation. You’ll learn to separate data into classes, implement log loss, and train models on linearly separable datasets.

Prerequisites

Import required packages:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import colors
from sklearn.datasets import make_blobs

%matplotlib inline
np.random.seed(3)

Simple Classification Problem

Classification assigns observations to categories. Binary classification has exactly two categories.

Example: Sentiment Classification

Classify sentences as “happy” or “angry” based on word counts:
  • Count occurrences of “aack” (x1x_1) and “beep” (x2x_2)
  • Rule: If x2>x1x_2 > x_1 (more “beep”), classify as angry; otherwise, happy
  • This creates a linear decision boundary

Visualizing Linearly Separable Classes

Consider 4 sentences:
  1. “Beep!” → (0, 1) → Angry
  2. “Aack?” → (1, 0) → Happy
  3. “Beep aack…” → (1, 1) → Happy
  4. ”!?” → (0, 0) → Happy
fig, ax = plt.subplots()
xmin, xmax = -0.2, 1.4
x_line = np.arange(xmin, xmax, 0.1)

# Data points (observations) from two classes
ax.scatter(0, 0, color="b")  # Happy
ax.scatter(0, 1, color="r")  # Angry
ax.scatter(1, 0, color="b")  # Happy
ax.scatter(1, 1, color="b")  # Happy

ax.set_xlim([xmin, xmax])
ax.set_ylim([-0.1, 1.1])
ax.set_xlabel('$x_1$ (aack count)')
ax.set_ylabel('$x_2$ (beep count)')

# Decision boundary: x₂ = x₁ + 0.5
ax.plot(x_line, x_line + 0.5, color="black")
plt.show()
Linearly Separable: Classes that can be separated by a straight line (or hyperplane in higher dimensions). This is the simplest classification scenario.

Finding the Decision Boundary

The line x1x2+0.5=0x_1 - x_2 + 0.5 = 0 separates the classes:
  • Above the line: x1x2+0.5<0x_1 - x_2 + 0.5 < 0 → Red class
  • Below the line: x1x2+0.5>0x_1 - x_2 + 0.5 > 0 → Blue class
Goal: Find parameters w1w_1, w2w_2, and bb in the equation w1x1+w2x2+b=0w_1x_1 + w_2x_2 + b = 0 that define this boundary.
For this simple example, we can see that w1=1w_1 = 1, w2=1w_2 = -1, b=0.5b = 0.5. But for complex problems, we need a neural network!

Single Perceptron with Activation Function

Neural Network Structure

The perceptron performs two operations:
  1. Linear combination: z(i)=w1x1(i)+w2x2(i)+b=Wx(i)+bz^{(i)} = w_1x_1^{(i)} + w_2x_2^{(i)} + b = Wx^{(i)} + b
  2. Activation function: a(i)=σ(z(i))a^{(i)} = \sigma(z^{(i)})
The activation function converts continuous values into class probabilities, enabling classification.

Sigmoid Activation Function

The sigmoid function maps any real number to the range (0, 1): a=σ(z)=11+eza = \sigma(z) = \frac{1}{1 + e^{-z}} Properties:
  • σ(0)=0.5\sigma(0) = 0.5
  • σ(z)1\sigma(z) \to 1 as zz \to \infty
  • σ(z)0\sigma(z) \to 0 as zz \to -\infty

Classification Rule

Use threshold 0.5: y^={1if a>0.50otherwise\hat{y} = \begin{cases} 1 & \text{if } a > 0.5 \\ 0 & \text{otherwise} \end{cases}

Mathematical Model

For a single training example: z^{(i)} &= Wx^{(i)} + b \\ a^{(i)} &= \sigma(z^{(i)}) \end{align}$$ For $m$ training examples in matrix $X$ (size $2 \times m$): $$\begin{align} Z &= WX + b \\ A &= \sigma(Z) \end{align}$$ where $b$ is broadcast to size $1 \times m$. ### Log Loss Cost Function For classification, use **log loss** (cross-entropy): $$\mathcal{L}(W, b) = \frac{1}{m}\sum_{i=1}^{m} \left[-y^{(i)}\log(a^{(i)}) - (1-y^{(i)})\log(1-a^{(i)})\right]$$ where: - $y^{(i)} \in \{0, 1\}$ are true labels - $a^{(i)}$ are predicted probabilities <Accordion title="Why log loss instead of sum of squares?"> Log loss penalizes confident wrong predictions heavily. For classification, this encourages the model to output probabilities close to 0 or 1, making decision boundaries sharper. Sum of squares doesn't have this property. </Accordion> ### Backward Propagation Compute gradients using the chain rule: $$\begin{align} \frac{\partial \mathcal{L}}{\partial w_1} &= \frac{1}{m}\sum_{i=1}^{m} (a^{(i)} - y^{(i)})x_1^{(i)} \\ \frac{\partial \mathcal{L}}{\partial w_2} &= \frac{1}{m}\sum_{i=1}^{m} (a^{(i)} - y^{(i)})x_2^{(i)} \\ \frac{\partial \mathcal{L}}{\partial b} &= \frac{1}{m}\sum_{i=1}^{m} (a^{(i)} - y^{(i)}) \end{align}$$ In matrix form: $$\begin{align} \frac{\partial \mathcal{L}}{\partial W} &= \frac{1}{m}(A - Y)X^T \\ \frac{\partial \mathcal{L}}{\partial b} &= \frac{1}{m}(A - Y)\mathbf{1} \end{align}$$ <Info> **Key Insight**: These gradient expressions are identical to linear regression! The difference is in forward propagation (sigmoid activation) and the cost function (log loss). </Info> ### Parameter Updates $$\begin{align} W &= W - \alpha \frac{\partial \mathcal{L}}{\partial W} \\ b &= b - \alpha \frac{\partial \mathcal{L}}{\partial b} \end{align}$$ ## Implementation ### Generate Dataset Create 30 data points with binary labels: ```python m = 30 X = np.random.randint(0, 2, (2, m)) Y = np.logical_and(X[0] == 0, X[1] == 1).astype(int).reshape((1, m)) print('Training dataset X containing (x1, x2) coordinates in columns:') print(X) print('Training dataset Y containing labels (0: blue, 1: red)') print(Y) print('The shape of X is:', X.shape) print('The shape of Y is:', Y.shape) print('I have m = %d training examples!' % (X.shape[1])) ``` ### Define Sigmoid Function ```python def sigmoid(z): return 1 / (1 + np.exp(-z)) print("sigmoid(-2) =", sigmoid(-2)) print("sigmoid(0) =", sigmoid(0)) print("sigmoid(3.5) =", sigmoid(3.5)) ``` Sigmoid works element-wise on arrays: ```python print(sigmoid(np.array([-2, 0, 3.5]))) # Output: [0.11920292 0.5 0.97068777] ``` ### Step 1: Define Layer Sizes ```python def layer_sizes(X, Y): """ Arguments: X -- input dataset of shape (input size, number of examples) Y -- labels of shape (output size, number of examples) Returns: n_x -- the size of the input layer n_y -- the size of the output layer """ n_x = X.shape[0] n_y = Y.shape[0] return (n_x, n_y) (n_x, n_y) = layer_sizes(X, Y) print("The size of the input layer is: n_x =", n_x) print("The size of the output layer is: n_y =", n_y) ``` ### Step 2: Initialize Parameters ```python def initialize_parameters(n_x, n_y): """ Returns: params -- python dictionary containing your parameters: W -- weight matrix of shape (n_y, n_x) b -- bias value set as a vector of shape (n_y, 1) """ W = np.random.randn(n_y, n_x) * 0.01 b = np.zeros((n_y, 1)) parameters = {"W": W, "b": b} return parameters parameters = initialize_parameters(n_x, n_y) print("W =", parameters["W"]) print("b =", parameters["b"]) ``` ### Step 3: Forward Propagation ```python def forward_propagation(X, parameters): """ Argument: X -- input data of size (n_x, m) parameters -- python dictionary containing your parameters Returns: A -- The sigmoid output """ W = parameters["W"] b = parameters["b"] Z = np.matmul(W, X) + b A = sigmoid(Z) # Apply sigmoid activation return A A = forward_propagation(X, parameters) print("Output vector A:", A) ``` <Note> The only difference from regression: applying the sigmoid activation function to Z. </Note> ### Step 4: Compute Cost (Log Loss) ```python def compute_cost(A, Y): """ Computes the log loss cost function Arguments: A -- The output of the neural network of shape (n_y, number of examples) Y -- "true" labels vector of shape (n_y, number of examples) Returns: cost -- log loss """ m = Y.shape[1] logprobs = -np.multiply(np.log(A), Y) - np.multiply(np.log(1 - A), 1 - Y) cost = (1/m) * np.sum(logprobs) return cost print("cost =", compute_cost(A, Y)) ``` ### Step 5: Backward Propagation ```python def backward_propagation(A, X, Y): """ Implements the backward propagation, calculating gradients Arguments: A -- the output of the neural network of shape (n_y, number of examples) X -- input data of shape (n_x, number of examples) Y -- "true" labels vector of shape (n_y, number of examples) Returns: grads -- python dictionary containing gradients """ m = X.shape[1] dZ = A - Y dW = (1/m) * np.dot(dZ, X.T) db = (1/m) * np.sum(dZ, axis=1, keepdims=True) grads = {"dW": dW, "db": db} return grads grads = backward_propagation(A, X, Y) print("dW =", grads["dW"]) print("db =", grads["db"]) ``` ### Step 6: Update Parameters ```python def update_parameters(parameters, grads, learning_rate=1.2): """ Updates parameters using the gradient descent update rule Arguments: parameters -- python dictionary containing parameters grads -- python dictionary containing gradients learning_rate -- learning rate parameter for gradient descent Returns: parameters -- python dictionary containing updated parameters """ W = parameters["W"] b = parameters["b"] dW = grads["dW"] db = grads["db"] W = W - learning_rate * dW b = b - learning_rate * db parameters = {"W": W, "b": b} return parameters parameters_updated = update_parameters(parameters, grads) print("W updated =", parameters_updated["W"]) print("b updated =", parameters_updated["b"]) ``` ### Step 7: Build Complete Model ```python def nn_model(X, Y, num_iterations=10, learning_rate=1.2, print_cost=False): """ Arguments: X -- dataset of shape (n_x, number of examples) Y -- labels of shape (n_y, number of examples) num_iterations -- number of iterations in the loop learning_rate -- learning rate parameter for gradient descent print_cost -- if True, print the cost every iteration Returns: parameters -- parameters learnt by the model """ n_x = layer_sizes(X, Y)[0] n_y = layer_sizes(X, Y)[1] parameters = initialize_parameters(n_x, n_y) for i in range(0, num_iterations): # Forward propagation A = forward_propagation(X, parameters) # Cost function cost = compute_cost(A, Y) # Backpropagation grads = backward_propagation(A, X, Y) # Update parameters parameters = update_parameters(parameters, grads, learning_rate) if print_cost: print("Cost after iteration %i: %f" % (i, cost)) return parameters ``` ### Train the Model ```python parameters = nn_model(X, Y, num_iterations=50, learning_rate=1.2, print_cost=True) print("W =", parameters["W"]) print("b =", parameters["b"]) ``` <Info> After about 40 iterations, the cost decreases very slowly, indicating convergence. This is a good point to stop training. </Info> ### Visualize Decision Boundary ```python def plot_decision_boundary(X, Y, parameters): W = parameters["W"] b = parameters["b"] fig, ax = plt.subplots() plt.scatter(X[0, :], X[1, :], c=Y, cmap=colors.ListedColormap(['blue', 'red'])) x_line = np.arange(np.min(X[0,:]), np.max(X[0,:])*1.1, 0.1) # Decision boundary: W[0,0]*x1 + W[0,1]*x2 + b = 0 # Solve for x2: x2 = -(W[0,0]/W[0,1])*x1 - b/W[0,1] ax.plot(x_line, -W[0,0]/W[0,1] * x_line + -b[0,0]/W[0,1], color="black", label="Decision Boundary") plt.legend() plt.xlabel('$x_1$') plt.ylabel('$x_2$') plt.show() plot_decision_boundary(X, Y, parameters) ``` <Tip> The decision boundary is the line where the perceptron output equals 0.5 (or equivalently, where $z = 0$). </Tip> ### Make Predictions ```python def predict(X, parameters): """ Using the learned parameters, predicts a class for each example in X Arguments: parameters -- python dictionary containing your parameters X -- input data of size (n_x, m) Returns: predictions -- vector of predictions (False: blue / True: red) """ A = forward_propagation(X, parameters) predictions = A > 0.5 return predictions X_pred = np.array([[1, 1, 0, 0], [0, 1, 0, 1]]) Y_pred = predict(X_pred, parameters) print(f"Coordinates (in columns):\n{X_pred}") print(f"Predictions:\n{Y_pred}") ``` ## Performance on Larger Dataset Test the model on a more realistic dataset: ```python n_samples = 1000 samples, labels = make_blobs( n_samples=n_samples, centers=([2.5, 3], [6.7, 7.9]), cluster_std=1.4, random_state=0 ) X_larger = np.transpose(samples) Y_larger = labels.reshape((1, n_samples)) plt.scatter(X_larger[0, :], X_larger[1, :], c=Y_larger, cmap=colors.ListedColormap(['blue', 'red'])) plt.show() ``` Train for 100 iterations: ```python parameters_larger = nn_model(X_larger, Y_larger, num_iterations=100, learning_rate=1.2, print_cost=False) print("W =", parameters_larger["W"]) print("b =", parameters_larger["b"]) ``` Visualize results: ```python plot_decision_boundary(X_larger, Y_larger, parameters_larger) ``` <Note> The decision boundary successfully separates the two clusters! Try changing `num_iterations` and `learning_rate` to see how they affect the results. </Note> ## Key Concepts <CardGroup cols={2}> <Card title="Sigmoid Activation" icon="wave-sine"> Converts linear output to probabilities between 0 and 1 </Card> <Card title="Log Loss" icon="function"> Penalizes confident wrong predictions for better classification </Card> <Card title="Decision Boundary" icon="divide"> Line (or hyperplane) that separates classes in feature space </Card> <Card title="Binary Classification" icon="code-branch"> Assigns observations to one of two categories </Card> </CardGroup> <Accordion title="Why use sigmoid instead of step function?"> Step functions have zero gradient everywhere (except at the discontinuity where it's undefined). This breaks gradient descent. Sigmoid is smooth and differentiable everywhere, enabling gradient-based learning. </Accordion> <Accordion title="What if classes aren't linearly separable?"> A single perceptron can only learn linear decision boundaries. For non-linear problems, you need: - Multiple layers (deep neural networks) - Non-linear activation functions in hidden layers - More complex architectures </Accordion> <Accordion title="How do I choose the classification threshold?"> 0.5 is standard, but you can adjust it based on application needs: - Lower threshold (e.g., 0.3): More false positives, fewer false negatives - Higher threshold (e.g., 0.7): Fewer false positives, more false negatives Choose based on the relative costs of different error types. </Accordion> ## Comparison: Regression vs Classification | Aspect | Regression | Classification | |--------|-----------|----------------| | **Output** | Continuous value | Class label (0 or 1) | | **Activation** | Identity ($a = z$) | Sigmoid ($a = \sigma(z)$) | | **Cost Function** | Sum of squares | Log loss | | **Interpretation** | Predicted value | Probability of class 1 | | **Decision Rule** | None | Threshold at 0.5 | <Info> **Implementation Similarity**: The backward propagation formulas are identical! The key differences are forward propagation (activation function) and the cost function. </Info> ## Summary You've built a binary classifier using: ✓ Single perceptron architecture ✓ Sigmoid activation function ✓ Log loss cost function ✓ Gradient descent optimization ✓ Decision boundary visualization ✓ Predictions on new data This foundation extends to more complex classification problems with multiple layers and classes! ## Next Steps Explore advanced topics: - Build multi-class classifiers (softmax activation) - Add hidden layers for non-linear decision boundaries - Implement regularization to prevent overfitting - Try different optimization algorithms (Adam, RMSprop)

Build docs developers (and LLMs) love