This guide builds a neural network classifier using a single perceptron with sigmoid activation. You’ll learn to separate data into classes, implement log loss, and train models on linearly separable datasets.
Prerequisites
Import required packages:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import colors
from sklearn.datasets import make_blobs
%matplotlib inline
np.random.seed(3)
Simple Classification Problem
Classification assigns observations to categories. Binary classification has exactly two categories.
Example: Sentiment Classification
Classify sentences as “happy” or “angry” based on word counts:
- Count occurrences of “aack” (x1) and “beep” (x2)
- Rule: If x2>x1 (more “beep”), classify as angry; otherwise, happy
- This creates a linear decision boundary
Visualizing Linearly Separable Classes
Consider 4 sentences:
- “Beep!” → (0, 1) → Angry
- “Aack?” → (1, 0) → Happy
- “Beep aack…” → (1, 1) → Happy
- ”!?” → (0, 0) → Happy
fig, ax = plt.subplots()
xmin, xmax = -0.2, 1.4
x_line = np.arange(xmin, xmax, 0.1)
# Data points (observations) from two classes
ax.scatter(0, 0, color="b") # Happy
ax.scatter(0, 1, color="r") # Angry
ax.scatter(1, 0, color="b") # Happy
ax.scatter(1, 1, color="b") # Happy
ax.set_xlim([xmin, xmax])
ax.set_ylim([-0.1, 1.1])
ax.set_xlabel('$x_1$ (aack count)')
ax.set_ylabel('$x_2$ (beep count)')
# Decision boundary: x₂ = x₁ + 0.5
ax.plot(x_line, x_line + 0.5, color="black")
plt.show()
Linearly Separable: Classes that can be separated by a straight line (or hyperplane in higher dimensions). This is the simplest classification scenario.
Finding the Decision Boundary
The line x1−x2+0.5=0 separates the classes:
- Above the line: x1−x2+0.5<0 → Red class
- Below the line: x1−x2+0.5>0 → Blue class
Goal: Find parameters w1, w2, and b in the equation w1x1+w2x2+b=0 that define this boundary.
For this simple example, we can see that w1=1, w2=−1, b=0.5. But for complex problems, we need a neural network!
Single Perceptron with Activation Function
Neural Network Structure
The perceptron performs two operations:
- Linear combination: z(i)=w1x1(i)+w2x2(i)+b=Wx(i)+b
- Activation function: a(i)=σ(z(i))
The activation function converts continuous values into class probabilities, enabling classification.
Sigmoid Activation Function
The sigmoid function maps any real number to the range (0, 1):
a=σ(z)=1+e−z1
Properties:
- σ(0)=0.5
- σ(z)→1 as z→∞
- σ(z)→0 as z→−∞
Classification Rule
Use threshold 0.5:
y^={10if a>0.5otherwise
Mathematical Model
For a single training example:
z^{(i)} &= Wx^{(i)} + b \\
a^{(i)} &= \sigma(z^{(i)})
\end{align}$$
For $m$ training examples in matrix $X$ (size $2 \times m$):
$$\begin{align}
Z &= WX + b \\
A &= \sigma(Z)
\end{align}$$
where $b$ is broadcast to size $1 \times m$.
### Log Loss Cost Function
For classification, use **log loss** (cross-entropy):
$$\mathcal{L}(W, b) = \frac{1}{m}\sum_{i=1}^{m} \left[-y^{(i)}\log(a^{(i)}) - (1-y^{(i)})\log(1-a^{(i)})\right]$$
where:
- $y^{(i)} \in \{0, 1\}$ are true labels
- $a^{(i)}$ are predicted probabilities
<Accordion title="Why log loss instead of sum of squares?">
Log loss penalizes confident wrong predictions heavily. For classification, this encourages the model to output probabilities close to 0 or 1, making decision boundaries sharper. Sum of squares doesn't have this property.
</Accordion>
### Backward Propagation
Compute gradients using the chain rule:
$$\begin{align}
\frac{\partial \mathcal{L}}{\partial w_1} &= \frac{1}{m}\sum_{i=1}^{m} (a^{(i)} - y^{(i)})x_1^{(i)} \\
\frac{\partial \mathcal{L}}{\partial w_2} &= \frac{1}{m}\sum_{i=1}^{m} (a^{(i)} - y^{(i)})x_2^{(i)} \\
\frac{\partial \mathcal{L}}{\partial b} &= \frac{1}{m}\sum_{i=1}^{m} (a^{(i)} - y^{(i)})
\end{align}$$
In matrix form:
$$\begin{align}
\frac{\partial \mathcal{L}}{\partial W} &= \frac{1}{m}(A - Y)X^T \\
\frac{\partial \mathcal{L}}{\partial b} &= \frac{1}{m}(A - Y)\mathbf{1}
\end{align}$$
<Info>
**Key Insight**: These gradient expressions are identical to linear regression! The difference is in forward propagation (sigmoid activation) and the cost function (log loss).
</Info>
### Parameter Updates
$$\begin{align}
W &= W - \alpha \frac{\partial \mathcal{L}}{\partial W} \\
b &= b - \alpha \frac{\partial \mathcal{L}}{\partial b}
\end{align}$$
## Implementation
### Generate Dataset
Create 30 data points with binary labels:
```python
m = 30
X = np.random.randint(0, 2, (2, m))
Y = np.logical_and(X[0] == 0, X[1] == 1).astype(int).reshape((1, m))
print('Training dataset X containing (x1, x2) coordinates in columns:')
print(X)
print('Training dataset Y containing labels (0: blue, 1: red)')
print(Y)
print('The shape of X is:', X.shape)
print('The shape of Y is:', Y.shape)
print('I have m = %d training examples!' % (X.shape[1]))
```
### Define Sigmoid Function
```python
def sigmoid(z):
return 1 / (1 + np.exp(-z))
print("sigmoid(-2) =", sigmoid(-2))
print("sigmoid(0) =", sigmoid(0))
print("sigmoid(3.5) =", sigmoid(3.5))
```
Sigmoid works element-wise on arrays:
```python
print(sigmoid(np.array([-2, 0, 3.5])))
# Output: [0.11920292 0.5 0.97068777]
```
### Step 1: Define Layer Sizes
```python
def layer_sizes(X, Y):
"""
Arguments:
X -- input dataset of shape (input size, number of examples)
Y -- labels of shape (output size, number of examples)
Returns:
n_x -- the size of the input layer
n_y -- the size of the output layer
"""
n_x = X.shape[0]
n_y = Y.shape[0]
return (n_x, n_y)
(n_x, n_y) = layer_sizes(X, Y)
print("The size of the input layer is: n_x =", n_x)
print("The size of the output layer is: n_y =", n_y)
```
### Step 2: Initialize Parameters
```python
def initialize_parameters(n_x, n_y):
"""
Returns:
params -- python dictionary containing your parameters:
W -- weight matrix of shape (n_y, n_x)
b -- bias value set as a vector of shape (n_y, 1)
"""
W = np.random.randn(n_y, n_x) * 0.01
b = np.zeros((n_y, 1))
parameters = {"W": W, "b": b}
return parameters
parameters = initialize_parameters(n_x, n_y)
print("W =", parameters["W"])
print("b =", parameters["b"])
```
### Step 3: Forward Propagation
```python
def forward_propagation(X, parameters):
"""
Argument:
X -- input data of size (n_x, m)
parameters -- python dictionary containing your parameters
Returns:
A -- The sigmoid output
"""
W = parameters["W"]
b = parameters["b"]
Z = np.matmul(W, X) + b
A = sigmoid(Z) # Apply sigmoid activation
return A
A = forward_propagation(X, parameters)
print("Output vector A:", A)
```
<Note>
The only difference from regression: applying the sigmoid activation function to Z.
</Note>
### Step 4: Compute Cost (Log Loss)
```python
def compute_cost(A, Y):
"""
Computes the log loss cost function
Arguments:
A -- The output of the neural network of shape (n_y, number of examples)
Y -- "true" labels vector of shape (n_y, number of examples)
Returns:
cost -- log loss
"""
m = Y.shape[1]
logprobs = -np.multiply(np.log(A), Y) - np.multiply(np.log(1 - A), 1 - Y)
cost = (1/m) * np.sum(logprobs)
return cost
print("cost =", compute_cost(A, Y))
```
### Step 5: Backward Propagation
```python
def backward_propagation(A, X, Y):
"""
Implements the backward propagation, calculating gradients
Arguments:
A -- the output of the neural network of shape (n_y, number of examples)
X -- input data of shape (n_x, number of examples)
Y -- "true" labels vector of shape (n_y, number of examples)
Returns:
grads -- python dictionary containing gradients
"""
m = X.shape[1]
dZ = A - Y
dW = (1/m) * np.dot(dZ, X.T)
db = (1/m) * np.sum(dZ, axis=1, keepdims=True)
grads = {"dW": dW, "db": db}
return grads
grads = backward_propagation(A, X, Y)
print("dW =", grads["dW"])
print("db =", grads["db"])
```
### Step 6: Update Parameters
```python
def update_parameters(parameters, grads, learning_rate=1.2):
"""
Updates parameters using the gradient descent update rule
Arguments:
parameters -- python dictionary containing parameters
grads -- python dictionary containing gradients
learning_rate -- learning rate parameter for gradient descent
Returns:
parameters -- python dictionary containing updated parameters
"""
W = parameters["W"]
b = parameters["b"]
dW = grads["dW"]
db = grads["db"]
W = W - learning_rate * dW
b = b - learning_rate * db
parameters = {"W": W, "b": b}
return parameters
parameters_updated = update_parameters(parameters, grads)
print("W updated =", parameters_updated["W"])
print("b updated =", parameters_updated["b"])
```
### Step 7: Build Complete Model
```python
def nn_model(X, Y, num_iterations=10, learning_rate=1.2, print_cost=False):
"""
Arguments:
X -- dataset of shape (n_x, number of examples)
Y -- labels of shape (n_y, number of examples)
num_iterations -- number of iterations in the loop
learning_rate -- learning rate parameter for gradient descent
print_cost -- if True, print the cost every iteration
Returns:
parameters -- parameters learnt by the model
"""
n_x = layer_sizes(X, Y)[0]
n_y = layer_sizes(X, Y)[1]
parameters = initialize_parameters(n_x, n_y)
for i in range(0, num_iterations):
# Forward propagation
A = forward_propagation(X, parameters)
# Cost function
cost = compute_cost(A, Y)
# Backpropagation
grads = backward_propagation(A, X, Y)
# Update parameters
parameters = update_parameters(parameters, grads, learning_rate)
if print_cost:
print("Cost after iteration %i: %f" % (i, cost))
return parameters
```
### Train the Model
```python
parameters = nn_model(X, Y, num_iterations=50, learning_rate=1.2, print_cost=True)
print("W =", parameters["W"])
print("b =", parameters["b"])
```
<Info>
After about 40 iterations, the cost decreases very slowly, indicating convergence. This is a good point to stop training.
</Info>
### Visualize Decision Boundary
```python
def plot_decision_boundary(X, Y, parameters):
W = parameters["W"]
b = parameters["b"]
fig, ax = plt.subplots()
plt.scatter(X[0, :], X[1, :], c=Y,
cmap=colors.ListedColormap(['blue', 'red']))
x_line = np.arange(np.min(X[0,:]), np.max(X[0,:])*1.1, 0.1)
# Decision boundary: W[0,0]*x1 + W[0,1]*x2 + b = 0
# Solve for x2: x2 = -(W[0,0]/W[0,1])*x1 - b/W[0,1]
ax.plot(x_line, -W[0,0]/W[0,1] * x_line + -b[0,0]/W[0,1],
color="black", label="Decision Boundary")
plt.legend()
plt.xlabel('$x_1$')
plt.ylabel('$x_2$')
plt.show()
plot_decision_boundary(X, Y, parameters)
```
<Tip>
The decision boundary is the line where the perceptron output equals 0.5 (or equivalently, where $z = 0$).
</Tip>
### Make Predictions
```python
def predict(X, parameters):
"""
Using the learned parameters, predicts a class for each example in X
Arguments:
parameters -- python dictionary containing your parameters
X -- input data of size (n_x, m)
Returns:
predictions -- vector of predictions (False: blue / True: red)
"""
A = forward_propagation(X, parameters)
predictions = A > 0.5
return predictions
X_pred = np.array([[1, 1, 0, 0],
[0, 1, 0, 1]])
Y_pred = predict(X_pred, parameters)
print(f"Coordinates (in columns):\n{X_pred}")
print(f"Predictions:\n{Y_pred}")
```
## Performance on Larger Dataset
Test the model on a more realistic dataset:
```python
n_samples = 1000
samples, labels = make_blobs(
n_samples=n_samples,
centers=([2.5, 3], [6.7, 7.9]),
cluster_std=1.4,
random_state=0
)
X_larger = np.transpose(samples)
Y_larger = labels.reshape((1, n_samples))
plt.scatter(X_larger[0, :], X_larger[1, :], c=Y_larger,
cmap=colors.ListedColormap(['blue', 'red']))
plt.show()
```
Train for 100 iterations:
```python
parameters_larger = nn_model(X_larger, Y_larger,
num_iterations=100,
learning_rate=1.2,
print_cost=False)
print("W =", parameters_larger["W"])
print("b =", parameters_larger["b"])
```
Visualize results:
```python
plot_decision_boundary(X_larger, Y_larger, parameters_larger)
```
<Note>
The decision boundary successfully separates the two clusters! Try changing `num_iterations` and `learning_rate` to see how they affect the results.
</Note>
## Key Concepts
<CardGroup cols={2}>
<Card title="Sigmoid Activation" icon="wave-sine">
Converts linear output to probabilities between 0 and 1
</Card>
<Card title="Log Loss" icon="function">
Penalizes confident wrong predictions for better classification
</Card>
<Card title="Decision Boundary" icon="divide">
Line (or hyperplane) that separates classes in feature space
</Card>
<Card title="Binary Classification" icon="code-branch">
Assigns observations to one of two categories
</Card>
</CardGroup>
<Accordion title="Why use sigmoid instead of step function?">
Step functions have zero gradient everywhere (except at the discontinuity where it's undefined). This breaks gradient descent. Sigmoid is smooth and differentiable everywhere, enabling gradient-based learning.
</Accordion>
<Accordion title="What if classes aren't linearly separable?">
A single perceptron can only learn linear decision boundaries. For non-linear problems, you need:
- Multiple layers (deep neural networks)
- Non-linear activation functions in hidden layers
- More complex architectures
</Accordion>
<Accordion title="How do I choose the classification threshold?">
0.5 is standard, but you can adjust it based on application needs:
- Lower threshold (e.g., 0.3): More false positives, fewer false negatives
- Higher threshold (e.g., 0.7): Fewer false positives, more false negatives
Choose based on the relative costs of different error types.
</Accordion>
## Comparison: Regression vs Classification
| Aspect | Regression | Classification |
|--------|-----------|----------------|
| **Output** | Continuous value | Class label (0 or 1) |
| **Activation** | Identity ($a = z$) | Sigmoid ($a = \sigma(z)$) |
| **Cost Function** | Sum of squares | Log loss |
| **Interpretation** | Predicted value | Probability of class 1 |
| **Decision Rule** | None | Threshold at 0.5 |
<Info>
**Implementation Similarity**: The backward propagation formulas are identical! The key differences are forward propagation (activation function) and the cost function.
</Info>
## Summary
You've built a binary classifier using:
✓ Single perceptron architecture
✓ Sigmoid activation function
✓ Log loss cost function
✓ Gradient descent optimization
✓ Decision boundary visualization
✓ Predictions on new data
This foundation extends to more complex classification problems with multiple layers and classes!
## Next Steps
Explore advanced topics:
- Build multi-class classifiers (softmax activation)
- Add hidden layers for non-linear decision boundaries
- Implement regularization to prevent overfitting
- Try different optimization algorithms (Adam, RMSprop)