Skip to main content

Introduction

Linear regression is probably the most widely used learning algorithm in the world today. It fits a straight line to your data to make predictions. As you become familiar with linear regression, many concepts you see here will also apply to other machine learning models.

Housing Price Prediction Example

Let’s predict the price of a house based on its size using a dataset from Portland, a city in the United States.
In this example:
  • Horizontal axis: Size of house in square feet
  • Vertical axis: Price of house in thousands of dollars
  • Each data point (cross) represents a house with its size and sale price
Suppose you’re a real estate agent helping a client sell her house. She asks: “How much do you think I can get for this house?” You measure the house: 1,250 square feet. How much could it sell for?

Building a Linear Regression Model

One approach is to build a linear regression model from the dataset. Your model will fit a straight line to the data:
  1. Measure the house size: 1,250 square feet
  2. Find where this intersects the best fit line
  3. Trace to the vertical axis to read the predicted price: approximately $220,000
This is a supervised learning model because you first train it by giving data with right answers—both the size of the house and the price for each example.

What Makes This a Regression Model?

Linear regression is a particular type of supervised learning model called a regression model because it predicts numbers as output, like prices in dollars.
Any supervised learning model that predicts a number such as 220,000 or 1.5 or -33.2 is addressing a regression problem.

Regression vs Classification

The key difference:
  • Regression: Infinitely many possible outputs (any number)
  • Classification: Small, finite set of categories (e.g., cat vs dog, malignant vs benign)

Understanding the Training Dataset

You can visualize the data in two ways:

1. Scatter Plot (Graph)

Plots house size vs price with each data point as a cross

2. Data Table

Size (sq ft)Price ($1000s)
2,104400
1,416232
1,534315
If you have 47 rows in the data table, there are 47 crosses on the plot, each corresponding to one row.

Machine Learning Notation

x = input variable (also called feature or input feature)For the first training example: x = 2,104 (square feet)

Indexing Training Examples

To refer to a specific training example, use superscript notation:
  • x⁽ⁱ⁾, y⁽ⁱ⁾ = the i-th training example
  • x⁽¹⁾ = 2,104 (first example’s input)
  • y⁽¹⁾ = 400 (first example’s output)
The superscript (i) in parentheses is NOT exponentiation. x⁽²⁾ does not mean x squared—it refers to the second training example.

The Cost Function

To measure how well our model fits the data, we need a cost function.

Model Representation

Our linear model is:
f(x) = w * x + b
Where:
  • w and b are parameters (also called coefficients or weights)
  • w determines the slope
  • b is the y-intercept

How Different Parameters Affect the Line

f(x) = 0 * x + 1.5
This creates a horizontal line at y = 1.5. The prediction is always 1.5 regardless of x.
f(x) = 0.5 * x + 0
When x = 0, f(x) = 0
When x = 2, f(x) = 1
This creates a line with slope 0.5 passing through the origin.
f(x) = 0.5 * x + 1
The line has slope 0.5 and y-intercept at 1.

Defining the Cost Function

How do we find the best values for w and b? We measure the error between predictions and actual values. For training example i:
  • ŷ⁽ⁱ⁾ = prediction = f(x⁽ⁱ⁾) = w * x⁽ⁱ⁾ + b
  • y⁽ⁱ⁾ = actual target value
  • Error = ŷ⁽ⁱ⁾ - y⁽ⁱ⁾
The squared error cost function is:
J(w, b) = (1 / 2m) * Σ(f(x⁽ⁱ⁾) - y⁽ⁱ⁾)²
Where:
  • J(w, b) = cost function
  • m = number of training examples
  • Σ = sum over all training examples from i=1 to m
The cost function is divided by 2m (not just m) to make later calculations neater. This division by 2 doesn’t change which parameters are optimal.

Why Squared Error?

The squared error cost function:
  1. Penalizes larger errors more heavily
  2. Is always positive (no negative errors)
  3. Has nice mathematical properties for optimization
  4. Is the most commonly used cost function for regression problems

Implementation Example

import numpy as np

# Training data
x_train = np.array([1.0, 2.0, 3.0])  # Size in 1000 sq ft
y_train = np.array([300, 500, 700])   # Price in $1000s

# Model parameters
w = 200
b = 100

# Make predictions
def predict(x, w, b):
    return w * x + b

# Calculate cost
def compute_cost(x, y, w, b):
    m = len(x)
    total_cost = 0
    
    for i in range(m):
        f_wb = w * x[i] + b
        cost = (f_wb - y[i]) ** 2
        total_cost += cost
    
    return total_cost / (2 * m)

# Example prediction
size = 1.25  # 1,250 sq ft
predicted_price = predict(size, w, b)
print(f"Predicted price: ${predicted_price * 1000}")

# Calculate cost for current parameters
cost = compute_cost(x_train, y_train, w, b)
print(f"Cost: {cost}")

Key Takeaways

1

Linear regression fits a straight line to data

The model f(x) = w * x + b predicts output y from input x
2

Parameters w and b define the line

w controls the slope, b controls the y-intercept
3

Cost function measures prediction error

J(w, b) quantifies how well the model fits the training data
4

Goal is to minimize the cost function

Find w and b that make J(w, b) as small as possible

What’s Next

Now that you understand the linear regression model and cost function, the next step is to learn how to systematically find the optimal values for w and b. This is where gradient descent comes in—a powerful algorithm for minimizing the cost function and finding the best fit line for your data.

Build docs developers (and LLMs) love