01. Linear Regression | ML From Scratch

Linear regression is perhaps the most foundational algorithm in machine learning—it appears in virtually every textbook and serves as the gateway to understanding more complex models. Even if you’ve never formally studied machine learning, you’ve likely encountered the concept of fitting a line to data points. In this article, we’ll go beyond the surface and explore the mathematical foundations, implementation details, and real-world applications of linear regression.

1. The Math: What Are We Building?

At its core, linear regression is elegantly simple: we’re trying to find the best-fitting line (or hyperplane in higher dimensions) for our data. The fundamental equation is:

$$\hat{y} = X \cdot w + b$$

Where:

$X$: Our input features (training data matrix, shape: $n_{\text{samples}} \times n_{\text{features}}$)
$w$: The weights vector (shape: $n_{\text{features}} \times 1$), representing how important each feature is
$b$: The bias term (scalar), representing the intercept
$\hat{y}$: Our predicted output (shape: $n_{\text{samples}} \times 1$)

The Optimization Goal: Our objective is to find the optimal values for $w$ and $b$ that minimize the prediction error across all training samples. We measure this error using the Mean Squared Error (MSE) loss function:

$$L(w, b) = \frac{1}{2n}\sum_{i=1}^{n}(\hat{y}_i - y_i)^2$$

The factor of $\frac{1}{2}$ is included for mathematical convenience—it simplifies the derivative calculations during gradient descent, which means with or without it doesn’t influence.

2. Step-by-Step Implementation

Step 1: Initialize Parameters

Every optimization algorithm needs a starting point. We initialize our weights to zero (though random initialization is also common) and set the bias to zero.

import numpy as np

# Initialize Model Parameters
def initialize_params(dims: int):
    '''
    Inputs:
    dims: Number of features (dimensions)

    Outputs:
    w: Initialized weight vector (dims, 1)
    b: Initialized bias scalar
    '''
    # Initialize weights to zeros
    w = np.zeros((dims, 1))
    # Initialize bias to zero
    b = 0
    return w, b

Step 2: The Core Logic (Forward Pass & Gradient Computation)

This is where the mathematics comes to life. We need to implement three critical operations:

Forward Pass (Prediction): Compute $\hat{y} = Xw + b$
Loss Calculation: Compute the MSE to quantify our error
Gradient Computation: Calculate the derivatives $\frac{\partial L}{\partial w}$ and $\frac{\partial L}{\partial b}$

Mathematical Derivation of Gradients:

Using the chain rule of calculus:

$$\frac{\partial L}{\partial w} = \frac{1}{n}X^T(\hat{y} - y)$$

$$\frac{\partial L}{\partial b} = \frac{1}{n}\sum_{i=1}^{n}(\hat{y}_i - y_i)$$

# Define Model Core
def linear_loss(X: np.ndarray, y: np.ndarray, w: np.ndarray, b: float):
    '''
    Inputs:
    X: Input feature matrix (num_samples, num_features)
    y: Output label vector (num_samples, 1)
    w: Weight parameters (num_features, 1)
    b: Bias parameter

    Outputs:
    y_hat: Linear model prediction
    loss: Mean Squared Error (MSE)
    dw: Gradient of weights
    db: Gradient of bias
    '''
    # Number of training samples
    num_train = X.shape[0]

    # Linear prediction: y = Xw + b
    y_hat = np.dot(X, w) + b

    # Compute Loss: (1/2m) * sum((y_hat - y)^2)
    # Note: Using 1/2m makes the derivative cleaner (cancels out the square power)
    loss = np.sum((y_hat - y) ** 2) / (2 * num_train)

    # Compute gradients (partial derivatives)
    # dw = (1/m) * X.T * (y_hat - y)
    dw = np.dot(X.T, (y_hat - y)) / num_train
    # db = (1/m) * sum(y_hat - y)
    db = np.sum((y_hat - y)) / num_train

    return y_hat, loss, dw, db

Step 3: The Training Loop (Gradient Descent)

Now we implement the iterative optimization process. Gradient descent is an iterative algorithm that updates parameters in the direction that reduces the loss:

$$w := w - \alpha \frac{\partial L}{\partial w}$$

$$b := b - \alpha \frac{\partial L}{\partial b}$$

where $\alpha$ is the learning rate, controlling the step size of each update.

# Define Training Process
def linear_train(X: np.ndarray, y: np.ndarray, learning_rate: float = 0.01, epochs: int = 10000):
    '''
    Inputs:
    X: Training inputs
    y: Training labels
    learning_rate: Step size for gradient descent
    epochs: Number of iterations

    Outputs:
    loss_his: History of loss values
    params: Dictionary containing optimized 'w' and 'b'
    grads: Dictionary containing final gradients
    '''
    loss_his = []

    # Initialize parameters
    w, b = initialize_params(X.shape[1])

    # Iterative Training
    for i in range(1, epochs + 1):
        # Forward pass & Gradient calculation
        y_hat, loss, dw, db = linear_loss(X, y, w, b)

        # Gradient Descent Update
        w += -learning_rate * dw
        b += -learning_rate * db

        # Record loss
        loss_his.append(loss)

        # Print loss every 10,000 epochs
        if i % 10000 == 0:
            print('epoch %d loss %f' % (i, loss))

        # Save current parameters and gradients
        params = {'w': w, 'b': b}
        grads = {'dw': dw, 'db': db}

    return loss_his, params, grads

3. Training on Real Data

To validate our implementation, we use the Diabetes dataset from Scikit-Learn. This dataset contains 442 samples with 10 physiological features (age, BMI, blood pressure, etc.) used to predict disease progression.

Data Preprocessing:

# Import helpers from sklearn
from sklearn.datasets import load_diabetes
from sklearn.utils import shuffle

# Load diabetes dataset
diabetes = load_diabetes()
data, target = diabetes.data, diabetes.target

# Shuffle randomizes the data
X, y = shuffle(data, target, random_state=42)

# Split into Train (80%) and Test (20%)
offset = int(X.shape[0] * 0.8)
X_train, y_train = X[:offset], y[:offset]
X_test, y_test = X[offset:], y[offset:]

# Reshape labels to column vectors (n, 1)
y_train = y_train.reshape((-1, 1))
y_test = y_test.reshape((-1, 1))

# Print shapes
print("X_train's shape: ", X_train.shape)
print("X_test's shape: ", X_test.shape)
print("y_train's shape: ", y_train.shape)
print("y_test's shape: ", y_test.shape)

Training the Model:

# Train Model
loss_his, params, grads = linear_train(X_train, y_train, 0.01, 200000)
# Print optimized parameters
print(params)

Training Results:

Initial loss: ~1930
Final loss: ~1392

The smooth convergence of the loss indicates that our model successfully learned the underlying patterns in the data. The gradual decrease shows that gradient descent is working correctly—each parameter update moves us closer to the optimal solution.

4. Evaluation: Measuring Performance

To quantify how well our model performs, we first need to make predictions on the test set, then calculate evaluation metrics.

Making Predictions:

### Define Prediction Function
def predict(X: np.ndarray, params: dict):
    '''
    Inputs:
    X: Test data
    params: Trained model parameters (w, b)

    Outputs:
    y_pred: Predicted values
    '''
    w = params['w']
    b = params['b']
    y_pred = np.dot(X, w) + b
    return y_pred

# Make predictions on test set
y_pred = predict(X_test, params)

Calculating R² Score:

We implement the $R^2$ score (coefficient of determination):

$$R^2 = 1 - \frac{\sum_{i}(y_i - \hat{y}i)^2}{\sum{i}(y_i - \bar{y})^2}$$

where $\bar{y}$ is the mean of the true values. The $R^2$ score ranges from 0 to 1 (or negative for very poor models), with 1 indicating perfect predictions.

### Define R-Squared (R2) Metric
def r2_score(y_test: np.ndarray, y_pred: np.ndarray):
    '''
    Inputs:
    y_test: True labels
    y_pred: Predicted labels

    Outputs:
    r2: R-squared coefficient
    '''
    # Mean of true labels
    y_avg = np.mean(y_test)

    # Total Sum of Squares (TSS)
    ss_tot = np.sum((y_test - y_avg) ** 2)

    # Residual Sum of Squares (RSS)
    ss_res = np.sum((y_test - y_pred) ** 2)

    # R2 formula
    r2 = 1 - (ss_res / ss_tot)
    return r2

print(r2_score(y_test, y_pred))

Performance Metrics:

$R^2$ Score: 0.4045
MSE: 3315.50

An $R^2$ score of ~0.40 indicates that our model explains approximately 40% of the variance in the data. While this isn’t perfect, it’s quite reasonable for real-world medical data, which often contains significant noise and complex interactions that linear models cannot fully capture.

Visualization: The loss curve shows a characteristic exponential decay, confirming that gradient descent converged properly. This smooth, monotonic decrease is exactly what we want to see—no oscillations or divergence.

5. Validation: Comparing with Scikit-Learn

A crucial step in implementing algorithms from scratch is validating against established libraries. Let’s compare our implementation with Scikit-Learn’s industry-standard LinearRegression:

### Compare with scikit-learn
# Import sklearn Linear Regression
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score

# Create model
regr = linear_model.LinearRegression()

# Fit
regr.fit(X_train, y_train)

# Predict
y_pred_sklearn = regr.predict(X_test)

# Print MSE
print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred_sklearn))
# Print R2 Score
print('R2 score: %.2f' % r2_score(y_test, y_pred_sklearn))

Results Comparison:

Metric	Our Implementation	Scikit-Learn
MSE	3315.50	3368.04
$R^2$	0.4045	0.40

The results are nearly identical! The slight difference in MSE is due to minor numerical precision variations, but the $R^2$ scores match perfectly. This confirms that our from-scratch implementation is mathematically correct and produces results equivalent to professionally optimized libraries.

Conclusion

We’ve successfully built linear regression from the ground up, understanding every mathematical operation and implementation detail along the way.

Key Takeaways:

Linear regression finds optimal parameters by minimizing Mean Squared Error
Gradient descent iteratively updates parameters in the direction that reduces loss
Proper initialization, learning rate selection, and sufficient training epochs are crucial
Our scratch implementation matches professional libraries, validating our understanding

This article is part of a series on ml-from-scratch. All code is available with detailed comments for educational purposes.

1. The Math: What Are We Building?#

2. Step-by-Step Implementation#

Step 1: Initialize Parameters#

Step 2: The Core Logic (Forward Pass & Gradient Computation)#

Step 3: The Training Loop (Gradient Descent)#

3. Training on Real Data#

4. Evaluation: Measuring Performance#

5. Validation: Comparing with Scikit-Learn#

Conclusion#