Linear regression is perhaps the most foundational algorithm in machine learning—it appears in virtually every textbook and serves as the gateway to understanding more complex models. Even if you’ve never formally studied machine learning, you’ve likely encountered the concept of fitting a line to data points. In this article, we’ll go beyond the surface and explore the mathematical foundations, implementation details, and real-world applications of linear regression.
1. The Math: What Are We Building?
At its core, linear regression is elegantly simple: we’re trying to find the best-fitting line (or hyperplane in higher dimensions) for our data. The fundamental equation is:
$$\hat{y} = X \cdot w + b$$
Where:
- $X$: Our input features (training data matrix, shape: $n_{\text{samples}} \times n_{\text{features}}$)
- $w$: The weights vector (shape: $n_{\text{features}} \times 1$), representing how important each feature is
- $b$: The bias term (scalar), representing the intercept
- $\hat{y}$: Our predicted output (shape: $n_{\text{samples}} \times 1$)
The Optimization Goal: Our objective is to find the optimal values for $w$ and $b$ that minimize the prediction error across all training samples. We measure this error using the Mean Squared Error (MSE) loss function:
$$L(w, b) = \frac{1}{2n}\sum_{i=1}^{n}(\hat{y}_i - y_i)^2$$
The factor of $\frac{1}{2}$ is included for mathematical convenience—it simplifies the derivative calculations during gradient descent, which means with or without it doesn’t influence.
2. Step-by-Step Implementation
Step 1: Initialize Parameters
Every optimization algorithm needs a starting point. We initialize our weights to zero (though random initialization is also common) and set the bias to zero.
import numpy as np
# Initialize Model Parameters
def initialize_params(dims: int):
'''
Inputs:
dims: Number of features (dimensions)
Outputs:
w: Initialized weight vector (dims, 1)
b: Initialized bias scalar
'''
# Initialize weights to zeros
w = np.zeros((dims, 1))
# Initialize bias to zero
b = 0
return w, b
Step 2: The Core Logic (Forward Pass & Gradient Computation)
This is where the mathematics comes to life. We need to implement three critical operations:
- Forward Pass (Prediction): Compute $\hat{y} = Xw + b$
- Loss Calculation: Compute the MSE to quantify our error
- Gradient Computation: Calculate the derivatives $\frac{\partial L}{\partial w}$ and $\frac{\partial L}{\partial b}$
Mathematical Derivation of Gradients:
Using the chain rule of calculus:
$$\frac{\partial L}{\partial w} = \frac{1}{n}X^T(\hat{y} - y)$$
$$\frac{\partial L}{\partial b} = \frac{1}{n}\sum_{i=1}^{n}(\hat{y}_i - y_i)$$
# Define Model Core
def linear_loss(X: np.ndarray, y: np.ndarray, w: np.ndarray, b: float):
'''
Inputs:
X: Input feature matrix (num_samples, num_features)
y: Output label vector (num_samples, 1)
w: Weight parameters (num_features, 1)
b: Bias parameter
Outputs:
y_hat: Linear model prediction
loss: Mean Squared Error (MSE)
dw: Gradient of weights
db: Gradient of bias
'''
# Number of training samples
num_train = X.shape[0]
# Linear prediction: y = Xw + b
y_hat = np.dot(X, w) + b
# Compute Loss: (1/2m) * sum((y_hat - y)^2)
# Note: Using 1/2m makes the derivative cleaner (cancels out the square power)
loss = np.sum((y_hat - y) ** 2) / (2 * num_train)
# Compute gradients (partial derivatives)
# dw = (1/m) * X.T * (y_hat - y)
dw = np.dot(X.T, (y_hat - y)) / num_train
# db = (1/m) * sum(y_hat - y)
db = np.sum((y_hat - y)) / num_train
return y_hat, loss, dw, db
Step 3: The Training Loop (Gradient Descent)
Now we implement the iterative optimization process. Gradient descent is an iterative algorithm that updates parameters in the direction that reduces the loss:
$$w := w - \alpha \frac{\partial L}{\partial w}$$
$$b := b - \alpha \frac{\partial L}{\partial b}$$
where $\alpha$ is the learning rate, controlling the step size of each update.
# Define Training Process
def linear_train(X: np.ndarray, y: np.ndarray, learning_rate: float = 0.01, epochs: int = 10000):
'''
Inputs:
X: Training inputs
y: Training labels
learning_rate: Step size for gradient descent
epochs: Number of iterations
Outputs:
loss_his: History of loss values
params: Dictionary containing optimized 'w' and 'b'
grads: Dictionary containing final gradients
'''
loss_his = []
# Initialize parameters
w, b = initialize_params(X.shape[1])
# Iterative Training
for i in range(1, epochs + 1):
# Forward pass & Gradient calculation
y_hat, loss, dw, db = linear_loss(X, y, w, b)
# Gradient Descent Update
w += -learning_rate * dw
b += -learning_rate * db
# Record loss
loss_his.append(loss)
# Print loss every 10,000 epochs
if i % 10000 == 0:
print('epoch %d loss %f' % (i, loss))
# Save current parameters and gradients
params = {'w': w, 'b': b}
grads = {'dw': dw, 'db': db}
return loss_his, params, grads
3. Training on Real Data
To validate our implementation, we use the Diabetes dataset from Scikit-Learn. This dataset contains 442 samples with 10 physiological features (age, BMI, blood pressure, etc.) used to predict disease progression.
Data Preprocessing:
# Import helpers from sklearn
from sklearn.datasets import load_diabetes
from sklearn.utils import shuffle
# Load diabetes dataset
diabetes = load_diabetes()
data, target = diabetes.data, diabetes.target
# Shuffle randomizes the data
X, y = shuffle(data, target, random_state=42)
# Split into Train (80%) and Test (20%)
offset = int(X.shape[0] * 0.8)
X_train, y_train = X[:offset], y[:offset]
X_test, y_test = X[offset:], y[offset:]
# Reshape labels to column vectors (n, 1)
y_train = y_train.reshape((-1, 1))
y_test = y_test.reshape((-1, 1))
# Print shapes
print("X_train's shape: ", X_train.shape)
print("X_test's shape: ", X_test.shape)
print("y_train's shape: ", y_train.shape)
print("y_test's shape: ", y_test.shape)
Training the Model:
# Train Model
loss_his, params, grads = linear_train(X_train, y_train, 0.01, 200000)
# Print optimized parameters
print(params)
Training Results:
- Initial loss: ~1930
- Final loss: ~1392
The smooth convergence of the loss indicates that our model successfully learned the underlying patterns in the data. The gradual decrease shows that gradient descent is working correctly—each parameter update moves us closer to the optimal solution.
4. Evaluation: Measuring Performance
To quantify how well our model performs, we first need to make predictions on the test set, then calculate evaluation metrics.
Making Predictions:
### Define Prediction Function
def predict(X: np.ndarray, params: dict):
'''
Inputs:
X: Test data
params: Trained model parameters (w, b)
Outputs:
y_pred: Predicted values
'''
w = params['w']
b = params['b']
y_pred = np.dot(X, w) + b
return y_pred
# Make predictions on test set
y_pred = predict(X_test, params)
Calculating R² Score:
We implement the $R^2$ score (coefficient of determination):
$$R^2 = 1 - \frac{\sum_{i}(y_i - \hat{y}i)^2}{\sum{i}(y_i - \bar{y})^2}$$
where $\bar{y}$ is the mean of the true values. The $R^2$ score ranges from 0 to 1 (or negative for very poor models), with 1 indicating perfect predictions.
### Define R-Squared (R2) Metric
def r2_score(y_test: np.ndarray, y_pred: np.ndarray):
'''
Inputs:
y_test: True labels
y_pred: Predicted labels
Outputs:
r2: R-squared coefficient
'''
# Mean of true labels
y_avg = np.mean(y_test)
# Total Sum of Squares (TSS)
ss_tot = np.sum((y_test - y_avg) ** 2)
# Residual Sum of Squares (RSS)
ss_res = np.sum((y_test - y_pred) ** 2)
# R2 formula
r2 = 1 - (ss_res / ss_tot)
return r2
print(r2_score(y_test, y_pred))
Performance Metrics:
- $R^2$ Score: 0.4045
- MSE: 3315.50
An $R^2$ score of ~0.40 indicates that our model explains approximately 40% of the variance in the data. While this isn’t perfect, it’s quite reasonable for real-world medical data, which often contains significant noise and complex interactions that linear models cannot fully capture.
Visualization: The loss curve shows a characteristic exponential decay, confirming that gradient descent converged properly. This smooth, monotonic decrease is exactly what we want to see—no oscillations or divergence.

5. Validation: Comparing with Scikit-Learn
A crucial step in implementing algorithms from scratch is validating against established libraries. Let’s compare our implementation with Scikit-Learn’s industry-standard LinearRegression:
### Compare with scikit-learn
# Import sklearn Linear Regression
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score
# Create model
regr = linear_model.LinearRegression()
# Fit
regr.fit(X_train, y_train)
# Predict
y_pred_sklearn = regr.predict(X_test)
# Print MSE
print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred_sklearn))
# Print R2 Score
print('R2 score: %.2f' % r2_score(y_test, y_pred_sklearn))
Results Comparison:
| Metric | Our Implementation | Scikit-Learn |
|---|---|---|
| MSE | 3315.50 | 3368.04 |
| $R^2$ | 0.4045 | 0.40 |
The results are nearly identical! The slight difference in MSE is due to minor numerical precision variations, but the $R^2$ scores match perfectly. This confirms that our from-scratch implementation is mathematically correct and produces results equivalent to professionally optimized libraries.
Conclusion
We’ve successfully built linear regression from the ground up, understanding every mathematical operation and implementation detail along the way.
Key Takeaways:
- Linear regression finds optimal parameters by minimizing Mean Squared Error
- Gradient descent iteratively updates parameters in the direction that reduces loss
- Proper initialization, learning rate selection, and sufficient training epochs are crucial
- Our scratch implementation matches professional libraries, validating our understanding
This article is part of a series on ml-from-scratch. All code is available with detailed comments for educational purposes.