Matrix Multiplication For Linear Regression

In the previous blog I have used python loops to find the parameter values. But that not being the optimized approach I learnt and utilised matrix multiplication to make the finding of gradients optimized.

Where did I start?

Starting point:

My goal was simple. I wanted to understand how linear regression actually works, rather than just using Python loops to find the required parameters that give the global minima.

The model was straightforward: y = mx + b

Loss Function:

We define loss as the mean squared error, a.k.a. MSE:
L = (1/n) Σ (y - y_pred)²
The above function for the loss tells us how wrong the predictions are, and it changes according to its parameters(m & b) by giving us a single number.

Gradient Descent Formula(In scalar terms):

Using the following derivations derived from the loss function with respect to the parameters used, we compute the gradients to minimise the loss.

dL/dm = -2/n Σ xᵢ (yᵢ - y_predᵢ) -> 1

dL/db = -2/n Σ (yᵢ - y_predᵢ) -> 2

Key insights I gained from the above:

Error: y - y_pred
Each data point contributes to a gradient and,
Each contribution depends on the given input xThis leads to:
m’s update depends on x * error as shown in equation 1
b’s update depends only on the error, as shown in equation 2

Main Intuition here is,

Instead of gradient being a formula, it represents how much changing a parameter can affect the loss value.

For example,

If x is large then m has strong influence.
If x = 0 m has no influence

There said,

Gradient = sum of (influence x error)

Learning rate behavior:

What I observed from playing around with the learning rate(LR) is that,

If LR is too high then the behavior is generally diverging or oscillating points.
If LR is too low then the convergence is very slow
If LR is a proper value or let’s say a suitable value for that regression then there is a smooth decrease in loss giving us a smooth convergence.

Important points to notice:

The direction comes from the gradient and
Learning rate controls the step size

Transitioning to vector form

Here instead of computing the dL/dm and dL/db we combine them as a single vector θ.

θ can be defined as θ = [m, b]

Then we construct the input matrix X:

X =[[ x₁, 1 ], [ x₂, 1 ], [ x₃, 1 ],…..]

Prediction y_pred becomes y_pred = Xθ

Error becomes: e = y - Xθ

And applying all of them in the vector the vector gradient becomes,

Gradient = -(2/n)X^T.e

It is simply a compact way to compute both the parameters gradients at once.

Insights I gained about X^T.e:

X^T rearranges the X input vector by,

First row => Collects all the x values

Second row => Collects all the constants (1s)

Multiplying it with X^T with e we get,

X^T.e = [ [ Σ xᵢ eᵢ ], [ Σ eᵢ ] ]

Which exactly matches:

[[dm], [db]]

Scaling Vs Direction:

A gradient has two parts,

Direction => given by X^T.e
Scaling => given by constants like 2/n

Multiplying the gradient by the constant what happens is that,

The direction does not change and
Only changes the step size

Note: The Learning rate can be a compensate for the scaling part.

Normalization (1/n):

Including the 1/n means:

We average the contributions of the data points
This keeps the gradient stable regardless the dataset size.

Without normalization large datasets will end up giving large gradients which in return gives us unstable updates

Loop vs Vectorized Implementation

Loop:

Clear and good to learn and debug but less efficient

Vectorized implementation

Uses matrix operations hence resulting in better performance being faster and scalable

Both produce the same final results but differ in speed and scaling as per the input.

Conclusion

This shift from loops to linear algebra does not relate with the complexity but relates with the concept of expressing/writing the same idea in an efficient way as well as scaling to real-world problems.

It took me a while but as soon as I understood the mapping,

Σ xᵢ eᵢ→Xᵀe

everything started to make sense.

Ten pages of rough work using matrix operations and explanations and notes was worth every second of understanding this core concept :)

Linear Algebra - Part 3

Where did I start?

Transitioning to vector form

Conclusion

Tags