Linear Algebra - Part 3
Sharath Lingam
April 6, 2026

In the previous blog I have used python loops to find the parameter values. But that not being the optimized approach I learnt and utilised matrix multiplication to make the finding of gradients optimized.
Where did I start?
Starting point:
My goal was simple. I wanted to understand how linear regression actually works, rather than just using Python loops to find the required parameters that give the global minima.
The model was straightforward: y = mx + b
Loss Function:
We define loss as the mean squared error, a.k.a. MSE:L = (1/n) Σ (y - y_pred)²
The above function for the loss tells us how wrong the predictions are, and it changes according to its parameters(m & b) by giving us a single number.
Gradient Descent Formula(In scalar terms):
Using the following derivations derived from the loss function with respect to the parameters used, we compute the gradients to minimise the loss.
dL/dm = -2/n Σ xᵢ (yᵢ - y_predᵢ) -> 1
dL/db = -2/n Σ (yᵢ - y_predᵢ) -> 2
Key insights I gained from the above:
- Error:
y - y_pred - Each data point contributes to a gradient and,
- Each contribution depends on the given input x
This leads to: - m’s update depends on
x * erroras shown in equation 1 - b’s update depends only on the
error, as shown in equation 2
Main Intuition here is,
Instead of gradient being a formula, it represents how much changing a parameter can affect the loss value.
For example,
- If x is large then m has strong influence.
- If x = 0 m has no influence
There said,
Gradient = sum of (influence x error)
Learning rate behavior:
What I observed from playing around with the learning rate(LR) is that,
- If LR is too high then the behavior is generally diverging or oscillating points.
- If LR is too low then the convergence is very slow
- If LR is a proper value or let’s say a suitable value for that regression then there is a smooth decrease in loss giving us a smooth convergence.
Important points to notice:
- The direction comes from the gradient and
- Learning rate controls the step size
Transitioning to vector form
Here instead of computing the dL/dm and dL/db we combine them as a single vector θ.
θ can be defined as θ = [m, b]
Then we construct the input matrix X:
X =[[ x₁, 1 ], [ x₂, 1 ], [ x₃, 1 ],…..]
Prediction y_pred becomes y_pred = Xθ
Error becomes: e = y - Xθ
And applying all of them in the vector the vector gradient becomes,
Gradient = -(2/n)X^T.e
It is simply a compact way to compute both the parameters gradients at once.
Insights I gained about X^T.e:
X^T rearranges the X input vector by,
First row => Collects all the x values
Second row => Collects all the constants (1s)
Multiplying it with X^T with e we get,
X^T.e = [ [ Σ xᵢ eᵢ ], [ Σ eᵢ ] ]
Which exactly matches:
[[dm], [db]]
Scaling Vs Direction:
A gradient has two parts,
- Direction => given by X^T.e
- Scaling => given by constants like 2/n
Multiplying the gradient by the constant what happens is that,
- The direction does not change and
- Only changes the step size
Note: The Learning rate can be a compensate for the scaling part.
Normalization (1/n):
Including the 1/n means:
- We average the contributions of the data points
- This keeps the gradient stable regardless the dataset size.
Without normalization large datasets will end up giving large gradients which in return gives us unstable updates
Loop vs Vectorized Implementation
Loop:
- Clear and good to learn and debug but less efficient
Vectorized implementation
- Uses matrix operations hence resulting in better performance being faster and scalable
Both produce the same final results but differ in speed and scaling as per the input.
Conclusion
This shift from loops to linear algebra does not relate with the complexity but relates with the concept of expressing/writing the same idea in an efficient way as well as scaling to real-world problems.
It took me a while but as soon as I understood the mapping,
Σ xᵢ eᵢ→Xᵀe
everything started to make sense.
Ten pages of rough work using matrix operations and explanations and notes was worth every second of understanding this core concept :)