Linear Algebra in ML: The linear regression & gradient descent

Intro

In this section, I am documenting Linear Regression, Gradient Descent, and explaining why derivatives are effective in this context.

Problem

Given a set of data points, how do we mathematically find the best-fitting line that minimises prediction error?

I had trouble understanding this part of linear regression and the concept behind it.

Learnings

Linear regression tries to find a line that best fits the data by minimising prediction error.
Regression in machine learning is a technique that is used to predict numerous values based on the relationship between the input features and the target variable. It is used to find the relationship between two variables in the data.
Linear, a straight line.
Equation of a line:
- y = mx + b
- m = slope
- b = intercept
- The goal is to find m and b such that the prediction error is minimised.
Learning Rate(α):
- Controls the size of the steps a model takes to update its parameters.
- Example - Down the hill,
  - Let’s say that someone’s climbing down the hill with their eyes tied(regression).
  - They take calculated steps to find their way till the flat surface is reached, i.e. till the global minima is reached in terms of machine learning
  - To find the global minima, we need to have controlled values, which is done using a learning rate.
Mean Squared Error(MSE):
- The function to find the error.
- MSE squares the error so that bigger mistakes are penalised more, and the negative values do not get cancelled out.

Loss Function - A subset of the MSE

Gradient Descent
- The gradient descent is an optimisation technique used in supervised machine learning algorithms to update the parameters (m, b) in the direction that reduces the loss.
- Gradient descent is highly dependent on the learning rate. If a higher learning rate is given, then the output will be larger, resulting in more errors and loss. A high learning rate can cause the values to increase drastically from the desired output, preventing the model from converging to the global minima.
Convergence behaviour:
- This is the process/state of achieving a desired outcome.
- Convergence is when losses get stabilised, and the parameter updates become so small.
- In linear regression, the convergence is the global minima.
Derivatives tell,

How much the output changes when the input changes slightly

It answers the question “If I change the input, how much will the loss change?“
As x increases => slope increases => rate of change increases
Example:

Practical learnings

Steps to find linear regression:

Initialise the values of m and b.
Use m and b to find the new line using the slope-intercept formula and check the error using the cost function(MSE - Mean Squared Error)
Reduce the error by finding the right values of m and b using Gradient Descent.

The final verdict on understanding linear regression is that when the gradient is larger, the value of loss is larger, and the predictions are incorrect. But as we move down the road, implementing the algorithm, the loss decreases, and so does the error and the gradient. This results in the parameters change not being very effective on the loss, and gives very minimal improvement in the loss.

What actually clicked for me?

I understood linear regression conceptually, but things only became clear after implementing gradient descent and understanding derivatives as something that optimises the resultant output rather than just as formulas.

This just made gradient descent intuitive for me :)

Conclusion

Treating machine learning algorithms as something that helps us build something cool rather than just memorising the formulas has helped me understand the concept and the mechanism behind it.

Resources

Resources used to learn these concepts,

https://www.youtube.com/watch?v=-FOH7LHehGc [Tamil explanation]
ChatGPT [Code examples and learning structure]

PS: If you find any misunderstandings I have in any concepts, it will mean a lot to let me know about it here: sharathlingam.s@gmail.com

Linear Algebra - Part 2