Home United States USA — software Gradient Descent vs Normal Equation for Regression Problems

Gradient Descent vs Normal Equation for Regression Problems

July 15, 2020

273

In this article, we will see the actual difference between gradient descent and the normal equation in a practical approach.
Let’s be friends:
Comment (0)
Join the DZone community and get the full member experience.
In this article, we will see the actual difference between gradient descent and the normal equation in a practical approach. Most of the newbie machine learning enthusiasts learn about gradient descent during the linear regression and move further without even knowing about the most underestimated Normal Equation that is far less complex and provides very good results for small to medium size datasets.
If you are new to machine learning, or not familiar with a normal equation or gradient descent, don’t worry I’ll try my best to explain these in layman’s terms. So, I will start by explaining a little about the regression problem.
It is the entry-level supervised (both feature and target variables are given) machine learning algorithms. Suppose, we plot all these variables in the space, then the main task here is to fit the line in such a way that it minimizes the cost function or loss(don’t worry I’ll also explain this). There are various types of linear regression like a simple(one feature), multiple, and logistic(for classification). We have taken multiple linear regression in consideration for this article. The actual regression formula is:-
where θ₀ and θ₁ are the parameters that we have to find in such a way that they minimize the loss. In multiple regression, formula extended like θ₀+θ₁X₁+θ₂X₂. Cost function finds the error between the actual value and predicted value of our algorithm. It should be as minimum as possible. The formula for the same is:-
where m=number of examples or rows in the dataset, xᶦ=feature values of iᵗʰ example, yᶦ=actual outcome of iᵗʰ example.
It is an optimizationtechnique to find the best combination of parameters that minimize the cost function. In this, we start with random values of parameters(in most cases zero) and then keep changing the parameters to reduce J(θ₀,θ₁) or cost function until end up at a minimum. The formula for the same is:-
where j represents the no. of the parameter, α represents the learning rate. I will not discuss it in depth. You can find handwritten notes for these here.
It is an approach in which we can directly find the best values of parameters without using gradient descent. It is a very effective algorithm or ill say formula(as it consists of only oneline ��)when you are working with smaller datasets.