May I ask how the 2 in "2/2m" at 17:34 end up there, where does it come from? I understand that its the part of the derivate in non-vectorized solution, but from where in the vectorized one?
when you have a vector term like (a^T) * (a) where (a) is a vector, that is essentially the squared of the vector. So when you take the derivative it is like taking the derivative of a^2 so you end up with 2 * (a^T). The same can be said for the term (X*THETA - Y)^T * (X*THETA - Y)
Thank you for this clear explanation of linear regression with gradient descent, it was a very well thought class. For the first time I feel like I understand!
Can I ask you one thing: Consider I have a dataset with many different features (counts of bacterial phyla) and the dependent variable is whether the subjects have a disease. I understand this is not linear but lets say it where, I am wondering could you instead of [1, x, x^2 ..... x^n] use [feature1, feature2, feature3 .... feature4] and treat the function as a linear combination of features? Or am I completely wrong here? My main Im trying to understand is how my rows and columns of my dataset relate to regression.
Hi Louise. Yes that is actually a good mathematical trick to use where your features (or basis functions are they are sometimes called) can be of any shape, but the model is a linear combination. And you can still use linear regression to fit that model
when we use simple linear regression and multi-linear regression then it use OLS by default or it use GRADIENT DESCENT for finding the best fit line ? please answer my question
The Ordinary Least Squares procedure seeks to minimize the sum of the squared residuals. If you have a function that you wish to minimize which contains more terms, for example regularization terms, then ordinary least squares may not work. Gradient descent is taught with a linear regression problem it is easier to understand, but it can also be applied to other problems
I am glad you enjoyed the video, I will be putting out some new videos soon let me know if you have any topics you want videos on. Please subscribe if you have not already, and share the videos you like! Thanks for watching
Thanks for your reply! I have watched many videos on the math of linear regression. But here in your video, you have explained in very detail and also some extra things! So, I request you to make playlist on machine learning and AI
Hello guys, i'm a little confused here, so help me out. In the theoretical lecture, you're taking the gradient of "J" (cost function) as the summation of ((y_hat[i]-y[i])*x[i])/m for all values of i(1,m). Only after the summation of the gradient is calculated for all the values of i(1,m) we are calculating the new theta (parameters) by plugging in the gradient which is the sum of ((y_hat[i]-y[i])*x[i])/m for all values of i(1,m) along with alpha and subtracting it from current theta value. But, when i went through the code in the jupyter notebook, in the function lin_reg_batch_gradient_descent, you're calculating new theta(parameter) for every i and then adding it to the current value of theta. so instead of this for x,y in zip(input_var, output_var): y_hat = np.dot(params, np.array([1.0, x])) gradient = np.array([1.0, x]) * (y - y_hat) params += alpha * gradient/num_samples shouldn't it be (according to the theoretical lecture) this gradient=np.zeros(2) for x,y in zip(input_var, output_var): y_hat = np.dot(params, np.array([1.0, x])) gradient += np.array([1.0, x]) * (y - y_hat) params += alpha * gradient/num_samples when i used the second piece of code, the gradient value becomes so big i get an overflow error.
Hi Sundeep. Thanks for the comment. If you look at the first equation in the notebook you will see the update equation for the parameters as follows: theta+ = theta- + (alpha/m) (yi - h(xi)) x_bar This equation implies that the update happens for every data sample. The mathematical summation operator used in the theoretical lecture has a distributive property where any constant multiplied by the sum can be moved in or out of the summation. For example alpha * sum_i (x_i) is the same as sum_i (alpha * xi) when alpha is constant. However when implementing the gradient descent if the number of samples is large your gradient can explode if you apply the alpha scaling after you do the sum (which is what you are seeing). This is an issue from finite precision systems like computers. To keep the number manageable we bring the constant alpha into the sum. in summary, both approaches are mathematically equivalent but we use the one with alpha inside the sum to maintain numerical stability.
Thank You@@EndlessEngineering for clearly explaining the reasoning behind why we apply alpha and calculate theta for every data point. So, if my understanding is correct, we use every data point with alpha to calculate the new theta if the data volume is huge to avoid gradient value explosion. and if the data size is small we can use the summation of the gradients to calculate new theta. Also, i would like to complement on your code. it's very easy to understand. very smart use of np.dot function to calculate slope and intercept of the regression line, and avoiding creating two separate variables for the parameters.
I have a query how could I optimize my cost function for selection of theta on the curve such that I could lower the computation time, rather than selecting it randomly
Hi Ajitesh, thanks for the question. I am not sure I fully understand your question, in linear regression the cost function is defined to be a quadratic function of the error, the computation time to minimize it depends on the algorithm used. In this video we are using the gradient descent algorithm at that has certain computational characteristics. You can certainly try other algorithms for solving the optimization problem. I am unclear on what you mean by "selecting it randomly." Are you referring to the initial value of theta, or the cost function itself?
@@EndlessEngineering thanks for replying I was referring to the initial value of theta, I have a collage project on minimizing the computation time by selecting a good initial value of theta, so that future iteration could be minimized. Any help would be appreciated.
@@ajiteshbhan Most numerical optimization algorithms initial condition is not the reason for how fast they run. Even if you knew the minimum (never happens in real life) and initialized your theta close to it the algorithm may chose to explore the parameter space and still take a number of iterations. That is why a random initialization is a good strategy. Like I mentioned in my earlier comment, the convergence time is more dependent on the algorithm used. Having said that, if your optimization problem has a certain structure and you can guess what the minimum is (or close to it) you should initialize your parameters as close to that as possible. For example, we know that for a cost of x^2 the minimum is at x = 0.0. So we can initialize close to that.
Hi Phuc, thanks for your question. Gradient descent is an algorithm used to minimize a function, in the linear regression case it is the sum of squared errors cost function. The least squares solution I show in the video is an analytical solution that is possible when the cost function us the sum of the squared errors. In short the gradient descent algorithm can be applied to any kind of cost (although you have to be careful with convergence), but the least squares analytical solution applies to the quadratic cost only.
Thanks for your comment. The space for xbar depends on how you define the model, I chose to define xbar as an n-space vector. If you choose to define x as an n-space vector then your xbar would be an n+1 space vector. Just be sure that the shape of the your parameters vector matches the model definition.
You should have used conventional symbols, this is making my brain hurt. why use m when n could substitute for num of data points, why us theta at all?? Your explanation is good but could be better by leaving out complicated symbols. why convert into a vector??
less than 1000 views ?!!!! this video must has at least 100000 views . this was amazing thaaaaank u
Thank you! I am glad you enjoyed the video. Feel free to share the video and subscribe to the channel.
This tutorial helped me understand the cost function easily.
Glad it helped! Feel free to like the video subscribe to the channel for more videos! Thanks for watching
Thanks for making linear regression interesting & helping in developing intuitions. Appreciate your effort!
Thank you for watching! I am glad you found it useful
May I ask how the 2 in "2/2m" at 17:34 end up there, where does it come from? I understand that its the part of the derivate in non-vectorized solution, but from where in the vectorized one?
when you have a vector term like (a^T) * (a) where (a) is a vector, that is essentially the squared of the vector. So when you take the derivative it is like taking the derivative of a^2 so you end up with 2 * (a^T). The same can be said for the term (X*THETA - Y)^T * (X*THETA - Y)
Thank you for this clear explanation of linear regression with gradient descent, it was a very well thought class. For the first time I feel like I understand!
Thank you for watching! I am glad you found it useful
Please let me know what other topics you would like to see videos on
I really like how you explained convergence 9:00
Hi Pablo! Glad you enjoyed the video.. feel free to check out other videos on the channel, and let me know what topics you would like to see covered!
Can I ask you one thing: Consider I have a dataset with many different features (counts of bacterial phyla) and the dependent variable is whether the subjects have a disease. I understand this is not linear but lets say it where, I am wondering could you instead of [1, x, x^2 ..... x^n] use [feature1, feature2, feature3 .... feature4] and treat the function as a linear combination of features? Or am I completely wrong here? My main Im trying to understand is how my rows and columns of my dataset relate to regression.
Hi Louise. Yes that is actually a good mathematical trick to use where your features (or basis functions are they are sometimes called) can be of any shape, but the model is a linear combination. And you can still use linear regression to fit that model
when we use simple linear regression and multi-linear regression then it use OLS by default or it use GRADIENT DESCENT for finding the best fit line ? please answer my question
The Ordinary Least Squares procedure seeks to minimize the sum of the squared residuals. If you have a function that you wish to minimize which contains more terms, for example regularization terms, then ordinary least squares may not work. Gradient descent is taught with a linear regression problem it is easier to understand, but it can also be applied to other problems
Wow yan ung hirap ako noong college mg isip about problem solving he he..
Wow! Maths is explained very well. Please make some more videos
I am glad you enjoyed the video, I will be putting out some new videos soon let me know if you have any topics you want videos on.
Please subscribe if you have not already, and share the videos you like! Thanks for watching
Thanks for your reply!
I have watched many videos on the math of linear regression. But here in your video, you have explained in very detail and also some extra things!
So, I request you to make playlist on machine learning and AI
nice explanations!!
Very good explanation
Thank you! I am glad you found this video clear and useful. Please let me know if there are other topics you would like to see videos on
thanks
Hello guys, i'm a little confused here, so help me out. In the theoretical lecture, you're taking the gradient of "J" (cost function) as the summation of ((y_hat[i]-y[i])*x[i])/m for all values of i(1,m). Only after the summation of the gradient is calculated for all the values of i(1,m) we are calculating the new theta (parameters) by plugging in the gradient which is the sum of ((y_hat[i]-y[i])*x[i])/m for all values of i(1,m) along with alpha and subtracting it from current theta value.
But, when i went through the code in the jupyter notebook, in the function lin_reg_batch_gradient_descent, you're calculating new theta(parameter) for every i and then adding it to the current value of theta.
so instead of this
for x,y in zip(input_var, output_var):
y_hat = np.dot(params, np.array([1.0, x]))
gradient = np.array([1.0, x]) * (y - y_hat)
params += alpha * gradient/num_samples
shouldn't it be (according to the theoretical lecture) this
gradient=np.zeros(2)
for x,y in zip(input_var, output_var):
y_hat = np.dot(params, np.array([1.0, x]))
gradient += np.array([1.0, x]) * (y - y_hat)
params += alpha * gradient/num_samples
when i used the second piece of code, the gradient value becomes so big i get an overflow error.
Hi Sundeep. Thanks for the comment.
If you look at the first equation in the notebook you will see the update equation for the parameters as follows:
theta+ = theta- + (alpha/m) (yi - h(xi)) x_bar
This equation implies that the update happens for every data sample.
The mathematical summation operator used in the theoretical lecture has a distributive property where any constant multiplied by the sum can be moved in or out of the summation. For example alpha * sum_i (x_i) is the same as sum_i (alpha * xi) when alpha is constant.
However when implementing the gradient descent if the number of samples is large your gradient can explode if you apply the alpha scaling after you do the sum (which is what you are seeing). This is an issue from finite precision systems like computers. To keep the number manageable we bring the constant alpha into the sum.
in summary, both approaches are mathematically equivalent but we use the one with alpha inside the sum to maintain numerical stability.
Thank You@@EndlessEngineering for clearly explaining the reasoning behind why we apply alpha and calculate theta for every data point. So, if my understanding is correct, we use every data point with alpha to calculate the new theta if the data volume is huge to avoid gradient value explosion. and if the data size is small we can use the summation of the gradients to calculate new theta.
Also, i would like to complement on your code. it's very easy to understand. very smart use of np.dot function to calculate slope and intercept of the regression line, and avoiding creating two separate variables for the parameters.
You are amazing! Thank you so much
Thank you! I am glad you enjoyed the video
great lecture! thanks!
Thank you for watching! I am glad you found it useful
I have a query how could I optimize my cost function for selection of theta on the curve such that I could lower the computation time, rather than selecting it randomly
Hi Ajitesh, thanks for the question.
I am not sure I fully understand your question, in linear regression the cost function is defined to be a quadratic function of the error, the computation time to minimize it depends on the algorithm used. In this video we are using the gradient descent algorithm at that has certain computational characteristics. You can certainly try other algorithms for solving the optimization problem.
I am unclear on what you mean by "selecting it randomly." Are you referring to the initial value of theta, or the cost function itself?
@@EndlessEngineering thanks for replying I was referring to the initial value of theta, I have a collage project on minimizing the computation time by selecting a good initial value of theta, so that future iteration could be minimized. Any help would be appreciated.
@@ajiteshbhan Most numerical optimization algorithms initial condition is not the reason for how fast they run. Even if you knew the minimum (never happens in real life) and initialized your theta close to it the algorithm may chose to explore the parameter space and still take a number of iterations. That is why a random initialization is a good strategy. Like I mentioned in my earlier comment, the convergence time is more dependent on the algorithm used.
Having said that, if your optimization problem has a certain structure and you can guess what the minimum is (or close to it) you should initialize your parameters as close to that as possible. For example, we know that for a cost of x^2 the minimum is at x = 0.0. So we can initialize close to that.
I dont understand Gradient Descent for ML, and Least Squares for statistical learning. Then which one is better?
Hi Phuc, thanks for your question.
Gradient descent is an algorithm used to minimize a function, in the linear regression case it is the sum of squared errors cost function. The least squares solution I show in the video is an analytical solution that is possible when the cost function us the sum of the squared errors.
In short the gradient descent algorithm can be applied to any kind of cost (although you have to be careful with convergence), but the least squares analytical solution applies to the quadratic cost only.
It was great!!!!
Thank you for watching! I am glad you found it useful
I think the xbar vector is in the space of n+1 not n
Thanks for your comment. The space for xbar depends on how you define the model, I chose to define xbar as an n-space vector. If you choose to define x as an n-space vector then your xbar would be an n+1 space vector. Just be sure that the shape of the your parameters vector matches the model definition.
You should have used conventional symbols, this is making my brain hurt. why use m when n could substitute for num of data points, why us theta at all?? Your explanation is good but could be better by leaving out complicated symbols. why convert into a vector??