Jesus man, I remember back before I started college when I checked out Prof Strang’s calculus series. He’s aged quite a lot since that series, but he’s always sharp as a tack. And I’m just astonished that even being so old he knows so much about machine learning, I didn’t think it was his field. Huge kudos Gilbert Strang, huge kudos.
Such a great lecturer, as well as in his classic Linear Algebra lecture series. Really nice to see him up and healthy, sharp and as a great step-by-step-explainer as ever.
Tough course to follow, from what I feel (I'm currently in my 4th semester of undergrad) Great lecture of Prof Gilbert, I feel kinda dumb after listening to this lecture, will try again
Finally a lecture that explains the magic numbers in momentum! Those shorter video formats are great for introduction but leave me confused about the math behind it. Love the ground up approach to explaining. Could any one tell me what the book that Professor Strang mentioned in 06:53 of the lecture is?
Can this procedure be expanded to deal with problems in multiple dimensions? So a, b, c, and d are not scalars but actually vectors themselves, representing the inputs x1, x2, x3 to a function f(x1, x2, x3). How would you form R that way, and would you have different condition numbers for each element of b?
Can anyone provide some clarification here? I think why we would like to follow an eigen-vector is made clear, but what's not clear to me is why we expected this would work prior to deriving the result (that f decreases faster). I can see that following an eigen vector reduces the problem of inverting a block matrix containing the original S to just inverting a much smaller matrix of scalars. So, maybe this strategy was just wishful thinking that paid off? Insight would be very welcome. Thanks.
@@e2DAiPIE maybe if you can show that the method converges in all directions pointed by eigenvectors then it also converges with at least the same rate in all other directions (since any vector x in S can be written as a linear combination of the eigenbasis)
You mean why are we trying to make the eigenvalue as small as possible? I am also wondering the same... if we make eigenvalues of R small, then R^k -->0 as k-->\infty and you end up with c_k, d_k --> 0, and what good is that? I am surely missing a few parts to this story...
@@samymohammed596 1) if on the contrary, the powers of R where increasing, the new values of c_k, d_k would increase with them, meaning that x_k = c_k*q would never settle for the minimum of the function but diverge from it. 2) you do want the value of d_k to approach zero, meaning that z_k = d_k*q = 0 which then makes x_(k+1) = x_k, the point of convergence would be found at the minimum of the function. it's true that R^k --> 0 as k --> inf but we are not computing these values that many times! Taking this into account, R^k*[c_k, d_k] is not = [0, 0]
@@0ScarletBlood0 Ah, of course you are right about wanting d_k = 0! :):) Thanks for making that point clear! I certainly see the issue with powers of R increasing and then that causing immediate divergence. Yes, better for eigenvalues to be < 0 because then at least you don't start off with divergence... But then you might hit zero... I guess you need a little skill to pick the parameters s, beta to ensure that your problem is well defined so that you reach convergence (d_k = 0) before the powers of R runaway and make the whole thing zero! Just my 2 cents... but thanks very much for your reply!
Would they please stop calling Nesterov's algorithm ``descent''? It's not a descent method as Nesterov himself keeps repeating. Otherwise, a wonderful lecture, and an impressive feat for the lecturer given his age.
Jesus man, I remember back before I started college when I checked out Prof Strang’s calculus series.
He’s aged quite a lot since that series, but he’s always sharp as a tack. And I’m just astonished that even being so old he knows so much about machine learning, I didn’t think it was his field.
Huge kudos Gilbert Strang, huge kudos.
impressive indeed. I'd be happy to be 50% sharp at that age as he was here.
Professor Strang ,thank you for an old fashion lecture on Accelerating Gradient Descent.
These topics are very theoretical for the average student.
Those who have sixth edition of Introduction to Linear Algebra can enjoy this course!!! In my view this course really increases the value of the book.
Such a great lecturer, as well as in his classic Linear Algebra lecture series. Really nice to see him up and healthy, sharp and as a great step-by-step-explainer as ever.
I'm so happy to see you here. I only trust you when it comes to lecture
Wow this old man is so smart. I would wish to see more lectures from him and learn much more of this stuff.
absolutely ! this man is a pure tresor
Check out his linear algebra course, this is one of the most liked playlists of MIT.
ua-cam.com/video/7UJ4CFRGd-U/v-deo.html
Why is there no more comments for such a great course? MIT is a great university!
I'm just speachless.
He radiates knowledge. Love the content!
祝老爷子健康,非常感谢您!
Prof Boyd is also very good teacher !
I enjoy his lecture very much.
such great lecturing makes me wonder what part of MIT student success is due to innate ability and how much due to superior teaching
In terms of this very lecture: think about a professor as a gradient with your ability being a momentum. ;)
I loved this amazing lecture. Great professor, and great content. Thanks for sharing it openly on UA-cam.
Tough course to follow, from what I feel (I'm currently in my 4th semester of undergrad)
Great lecture of Prof Gilbert, I feel kinda dumb after listening to this lecture, will try again
Finally a lecture that explains the magic numbers in momentum! Those shorter video formats are great for introduction but leave me confused about the math behind it. Love the ground up approach to explaining.
Could any one tell me what the book that Professor Strang mentioned in 06:53 of the lecture is?
web.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf
It’s nice you got it on a linear line.
wow, beautiful, now i see why it oscillates
Crystal clear! Thank you very much for sharing it
Excellent lecture
why is it enough to assume x follows an eigenvector to demonstrate the rate of convergence?
Can this procedure be expanded to deal with problems in multiple dimensions? So a, b, c, and d are not scalars but actually vectors themselves, representing the inputs x1, x2, x3 to a function f(x1, x2, x3). How would you form R that way, and would you have different condition numbers for each element of b?
At 27:00 why follow the direction of eigenvalue? It just comes out of no where
i think it has something to do with pca.
Can anyone provide some clarification here?
I think why we would like to follow an eigen-vector is made clear, but what's not clear to me is why we expected this would work prior to deriving the result (that f decreases faster).
I can see that following an eigen vector reduces the problem of inverting a block matrix containing the original S to just inverting a much smaller matrix of scalars. So, maybe this strategy was just wishful thinking that paid off?
Insight would be very welcome. Thanks.
@@e2DAiPIE maybe if you can show that the method converges in all directions pointed by eigenvectors then it also converges with at least the same rate in all other directions (since any vector x in S can be written as a linear combination of the eigenbasis)
Such a great lecturer. Thank you!
That guy who is always capturing the photo
B is just the momentum :)
why do we need to make the eigen vector as small as possible ?
You mean why are we trying to make the eigenvalue as small as possible? I am also wondering the same... if we make eigenvalues of R small, then R^k -->0 as k-->\infty and you end up with c_k, d_k --> 0, and what good is that? I am surely missing a few parts to this story...
@@samymohammed596 1) if on the contrary, the powers of R where increasing, the new values of c_k, d_k would increase with them, meaning that x_k = c_k*q would never settle for the minimum of the function but diverge from it.
2) you do want the value of d_k to approach zero, meaning that z_k = d_k*q = 0 which then makes x_(k+1) = x_k, the point of convergence would be found at the minimum of the function.
it's true that R^k --> 0 as k --> inf but we are not computing these values that many times! Taking this into account, R^k*[c_k, d_k] is not = [0, 0]
@@0ScarletBlood0 Ah, of course you are right about wanting d_k = 0! :):) Thanks for making that point clear!
I certainly see the issue with powers of R increasing and then that causing immediate divergence. Yes, better for eigenvalues to be < 0 because then at least you don't start off with divergence...
But then you might hit zero... I guess you need a little skill to pick the parameters s, beta to ensure that your problem is well defined so that you reach convergence (d_k = 0) before the powers of R runaway and make the whole thing zero! Just my 2 cents... but thanks very much for your reply!
@@samymohammed596 that matrix has full rank, as long as β!=0.
All I know is it’s based on symmetry and the remaining 5 will be at the end of the spool.
why f is equal to (1/2)X(transpose)Sx where prof did not explain what is S. Does anyone know what is that?
see lecture 22 for the definition
this subchapter is limited to the convex function. convex provides a nice property: the local minima is also the global minima
Momentum forsenCD
Would they please stop calling Nesterov's algorithm ``descent''? It's not a descent method as Nesterov himself keeps repeating. Otherwise, a wonderful lecture, and an impressive feat for the lecturer given his age.
I agree with your point.
I'm back! 🤓
reis 85 yaşında kafa zehir.