It's amazing how well the timing of this video came out. I sat down to write an article on neural networks, went on UA-cam to play some background music, and had one of my favorite educators show up on my home screen with this extremely relevant topic. Awesome!
Nice video! For being so simple, it is fascinating how well gradient descent and its variants work. Like you mention, it is not at all guaranteed to find a good minimizer for non-convex functions in low dimension, but in higher dimensions, things just magically seem to work out, both numerically and (recently) mathematically. There's so much about high-dimensional objects that isn't really captured by our low-dimensional intuition, which unfortunately is quite limited. I recently saw a quote from Geoff Hinton that said: "To deal with a 14-dimensional space, visualize a 3-D space and say 'fourteen' to yourself very loudly."
@@proxagonal5954 When optimizing a neural network to fit training data, there are two important factors: (i) the training itself, (ii) the generalization to new data. For (i) to go good, you want to find a (nearly) global minimum of your training loss. Recent results show that, when you "overparameterize" your model (i.e., use many more parameters for the neural network than you have data points), every local minimum is global (at least in the infinite limit). Even though in lower dimensions, it seems that adding more parameters should make the function "less convex", it somehow seems to just work. Now, think about (ii) in two dimensions: say you want to fit a polynomial curve to a number of fixed points on the plane. If you fit a first-degree polynomial, i.e. a straight line, you cannot "overfit" to the data-points, you get a straight line that sort of goes through the data-points, without bending too extremely due to one of them. If you use a high-degree polynomial, you can fit all the points very well, but your curve may bend extremely, and you fit the training points "too" well -- especially if some of them are noisy. However, if you find the high-degree polynomial with the minimum norm of the coefficients, this typically works better. The magic of gradient descent in high-dimension is that it automatically finds this minimum-norm solution! This seems to be true also for complex models like neural networks, but this stuff isn't fully understood yet. If you search for "neural tangent kernel", you can find some info about these kinds of results.
@@jeevan88888 (silly example) Let's say you have three data points (-1, -1), (0, 0), and (1, 1) and you want to fit a degree-3 polynomial to this. One way to do this is to use: y = 500x^3 - 499x = 500*x^3 + 0*x^2 - 499*x + 0 This fits all three points perfectly, so it works great for your "training data". Another solution (which intuitively makes more sense) is just to use y = x = 0*x^3 + 0*x^2 + 1*x + 0 This also perfectly fits your known data. Now, say you want to predict the y-value for x=0.5. The first options predicts -187, the second predicts 0.5. The 0.5 prediction seems much more reasonable, the first curve is too bendy. One way to quantify this is to form vectors of the coefficients of the functions and compute their (Euclidean 2) norm: First, || [500, 0, -499, 0] ||_2 = 706.4 Second, || [0, 0, 1, 0] ||_2 = 1 The second has much lower norm! This means that it is in some sense simpler, and expected to generalize better to new data. (This example can be made a bit more interesting by adding a bit of noise to the observed data, or outliers. The principle still holds) So what I mean is: a polynomial that passes through the observed data points, but the norm of the coefficient vector (as above) is as small as possible
Hello Dr. Bazett, I have been watching a lot of your videos during my current PhD process. I just want to say that you made hard concept visual and more intuitive. it has been a great help to me! Please continue making more! I love to see you explaining some convergence analysis!
dude, I am studying cs and getting an expert in ai, but I found your account because of latex. Now this gets recommended to me. The recommendation algorithm predicting at it's finest ;D
Hello, that was perhaps one of the most beautiful explanation and visualisation of the gradient I've come across, I could not be happier after watching this video, your excitement is what mathematics is all about, discovering beautiful things. Thank you so much !
This year i will be going to college once i am finished with my highschool exams and entrance exams. Can't wait to explore all of your content in college.
This video has been very interesting. I have really enjoyed and benifitted from your videos, especially the ones on vector calculus. When I did my engineering degree ten years ago, either I wasn't aware of these types of videos, or they didn't exist. I am currently studying a physics degree remotely and these videos have been IMMENSELY helpful. You and prof Leonard and many others are changing lives. I understand concepts now that I didn't even realize I didnt understand the first time around. The graphics and your teaching style - awesome. Thank you very very much.
When I did my PhD in aerodynamic shape optimization for airplanes, we would have hundreds or thousands of variables to define the shape of the wing, and the “function” that we were minimizing (e.g. drag at a fixed lift) cannot be defined analytically (it required a full flow analysis), there were a lot of research done in finding the gradient direction. Also, while we would start with the steepest descent direction, we would also build an approximate Hessian matrix that gives us a progressively clearer picture of the function space after every step. The most efficient step sizes that we would use were quite different from the steepest descent line search.
my only complaint is that its too short! i would love to see more about variations of gradient descent like space curves and how you mentioned things like momentum as a parameter, or higher/complex dimensions, etc! especially coming up with cool new interpolations on scalar or vector fields with calculus
@@DrTrefor I'm a big algebraic topology and Lie group/algebra nut myself, so of course my first thought is "what are some natural yet interesting families of curves that can come from this mathematical structure?" But I'm so glad you agree!!! I really do hope you go further, I would play with generating families of curves/manifolds if I weren't so slammed with university work right now so it will be amazing to have a springboard to explore from with a followup!
Every choice sucks a bit. One argument for vector is to distinguish it sort of philosophically from the point (a,b). But yes, then the notation is aligned with the inner product notation.
Very interesting to reduce this to a 1D optimization for each step. How would you consider the convergence speed and cost of this method compared to for example iterating basing the update on knowing both Jacobian and hessian Matrix (of course, without considering a number of variables that would make the hessian inversion too cumbersome)? In the video way you still need to find a stationary point even in your 1D constrain with some guess of a second directional derivative, but of course looks much leaner. Thank you so much for the insight.
What an interesting and useful video! I haven't thought about this kind of thing for a good while now. But I'm curious.... what's the deal with the *conjugate gradient* method?
Great explanation of the variable step method. But how do you calculate the gradient vector? It seems that you need that computationally expensive differentiation?!
Yes indeed, it is defined by the partial derivatives in each component. However, you can approximate those (take a small step in the x direction and compute the secant line) so depending on your accuracy they can be done relatively efficiently. All of this is a balance of different types of efficiencies ultimately.
Yes, well explained. I had this type of situation more than once (admittedly one dimensional). Defining a search pattern that is relatively fast and guaranteed to converge was interesting. Even defining the end point condition required clear thought. Who said the life and pensions industry was boring?
Great video! I was curious a month ago about the method and decided to implement it and animate it using “Naive GD” aka: constant step size . What is the name of this Gradient Descent method, where you optimize the choice of the step size at each I?
I am interested in gradient decent when you only have samples and re sampling is not necessarily possible. Do you know if any good research in that area?
One question. Say I wanted to implement it in Matlab. I thought about estimating the one variable function by estimating the Gradient numerically. Now, once O have this function defined, how would you go on finding the extrema of it? I thought of estimating it’s derivative also, numerically and then defining a function “FindExtrema(g’(phi)) “ that performs the Newton Method on a set of N equispaced points from (0-b] and then uses the estimated roots obtained to evaluate the function of phi at them and use the one that min or max the value respectively. Does this make sense? Ps: I would also evaluate the endpoints to see if the extrema is there
Yes that is exactly right. When I was programming this in MATLAB, I got MATLAB to solve the derivative equal to zero formula for the specific line. You might also want to google backtracking line search for a slightly better way.
@@DrTrefor thanks so much. The thing is I would like to animate it for arbitrary functions defined by the user so I think I’ll have to do everything numerically
Also, what method do you use for animations like this one. I have learned a way in Matlab that works by calling the drawnow() function to update the appearance of a plot after having used the functions Set(h,”Xdata”, whatever) to update the function x,y,z plotted points. Thanks so much in advance. If you could also upload the script on description that would be great!
I was talking to a friend yesterday about the strategies to optimize learning rate and boom! you made it really clear. Thanks. Thou this might seems less expensive computionally, but in a real problem one does not know the loss/target function and must evaluate many model parameters sampled/updated in the direction of the gradient and then select the best model parameters according to new calculated losses. right? So the optimizer needs many many more model evaluations to select the best update. This can be even more computationally expensive than SGD.
I'm currently using the method of moving asymptotes (Svanberg, 1987) for nonlinear optimization of nonconvex problems in topology optimization. However, using it and fully understanding it are two different things. I would be curious whether you would be able to give a simple explanation for these kind of methods.
do you know of good techniques for discontinuous search spaces? i had been thinking a genetic algorithm (tho that’s on the back burner while i write a ton of code for it to figure out how good something it comes up with is)
ua-cam.com/video/fXQXE96r4AY/v-deo.html is it not a second order Gradient Descent Optimization where you need to also know all the partial derivatives? if not where can i see the formula for this Gradient Descent Optimization?
I have used gradient descent a lot. It is good this video mentions line search because most don't, and line search is very effective relative to other techniques. I have an example using the Rosenbrock equation. This is easy but gradient descent take more computations to get to the minimum. Also, with real data, there isn't a cost function one can take the derivative of. Finding the gradient involves evaluating the cost function after taking small steps in each direction and computing the slope. This is time consuming and not as accurate. The is no ability to compute when the slope of the line search is zero without using lots of math and it isn't noise free.
It's amazing how well the timing of this video came out. I sat down to write an article on neural networks, went on UA-cam to play some background music, and had one of my favorite educators show up on my home screen with this extremely relevant topic. Awesome!
Nice video! For being so simple, it is fascinating how well gradient descent and its variants work. Like you mention, it is not at all guaranteed to find a good minimizer for non-convex functions in low dimension, but in higher dimensions, things just magically seem to work out, both numerically and (recently) mathematically. There's so much about high-dimensional objects that isn't really captured by our low-dimensional intuition, which unfortunately is quite limited. I recently saw a quote from Geoff Hinton that said:
"To deal with a 14-dimensional space, visualize a 3-D space and say 'fourteen' to yourself very loudly."
Lol. By the way, what do you mean by things recently working out mathematically for gradient descent in higher dimensions?
@@proxagonal5954 When optimizing a neural network to fit training data, there are two important factors: (i) the training itself, (ii) the generalization to new data. For (i) to go good, you want to find a (nearly) global minimum of your training loss. Recent results show that, when you "overparameterize" your model (i.e., use many more parameters for the neural network than you have data points), every local minimum is global (at least in the infinite limit). Even though in lower dimensions, it seems that adding more parameters should make the function "less convex", it somehow seems to just work.
Now, think about (ii) in two dimensions: say you want to fit a polynomial curve to a number of fixed points on the plane. If you fit a first-degree polynomial, i.e. a straight line, you cannot "overfit" to the data-points, you get a straight line that sort of goes through the data-points, without bending too extremely due to one of them. If you use a high-degree polynomial, you can fit all the points very well, but your curve may bend extremely, and you fit the training points "too" well -- especially if some of them are noisy. However, if you find the high-degree polynomial with the minimum norm of the coefficients, this typically works better. The magic of gradient descent in high-dimension is that it automatically finds this minimum-norm solution! This seems to be true also for complex models like neural networks, but this stuff isn't fully understood yet.
If you search for "neural tangent kernel", you can find some info about these kinds of results.
@@ProbablyApproximatelyCorrect Thank you!
@@ProbablyApproximatelyCorrect what do you mean by "if you find the high-degree polynomial with the minimum norm of the coefficients" ???
@@jeevan88888 (silly example) Let's say you have three data points (-1, -1), (0, 0), and (1, 1) and you want to fit a degree-3 polynomial to this. One way to do this is to use:
y = 500x^3 - 499x = 500*x^3 + 0*x^2 - 499*x + 0
This fits all three points perfectly, so it works great for your "training data". Another solution (which intuitively makes more sense) is just to use
y = x = 0*x^3 + 0*x^2 + 1*x + 0
This also perfectly fits your known data. Now, say you want to predict the y-value for x=0.5. The first options predicts -187, the second predicts 0.5. The 0.5 prediction seems much more reasonable, the first curve is too bendy. One way to quantify this is to form vectors of the coefficients of the functions and compute their (Euclidean 2) norm:
First, || [500, 0, -499, 0] ||_2 = 706.4
Second, || [0, 0, 1, 0] ||_2 = 1
The second has much lower norm! This means that it is in some sense simpler, and expected to generalize better to new data. (This example can be made a bit more interesting by adding a bit of noise to the observed data, or outliers. The principle still holds)
So what I mean is: a polynomial that passes through the observed data points, but the norm of the coefficient vector (as above) is as small as possible
Best math channel on UA-cam. The way you distill things down to intuitive graphics really helped me get more out of Calc III last semester.
So glad I was able to help!
Wtf is Penn then
2:52 If we want to maximize, shouldn't we add the gradient? Of course gamma can be negative but just conceptually?
Optimizing functions ❌️
Something we've heard about it and didn't give it enough time in that 30 hours machine learning course ✅️
thanks for the vid!
I literally search Gradient Descent Trefor bazett two days ago... This is destiny....
haha perfect timeing!
Now this is amazing! Please continue the ML math stuff!!
Hello Dr. Bazett, I have been watching a lot of your videos during my current PhD process. I just want to say that you made hard concept visual and more intuitive. it has been a great help to me! Please continue making more! I love to see you explaining some convergence analysis!
dude, I am studying cs and getting an expert in ai, but I found your account because of latex. Now this gets recommended to me. The recommendation algorithm predicting at it's finest ;D
ha all praise the algorithm!
Just want to say this channel is fantastic, its very enjoyable listening to you talk about these topics.
Glad you enjoy it!
I love you man. Like I've been trying to understand this but your graphics and explanations just make life easier
Hello, that was perhaps one of the most beautiful explanation and visualisation of the gradient I've come across, I could not be happier after watching this video, your excitement is what mathematics is all about, discovering beautiful things. Thank you so much !
Wow, thank you!
wow, the most beautiful explanation i have come across today
Glad you enjoyed!
Really happy to see ML content on the channel!
Thanks! Plan to do more for sure
This is so much better than a lot of text books I read
Great video!! Please tell Brilliant that our mobile app needs a dark mode ASAP. Love their stuff
One of the best UA-cam channels
Thank you!
This year i will be going to college once i am finished with my highschool exams and entrance exams. Can't wait to explore all of your content in college.
You’re going to love it!
How @@DrTrefornj
This video has been very interesting. I have really enjoyed and benifitted from your videos, especially the ones on vector calculus. When I did my engineering degree ten years ago, either I wasn't aware of these types of videos, or they didn't exist. I am currently studying a physics degree remotely and these videos have been IMMENSELY helpful. You and prof Leonard and many others are changing lives. I understand concepts now that I didn't even realize I didnt understand the first time around. The graphics and your teaching style - awesome. Thank you very very much.
Thank you so much!
i'm in the 9th grade, I have understood 1/27 of what you're talking about. Ig i have to watch this 27 more times
I was just thinking about this topic since several days. now the video is here. thanks, Prof. you have helped me a lot
Glad it was helpful!
Unbelievable explanation as usual, prof.
Thanks
excellent video trefor
Thank you very much!
Amazing visualizations, really well done!!!
When I did my PhD in aerodynamic shape optimization for airplanes, we would have hundreds or thousands of variables to define the shape of the wing, and the “function” that we were minimizing (e.g. drag at a fixed lift) cannot be defined analytically (it required a full flow analysis), there were a lot of research done in finding the gradient direction. Also, while we would start with the steepest descent direction, we would also build an approximate Hessian matrix that gives us a progressively clearer picture of the function space after every step. The most efficient step sizes that we would use were quite different from the steepest descent line search.
AWESOME, thanks, gradient descent is kind of neat
Isn't it cool?
my only complaint is that its too short! i would love to see more about variations of gradient descent like space curves and how you mentioned things like momentum as a parameter, or higher/complex dimensions, etc! especially coming up with cool new interpolations on scalar or vector fields with calculus
I kind of want to do some follow ups because you are right, this just scratches the surface of all the options!
@@DrTrefor I'm a big algebraic topology and Lie group/algebra nut myself, so of course my first thought is "what are some natural yet interesting families of curves that can come from this mathematical structure?" But I'm so glad you agree!!! I really do hope you go further, I would play with generating families of curves/manifolds if I weren't so slammed with university work right now so it will be amazing to have a springboard to explore from with a followup!
2:20 surprising notation for a vector) Usually brackets are used for scalar product.
Every choice sucks a bit. One argument for vector is to distinguish it sort of philosophically from the point (a,b). But yes, then the notation is aligned with the inner product notation.
The visuals are impressive
Very interesting to reduce this to a 1D optimization for each step. How would you consider the convergence speed and cost of this method compared to for example iterating basing the update on knowing both Jacobian and hessian Matrix (of course, without considering a number of variables that would make the hessian inversion too cumbersome)?
In the video way you still need to find a stationary point even in your 1D constrain with some guess of a second directional derivative, but of course looks much leaner.
Thank you so much for the insight.
Great video!
Thanks!
Well done 👍 It’s far easier to understand and with much high information density than the AI open course of Stanford University 😊
can we get a gradient descent video using backtracking line search? this video was really helpful :)
great video! helped me a lot
Great work
nice visualization
What an interesting and useful video! I haven't thought about this kind of thing for a good while now. But I'm curious.... what's the deal with the *conjugate gradient* method?
Great explanation of the variable step method. But how do you calculate the gradient vector? It seems that you need that computationally expensive differentiation?!
Yes indeed, it is defined by the partial derivatives in each component. However, you can approximate those (take a small step in the x direction and compute the secant line) so depending on your accuracy they can be done relatively efficiently. All of this is a balance of different types of efficiencies ultimately.
@@DrTrefor got it, thanks
amazing!
Yes, well explained.
I had this type of situation more than once (admittedly one dimensional). Defining a search pattern that is relatively fast and guaranteed to converge was interesting. Even defining the end point condition required clear thought.
Who said the life and pensions industry was boring?
I suppose life insurance and pensions is really all just math!
@@DrTrefor Just don't tell the salesman.
Great video! I was curious a month ago about the method and decided to implement it and animate it using “Naive GD” aka: constant step size . What is the name of this Gradient Descent method, where you optimize the choice of the step size at each I?
Typically just "line search".
I am interested in gradient decent when you only have samples and re sampling is not necessarily possible. Do you know if any good research in that area?
Thank you for the video! Which tool did you use to display the planes and graphs with the ascents?
this is all matlab
First things first, what did you use for the plots? ;-)
Ha this is all MATLAB, but you can use lots of things to generate these types of plots
whats the i in the reworked gradient descent formula? the one where the gamma is affected by it?
One question. Say I wanted to implement it in Matlab. I thought about estimating the one variable function by estimating the Gradient numerically. Now, once O have this function defined, how would you go on finding the extrema of it? I thought of estimating it’s derivative also, numerically and then defining a function “FindExtrema(g’(phi)) “ that performs the Newton Method on a set of N equispaced points from (0-b] and then uses the estimated roots obtained to evaluate the function of phi at them and use the one that min or max the value respectively. Does this make sense?
Ps: I would also evaluate the endpoints to see if the extrema is there
Yes that is exactly right. When I was programming this in MATLAB, I got MATLAB to solve the derivative equal to zero formula for the specific line. You might also want to google backtracking line search for a slightly better way.
@@DrTrefor thanks so much. The thing is I would like to animate it for arbitrary functions defined by the user so I think I’ll have to do everything numerically
Also, what method do you use for animations like this one. I have learned a way in Matlab that works by calling the drawnow() function to update the appearance of a plot after having used the functions Set(h,”Xdata”, whatever) to update the function x,y,z plotted points. Thanks so much in advance. If you could also upload the script on description that would be great!
I was talking to a friend yesterday about the strategies to optimize learning rate and boom! you made it really clear. Thanks. Thou this might seems less expensive computionally, but in a real problem one does not know the loss/target function and must evaluate many model parameters sampled/updated in the direction of the gradient and then select the best model parameters according to new calculated losses. right? So the optimizer needs many many more model evaluations to select the best update. This can be even more computationally expensive than SGD.
@drtrefor it's sad that my comment/question is one of the only 3 you did not answer! :(
I'm currently using the method of moving asymptotes (Svanberg, 1987) for nonlinear optimization of nonconvex problems in topology optimization. However, using it and fully understanding it are two different things. I would be curious whether you would be able to give a simple explanation for these kind of methods.
so once i have my vector for the gradient of the two dimensional space, how do i turn it into a single variable equation?
great t-shirt where can you one get them ...
Merch link in description!
do you know of good techniques for discontinuous search spaces? i had been thinking a genetic algorithm (tho that’s on the back burner while i write a ton of code for it to figure out how good something it comes up with is)
We know the zero of the Error (loss) term for each iteration. Why not divide the delta by 2 and approach exponentially slower?
Bro i was watching this like "Cool video, how old is this?"
*1 hour ago*
haha:D
@@DrTrefor i’ve always watched your older videos through reccomendations, nice to catch a never video
Cool t-shirtl :))
ua-cam.com/video/fXQXE96r4AY/v-deo.html
is it not a second order Gradient Descent Optimization where you need to also know all the partial derivatives?
if not where can i see the formula for this Gradient Descent Optimization?
Clearly not an exact maths. Very clever
Nope! But great for getting close.
♥️♥️♥️
I have used gradient descent a lot. It is good this video mentions line search because most don't, and line search is very effective relative to other techniques. I have an example using the Rosenbrock equation. This is easy but gradient descent take more computations to get to the minimum. Also, with real data, there isn't a cost function one can take the derivative of. Finding the gradient involves evaluating the cost function after taking small steps in each direction and computing the slope. This is time consuming and not as accurate. The is no ability to compute when the slope of the line search is zero without using lots of math and it isn't noise free.