Intro to Gradient Descent || Optimizing High-Dimensional Equations

Dr. Trefor Bazett

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 29 гру 2024

КОМЕНТАРІ • 98

@spongebobinator123 Рік тому ⁺³⁷
It's amazing how well the timing of this video came out. I sat down to write an article on neural networks, went on UA-cam to play some background music, and had one of my favorite educators show up on my home screen with this extremely relevant topic. Awesome!
@ProbablyApproximatelyCorrect Рік тому ⁺³⁶
Nice video! For being so simple, it is fascinating how well gradient descent and its variants work. Like you mention, it is not at all guaranteed to find a good minimizer for non-convex functions in low dimension, but in higher dimensions, things just magically seem to work out, both numerically and (recently) mathematically. There's so much about high-dimensional objects that isn't really captured by our low-dimensional intuition, which unfortunately is quite limited. I recently saw a quote from Geoff Hinton that said:
"To deal with a 14-dimensional space, visualize a 3-D space and say 'fourteen' to yourself very loudly."
@proxagonal5954 Рік тому ⁺²
Lol. By the way, what do you mean by things recently working out mathematically for gradient descent in higher dimensions?
@ProbablyApproximatelyCorrect Рік тому ⁺¹⁰
@@proxagonal5954 When optimizing a neural network to fit training data, there are two important factors: (i) the training itself, (ii) the generalization to new data. For (i) to go good, you want to find a (nearly) global minimum of your training loss. Recent results show that, when you "overparameterize" your model (i.e., use many more parameters for the neural network than you have data points), every local minimum is global (at least in the infinite limit). Even though in lower dimensions, it seems that adding more parameters should make the function "less convex", it somehow seems to just work.
Now, think about (ii) in two dimensions: say you want to fit a polynomial curve to a number of fixed points on the plane. If you fit a first-degree polynomial, i.e. a straight line, you cannot "overfit" to the data-points, you get a straight line that sort of goes through the data-points, without bending too extremely due to one of them. If you use a high-degree polynomial, you can fit all the points very well, but your curve may bend extremely, and you fit the training points "too" well -- especially if some of them are noisy. However, if you find the high-degree polynomial with the minimum norm of the coefficients, this typically works better. The magic of gradient descent in high-dimension is that it automatically finds this minimum-norm solution! This seems to be true also for complex models like neural networks, but this stuff isn't fully understood yet.
If you search for "neural tangent kernel", you can find some info about these kinds of results.
@proxagonal5954 Рік тому ⁺¹
@@ProbablyApproximatelyCorrect Thank you!
@jeevan88888 Місяць тому
@@ProbablyApproximatelyCorrect what do you mean by "if you find the high-degree polynomial with the minimum norm of the coefficients" ???
@ProbablyApproximatelyCorrect Місяць тому
@@jeevan88888 (silly example) Let's say you have three data points (-1, -1), (0, 0), and (1, 1) and you want to fit a degree-3 polynomial to this. One way to do this is to use:
y = 500x^3 - 499x = 500*x^3 + 0*x^2 - 499*x + 0
This fits all three points perfectly, so it works great for your "training data". Another solution (which intuitively makes more sense) is just to use
y = x = 0*x^3 + 0*x^2 + 1*x + 0
This also perfectly fits your known data. Now, say you want to predict the y-value for x=0.5. The first options predicts -187, the second predicts 0.5. The 0.5 prediction seems much more reasonable, the first curve is too bendy. One way to quantify this is to form vectors of the coefficients of the functions and compute their (Euclidean 2) norm:
First, || [500, 0, -499, 0] ||_2 = 706.4
Second, || [0, 0, 1, 0] ||_2 = 1
The second has much lower norm! This means that it is in some sense simpler, and expected to generalize better to new data. (This example can be made a bit more interesting by adding a bit of noise to the observed data, or outliers. The principle still holds)
So what I mean is: a polynomial that passes through the observed data points, but the norm of the coefficient vector (as above) is as small as possible
@vincentalegrete3287 Рік тому ⁺²⁸
Best math channel on UA-cam. The way you distill things down to intuitive graphics really helped me get more out of Calc III last semester.
@DrTrefor Рік тому ⁺⁴
So glad I was able to help!
@atreidesson Рік тому
Wtf is Penn then
@Kokurorokuko Рік тому ⁺³
2:52 If we want to maximize, shouldn't we add the gradient? Of course gamma can be negative but just conceptually?
@fatianass8317 Рік тому ⁺²
Optimizing functions ❌️
Something we've heard about it and didn't give it enough time in that 30 hours machine learning course ✅️
thanks for the vid!
@mohammedmhilal4129 Рік тому ⁺⁵
I literally search Gradient Descent Trefor bazett two days ago... This is destiny....
@DrTrefor Рік тому ⁺²
haha perfect timeing!
@judedavis92 Рік тому ⁺⁶
Now this is amazing! Please continue the ML math stuff!!
@jinc7021 Рік тому ⁺³
Hello Dr. Bazett, I have been watching a lot of your videos during my current PhD process. I just want to say that you made hard concept visual and more intuitive. it has been a great help to me! Please continue making more! I love to see you explaining some convergence analysis!
@Heisenberg355 Рік тому ⁺²
dude, I am studying cs and getting an expert in ai, but I found your account because of latex. Now this gets recommended to me. The recommendation algorithm predicting at it's finest ;D
@DrTrefor Рік тому ⁺¹
ha all praise the algorithm!
@teikono Рік тому ⁺²
Just want to say this channel is fantastic, its very enjoyable listening to you talk about these topics.
@DrTrefor Рік тому ⁺¹
Glad you enjoy it!
@lifetimejourney10 2 місяці тому ⁺¹
I love you man. Like I've been trying to understand this but your graphics and explanations just make life easier
@babybeel8787 Рік тому ⁺²
Hello, that was perhaps one of the most beautiful explanation and visualisation of the gradient I've come across, I could not be happier after watching this video, your excitement is what mathematics is all about, discovering beautiful things. Thank you so much !
@DrTrefor Рік тому ⁺¹
Wow, thank you!
@weggquiz Рік тому ⁺¹
wow, the most beautiful explanation i have come across today
@DrTrefor Рік тому ⁺¹
Glad you enjoyed!
@sedthh Рік тому ⁺¹
Really happy to see ML content on the channel!
@DrTrefor Рік тому ⁺¹
Thanks! Plan to do more for sure
@erkangwang630 9 місяців тому
This is so much better than a lot of text books I read
@kaygeea 5 місяців тому ⁺¹
Great video!! Please tell Brilliant that our mobile app needs a dark mode ASAP. Love their stuff
@amandeepsaha Рік тому ⁺¹
One of the best UA-cam channels
@DrTrefor Рік тому ⁺¹
Thank you!
@geraldsnodd Рік тому ⁺¹
This year i will be going to college once i am finished with my highschool exams and entrance exams. Can't wait to explore all of your content in college.
@DrTrefor Рік тому ⁺²
You’re going to love it!
@anubhavbinit8316 4 місяці тому
How @@DrTrefornj
@pierreretief Рік тому ⁺⁵
This video has been very interesting. I have really enjoyed and benifitted from your videos, especially the ones on vector calculus. When I did my engineering degree ten years ago, either I wasn't aware of these types of videos, or they didn't exist. I am currently studying a physics degree remotely and these videos have been IMMENSELY helpful. You and prof Leonard and many others are changing lives. I understand concepts now that I didn't even realize I didnt understand the first time around. The graphics and your teaching style - awesome. Thank you very very much.
@DrTrefor Рік тому ⁺¹
Thank you so much!
@reckless2830 Рік тому ⁺³
i'm in the 9th grade, I have understood 1/27 of what you're talking about. Ig i have to watch this 27 more times
@bhavesh.adhikari Рік тому ⁺¹
I was just thinking about this topic since several days. now the video is here. thanks, Prof. you have helped me a lot
@DrTrefor Рік тому ⁺³
Glad it was helpful!
@egycg3569 Рік тому ⁺¹
Unbelievable explanation as usual, prof.
Thanks
@siddharthvj1 Рік тому ⁺³
excellent video trefor
@DrTrefor Рік тому ⁺²
Thank you very much!
@tooooomboh Рік тому ⁺¹
Amazing visualizations, really well done!!!
@TimLeung12345 Рік тому
When I did my PhD in aerodynamic shape optimization for airplanes, we would have hundreds or thousands of variables to define the shape of the wing, and the “function” that we were minimizing (e.g. drag at a fixed lift) cannot be defined analytically (it required a full flow analysis), there were a lot of research done in finding the gradient direction. Also, while we would start with the steepest descent direction, we would also build an approximate Hessian matrix that gives us a progressively clearer picture of the function space after every step. The most efficient step sizes that we would use were quite different from the steepest descent line search.
@sinonimo8719 Рік тому ⁺¹
AWESOME, thanks, gradient descent is kind of neat
@DrTrefor Рік тому ⁺²
Isn't it cool?
@lexinwonderland5741 Рік тому ⁺¹
my only complaint is that its too short! i would love to see more about variations of gradient descent like space curves and how you mentioned things like momentum as a parameter, or higher/complex dimensions, etc! especially coming up with cool new interpolations on scalar or vector fields with calculus
@DrTrefor Рік тому ⁺¹
I kind of want to do some follow ups because you are right, this just scratches the surface of all the options!
@lexinwonderland5741 Рік тому
@@DrTrefor I'm a big algebraic topology and Lie group/algebra nut myself, so of course my first thought is "what are some natural yet interesting families of curves that can come from this mathematical structure?" But I'm so glad you agree!!! I really do hope you go further, I would play with generating families of curves/manifolds if I weren't so slammed with university work right now so it will be amazing to have a springboard to explore from with a followup!
@maksim-surov Рік тому ⁺¹
2:20 surprising notation for a vector) Usually brackets are used for scalar product.
@DrTrefor Рік тому ⁺¹
Every choice sucks a bit. One argument for vector is to distinguish it sort of philosophically from the point (a,b). But yes, then the notation is aligned with the inner product notation.
@yann9637 11 місяців тому
The visuals are impressive
@maurolocatelli3321 Рік тому ⁺²
Very interesting to reduce this to a 1D optimization for each step. How would you consider the convergence speed and cost of this method compared to for example iterating basing the update on knowing both Jacobian and hessian Matrix (of course, without considering a number of variables that would make the hessian inversion too cumbersome)?
In the video way you still need to find a stationary point even in your 1D constrain with some guess of a second directional derivative, but of course looks much leaner.
Thank you so much for the insight.
@murillonetoo Рік тому ⁺¹
Great video!
@DrTrefor Рік тому ⁺²
Thanks!
@yeshengwei79 9 місяців тому
Well done 👍 It’s far easier to understand and with much high information density than the AI open course of Stanford University 😊
@NilaDelilahMendez 2 місяці тому
can we get a gradient descent video using backtracking line search? this video was really helpful :)
@markolazar2149 Рік тому
great video! helped me a lot
@balakrishnaprasad8928 Рік тому ⁺¹
Great work
@HaUzumaki 2 місяці тому ⁺¹
nice visualization
@camrouxbg Рік тому ⁺²
What an interesting and useful video! I haven't thought about this kind of thing for a good while now. But I'm curious.... what's the deal with the *conjugate gradient* method?
@dsizov Рік тому ⁺³
Great explanation of the variable step method. But how do you calculate the gradient vector? It seems that you need that computationally expensive differentiation?!
@DrTrefor Рік тому ⁺³
Yes indeed, it is defined by the partial derivatives in each component. However, you can approximate those (take a small step in the x direction and compute the secant line) so depending on your accuracy they can be done relatively efficiently. All of this is a balance of different types of efficiencies ultimately.
@dsizov Рік тому
@@DrTrefor got it, thanks
@speedbird7587 8 місяців тому ⁺¹
amazing!
@andrewharrison8436 Рік тому ⁺¹
Yes, well explained.
I had this type of situation more than once (admittedly one dimensional). Defining a search pattern that is relatively fast and guaranteed to converge was interesting. Even defining the end point condition required clear thought.
Who said the life and pensions industry was boring?
@DrTrefor Рік тому ⁺¹
I suppose life insurance and pensions is really all just math!
@andrewharrison8436 Рік тому
@@DrTrefor Just don't tell the salesman.
@alexfernandez4883 Рік тому ⁺²
Great video! I was curious a month ago about the method and decided to implement it and animate it using “Naive GD” aka: constant step size . What is the name of this Gradient Descent method, where you optimize the choice of the step size at each I?
@DrTrefor Рік тому ⁺¹
Typically just "line search".
@johnsalkeld1088 Рік тому ⁺¹
I am interested in gradient decent when you only have samples and re sampling is not necessarily possible. Do you know if any good research in that area?
@DoM-u4o Рік тому ⁺¹
Thank you for the video! Which tool did you use to display the planes and graphs with the ascents?
@DrTrefor Рік тому ⁺²
this is all matlab
@FelipeBalbi Рік тому ⁺²
First things first, what did you use for the plots? ;-)
@DrTrefor Рік тому ⁺¹
Ha this is all MATLAB, but you can use lots of things to generate these types of plots
@rodrigovillalón-z2p 11 місяців тому
whats the i in the reworked gradient descent formula? the one where the gamma is affected by it?
@alexfernandez4883 Рік тому ⁺¹
One question. Say I wanted to implement it in Matlab. I thought about estimating the one variable function by estimating the Gradient numerically. Now, once O have this function defined, how would you go on finding the extrema of it? I thought of estimating it’s derivative also, numerically and then defining a function “FindExtrema(g’(phi)) “ that performs the Newton Method on a set of N equispaced points from (0-b] and then uses the estimated roots obtained to evaluate the function of phi at them and use the one that min or max the value respectively. Does this make sense?
Ps: I would also evaluate the endpoints to see if the extrema is there
@DrTrefor Рік тому ⁺²
Yes that is exactly right. When I was programming this in MATLAB, I got MATLAB to solve the derivative equal to zero formula for the specific line. You might also want to google backtracking line search for a slightly better way.
@alexfernandez4883 Рік тому
@@DrTrefor thanks so much. The thing is I would like to animate it for arbitrary functions defined by the user so I think I’ll have to do everything numerically
@alexfernandez4883 Рік тому
Also, what method do you use for animations like this one. I have learned a way in Matlab that works by calling the drawnow() function to update the appearance of a plot after having used the functions Set(h,”Xdata”, whatever) to update the function x,y,z plotted points. Thanks so much in advance. If you could also upload the script on description that would be great!
@-slt Рік тому
I was talking to a friend yesterday about the strategies to optimize learning rate and boom! you made it really clear. Thanks. Thou this might seems less expensive computionally, but in a real problem one does not know the loss/target function and must evaluate many model parameters sampled/updated in the direction of the gradient and then select the best model parameters according to new calculated losses. right? So the optimizer needs many many more model evaluations to select the best update. This can be even more computationally expensive than SGD.
@-slt Рік тому
@drtrefor it's sad that my comment/question is one of the only 3 you did not answer! :(
@porschepanamera92 Рік тому
I'm currently using the method of moving asymptotes (Svanberg, 1987) for nonlinear optimization of nonconvex problems in topology optimization. However, using it and fully understanding it are two different things. I would be curious whether you would be able to give a simple explanation for these kind of methods.
@ja6920 Рік тому
so once i have my vector for the gradient of the two dimensional space, how do i turn it into a single variable equation?
@TheNetkrot Рік тому ⁺¹
great t-shirt where can you one get them ...
@DrTrefor Рік тому ⁺¹
Merch link in description!
@morgan0 Рік тому
do you know of good techniques for discontinuous search spaces? i had been thinking a genetic algorithm (tho that’s on the back burner while i write a ton of code for it to figure out how good something it comes up with is)
@dandan1364 4 місяці тому
We know the zero of the Error (loss) term for each iteration. Why not divide the delta by 2 and approach exponentially slower?
@o5-1-formerlycalvinlucien60 Рік тому ⁺²
Bro i was watching this like "Cool video, how old is this?"
*1 hour ago*
@DrTrefor Рік тому ⁺¹
haha:D
@o5-1-formerlycalvinlucien60 Рік тому
@@DrTrefor i’ve always watched your older videos through reccomendations, nice to catch a never video
@quynhngadau5482 4 місяці тому
Cool t-shirtl :))
@lior0110 Рік тому
ua-cam.com/video/fXQXE96r4AY/v-deo.html
is it not a second order Gradient Descent Optimization where you need to also know all the partial derivatives?
if not where can i see the formula for this Gradient Descent Optimization?
@cwaddle Рік тому ⁺³
Clearly not an exact maths. Very clever
@DrTrefor Рік тому ⁺³
Nope! But great for getting close.
@eceto123 Рік тому ⁺³
♥️♥️♥️
@pnachtwey 7 місяців тому
I have used gradient descent a lot. It is good this video mentions line search because most don't, and line search is very effective relative to other techniques. I have an example using the Rosenbrock equation. This is easy but gradient descent take more computations to get to the minimum. Also, with real data, there isn't a cost function one can take the derivative of. Finding the gradient involves evaluating the cost function after taking small steps in each direction and computing the slope. This is time consuming and not as accurate. The is no ability to compute when the slope of the line search is zero without using lots of math and it isn't noise free.

Наступне

Автоматичне відтворення