Intro to Gradient Descent || Optimizing High-Dimensional Equations

Поділитися
Вставка
  • Опубліковано 13 лют 2023
  • Keep exploring at ► brilliant.org/TreforBazett. Get started for free for 30 days - and the first 200 people get 20% off an annual premium subscription!
    How can we find maximums and minimums for complicated functions with an enormous number of variables like you might get when studying neural networks or machine learning? In this video we are going to talk about a topic from nonlinear optimization called Gradient Descent (or Gradient Ascent if you want maximums) where you step-by-step approach an extremum by stepping in the direction of the gradient vector. We're going to see the basic algorithm, see some common pitfalls, and then upgrade it using a method called line searches to improve the efficiency.
    Check out my MATH MERCH line in collaboration with Beautiful Equations
    ►beautifulequations.net/pages/...
    COURSE PLAYLISTS:
    ►DISCRETE MATH: • Discrete Math (Full Co...
    ►LINEAR ALGEBRA: • Linear Algebra (Full C...
    ►CALCULUS I: • Calculus I (Limits, De...
    ► CALCULUS II: • Calculus II (Integrati...
    ►MULTIVARIABLE CALCULUS (Calc III): • Calculus III: Multivar...
    ►VECTOR CALCULUS (Calc IV) • Calculus IV: Vector Ca...
    ►DIFFERENTIAL EQUATIONS: • Ordinary Differential ...
    ►LAPLACE TRANSFORM: • Laplace Transforms and...
    ►GAME THEORY: • Game Theory
    OTHER PLAYLISTS:
    ► Learning Math Series
    • 5 Tips To Make Math Pr...
    ►Cool Math Series:
    • Cool Math Series
    BECOME A MEMBER:
    ►Join: / @drtrefor
    MATH BOOKS I LOVE (affilliate link):
    ► www.amazon.com/shop/treforbazett
    SOCIALS:
    ►Twitter (math based): / treforbazett
    ►Instagram (photography based): / treforphotography

КОМЕНТАРІ • 89

  • @spongebobinator123
    @spongebobinator123 Рік тому +29

    It's amazing how well the timing of this video came out. I sat down to write an article on neural networks, went on UA-cam to play some background music, and had one of my favorite educators show up on my home screen with this extremely relevant topic. Awesome!

  • @ProbablyApproximatelyCorrect
    @ProbablyApproximatelyCorrect Рік тому +31

    Nice video! For being so simple, it is fascinating how well gradient descent and its variants work. Like you mention, it is not at all guaranteed to find a good minimizer for non-convex functions in low dimension, but in higher dimensions, things just magically seem to work out, both numerically and (recently) mathematically. There's so much about high-dimensional objects that isn't really captured by our low-dimensional intuition, which unfortunately is quite limited. I recently saw a quote from Geoff Hinton that said:
    "To deal with a 14-dimensional space, visualize a 3-D space and say 'fourteen' to yourself very loudly."

    • @proxagonal5954
      @proxagonal5954 Рік тому +2

      Lol. By the way, what do you mean by things recently working out mathematically for gradient descent in higher dimensions?

    • @ProbablyApproximatelyCorrect
      @ProbablyApproximatelyCorrect Рік тому +8

      @@proxagonal5954 When optimizing a neural network to fit training data, there are two important factors: (i) the training itself, (ii) the generalization to new data. For (i) to go good, you want to find a (nearly) global minimum of your training loss. Recent results show that, when you "overparameterize" your model (i.e., use many more parameters for the neural network than you have data points), every local minimum is global (at least in the infinite limit). Even though in lower dimensions, it seems that adding more parameters should make the function "less convex", it somehow seems to just work.
      Now, think about (ii) in two dimensions: say you want to fit a polynomial curve to a number of fixed points on the plane. If you fit a first-degree polynomial, i.e. a straight line, you cannot "overfit" to the data-points, you get a straight line that sort of goes through the data-points, without bending too extremely due to one of them. If you use a high-degree polynomial, you can fit all the points very well, but your curve may bend extremely, and you fit the training points "too" well -- especially if some of them are noisy. However, if you find the high-degree polynomial with the minimum norm of the coefficients, this typically works better. The magic of gradient descent in high-dimension is that it automatically finds this minimum-norm solution! This seems to be true also for complex models like neural networks, but this stuff isn't fully understood yet.
      If you search for "neural tangent kernel", you can find some info about these kinds of results.

    • @proxagonal5954
      @proxagonal5954 Рік тому +1

      @@ProbablyApproximatelyCorrect Thank you!

  • @vincentalegrete3287
    @vincentalegrete3287 Рік тому +26

    Best math channel on UA-cam. The way you distill things down to intuitive graphics really helped me get more out of Calc III last semester.

  • @fatianass8317
    @fatianass8317 Рік тому +2

    Optimizing functions ❌️
    Something we've heard about it and didn't give it enough time in that 30 hours machine learning course ✅️
    thanks for the vid!

  • @jinc7021
    @jinc7021 Рік тому +3

    Hello Dr. Bazett, I have been watching a lot of your videos during my current PhD process. I just want to say that you made hard concept visual and more intuitive. it has been a great help to me! Please continue making more! I love to see you explaining some convergence analysis!

  • @mohammedmhilal4129
    @mohammedmhilal4129 Рік тому +3

    I literally search Gradient Descent Trefor bazett two days ago... This is destiny....

    • @DrTrefor
      @DrTrefor  Рік тому +2

      haha perfect timeing!

  • @judedavis92
    @judedavis92 Рік тому +6

    Now this is amazing! Please continue the ML math stuff!!

  • @teikono
    @teikono Рік тому +2

    Just want to say this channel is fantastic, its very enjoyable listening to you talk about these topics.

  • @pierreretief
    @pierreretief Рік тому +5

    This video has been very interesting. I have really enjoyed and benifitted from your videos, especially the ones on vector calculus. When I did my engineering degree ten years ago, either I wasn't aware of these types of videos, or they didn't exist. I am currently studying a physics degree remotely and these videos have been IMMENSELY helpful. You and prof Leonard and many others are changing lives. I understand concepts now that I didn't even realize I didnt understand the first time around. The graphics and your teaching style - awesome. Thank you very very much.

  • @tooooomboh
    @tooooomboh 5 місяців тому +1

    Amazing visualizations, really well done!!!

  • @babybeel8787
    @babybeel8787 Рік тому +2

    Hello, that was perhaps one of the most beautiful explanation and visualisation of the gradient I've come across, I could not be happier after watching this video, your excitement is what mathematics is all about, discovering beautiful things. Thank you so much !

  • @egycg3569
    @egycg3569 Рік тому +1

    Unbelievable explanation as usual, prof.
    Thanks

  • @Heisenberg355
    @Heisenberg355 Рік тому +2

    dude, I am studying cs and getting an expert in ai, but I found your account because of latex. Now this gets recommended to me. The recommendation algorithm predicting at it's finest ;D

    • @DrTrefor
      @DrTrefor  Рік тому +1

      ha all praise the algorithm!

  • @bhavesh.adhikari
    @bhavesh.adhikari Рік тому +1

    I was just thinking about this topic since several days. now the video is here. thanks, Prof. you have helped me a lot

    • @DrTrefor
      @DrTrefor  Рік тому +3

      Glad it was helpful!

  • @erkangwang630
    @erkangwang630 2 місяці тому

    This is so much better than a lot of text books I read

  • @siddharthvj1
    @siddharthvj1 Рік тому +3

    excellent video trefor

    • @DrTrefor
      @DrTrefor  Рік тому +2

      Thank you very much!

  • @markolazar2149
    @markolazar2149 9 місяців тому

    great video! helped me a lot

  • @balakrishnaprasad8928
    @balakrishnaprasad8928 Рік тому +1

    Great work

  • @reckless2830
    @reckless2830 Рік тому +1

    i'm in the 9th grade, I have understood 1/27 of what you're talking about. Ig i have to watch this 27 more times

  • @geraldsnodd
    @geraldsnodd Рік тому +1

    This year i will be going to college once i am finished with my highschool exams and entrance exams. Can't wait to explore all of your content in college.

    • @DrTrefor
      @DrTrefor  Рік тому +2

      You’re going to love it!

  • @yann9637
    @yann9637 5 місяців тому

    The visuals are impressive

  • @sedthh
    @sedthh Рік тому +1

    Really happy to see ML content on the channel!

    • @DrTrefor
      @DrTrefor  Рік тому +1

      Thanks! Plan to do more for sure

  • @yeshengwei79
    @yeshengwei79 3 місяці тому

    Well done 👍 It’s far easier to understand and with much high information density than the AI open course of Stanford University 😊

  • @speedbird7587
    @speedbird7587 Місяць тому +1

    amazing!

  • @tutorchristabel
    @tutorchristabel Рік тому +1

    wow, the most beautiful explanation i have come across today

  • @maurolocatelli3321
    @maurolocatelli3321 Рік тому +2

    Very interesting to reduce this to a 1D optimization for each step. How would you consider the convergence speed and cost of this method compared to for example iterating basing the update on knowing both Jacobian and hessian Matrix (of course, without considering a number of variables that would make the hessian inversion too cumbersome)?
    In the video way you still need to find a stationary point even in your 1D constrain with some guess of a second directional derivative, but of course looks much leaner.
    Thank you so much for the insight.

  • @TimLeung12345
    @TimLeung12345 Рік тому

    When I did my PhD in aerodynamic shape optimization for airplanes, we would have hundreds or thousands of variables to define the shape of the wing, and the “function” that we were minimizing (e.g. drag at a fixed lift) cannot be defined analytically (it required a full flow analysis), there were a lot of research done in finding the gradient direction. Also, while we would start with the steepest descent direction, we would also build an approximate Hessian matrix that gives us a progressively clearer picture of the function space after every step. The most efficient step sizes that we would use were quite different from the steepest descent line search.

  • @amandeepsaha
    @amandeepsaha Рік тому +1

    One of the best UA-cam channels

  • @camrouxbg
    @camrouxbg Рік тому +2

    What an interesting and useful video! I haven't thought about this kind of thing for a good while now. But I'm curious.... what's the deal with the *conjugate gradient* method?

  • @maksim-surov
    @maksim-surov Рік тому +1

    2:20 surprising notation for a vector) Usually brackets are used for scalar product.

    • @DrTrefor
      @DrTrefor  Рік тому +1

      Every choice sucks a bit. One argument for vector is to distinguish it sort of philosophically from the point (a,b). But yes, then the notation is aligned with the inner product notation.

  • @sinonimo8719
    @sinonimo8719 Рік тому +1

    AWESOME, thanks, gradient descent is kind of neat

  • @murillonetoo
    @murillonetoo Рік тому +1

    Great video!

  • @Kokurorokuko
    @Kokurorokuko 8 місяців тому +1

    2:52 If we want to maximize, shouldn't we add the gradient? Of course gamma can be negative but just conceptually?

  • @porschepanamera92
    @porschepanamera92 Рік тому

    I'm currently using the method of moving asymptotes (Svanberg, 1987) for nonlinear optimization of nonconvex problems in topology optimization. However, using it and fully understanding it are two different things. I would be curious whether you would be able to give a simple explanation for these kind of methods.

  • @lexinwonderland5741
    @lexinwonderland5741 Рік тому +1

    my only complaint is that its too short! i would love to see more about variations of gradient descent like space curves and how you mentioned things like momentum as a parameter, or higher/complex dimensions, etc! especially coming up with cool new interpolations on scalar or vector fields with calculus

    • @DrTrefor
      @DrTrefor  Рік тому +1

      I kind of want to do some follow ups because you are right, this just scratches the surface of all the options!

    • @lexinwonderland5741
      @lexinwonderland5741 Рік тому

      @@DrTrefor I'm a big algebraic topology and Lie group/algebra nut myself, so of course my first thought is "what are some natural yet interesting families of curves that can come from this mathematical structure?" But I'm so glad you agree!!! I really do hope you go further, I would play with generating families of curves/manifolds if I weren't so slammed with university work right now so it will be amazing to have a springboard to explore from with a followup!

  • @johnsalkeld1088
    @johnsalkeld1088 Рік тому +1

    I am interested in gradient decent when you only have samples and re sampling is not necessarily possible. Do you know if any good research in that area?

  • @user-dl7bk2yi1e
    @user-dl7bk2yi1e 4 місяці тому

    whats the i in the reworked gradient descent formula? the one where the gamma is affected by it?

  • @ja6920
    @ja6920 Рік тому

    so once i have my vector for the gradient of the two dimensional space, how do i turn it into a single variable equation?

  • @dsizov
    @dsizov Рік тому +3

    Great explanation of the variable step method. But how do you calculate the gradient vector? It seems that you need that computationally expensive differentiation?!

    • @DrTrefor
      @DrTrefor  Рік тому +3

      Yes indeed, it is defined by the partial derivatives in each component. However, you can approximate those (take a small step in the x direction and compute the secant line) so depending on your accuracy they can be done relatively efficiently. All of this is a balance of different types of efficiencies ultimately.

    • @dsizov
      @dsizov Рік тому

      @@DrTrefor got it, thanks

  • @alexfernandez4883
    @alexfernandez4883 Рік тому +2

    Great video! I was curious a month ago about the method and decided to implement it and animate it using “Naive GD” aka: constant step size . What is the name of this Gradient Descent method, where you optimize the choice of the step size at each I?

    • @DrTrefor
      @DrTrefor  Рік тому +1

      Typically just "line search".

  • @user-so4bi2if9c
    @user-so4bi2if9c 8 місяців тому +1

    Thank you for the video! Which tool did you use to display the planes and graphs with the ascents?

    • @DrTrefor
      @DrTrefor  8 місяців тому +2

      this is all matlab

  • @morgan0
    @morgan0 Рік тому

    do you know of good techniques for discontinuous search spaces? i had been thinking a genetic algorithm (tho that’s on the back burner while i write a ton of code for it to figure out how good something it comes up with is)

  • @-slt
    @-slt Рік тому

    I was talking to a friend yesterday about the strategies to optimize learning rate and boom! you made it really clear. Thanks. Thou this might seems less expensive computionally, but in a real problem one does not know the loss/target function and must evaluate many model parameters sampled/updated in the direction of the gradient and then select the best model parameters according to new calculated losses. right? So the optimizer needs many many more model evaluations to select the best update. This can be even more computationally expensive than SGD.

    • @-slt
      @-slt Рік тому

      @drtrefor it's sad that my comment/question is one of the only 3 you did not answer! :(

  • @FelipeBalbi
    @FelipeBalbi Рік тому +2

    First things first, what did you use for the plots? ;-)

    • @DrTrefor
      @DrTrefor  Рік тому +1

      Ha this is all MATLAB, but you can use lots of things to generate these types of plots

  • @andrewharrison8436
    @andrewharrison8436 Рік тому +1

    Yes, well explained.
    I had this type of situation more than once (admittedly one dimensional). Defining a search pattern that is relatively fast and guaranteed to converge was interesting. Even defining the end point condition required clear thought.
    Who said the life and pensions industry was boring?

    • @DrTrefor
      @DrTrefor  Рік тому +1

      I suppose life insurance and pensions is really all just math!

    • @andrewharrison8436
      @andrewharrison8436 Рік тому

      @@DrTrefor Just don't tell the salesman.

  • @alexfernandez4883
    @alexfernandez4883 Рік тому +1

    One question. Say I wanted to implement it in Matlab. I thought about estimating the one variable function by estimating the Gradient numerically. Now, once O have this function defined, how would you go on finding the extrema of it? I thought of estimating it’s derivative also, numerically and then defining a function “FindExtrema(g’(phi)) “ that performs the Newton Method on a set of N equispaced points from (0-b] and then uses the estimated roots obtained to evaluate the function of phi at them and use the one that min or max the value respectively. Does this make sense?
    Ps: I would also evaluate the endpoints to see if the extrema is there

    • @DrTrefor
      @DrTrefor  Рік тому +2

      Yes that is exactly right. When I was programming this in MATLAB, I got MATLAB to solve the derivative equal to zero formula for the specific line. You might also want to google backtracking line search for a slightly better way.

    • @alexfernandez4883
      @alexfernandez4883 Рік тому

      @@DrTrefor thanks so much. The thing is I would like to animate it for arbitrary functions defined by the user so I think I’ll have to do everything numerically

    • @alexfernandez4883
      @alexfernandez4883 Рік тому

      Also, what method do you use for animations like this one. I have learned a way in Matlab that works by calling the drawnow() function to update the appearance of a plot after having used the functions Set(h,”Xdata”, whatever) to update the function x,y,z plotted points. Thanks so much in advance. If you could also upload the script on description that would be great!

  • @eceto123
    @eceto123 Рік тому +3

    ♥️♥️♥️

  • @TheNetkrot
    @TheNetkrot 6 місяців тому +1

    great t-shirt where can you one get them ...

    • @DrTrefor
      @DrTrefor  6 місяців тому +1

      Merch link in description!

  • @lior0110
    @lior0110 Рік тому

    ua-cam.com/video/fXQXE96r4AY/v-deo.html
    is it not a second order Gradient Descent Optimization where you need to also know all the partial derivatives?
    if not where can i see the formula for this Gradient Descent Optimization?

  • @o5-1-formerlycalvinlucien60
    @o5-1-formerlycalvinlucien60 Рік тому +2

    Bro i was watching this like "Cool video, how old is this?"
    *1 hour ago*

  • @cwaddle
    @cwaddle Рік тому +3

    Clearly not an exact maths. Very clever

    • @DrTrefor
      @DrTrefor  Рік тому +3

      Nope! But great for getting close.

  • @azzteke
    @azzteke Рік тому

    The plural of maximum is maxima. Otherwise is uneducated!

  • @pnachtwey
    @pnachtwey 24 дні тому

    I have used gradient descent a lot. It is good this video mentions line search because most don't, and line search is very effective relative to other techniques. I have an example using the Rosenbrock equation. This is easy but gradient descent take more computations to get to the minimum. Also, with real data, there isn't a cost function one can take the derivative of. Finding the gradient involves evaluating the cost function after taking small steps in each direction and computing the slope. This is time consuming and not as accurate. The is no ability to compute when the slope of the line search is zero without using lots of math and it isn't noise free.