Machine Learning Lecture 32 "Boosting" -Cornell CS4780 SP17

Поділитися
Вставка
  • Опубліковано 26 гру 2024
  • Lecture Notes:
    www.cs.cornell....

КОМЕНТАРІ • 55

  • @rohit2761
    @rohit2761 3 роки тому +12

    Kilian + StatQuest = 100 percent learning.
    Thank You Prof Kilian, absolutely loved the series.
    Now I'll head over to Deep Learning :)

    • @ayeshavlogsfun
      @ayeshavlogsfun 2 роки тому +1

      Why not only statquest?

    • @AJ-zc4mx
      @AJ-zc4mx Рік тому +1

      Triple bamm!!

    • @RELAXISLANDS
      @RELAXISLANDS 4 місяці тому +1

      @@ayeshavlogsfun because it lacks mathematics part, prof kilian completes it

  • @OliverJanShD
    @OliverJanShD 2 роки тому +7

    Thank you for posting these! I find your content to be the most educational/entertaining on this subject matter.

  • @ayushinayak3980
    @ayushinayak3980 5 років тому +17

    Really loved the drunk walk demo at the start. You are a good one, Professor!

  • @sans8119
    @sans8119 3 роки тому +15

    Would be amazing if we could have similar quality content on Reinforcement learning from you! Thank you for having the lectures on UA-cam!

  • @linxingyao9311
    @linxingyao9311 4 роки тому +4

    Really love it, I learn quite a lot each time I watch this video.

  • @vatsan16
    @vatsan16 4 роки тому +9

    "It begins with a Tay and ends with a Lor" hahaha "Taylor expansion" "That's right! HOW DID YOU KNOW?" i burst out laughing at that.

  • @vatsan16
    @vatsan16 4 роки тому +8

    I have had my fruit and I am ready!! Lets do this!

  • @rishabhkumar-qs3jb
    @rishabhkumar-qs3jb 3 роки тому +1

    Awesome lectures, you made machine learning easy for me, now I have better grip over the concepts of boosting .

  • @HLi-pc4km
    @HLi-pc4km 2 роки тому +2

    "Boosting is gradient descent in function space" - I am gonna steal this line.

    • @dragonsaige
      @dragonsaige 2 роки тому

      In the reference textbook Machine Learning - A Probabilistic Perspective they call it "functional gradient descent"

  • @kirtanpatel797
    @kirtanpatel797 5 років тому +10

    Waking up from coma, What is boosting? Drunk Guy.

  • @ruman2494
    @ruman2494 5 років тому +6

    Explained very well. Thanks a lot sir.!

  • @andreamercuri9984
    @andreamercuri9984 4 роки тому +2

    the idea of putting the labels in a very high dimensional space is pedagogically very, very good! I had some troubles trying to understand AnyBoost from the original paper, but Prof. Weinberger made the principle behind it very easy to understand. Great as usual

  • @juliocardenas4485
    @juliocardenas4485 2 роки тому

    It’s wonderful to revisit these lectures

  • @xiaoyu5181
    @xiaoyu5181 4 роки тому +3

    Your lecture really helped me understand bagging and boosting more clearly. Thanks for your sharing.

  • @KW-fb4kv
    @KW-fb4kv 9 місяців тому

    Your English is so perfect.

  • @Ange-ClementAkazan-k5s
    @Ange-ClementAkazan-k5s Рік тому

    Very Great Lecture

  • @jiahao2709
    @jiahao2709 5 років тому +2

    I really like your lecture!

  • @yannickpezeu3419
    @yannickpezeu3419 3 роки тому

    so well and funnyly explained TY

  • @aciobanusebi
    @aciobanusebi 5 років тому +3

    Really great lectures! Thanks! I have a question: In Gradient Boosting the algorithm says "h* = argmin sum (h(x_i) - t_i)^2". I don't understand if one should:
    1. go through all the possible regression trees (with a maximum depth set) and find that "h*"
    OR
    2. one could just apply CART with the new labels "t_i" instead of "y_i".
    As I might guess, we should go with 2. But doesn't 2 give just an approximate solution to that maximization problem? Am missing something?

    • @kilianweinberger698
      @kilianweinberger698  5 років тому +4

      You are right, it is just an approximation. But in practice, it is pretty good.:-)

    • @sudhanshuvashisht8960
      @sudhanshuvashisht8960 3 роки тому +1

      I had a similar question. I guess the answer ties down to the decision tree lecture where the Prof told that in general these kinds of problems (the minimization of squared loss by considering all possible CART trees) are np hard and hence we just take a 'greedy' approach where we only care about the split we are making at a given point. Essentially, we are doing a similar thing here as well (i.e. instead of solving the np hard problem, using the approach of a typical CART tree (with the prediction variable as the residual) which as the Prof said is of course an approximation)

  • @gregmakov2680
    @gregmakov2680 2 роки тому

    lai mi dan roi :D cai do thi qua dep mot cach ngao nghe ba xao :D

  • @vocabularybytesbypriyankgo1558
    @vocabularybytesbypriyankgo1558 2 місяці тому

    Thank you !!

  • @HuyNguyen-fp7oz
    @HuyNguyen-fp7oz 4 роки тому +1

    Can someone help me: in , dl/dH and h is vector (of what?) or scalar value?

    • @kilianweinberger698
      @kilianweinberger698  4 роки тому +1

      h is a function (the weak learner). But the trick is that you only care about the predictions on your training samples, so every function can be represented and written as an n-dimensional vector (with one dimension fo the prediction on each training sample). So then it should hopefully all come together. l is a loss function that computes the loss of the current prediction, which is the vector H+alpha h . You take the derivative with respect to h. Hope this helps.

  • @vaithak
    @vaithak 4 роки тому +1

    Dr. Weinberger, I think there is a minor error in the notes on your website, in the derivation of Gradient Boosted Regression Trees there should not be the (-) sign before the argmin in step 3. I am referring to your notes here: www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote19.html

  • @harshk2010
    @harshk2010 5 років тому +2

    Perfect

  • @charles-zk2bb
    @charles-zk2bb 2 роки тому

    thank you for the lecture series, but I could not understand the Kaggle bit, what does flipping random bits mean?

    • @kilianweinberger698
      @kilianweinberger698  2 роки тому +1

      For each test sample you either classify it as classes 1 or 0 (these are the bits). You can then create and submit a random bit vector and observe if you perform worse or better than 50% error (assuming the data is balanced). If you are worse, you can flip the bits then you will do better.

    • @charles-zk2bb
      @charles-zk2bb 2 роки тому

      @@kilianweinberger698 yes got it, thank you :)

  • @ujala5538
    @ujala5538 4 роки тому

    Why do we treat L(H) as a constant? It is a function of H, so it varies. Sorry if I am missing anything basic

    • @kilianweinberger698
      @kilianweinberger698  4 роки тому +2

      So the previous weak learners are treated as fixed, and the only change we are allowed to do in each iteration is add another weak learner. So H, and therefore L(H) is fixed and we find a h to minimize L(H+\alpha h).

  • @JeremyWang-e6t
    @JeremyWang-e6t 5 років тому

    excellent, very good

  • @HLi-pc4km
    @HLi-pc4km 2 роки тому

    I have to leave another comment. Prof. Weinberger is a teaching genius. Also, who says germans ain't funny?

  • @rameezusmani1294
    @rameezusmani1294 3 роки тому

    thank you for the wonderful and so easy explanation. However i am confused about one thing. What i understand is that we are taking multiple weak classifiers and summing their output to get final result. If we have our first weak classifier wc1...how do we create wc2 and so on ?...if we are generating wc2,wc3...wcn from training data points...aren't we overfitting ?

    • @kilianweinberger698
      @kilianweinberger698  3 роки тому +1

      You will be overfitting eventually, but not for a long time. The reason is that a) your weak learners are pretty "weak", b) you multiply them with a small learning rate / shrinkage constant (thus reducing their influence), c) your loss function (e.g. exponential loss) may learn to maximize the margin between the two classes (which helps a lot against overfitting).

    • @rameezusmani1294
      @rameezusmani1294 3 роки тому

      @@kilianweinberger698 thank you....b) is clear...but not clear about c) ...will try to figure out...thank you again for the help

  • @HuyNguyen-fp7oz
    @HuyNguyen-fp7oz 4 роки тому

    Why sum(h(x)) = c (constant) ?

    • @kilianweinberger698
      @kilianweinberger698  4 роки тому +2

      Well, it is a trick that you assume. You can always do this by normalizing your weak learners. (in practice nobody does, actually)

    • @hastings_clf
      @hastings_clf 2 роки тому

      ​Hey Professor@@kilianweinberger698, loving the analogies. So is this constraint for h applied to other types of learners/algorithms as well? I am having trouble reasoning why only CART learners would be susceptible to rescaling with a large constant.

    • @dragonsaige
      @dragonsaige 2 роки тому

      ​Why is it that multiplying our weak classifier h by a large number gives a smaller value for ∑ri h(xi) in h = argmin ∑ rih(xi)? Shouldn't it make it even larger?
      Nevermind lol, I got it as I was typing the question but I'll type it out anyway in case anybody else is wondering. In ∑ri h(xi), ri is the derivative of the loss function wrt to each training input. Visually in functional space, ri is a vector from groun-truth y to our current ensemble vector H. We are trying to find h which points in the direction opposite to the derivative, so we try to minimize the inner product between arbitrary h and ri - but we can make this inner product as negative (and hence smaller) as we want by inflating the values of h. This should not be allowed otherwise we would find h to be an infinitely long vector, so we restrict to be in a small circle.
      Short answer: " Shouldn't it make it even larger?" it does make it as large as possible, but in negative.

  • @sudhanshuvashisht8960
    @sudhanshuvashisht8960 4 роки тому +2

    I can't appreciate you enough, Sir. Absolutely grateful for these video lectures. I had a couple of small questions:
    a) For two different classifiers (say H_1 and H_2), who let's say have exact same predictions on our training data. Will the derivative of the loss function with respect to both these functions (∂L/∂H_1 and ∂L/∂H_2, L is the loss function) be the same?
    b) Also, what's the intuition behind this derivative (∂L/dH) ? I can only think of that as a change in loss function per small nudge in H (a bit confused around what would be the intuition of that nudge in a classifier H here?)

    • @kilianweinberger698
      @kilianweinberger698  4 роки тому +2

      a) yes exactly
      b) the gradient is basically telling you how you would have to change H(x_i) in order to reduce the loss, and then you do make that change (approximately) by adding another tree to H.

  • @denismamedov5480
    @denismamedov5480 4 роки тому +1

    Love his jokes :)

  • @sebmulti21
    @sebmulti21 5 років тому

    How is H constant when you discarded the term l(H) + .

    • @sagar35756
      @sagar35756 5 років тому

      it means that it does not depend on small h, its fixed for all h.

  • @JD-ov5gt
    @JD-ov5gt 3 роки тому

    Cordell, cut Mr. Barney a check for two hundred and fifty thousand dollars.

  • @sofluzik
    @sofluzik 4 роки тому

    His grunting is even more predictable