the idea of putting the labels in a very high dimensional space is pedagogically very, very good! I had some troubles trying to understand AnyBoost from the original paper, but Prof. Weinberger made the principle behind it very easy to understand. Great as usual
Really great lectures! Thanks! I have a question: In Gradient Boosting the algorithm says "h* = argmin sum (h(x_i) - t_i)^2". I don't understand if one should: 1. go through all the possible regression trees (with a maximum depth set) and find that "h*" OR 2. one could just apply CART with the new labels "t_i" instead of "y_i". As I might guess, we should go with 2. But doesn't 2 give just an approximate solution to that maximization problem? Am missing something?
I had a similar question. I guess the answer ties down to the decision tree lecture where the Prof told that in general these kinds of problems (the minimization of squared loss by considering all possible CART trees) are np hard and hence we just take a 'greedy' approach where we only care about the split we are making at a given point. Essentially, we are doing a similar thing here as well (i.e. instead of solving the np hard problem, using the approach of a typical CART tree (with the prediction variable as the residual) which as the Prof said is of course an approximation)
h is a function (the weak learner). But the trick is that you only care about the predictions on your training samples, so every function can be represented and written as an n-dimensional vector (with one dimension fo the prediction on each training sample). So then it should hopefully all come together. l is a loss function that computes the loss of the current prediction, which is the vector H+alpha h . You take the derivative with respect to h. Hope this helps.
Dr. Weinberger, I think there is a minor error in the notes on your website, in the derivation of Gradient Boosted Regression Trees there should not be the (-) sign before the argmin in step 3. I am referring to your notes here: www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote19.html
For each test sample you either classify it as classes 1 or 0 (these are the bits). You can then create and submit a random bit vector and observe if you perform worse or better than 50% error (assuming the data is balanced). If you are worse, you can flip the bits then you will do better.
So the previous weak learners are treated as fixed, and the only change we are allowed to do in each iteration is add another weak learner. So H, and therefore L(H) is fixed and we find a h to minimize L(H+\alpha h).
thank you for the wonderful and so easy explanation. However i am confused about one thing. What i understand is that we are taking multiple weak classifiers and summing their output to get final result. If we have our first weak classifier wc1...how do we create wc2 and so on ?...if we are generating wc2,wc3...wcn from training data points...aren't we overfitting ?
You will be overfitting eventually, but not for a long time. The reason is that a) your weak learners are pretty "weak", b) you multiply them with a small learning rate / shrinkage constant (thus reducing their influence), c) your loss function (e.g. exponential loss) may learn to maximize the margin between the two classes (which helps a lot against overfitting).
Hey Professor@@kilianweinberger698, loving the analogies. So is this constraint for h applied to other types of learners/algorithms as well? I am having trouble reasoning why only CART learners would be susceptible to rescaling with a large constant.
Why is it that multiplying our weak classifier h by a large number gives a smaller value for ∑ri h(xi) in h = argmin ∑ rih(xi)? Shouldn't it make it even larger? Nevermind lol, I got it as I was typing the question but I'll type it out anyway in case anybody else is wondering. In ∑ri h(xi), ri is the derivative of the loss function wrt to each training input. Visually in functional space, ri is a vector from groun-truth y to our current ensemble vector H. We are trying to find h which points in the direction opposite to the derivative, so we try to minimize the inner product between arbitrary h and ri - but we can make this inner product as negative (and hence smaller) as we want by inflating the values of h. This should not be allowed otherwise we would find h to be an infinitely long vector, so we restrict to be in a small circle. Short answer: " Shouldn't it make it even larger?" it does make it as large as possible, but in negative.
I can't appreciate you enough, Sir. Absolutely grateful for these video lectures. I had a couple of small questions: a) For two different classifiers (say H_1 and H_2), who let's say have exact same predictions on our training data. Will the derivative of the loss function with respect to both these functions (∂L/∂H_1 and ∂L/∂H_2, L is the loss function) be the same? b) Also, what's the intuition behind this derivative (∂L/dH) ? I can only think of that as a change in loss function per small nudge in H (a bit confused around what would be the intuition of that nudge in a classifier H here?)
a) yes exactly b) the gradient is basically telling you how you would have to change H(x_i) in order to reduce the loss, and then you do make that change (approximately) by adding another tree to H.
Kilian + StatQuest = 100 percent learning.
Thank You Prof Kilian, absolutely loved the series.
Now I'll head over to Deep Learning :)
Why not only statquest?
Triple bamm!!
@@ayeshavlogsfun because it lacks mathematics part, prof kilian completes it
Thank you for posting these! I find your content to be the most educational/entertaining on this subject matter.
Really loved the drunk walk demo at the start. You are a good one, Professor!
Would be amazing if we could have similar quality content on Reinforcement learning from you! Thank you for having the lectures on UA-cam!
Really love it, I learn quite a lot each time I watch this video.
"It begins with a Tay and ends with a Lor" hahaha "Taylor expansion" "That's right! HOW DID YOU KNOW?" i burst out laughing at that.
I have had my fruit and I am ready!! Lets do this!
Awesome lectures, you made machine learning easy for me, now I have better grip over the concepts of boosting .
"Boosting is gradient descent in function space" - I am gonna steal this line.
In the reference textbook Machine Learning - A Probabilistic Perspective they call it "functional gradient descent"
Waking up from coma, What is boosting? Drunk Guy.
Explained very well. Thanks a lot sir.!
the idea of putting the labels in a very high dimensional space is pedagogically very, very good! I had some troubles trying to understand AnyBoost from the original paper, but Prof. Weinberger made the principle behind it very easy to understand. Great as usual
It’s wonderful to revisit these lectures
Your lecture really helped me understand bagging and boosting more clearly. Thanks for your sharing.
Your English is so perfect.
Very Great Lecture
I really like your lecture!
so well and funnyly explained TY
Really great lectures! Thanks! I have a question: In Gradient Boosting the algorithm says "h* = argmin sum (h(x_i) - t_i)^2". I don't understand if one should:
1. go through all the possible regression trees (with a maximum depth set) and find that "h*"
OR
2. one could just apply CART with the new labels "t_i" instead of "y_i".
As I might guess, we should go with 2. But doesn't 2 give just an approximate solution to that maximization problem? Am missing something?
You are right, it is just an approximation. But in practice, it is pretty good.:-)
I had a similar question. I guess the answer ties down to the decision tree lecture where the Prof told that in general these kinds of problems (the minimization of squared loss by considering all possible CART trees) are np hard and hence we just take a 'greedy' approach where we only care about the split we are making at a given point. Essentially, we are doing a similar thing here as well (i.e. instead of solving the np hard problem, using the approach of a typical CART tree (with the prediction variable as the residual) which as the Prof said is of course an approximation)
lai mi dan roi :D cai do thi qua dep mot cach ngao nghe ba xao :D
Thank you !!
Can someone help me: in , dl/dH and h is vector (of what?) or scalar value?
h is a function (the weak learner). But the trick is that you only care about the predictions on your training samples, so every function can be represented and written as an n-dimensional vector (with one dimension fo the prediction on each training sample). So then it should hopefully all come together. l is a loss function that computes the loss of the current prediction, which is the vector H+alpha h . You take the derivative with respect to h. Hope this helps.
Dr. Weinberger, I think there is a minor error in the notes on your website, in the derivation of Gradient Boosted Regression Trees there should not be the (-) sign before the argmin in step 3. I am referring to your notes here: www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote19.html
Yes, good catch. Thanks for pointing this out!
Perfect
thank you for the lecture series, but I could not understand the Kaggle bit, what does flipping random bits mean?
For each test sample you either classify it as classes 1 or 0 (these are the bits). You can then create and submit a random bit vector and observe if you perform worse or better than 50% error (assuming the data is balanced). If you are worse, you can flip the bits then you will do better.
@@kilianweinberger698 yes got it, thank you :)
Why do we treat L(H) as a constant? It is a function of H, so it varies. Sorry if I am missing anything basic
So the previous weak learners are treated as fixed, and the only change we are allowed to do in each iteration is add another weak learner. So H, and therefore L(H) is fixed and we find a h to minimize L(H+\alpha h).
excellent, very good
I have to leave another comment. Prof. Weinberger is a teaching genius. Also, who says germans ain't funny?
thank you for the wonderful and so easy explanation. However i am confused about one thing. What i understand is that we are taking multiple weak classifiers and summing their output to get final result. If we have our first weak classifier wc1...how do we create wc2 and so on ?...if we are generating wc2,wc3...wcn from training data points...aren't we overfitting ?
You will be overfitting eventually, but not for a long time. The reason is that a) your weak learners are pretty "weak", b) you multiply them with a small learning rate / shrinkage constant (thus reducing their influence), c) your loss function (e.g. exponential loss) may learn to maximize the margin between the two classes (which helps a lot against overfitting).
@@kilianweinberger698 thank you....b) is clear...but not clear about c) ...will try to figure out...thank you again for the help
Why sum(h(x)) = c (constant) ?
Well, it is a trick that you assume. You can always do this by normalizing your weak learners. (in practice nobody does, actually)
Hey Professor@@kilianweinberger698, loving the analogies. So is this constraint for h applied to other types of learners/algorithms as well? I am having trouble reasoning why only CART learners would be susceptible to rescaling with a large constant.
Why is it that multiplying our weak classifier h by a large number gives a smaller value for ∑ri h(xi) in h = argmin ∑ rih(xi)? Shouldn't it make it even larger?
Nevermind lol, I got it as I was typing the question but I'll type it out anyway in case anybody else is wondering. In ∑ri h(xi), ri is the derivative of the loss function wrt to each training input. Visually in functional space, ri is a vector from groun-truth y to our current ensemble vector H. We are trying to find h which points in the direction opposite to the derivative, so we try to minimize the inner product between arbitrary h and ri - but we can make this inner product as negative (and hence smaller) as we want by inflating the values of h. This should not be allowed otherwise we would find h to be an infinitely long vector, so we restrict to be in a small circle.
Short answer: " Shouldn't it make it even larger?" it does make it as large as possible, but in negative.
I can't appreciate you enough, Sir. Absolutely grateful for these video lectures. I had a couple of small questions:
a) For two different classifiers (say H_1 and H_2), who let's say have exact same predictions on our training data. Will the derivative of the loss function with respect to both these functions (∂L/∂H_1 and ∂L/∂H_2, L is the loss function) be the same?
b) Also, what's the intuition behind this derivative (∂L/dH) ? I can only think of that as a change in loss function per small nudge in H (a bit confused around what would be the intuition of that nudge in a classifier H here?)
a) yes exactly
b) the gradient is basically telling you how you would have to change H(x_i) in order to reduce the loss, and then you do make that change (approximately) by adding another tree to H.
Love his jokes :)
How is H constant when you discarded the term l(H) + .
it means that it does not depend on small h, its fixed for all h.
Cordell, cut Mr. Barney a check for two hundred and fifty thousand dollars.
His grunting is even more predictable