NOTE: To apply this method to a classification, replace SSR with Gini Impurity (or Information Gain or Entropy or whatever metric you are using). Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/
I read some reference books like Introduction to Machine learning and Hands on Machine learning, but I didn't find any details about Decision Trees, or lets say other methods too like you covered!! Could you suggest some more references to get more deeper understanding?
And, Thanks a lot, this is really the best place for me to learn about machine learning as my first source, the explanations are both deep yet still accessible to a beginner
This is the best explanation of regression trees that I could find online. Professors are always too mathematical and programmers are too practical. You're explanation is juusssst right. Thanks a bunch for this!
Josh: you have truly cracked how to use technology(slides/basic animation) to change the way we are teaching for decades. I wish all universities take a note from you and revise the way they are teaching
yes, 100% true, lot of knowledge sharing just with simple visualisation, as you mentioned, way of teaching matters a lot.....Tks to our MASTER Josh Starmer once again for this awesome video/content !!!
@@statquest I can confirm this. I am pursuing a Masters Degree in Data Analytics Engineering. And I have this course that is giving me a headache: Statistical Modelling. Your videos have helped me a lot, this is the the perfect stuff I was looking for. BAM!!!! BTW, I saw the video where you explained how your pop helped you and the StatQuest community in general, I totally get it why your videos are perfect. Double BAM!!!!! Seriously man, thanks a ton.....
seriously yes! I'm taking an online course from MIT....brilliant faculty but just oh so removed from the every day regular experience of learning as a student. Their slides, and teaching methods leave much to be desired. I don't understand anything they are teaching in stats or ML but Starmer is saving my life atm.
sir i am learning ML from your videos and everyday i am forced to comment expressing the beauty with which the concept is explained..and the best part is you still clear our doubts even after 3 years..for those who don't know sir has also written a book which is too good
I searched lot of thing for my project on ML to start from scratch. Then i landed here You nailed it. 🔥🔥🙏 Now i am on edge of completing my project thank a lot
Absolutely brillian videos!!! I watched everything from the 1st one to this one in the list and understood so many things that I never understood in schools. I love your videos so much!
I love the way you say "BAMM"!!! Gives great relief during the video :) I want to say your style of teaching is great. The way you are explaining is making very easy for us to understand. In my opinion I can say "A difficult subject with easy to understand using your video lectures!" Thank you very much.
This is awesome. I remember I was a bit confused when I was reading tree based methods in An Introduction to Statistical Learning. This really helps me understand it much easier when I can visualize it other than read some formulas. Thank you!
Thanks, Josh Starmer. The way of using train + test data to find a list of alpha, then use K-fold CV on train data to find out the optimal alpha leads to the data leakage.
Oh my Buddha! I'm falling in love with funny of your voice when you're explaining. Before I met your channel, my head is spinning round and round. I don't know what to do with my learning, but you came in and took me by big surprise. You made the abstract concept to be simple! Thank you!!!!! :)))))))
At 12:20, let's say the right-hand side of the tree had node instead of just a leaf, and that node led to two leaves. In this situation, what would be pruned first to created the pruned comparison tree: prune the two leaves at the bottom of the left side only first because it's deeper? or the two leaves at the bottom of the right only first? or prune all ends with two leaves at the same time?
Thanks so much for your video. I've watched 8-times and have still one question. It is about 13:17~14:08 Could you possibly explain more details about the sentence 13:17 “Use the alpha values we found before to build trees(full and sub) that minimize the tree score”? My questions about this sentence are 1. How can we use alpha in building trees process? - I thought the way we build trees(full and sub) is the same as how we did in your video ‘regression tree’ *I understand tree score, alpha, and how alpha plays role in changing the tree scores of different size trees(full, sub trees) my question about the role of alpha(from whole data) in creating trees.. 2. If we can build trees with the same way as we did in ‘regression tree video’, why do we need this process of 13:17~14:08?
The idea is that when alpha is 0, we build a full tree just like in the original regression tree video. Then we increase alpha to the first value and that causes us to prune that tree a little bit. Then we increase alpha some more and then we prune the tree some more, until we've used all of the values from alpha we identified earlier.
At 12:33, it reads on the video that "we increase alpha again until pruning leaves will give us a lower tree score". My question is lower than what? My understanding is lower than tree score of the full sized tree at that specific alpha value, in this example, at alpha=10000, because we have already established the tree score of the full sized tree at alpha = 0 has the smallest tree score. Similarly at 12:43, the third tree at alpha = 15,000 is chosen because it has lower tree score than the second tree at alpha = 15000. Please let me know if this is correct. Thanks.
At 12:05 you said that full tree has lowest tree score when alpha = 0. By default, pruned trees have higher SSR. Won't increasing alpha increase their tree scores even further, instead of lowering them?
Increasing alpha increases the scores of all of the trees (it never lowers the scores). However, remember, we multiply alpha by the number of leaves in a tree. So a large tree will get a much larger penalty (alpha * number of leaves) than a smaller tree.
@@statquest Actually I just figured out earlier that we have to increase the alpha on the previous tree structure to see the comparison. I thought we only need to increase the alpha only on pruned trees (to make it smaller). Thanks for the reply.
@@statquest Hi Josh, I am a beginner in learning data analytics. First I have to say our professor uses your videos in class and they really helped in understanding the course materials! As for the problem mentioned by Bobby, I also had problem with it until I saw your interaction here. I was stopped by "increase alpha until pruning leaves will give us a lower tree score."My problem was the same as Bobby. I thought increasing alpha only increase the tree scores and where should I start pruning leaves? The alpha and T get fuzzy in my mind. However, when i think "increasing alpha only increases the tree scores, i was comparing with just the full tree. Therefore the increasing of tree scores seems to have no stop. While actually, the stop is the SSR of the next tree where the leaves got pruned. At that point, the alpha is the knot which will be assigned to the next tree. (I hope I am right. Please correct if not.) Having this in mind, when I look at your words "increase alpha until pruning leaves will give us a lower tree score.", I understand it completely and I think it's very accurate. I would recommend to have some explanation here or make it more obvious in arranging of the graphical illustrations in the video. very lengthy comments. I was trying to organize my thoughts. Thank you again for wonderful videos!
Thank you very much for this video! I really enjoyed the full step-by-step process of building the various trees using different alpha values and the use of cross validation to select the best alpha!
One good idea for cross-validation maybe is we can split data first to train and test and again split train to train and validation sets. Therefore, we can guarantee that our test set is totally new to environment. This will result in more realistic scores.
i'm curious as to how we find sensible alphas. there seems to be an explanatory gap here since in the section "comparing pruned trees with alpha" it says (in the NOTE at 8:23) that we find it during cross validation, but in the cross validation section at 13:18 we are supposed to "use the a values we found before". sensible alpha values would probably vary widely depending on the SSRs of the trees (and ultimately the variable ranges) and even more so once we do classification since gini and entropy give small values for which alphas of 10k would not work usefully. surely simply guessing various alpha levels and finding which of the guessed ones work best in cross validation is not the best method or am i misunderstanding something here?
We use all of the data to find candidate values for alpha (this is demonstrated at 11:18). Once we have candidate values, we test each one with cross validation to find the optimal value for alpha.
Your lectures are awesome. I like the way you explain things. I have read your book also it's brilliant. I just want the pdf of this lecture. Can you share this
Hi Josh, I don't get this part: 12:10 "increase alpha until pruning leave give a lower tree score" What does "lower tree score mean"? Lower than the the score from the previous round? The first (full) tree use alpha of zero. The second tree, pruned, will have a higher SSE, hence higher Tree Score. A non-zero alpha will only increase the Tree Score further. So, how is it getting a "lower tree score"? Thanks, Kar Ann --- Notes for my own future reference: *WHY PRUNING* To avoid over fitting, a less full tree is needed. Pruning procedure uses tree score that balance between classification error and number of leaves. *OVERALL STEPS* 1. Create a set of Prune Trees. Use Tree Score to decide how much to prune. The trees should have comparable "quality" in terms of error and number of node. Tree Score = SSR + αT 2. Then, apply different subset of data (cross-validationally) to all Prune Trees. The Pruned Tree that give lowest SSR (with testing data set) is the winner. *DETAILED STEPS* Tree Score, TS = SSR + alpha* T (T=no. of leaves) TS is used to decide how much to prune, and to create different trees. *1. Create a set of Prune Trees, which have comparable quality, in terms of error and number of leaves* Create the 1st trial tree. The fullest tree. Tree1. α1=0. Get TS1. Create the 2nd trial tree. A new tree. A pruned version of Tree1. Tree2. α2=a new value bigger than 0. Compare the Tree Score between the old (full) and new trees (of smaller size i.e. pruned). Even though the newer (pruned) tree has a bigger SSR, it has a lower T. At α2 = 0, the newer tree has a higher TS than the old tree. But the new tree TS increases slower than the old TS, because the old TS has a higher T. Eventually, with a certain α2 value, the newer tree has a higher TS. Get TS2, using the new α2. Create the 3rd tree. A pruned version of Tree2. Get TS3, using the new α3. Repeat and create more prune trees. Each with a α values. Now, there are many set of tree. First tree is a fill tree. Last tree is only one leaf. All tree comparable quality, based on fine balance between classification errors and number of leaves. *2. Apply different set of data, cross-validationally, to all trees. The Prune Tree that give lowest average SSR is the winner* .Create different values of alpha from the above eg. α1 - α5 .Use different set of training and test data i.e. 10 folds cross validation .From the 10 folds training and testing data, get SSR, with α1 - α4 .Get the SSR of each α, averaged over 10 sets of training and testing data .Pick the tree, and its associated α, that gives the lowest SSR The fullest tree, though gives the lowest SSR with the training data, it doesn't give the lowest SSR with test data due to over fitting. Cross validation also helps to minimise impact of over fitting.
The tree score is the SSR + alpha*number of leaves. We increase alpha until it makes sense to reduce the number of leaves in order to get a lower tree score. Although reducing the number of leaves will increase SSR, if alpha is larger enough, it can overcome that increase. For example, if a full tree has 10 leaves and SSR = 10, and the same tree with 9 leaves has SSR = 20, then when alpha = 11, the tree score for the tree with 9 leaves will be smaller than the tree score for the tree with 10 leaves. In other words: 10 + (11 * 10) > 20 + (11 * 9) -> 120 > 199
@@statquestI got it now. Thanks! But one more question.... Why "We increase alpha until it makes sense to reduce the number of leaves in order to get a lower tree score"? Doesn't it make better sense to say "We reduce the number of leaves and increase alpha until a lower tree score"?
@@statquest Sorry, Josh, Re-watching it... and more questions sprung up. Q1) At 13:17: "Using the training data... use the alpha values we found before to build a full tree and a sequence of subtrees that minimise the Tree Score" Why do we need to build trees again? Weren't they already created in the previous steps, when the alphas of subtrees were evaluated? Q2. The objective is to apply different folds of data to find out which subtrees give the lowest average SSR. Why is alpha needed here again? Alphas were used in Tree Score, to create the subtrees. But SSR doesn't need alpha, right?
@@karannchew2534 1) The first set of trees were built with all of the available data (not including a validation dataset if you have one), the subsequent sets of trees are built with different subsets of all of the available data and this allows us to pick the best value for alpha using k-fold cross validation. 2) Correct. Ultimately we pick the tree with the lowest average SSR after cross validation.
This video is awesome! Why didn't I know StatQuest earlier, it really helps, THX!!! btw: I'm confused about the built of trees in 13:19, how could we know that it is the bottom 2 leaves should be cut for a new training set when α=10000(pre-calculated), is that just a coincidence? Or the cut depends on which leaves give the lowest Tree Score?
My understanding of the pruning process is this: ①use the whole data set to build a full-sized tree, and then increase alpha from 0 to get different sub-trees that are pruned and have lower Tree Score corresponding to different alpha ②build a new training set, testing set from whole data, then build a full-sized tree, using the alpha we have before to build sub-trees, then calculating SSR on testing data ③repeat ② til we done 10-fold cross validation ④for each iteration, choose the alpha that has the lowest SSR ⑤calculate the average among all alpha in ④ to get its final value Am I get it right?
You're correct about everything except 5. We don't calculate the average of the alphas, we calculate the average of the sum of the squared residuals for each level of alpha, and select the level of alpha that corresponds to the lowest average sum of the squared residuals.
@@statquest Thx so much for your reply and correction! I think I might get the process and concept a little bit, but still need more time to fully understand it, and your videos are just the most helpful, vividly, clearly! Thx again for all the efforts and sharing! I'm moving to SVM right now hahaha
@@statquest I am also having same doubt, can you please answer the original question? "how could we know that it is the bottom 2 leaves should be cut for a new training set when α=10000(pre-calculated), is that just a coincidence? Or the cut depends on which leaves give the lowest Tree Score?"
Some questions here! Hope they get explained. 1. at 14:14 , why calculate the SSR, not the Tree score? 2. at 14:34 when making a new tree based on new training data, is there evidence that, when using the previous alpha value obtained before, it always elimiinate the leafs?
@statquest Then, are the pictures of trees in 14:34 just arbitrary? Since the training sample is different, the trees made by the new data samples may not be the same shape as the tree before.
Cost Complexity Pruning is otherwise known as post-pruning, while limiting the tree depth, enforcing a minimum amount of samples per leaf/split, a minimum impurity decrease is known as pre-pruning, correct? My question is, can you apply pre-pruning first, and then apply post-pruning to the pre-pruned tree? If yes, then i assume that the alpha parameters will be found from the pre-pruned tree, right? Not the initial full tree? And then cross validation will also be performed with the pre-pruned tree, not the initial full tree, to determine the final optimal alpha score? And on a separate note, should only those alpha obtained from the tree trained on all the data be used when cross validating, and why? Is there no chance that some other, random alpha, might result in better performance? Considering that cross validation is done on several different test/train splits, and there will be those that do better with one alpha, and those that do better with other alphas, doesn't it make sense to try all possible alphas (from 0 to infinity) in the cross validation, not only those that give the best tree scores for the full tree? Isn't there a chance that some other alpha will give, on average, a lower sum of squared residuals than those obtained from the full tree?
I believe you are correct about how pre-pruned trees are used. And this is just how the algorithm is spelled out - possibly to keep the running time down.
How new trees are build by imposing previous alpha values? Maybe it is not possible to find new continuously smaller trees of reduced Tree Scores for fixed alphas. Before alpha was a parameter and during cross validation is a constraint :(
(Making notes for my own future reference) Tree Score, SSR + αT, is used to create a set of Prune Trees. Then, apply data (cross-validationally), to all Prune Trees. The Prune Tree (and its α value) that give lowest SSR (with testing data set) is the winner.
A quick question please. In this example, the tree is not balanced, in the sense that right subtree is a lot deeper. What if the left subtree is as deep as the right subtree, then how do we choose which side of the internal node to collapse, or in other words, which side of the leaf nodes to delete? Based on what is shown at 12:33, we should go with whichever that gives us lower tree score, right?
Hi Josh , I hope u answer my question, I was searching for 3 days till now and i got nothing I have 2 problem which is : 1_ How to determine alpha where there is more one leaf in the bottom of tree (i.e : u said increase alpha till pruning this leaf get lower score) , so if i have more than one leaf in the last level of tree, which one should i cut or should i look for all subtrees every time increasing the alpha it seems like it will get high complexity? 2_ in implementation when i will give the model the ideal alpha to implement the decision tree, how the model will know when building it in every step he take is that will lead to the subtree related to this alpha finally , u r such amazing i really enjoyed every lesson i took from this channel
Just a random thought, what if we prune the tree directly based on ssr computed on validation set instead of adding penalty. Anyway the tree that works well on validation set is selected. Why are adding penalty?. Does it helps controlling fluctuations in the number of leafs selected across cross validation folds during Hypertuning .
Hey Josh, how can we choose our alpha wisely? So that the tree with the minimum tree score will really work well for testing data too. Is there a specific rule of thumb?
@statquest can you reconcile what you said at 10:41, 15:05 and 15:35, if we are picking the final pruned tree at 15:35 what are we doing at 10:41 with the tree score? I thought cross validation was just to pick the optimal value of alpha 15:05 which is then used at 10:41, to pick final pruned tree. any clarification will be helpful
The way you explain..and the amount of effort you put in the videos is great. I have learnt a lot from you sir. I always feel so positive and motivated while learning from you. Thank a lot🎈
Hey, Josh! Is it ok that we are using train+test to find alpha values? I mean that we are peeping into the future. do we know good thresholds using a test sample(not only train), or am I wrong? Thank you
@3:35 when you remove the leaves how do you change the decision rule of the parent node ? @11:55 when you build the tree from the full data set how do you get the SSR ? i mean what is the input to get it ?
1) The parent node reverts to the decision it made before the branch was added. In this case that means the average of the drug effectiveness for all values with dosage >= 14.5 2) Calculating the SSR is described in the video that explains Regression Trees here: ua-cam.com/video/g9c66TUylZ4/v-deo.html
Thank you for the video, I just have one question. We use pruning to figure out the optimal number of leaves for the tree we build using the original training data. The purpose of the original validation/ testing data was to validate the performance of this tree, as we did not use that data in the creation of said tree. Now in the pruning process, the original testing/ validation data is used in conjunction with the original training data to determine alpha. This means, that the original validation data had a large impact on the tree we created from the orinal training data and is therefore no longer independant from it. How is this procedure justified and would it be necessary to get a third independant data set for a true validation of the trees predictive performance?
The term testing data is used a bit misleading here. You usually have 3 sets of data: Training, Validation and Testing. The Validation data is used to optimize the model (in this case the pruning). The test data is seperate and should - as you said - be completely independent from the training data. I've noticed that quite some people use 'validation' and 'test' data synonymously, when they are in fact two different things.
Thank you for the amazing content Josh! I had a doubt, do we change the alpha and the number of nodes simultaneously cause I thought that we change just alpha and then check which tree performs better (keeping one thing constant and changing the other gives us a better way to compare). Also at 11:32, you used the entire data (train+test) to build the model and find different values of alpha and then used those values on train split to build the model, compare test scores and finally find the best value of alpha. If we use the train+test data to find alpha, isn't that causing data leakage? Also, why do we compute the alpha values beforehand? why don't we do it the usual way(the way we do in ridge regression) where we find the optimal value of regularization parameter using GridSearchCV because anyhow alpha is analogous to regularization parameter (I think).
1) We find values for alpha to result in better performance after pruning. 2) We can keep some data set aside for validation. 3) Perhaps for performance reasons.
@@statquest sorry. I don't get your answer for #1. Can you please elaborate more? Why do you find alpha using test+training data? If you just use the training data to find alpha, you will get an alpha to result in a good performance, but it's better if we use both training and testing set. CAn you please confirm?
@@statquest Thank you. Can you please confirm what I understand from this video? You pick a number of alphas from a set which combines both training and test set. First, you have the full (untouched) tree where alpha is 0. Then, you increase alpha until pruning leaves will give us a lower tree score. You increase alpha until pruning additional leaves will give us a lower tree score. Long story short, a set of alphas are derived from the (training and testing) sets. You train the models using a set of alpha's that we calculated before. We then test the trained models by picking the optimal alpha value from the alphas by selecting the lowest SSE in the first fold which is the testing set. You pick the next optimal alpha value by selecting the lowest SSE in the second fold as the next testing set. If you are using 5-fold CV, then you have to pick 5 optimal alpha's on 5 different testing set. You just select the best alpha that appear the most. can you please verify if what I understand is correct? Thank you, Josh!
@@shawnkim6287 That seems reasonable to me. If you want to double check, you can find the algorithm in The Introduction to Statistical Learning in R (which is a free PDF if you look for it).
Thanks for the great video. One question though: Why is the full-sized tree build from all data (see 11:25) and not just the testing data? Couldn't this potentially give problems w.r.t. leakage?
Regarding choosing α (starting at 11:19) when we fit a new regression to the FULL data, does this not cause us to 'overfit' α to some extent? I was wondering what would happen if we did the following: i) Split the data k different ways into a set that we find α for and a set that we ignore. ii) On the set that we find alpha for, we get [α11, α12, α13] (the first set of α's such that we get better tree scores cutting the tree by 1, 2, and 3 levels respectively.) up to [αk1, αk2, αk3]. iii) We then take the average α for each cut so [(α11+α21+...+αk1) / k, (α12+α22+...+αk2) / k, (α13+α23+...+αk3) / k] as our set of final α's. iv) Perform K-Fold Cross Validation using the above to see what α gives the lowest SSR for it's optimal tree. Would my method make little to no difference? Or is my method overfitting more in some sense? Let me know what you think!
In the video, you selected 4 values of alpha i.e. 0, 10k, 15k and 22k. How did you come up with exact alpha values? In the real-world, the dimensions of data will be huge so how to decide value alpha?
In the real world, you keep track of the sum of the squared residuals at each point in the tree. You can then figure out what values for alpha work directly from that information.
Hi Josh, Thanks for this GREAT video!!! Just wanted to ask what's the principal to choose several alpha as the starting values? ie. in the video, you choose these 3 alpha values as the candidates to determine the final alpha: 10,000, 15,000, 22,000, how did you come up with these three alpha?
I started with alpha =0. I then increased alpha until I got a lower SSR + alpha score by pruning a branch. That was my second value for alpha. I then increased until I got a lower SSR + alpha by pruning another branch...etc.
@@statquest Thank you Josh! Really nice videos! I am looking at most of them! I rank them by the total BAM count of each video :) One thing I don't understand is: in the example above the full tree and the tree with one pruned leaf would have an equal tree score at alpha = 5494.8 - 543.8 = 4951.0, whereas the trees with one and two pruned nodes would have an equal score at 19243.7 - 5494.8 = 13748.9. So instead of choosing a candidate alpha of 10000 as you did in the very first step, you could have chosen any value 4951.0 < alpha < 13748.9. Would perhaps the best value to choose for a candidate alpha then be the mean of 4951.0 and 13748.9 = 18699.9 rather than 10000?
@@statquest Sorry for my unclear question. Between 12:00 and 12:54 you describe that you increase alpha until pruning one leaf renders a lower score. So the border between the full tree and the tree with one pruned leaf is at 10000 here, and the border between the trees with one and two pruned leaves is at 15000. Then couldn't the candidate alpha for the tree with one pruned leaf be set to 12500, i.e. the middle of the interval where this subtree is optimal, rather than at the border to the full tree?
Why do you need to create the tree with the full dataset (training+test, at 11:25) ? Is it not enough with the trees created with the training data during the k-fold?
We use all of the data to identify reasonable values for alpha. One those are identified, we use cross validation to identify the best value for alpha.
@@statquest I think I got confused because you first used the full dataset, and again the k-folds and later the full set . I think I can first do the k-fold to choose the best alfa. And later prune the trees with the full dataset and that best alfa. Isn't it?
It is the lowest Sum of Squared Residuals with the Testing Data, not the lowest tree score. In other words, when alpha = 10,000, we have a tree that does not over fit the training data (which would result in a larger SSR for the Testing Data) or under fit the training data (which would also result in a larger SSR for the Testing Data).
Thank you very much Josh. Could I as a question: what if we throw away the alpha, but just use the three sub-trees and the full tree to run the cross validation and find out the optimal tree structure with the lowest SSR? I mean deciding the best alpha is essentially the same as finding out the best tree structure, for example, in the video 15:16, the alpha=10000 gives the lowest SSR, i can also say the three-leaf-node tree gives the lowest SSR, and then I just choose the three-leaf-node tree as my final decision tree. So it seems meaningless to find out alpha.
I believe we use alpha because we use it prune a full sized tree, rather than build sets of smaller trees, like you are describing. In other words, we don't set out to build a 3 node tree - we set out to build a full sized tree, and then prune according to various values for alpha. Since we are using Cross Validation, and thus, have a different training dataset each time, this means that the same value for alpha may not always result in a tree with the same number of nodes.
Hello, first of all thanks for the great material you produced and shared, certainly among the clearest and effective I've come across. My questions are about the cross-validation trees to determine the right alpha values. As a premise, if I understood correctly, we first determine candidate alpha values by : a) create a "full" tree from the full training+testing datasets b) produce the corresponding family of "pruned" versions (and I guess asses their SSRs in preparation for the next step) based on the morphology of the "full" tree (meaning, all possible pruned trees are considered - is that correct?) c) identify the candidate alpha values as those by which the "full" tree's score becomes higher than one of the pruned versions. Assuming the above is correct, when we move on to cross-validate in order to ultimately determine the right alpha, I understand that we resample a training set (and a corresponding test set) for a number of times. Each time, we build a new tree from the training set, and its associated set of pruned versions (let me call these tress a "cross-validation family of trees" (CVFTs)), and assess their SSRs based on the test set for the current round in order to contribute to ultimately calculate the actual alpha to use. First question: how come every CVFTs in your slides has a number of members that equals the number of candidate values for alpha? couldn't a resampled training set might give rise to trees with more or even fewer leaves - and corresponding pruned versions - than the tree that was used to identify the candidate alpha values? And in that case, the candidate alpha values might be in larger or smaller number than the possible number of trees in the CVFTs at hand. I imagine that a possible answer is that the number of members in a CVFTs can actually be different than the number of candidate alphas, and that the pruned tress in a CVFTs are actually identified through their Tree Scores when each of the alpha candidate values is applied -- and if so I guess the issue is that perhaps this mechanism does not stand out 100% from the presentation... Second question: if we assess the trees in each CVFTs only by their SSRs, wouldn't always the tree with more leaves (therefore alpha=0) win? Thanks much
What you wrote for b) "all possible pruned trees are considered" is not correct. When we remove a leaf, we don't just create all possible subtrees with one leaf removed. Instead, we pick the one subtree that, when we remove one leaf, results in the smallest increase in the sum of squared residuals.
@@statquest Josh, OK, that makes sense -- so this is repeated on each new subtree to produce the set of trees where the candidate alpha values are then formulated as at minute 13:03, correct? If so, are my subsequent questions still standing?
@@stedev2256 Each time we do cross validation we get a new "full sized tree", which may have a different size than the original. We then use the pre-determined alpha values to prune that new tree and use the test dataset to find out which tree (and alpha value) is best for that iteration. As for your second question, this is where the "testing" data comes in handy. A full sized tree with the most leaves (and alpha=0) will probably overfit the training data, and thus, do a pretty bad job predicting the testing data. So in practice, the full size tree (with alpha = 0) performs great with the training data (low SSR) but poorly on testing data (high SSR).
@@statquest Josh, thanks, I was actually rephrasing / correcting my last post, and clarified a number of things to myself while doing that... I didn't think you could see my second post while I was editing it... sorry. But it was not all in vain, as what you wrote last confirms what I was getting to while revising my question and in light of your previous answer, and things seem clear now. Thanks much
Are the trees at 11:25 (based on all the data), 13:36 (based on 1st set of training data), and at 14:40 (based on 2nd set of training data), identical on the dosage threshold values in the blue boxes or not? That is unclear to me...
Ever time we build a "full sized" tree (with alpha=0), we fit it to whatever data we are using and find optimal threshold values that are specific to the current dataset. So if we are using all of the data, we fit the tree to all of the data and find the optimal threshold values for that specific dataset. Then, when we are using the 1st set of training data, we fit the tree to that data and find new optimal thresholds for that specific dataset. Then, when we are using the 2nd set of training data, we fit the tree to that data and find new optimal thresholds for that specific dataset.
I'm a bit confounded. Isn't it the case that the more the nodes are, the smaller the SSR (as in the example in 9:30)**? If so, how is it possible that a pruned tree with an alpha of 10.000 (**14:22**) has a smaller SSR than a full tree with an alpha of 0? Or, could my second sentence** possibly be true only in terms of constructing the regression tree (based on the training data) and not necessarily when using it to the test data? I hope I'm formulating the question clear enough :P
The full tree will always have the smallest SSR when applied to the training data, because it was built specifically to minimize the SSR for that training data. However, good performance (lowest SSR) with the training data does not mean we will have good performance (lowest SRR) with the testing data, which is a completely different set of data. The goal for pruning is to minimize the SSR when the tree is applied to testing data.
Josh, the videos on this channel are nothing short of superb. I have only one suggestion: how about a dark theme for these presentations? That white background is like a supernova, especially on my 55" TV.
Thanks a lot for this. I came here after getting confused reading this concept from a book. I am inspired by your teaching style. Your style of teaching by examples is the best way to transfer knowledge without losing the audience at any point. May I ask how much time do you spend to create a tutorial like this? Also what kind of tools do you use to make these videos.
Thanks for the great video, Josh. Quick question. At 15:06 what is the exact meaning of "on average"? Do you mean simply the most frequently selected alpha during the CV is the final value? Or, should I have to calculate the average of SSR for test set, something like that?
@@statquest Thank you for the quick reply and sorry for the unclear question. What I wanted to ask is the way of averaging. Should I count how many times each alpha was the best during each fold of C.V.? And then pick the most frequently selected alpha?
Great video, thanks a lot! I just have one question to the cross-validation procedure: When creating the original sequence of subtrees from the whole big dataset we get certain discrete alpha values for each subtree (in the video it is 10 000, 15 000 and 22 000). You said we obtain these values as soon as pruning leaves gives us a lower tree score. So the pruning is done first and then the alpha is increased till the tree score is lower than for the unpruned tree. Are we then using these discrete alpha values in the cross-validation procedure? Because then there must be something like a guarantee that these discrete alpha values also lead to a pruning in the training subset we using in the current step of cross-validation, or not?
So a question. We learned previously that cross validation is used to test the model on different "blocks" of the test set. But in this case you are advocating for the cross validation to be used for hyper parameter tuning. Does that mean the test sets remain constant?
A lot of people ask about this and I could have probably done a much better job wording things. The way I see it, is that we have all of the data and we can split that in to "all the data we want for training" and "all the data we want for testing". We then build a tree and prune etc. using all the data we want for training and test against the testing data.
thanks joshh for this beautiful video i have a question in the statquest classification tree using python you extract the alphas only from train data however in this statquest you extract.alphas from all data(train and test data) ?
Unfortunately I was a little sloppy with the terminology here. Because we are applying CV to the data, "train and test" refer to just the data used for training, but split up into "training and testing" sets for CV.
@@statquest ahh okayyy so whenever i want to determine the value of alphas i dont have to use the whole data maybe the train would be sufficient? but implying cv imposes both the train and test?
@StatQuest with Josh Starmer, I have purchased your book but I didn't find these concepts (pruning, random forest, adaboost, gradient boosting) in that. Is there a way to access these presentation slides?
how to decide the alpha value ? from which value should i start the testing of tree score as in this video you took alpha=0,10000,15000,22000 if i have a different data set then how should i take that alpha value?
This question is answered at 12:10 in the video. Start with alpha = 0 and increment it until we prune the tree. That's your first value. Then continue to increment it until you prune more etc.
@@statquest but in which order we should increase the alpha like, if I take first alpha =0 then what should be my second value of alpha? As you took 10,000 in video how can we choose second value of that ?
@@Aman-uk6fw Start with alpha = 1. See if it results in pruning. If not, increment alpha by 1. Repeat until you prune, that's your first value for alpha. etc.
That's a good question and it comes up a lot. You can always use an additional testing dataset, but then you have the problem of deciding which data would make the best dataset. A better way might be to estimate of how things will perform by looking at the sum of squared residuals for the testing data from all of the trees associated with the final level of alpha you pick. For example, when we doing the cross validation to figure out which value for alpha to use, we can keep track of the sum of squared residuals for each level of alpha and at the end, when we know the alpha value, just average the sum of squared residuals for that level.
@@toja240 Whether or not you should have a special testing dataset that is never used for cross validation or anything else really depends on how much data you have. If you have a ton of data, sure, hold out some of it to be a special testing-dataset. If you don't have a ton of data, then you just estimate the models performance based on how it did in the cross-validation.
This was actually the very first XGBoost video. XGBoost uses unique trees and to understand why, you have to know everything about normal regression trees. This information was originally in a XGBoost StatQuest, but it was too much, so I made it a stand alone video. That being said, the next video I put out, in the next two weeks or so, will be on XGBoost, then the next and the next, etc. XGBoost is a huge algorithm with a lot of parts. I've got 3 videos worth of material so far and I've only scratched the surface. I'm expecting to have at least 4 XGBoost videos, maybe more.
@@statquest Thanks so much for the reply. I've seen all your tree based videos. For the last two days, I've been reading about XGBoost all over the internet and it was quiet difficult to grasp the whole picture of XGB. I genuinely thought it would be lot more easier and intuitive if it had been explained by you. Appreciate all your work. Couldn't be more excited for the future XGBoost videos. BAM!
I want to ask a question. Do we use same alpha values for trees that built on train data as the ones we found using the trees built on full data? Is it possible as trees for two different datasets may differ, the alpha values we chose from full sized data aren't right for training trees? Not right here I mean maybe that alpha values doesn't reduce the tree score after pruning the training tree, and we need to use another alpha value. Also why exactly we need to train our tree on full dataset, and choose that model is not clear.
Yes, it's possible to get different trees, but if the dataset is large enough, it shouldn't be a problem. To see how this is done in practice, see: ua-cam.com/video/q90UDEgYqeI/v-deo.html
Great video! Trying to implement now, from scratch, and I have a few questions: at 12:12 “now we will increase alpha until pruning leaves gives us a lower tree score”. So, do we precompute all of the trees first? (I have L trees if I start with L leaves?), and then continue to increase k by some increment until the tree in question has the lowest score? If I’m wrong, maybe somebody could clarify? Thanks so much, and thanks for the video!!
To be honest, I'm not sure the best way to implement this. However, my guess it that when you build a tree, you can keep track of how much the SSR decreases with each additional branch. Then, when you increase alpha, you can compare it to those stored SSR values and when use that to decide when to prune. I believe that this would be easier and faster than building a bunch of trees at the very start.
@@statquest oh, that's a good point. So every single node can have an SSR that is calculated when building the tree (stored in each node, leaf or not), and then finding the best subtree for the same training set & some fixed alpha can be done with just one more traversal, recursively. I will give this a try! 🙂
@@rkotcher Bam! Let me know how it goes. Also, if you have the Introduction to Statistical Inference in R (it's a free download), they may have pseudo code you can use as a guide.
@@statquest Thanks for the reference! I'll have to check it out. You mentioned to say how it goes - I had two issues calculating SSR cleanly during tree construction (my implementation is recursive, so right off the bat this might not be optimal). The first issue is that, because you must calculate SSR after the current node is attached to the parent node, you've already finished building the subtree so the results will include that subtree. The second issue is that, because you always build either the left or right subtree first, calculating SSR might mean that you're missing distant relative nodes (aunt/uncle nodes, for example). I believe (?) this is different than what we want, which is actually to calculate SSR for all subtrees constructed from removing exactly one node. Anyway, if somebody stumbles upon this and agrees/disagrees, I'd be curious to hear :) I ended up calculating SSR & tree score in a second pass.
Great explanation. I really enjoyed the video, but I'm a bit confused. Why use all of the data to build the initial tree in step 1? If we do that there is no test data left for testing the tree.
NOTE: To apply this method to a classification, replace SSR with Gini Impurity (or Information Gain or Entropy or whatever metric you are using).
Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/
I read some reference books like Introduction to Machine learning and Hands on Machine learning, but I didn't find any details about Decision Trees, or lets say other methods too like you covered!! Could you suggest some more references to get more deeper understanding?
I really wanted to try out the offline material too, but I'm still a student and can't afford it
And, Thanks a lot, this is really the best place for me to learn about machine learning as my first source, the explanations are both deep yet still accessible to a beginner
@Pradeep Tripathi We use predicted and actual classes to calculate gini. For details, see: ua-cam.com/video/_L39rN6gz7Y/v-deo.html
This is the best explanation of regression trees that I could find online. Professors are always too mathematical and programmers are too practical. You're explanation is juusssst right. Thanks a bunch for this!
Thanks!
Josh: you have truly cracked how to use technology(slides/basic animation) to change the way we are teaching for decades. I wish all universities take a note from you and revise the way they are teaching
Thank you very much! :)
yes, 100% true, lot of knowledge sharing just with simple visualisation, as you mentioned, way of teaching matters a lot.....Tks to our MASTER Josh Starmer once again for this awesome video/content !!!
@@statquest I can confirm this. I am pursuing a Masters Degree in Data Analytics Engineering. And I have this course that is giving me a headache: Statistical Modelling.
Your videos have helped me a lot, this is the the perfect stuff I was looking for. BAM!!!! BTW, I saw the video where you explained how your pop helped you and the StatQuest community in general, I totally get it why your videos are perfect. Double BAM!!!!!
Seriously man, thanks a ton.....
seriously yes! I'm taking an online course from MIT....brilliant faculty but just oh so removed from the every day regular experience of learning as a student. Their slides, and teaching methods leave much to be desired. I don't understand anything they are teaching in stats or ML but Starmer is saving my life atm.
The channel with the lowest Gini score of likes vs. dislikes
You get a DOUBLE BAM for that comment. Funny! :)
sir i am learning ML from your videos and everyday i am forced to comment expressing the beauty with which the concept is explained..and the best part is you still clear our doubts even after 3 years..for those who don't know sir has also written a book which is too good
Thank you very much! :)
Sir can i connect with you on linkedin
@@pratyanshvaibhav Linkedin limits the number of connections anyone can have, and I've hit that limit.
Okay sir
I searched lot of thing for my project on ML to start from scratch.
Then i landed here
You nailed it. 🔥🔥🙏
Now i am on edge of completing my project
thank a lot
Awesome! Glad my videos are helpful. :)
Absolutely brillian videos!!! I watched everything from the 1st one to this one in the list and understood so many things that I never understood in schools. I love your videos so much!
Awesome, thank you!
The good thing about his videos is you just have to watch any video once and the concept will not leave you for a long time.
Awesome! :)
lol, that intro one who watches friends and Stat Quest would get it!! love your content its the best machine learning tutorials available
Thank you very much! :)
Re watching after practising I can even further appreciate the quality of your explanations, thanks Josh :)
Thank you very much! Good luck with your practicing. :)
I don't know why I spend a lot of time googling if I always end up watching statquest haahahha
BAM! :)
What a beautiful content!
I'm not an English speaker, but His video is more helpful than the Korean lecture provided by the college I attending.
bam! :)
I love the way you say "BAMM"!!! Gives great relief during the video :) I want to say your style of teaching is great. The way you are explaining is making very easy for us to understand. In my opinion I can say "A difficult subject with easy to understand using your video lectures!" Thank you very much.
Thank you! 😃
This is awesome. I remember I was a bit confused when I was reading tree based methods in An Introduction to Statistical Learning. This really helps me understand it much easier when I can visualize it other than read some formulas. Thank you!
Hooray! I'm glad the this video is helpful. :)
I am reading the same book now and without this video series it's impossible to understand.
Thanks, Josh Starmer. The way of using train + test data to find a list of alpha, then use K-fold CV on train data to find out the optimal alpha leads to the data leakage.
Noted
Best video on pruning and tree selection till date!!!!!
Wow, thanks!
Oh my Buddha!
I'm falling in love with funny of your voice when you're explaining.
Before I met your channel, my head is spinning round and round.
I don't know what to do with my learning, but you came in and took me by big surprise.
You made the abstract concept to be simple!
Thank you!!!!! :)))))))
Wow, thank you!!!
Ahhh Phobeee from Friends aka Smelly Cat. Haha good one Josh.
Thanks! :)
At 12:20, let's say the right-hand side of the tree had node instead of just a leaf, and that node led to two leaves. In this situation, what would be pruned first to created the pruned comparison tree: prune the two leaves at the bottom of the left side only first because it's deeper? or the two leaves at the bottom of the right only first? or prune all ends with two leaves at the same time?
We always remove the leaves that result in the smallest increase in SSR.
Excellent explanation, you are the master in teaching, thank you so much much for your valuable effort
Many thanks!
Liked and Commented to help you with the UA-cam algorithm.
Triple bam! :)
Thanks so much for your video.
I've watched 8-times and have still one question.
It is about 13:17~14:08
Could you possibly explain more details about the sentence
13:17 “Use the alpha values we found before to build trees(full and sub) that minimize the tree score”?
My questions about this sentence are
1. How can we use alpha in building trees process?
- I thought the way we build trees(full and sub) is the same as how we did in your video ‘regression tree’
*I understand tree score, alpha, and how alpha plays role in changing the tree scores of different size trees(full, sub trees)
my question about the role of alpha(from whole data) in creating trees..
2. If we can build trees with the same way as we did in ‘regression tree video’, why do we need this process of 13:17~14:08?
The idea is that when alpha is 0, we build a full tree just like in the original regression tree video. Then we increase alpha to the first value and that causes us to prune that tree a little bit. Then we increase alpha some more and then we prune the tree some more, until we've used all of the values from alpha we identified earlier.
At 12:33, it reads on the video that "we increase alpha again until pruning leaves will give us a lower tree score". My question is lower than what? My understanding is lower than tree score of the full sized tree at that specific alpha value, in this example, at alpha=10000, because we have already established the tree score of the full sized tree at alpha = 0 has the smallest tree score. Similarly at 12:43, the third tree at alpha = 15,000 is chosen because it has lower tree score than the second tree at alpha = 15000. Please let me know if this is correct. Thanks.
Yes
Thanks, Josh, every time I watch your video, I feel like the concept is very easy to understand! lol
Happy to help!
There are some notifications... Right when it shows on ur phone... ur feelings says.. I going to learn something today...
MEGAAA BAMMMMM
Hooray!!!! :)
Thank you very much
At 12:05 you said that full tree has lowest tree score when alpha = 0.
By default, pruned trees have higher SSR. Won't increasing alpha increase their tree scores even further, instead of lowering them?
Increasing alpha increases the scores of all of the trees (it never lowers the scores). However, remember, we multiply alpha by the number of leaves in a tree. So a large tree will get a much larger penalty (alpha * number of leaves) than a smaller tree.
@@statquest Actually I just figured out earlier that we have to increase the alpha on the previous tree structure to see the comparison. I thought we only need to increase the alpha only on pruned trees (to make it smaller).
Thanks for the reply.
@@statquest Hi Josh, I am a beginner in learning data analytics. First I have to say our professor uses your videos in class and they really helped in understanding the course materials! As for the problem mentioned by Bobby, I also had problem with it until I saw your interaction here. I was stopped by "increase alpha until pruning leaves will give us a lower tree score."My problem was the same as Bobby. I thought increasing alpha only increase the tree scores and where should I start pruning leaves? The alpha and T get fuzzy in my mind. However, when i think "increasing alpha only increases the tree scores, i was comparing with just the full tree. Therefore the increasing of tree scores seems to have no stop. While actually, the stop is the SSR of the next tree where the leaves got pruned. At that point, the alpha is the knot which will be assigned to the next tree. (I hope I am right. Please correct if not.) Having this in mind, when I look at your words "increase alpha until pruning leaves will give us a lower tree score.", I understand it completely and I think it's very accurate. I would recommend to have some explanation here or make it more obvious in arranging of the graphical illustrations in the video. very lengthy comments. I was trying to organize my thoughts. Thank you again for wonderful videos!
@@graceqin5024 I'm glad you were able to understand it! BAM! :)
Thank you very much for this video! I really enjoyed the full step-by-step process of building the various trees using different alpha values and the use of cross validation to select the best alpha!
Double bam! :)
This video really helped me to clearly understand the concept. Thank you for this good work
Awesome! :)
Damn. I was finding only scientific articles and I was having trouble understanding the CCP, now you made it very clear! Thanks.
Hooray!
One good idea for cross-validation maybe is we can split data first to train and test and again split train to train and validation sets. Therefore, we can guarantee that our test set is totally new to environment. This will result in more realistic scores.
yep
Was looking for this comment.
I'm 50% here for stats and 50% here for the sound effects!
double bam! :)
BEST CHANNEL! no i m not just saying, i m shouting!
BAM! :)
Your intros make me smile :)
Thank you! :)
Thanks!
TRIPLE BAM!!! Thank you so much for supporting StatQuest!!! :)
i'm curious as to how we find sensible alphas. there seems to be an explanatory gap here since in the section "comparing pruned trees with alpha" it says (in the NOTE at 8:23) that we find it during cross validation, but in the cross validation section at 13:18 we are supposed to "use the a values we found before". sensible alpha values would probably vary widely depending on the SSRs of the trees (and ultimately the variable ranges) and even more so once we do classification since gini and entropy give small values for which alphas of 10k would not work usefully. surely simply guessing various alpha levels and finding which of the guessed ones work best in cross validation is not the best method or am i misunderstanding something here?
We use all of the data to find candidate values for alpha (this is demonstrated at 11:18). Once we have candidate values, we test each one with cross validation to find the optimal value for alpha.
love the reference to Phoebe ! also thank you, all your videos are very helpful
Glad you like them!
Your lectures are awesome. I like the way you explain things. I have read your book also it's brilliant. I just want the pdf of this lecture. Can you share this
I'm glad you like my stuff! I fortunately I can't share PDFs of my work.
Hi Josh,
I don't get this part: 12:10 "increase alpha until pruning leave give a lower tree score"
What does "lower tree score mean"? Lower than the the score from the previous round?
The first (full) tree use alpha of zero. The second tree, pruned, will have a higher SSE, hence higher Tree Score. A non-zero alpha will only increase the Tree Score further.
So, how is it getting a "lower tree score"?
Thanks,
Kar Ann
---
Notes for my own future reference:
*WHY PRUNING*
To avoid over fitting, a less full tree is needed. Pruning procedure uses tree score that balance between classification error and number of leaves.
*OVERALL STEPS*
1. Create a set of Prune Trees. Use Tree Score to decide how much to prune. The trees should have comparable "quality" in terms of error and number of node.
Tree Score = SSR + αT
2. Then, apply different subset of data (cross-validationally) to all Prune Trees.
The Pruned Tree that give lowest SSR (with testing data set) is the winner.
*DETAILED STEPS*
Tree Score, TS = SSR + alpha* T (T=no. of leaves)
TS is used to decide how much to prune, and to create different trees.
*1. Create a set of Prune Trees, which have comparable quality, in terms of error and number of leaves*
Create the 1st trial tree. The fullest tree. Tree1.
α1=0. Get TS1.
Create the 2nd trial tree. A new tree. A pruned version of Tree1. Tree2. α2=a new value bigger than 0.
Compare the Tree Score between the old (full) and new trees (of smaller size i.e. pruned).
Even though the newer (pruned) tree has a bigger SSR, it has a lower T.
At α2 = 0, the newer tree has a higher TS than the old tree.
But the new tree TS increases slower than the old TS, because the old TS has a higher T.
Eventually, with a certain α2 value, the newer tree has a higher TS.
Get TS2, using the new α2.
Create the 3rd tree. A pruned version of Tree2.
Get TS3, using the new α3.
Repeat and create more prune trees. Each with a α values.
Now, there are many set of tree.
First tree is a fill tree.
Last tree is only one leaf.
All tree comparable quality, based on fine balance between classification errors and number of leaves.
*2. Apply different set of data, cross-validationally, to all trees. The Prune Tree that give lowest average SSR is the winner*
.Create different values of alpha from the above eg. α1 - α5
.Use different set of training and test data i.e. 10 folds cross validation
.From the 10 folds training and testing data, get SSR, with α1 - α4
.Get the SSR of each α, averaged over 10 sets of training and testing data
.Pick the tree, and its associated α, that gives the lowest SSR
The fullest tree, though gives the lowest SSR with the training data, it doesn't give the lowest SSR with test data due to over fitting.
Cross validation also helps to minimise impact of over fitting.
The tree score is the SSR + alpha*number of leaves. We increase alpha until it makes sense to reduce the number of leaves in order to get a lower tree score. Although reducing the number of leaves will increase SSR, if alpha is larger enough, it can overcome that increase. For example, if a full tree has 10 leaves and SSR = 10, and the same tree with 9 leaves has SSR = 20, then when alpha = 11, the tree score for the tree with 9 leaves will be smaller than the tree score for the tree with 10 leaves. In other words: 10 + (11 * 10) > 20 + (11 * 9) -> 120 > 199
@@statquestI got it now. Thanks! But one more question....
Why "We increase alpha until it makes sense to reduce the number of leaves in order to get a lower tree score"?
Doesn't it make better sense to say "We reduce the number of leaves and increase alpha until a lower tree score"?
@@statquest Thanks Josh
@@statquest Sorry, Josh,
Re-watching it... and more questions sprung up.
Q1) At 13:17: "Using the training data... use the alpha values we found before to build a full tree and a sequence of subtrees that minimise the Tree Score"
Why do we need to build trees again? Weren't they already created in the previous steps, when the alphas of subtrees were evaluated?
Q2. The objective is to apply different folds of data to find out which subtrees give the lowest average SSR. Why is alpha needed here again? Alphas were used in Tree Score, to create the subtrees. But SSR doesn't need alpha, right?
@@karannchew2534 1) The first set of trees were built with all of the available data (not including a validation dataset if you have one), the subsequent sets of trees are built with different subsets of all of the available data and this allows us to pick the best value for alpha using k-fold cross validation. 2) Correct. Ultimately we pick the tree with the lowest average SSR after cross validation.
So good! Consistently high quality across across videos and time! Keep going. Many thanks!
Thank you very much! :)
Explanation is TRIPLE BAM!!!
Hooray! :)
thank you Josh for your very easy understanding explanation & lovely rhythm
Thanks! :)
This video is awesome! Why didn't I know StatQuest earlier, it really helps, THX!!!
btw: I'm confused about the built of trees in 13:19, how could we know that it is the bottom 2 leaves should be cut for a new training set when α=10000(pre-calculated), is that just a coincidence? Or the cut depends on which leaves give the lowest Tree Score?
My understanding of the pruning process is this:
①use the whole data set to build a full-sized tree, and then increase alpha from 0 to get different sub-trees that are pruned and have lower Tree Score corresponding to different alpha
②build a new training set, testing set from whole data, then build a full-sized tree, using the alpha we have before to build sub-trees, then calculating SSR on testing data
③repeat ② til we done 10-fold cross validation
④for each iteration, choose the alpha that has the lowest SSR
⑤calculate the average among all alpha in ④ to get its final value
Am I get it right?
You're correct about everything except 5. We don't calculate the average of the alphas, we calculate the average of the sum of the squared residuals for each level of alpha, and select the level of alpha that corresponds to the lowest average sum of the squared residuals.
@@statquest Thx so much for your reply and correction! I think I might get the process and concept a little bit, but still need more time to fully understand it, and your videos are just the most helpful, vividly, clearly! Thx again for all the efforts and sharing!
I'm moving to SVM right now hahaha
@@statquest and BAAAM!
@@statquest I am also having same doubt, can you please answer the original question? "how could we know that it is the bottom 2 leaves should be cut for a new training set when α=10000(pre-calculated), is that just a coincidence? Or the cut depends on which leaves give the lowest Tree Score?"
Some questions here! Hope they get explained.
1. at 14:14 , why calculate the SSR, not the Tree score?
2. at 14:34 when making a new tree based on new training data, is there evidence that, when using the previous alpha value obtained before, it always elimiinate the leafs?
1. Because we're simply interested in how the tree performs with the testing dataset.
2. Not that I know of.
@statquest
Then, are the pictures of trees in 14:34 just arbitrary? Since the training sample is different, the trees made by the new data samples may not be the same shape as the tree before.
@@NICKNAME_NONE-r7n That is correct
The Best! The Best! The Best! video I've ever seen about Tree Pruning.
Thanks a lot. Now I got the concepts.
BAMM!
Hooray!!! Thank you very much! :)
Best intro of all the statquest videos which I have seen 😍
Hooray! :)
Cost Complexity Pruning is otherwise known as post-pruning, while limiting the tree depth, enforcing a minimum amount of samples per leaf/split, a minimum impurity decrease is known as pre-pruning, correct? My question is, can you apply pre-pruning first, and then apply post-pruning to the pre-pruned tree? If yes, then i assume that the alpha parameters will be found from the pre-pruned tree, right? Not the initial full tree? And then cross validation will also be performed with the pre-pruned tree, not the initial full tree, to determine the final optimal alpha score?
And on a separate note, should only those alpha obtained from the tree trained on all the data be used when cross validating, and why? Is there no chance that some other, random alpha, might result in better performance? Considering that cross validation is done on several different test/train splits, and there will be those that do better with one alpha, and those that do better with other alphas, doesn't it make sense to try all possible alphas (from 0 to infinity) in the cross validation, not only those that give the best tree scores for the full tree? Isn't there a chance that some other alpha will give, on average, a lower sum of squared residuals than those obtained from the full tree?
I believe you are correct about how pre-pruned trees are used. And this is just how the algorithm is spelled out - possibly to keep the running time down.
As always the clearest explanation! Thank you so much 😊 and BAM BAM BAM
Thank you! :)
Love the song in the beginning!!
Bam! :)
How new trees are build by imposing previous alpha values? Maybe it is not possible to find new continuously smaller trees of reduced Tree Scores for fixed alphas. Before alpha was a parameter and during cross validation is a constraint :(
(Making notes for my own future reference)
Tree Score, SSR + αT, is used to create a set of Prune Trees.
Then, apply data (cross-validationally), to all Prune Trees.
The Prune Tree (and its α value) that give lowest SSR (with testing data set) is the winner.
Noted
A quick question please. In this example, the tree is not balanced, in the sense that right subtree is a lot deeper. What if the left subtree is as deep as the right subtree, then how do we choose which side of the internal node to collapse, or in other words, which side of the leaf nodes to delete? Based on what is shown at 12:33, we should go with whichever that gives us lower tree score, right?
Correct
Hi Josh , I hope u answer my question, I was searching for 3 days till now and i got nothing
I have 2 problem which is :
1_ How to determine alpha where there is more one leaf in the bottom of tree (i.e : u said increase alpha till pruning this leaf get lower score) , so if i have more than one leaf in the last level of tree, which one should i cut or should i look for all subtrees every time increasing the alpha it seems like it will get high complexity?
2_ in implementation when i will give the model the ideal alpha to implement the decision tree, how the model will know when building it in every step he take is that will lead to the subtree related to this alpha
finally , u r such amazing i really enjoyed every lesson i took from this channel
1) You remove the leaf that results in the smallest increase in SSR.
2) You build the full tree, and prune just like we did before.
@@statquest
Thank u josh ,I really appreciate ur time for answering my questions
Really love your content. Must watch content for any learner
Thanks!
Just a random thought, what if we prune the tree directly based on ssr computed on validation set instead of adding penalty. Anyway the tree that works well on validation set is selected. Why are adding penalty?. Does it helps controlling fluctuations in the number of leafs selected across cross validation folds during Hypertuning .
Hey Josh, how can we choose our alpha wisely? So that the tree with the minimum tree score will really work well for testing data too. Is there a specific rule of thumb?
I give a practical tutorial on building trees with real data here: ua-cam.com/video/q90UDEgYqeI/v-deo.html
@statquest can you reconcile what you said at 10:41, 15:05 and 15:35, if we are picking the final pruned tree at 15:35 what are we doing at 10:41 with the tree score? I thought cross validation was just to pick the optimal value of alpha 15:05 which is then used at 10:41, to pick final pruned tree. any clarification will be helpful
At 10:41 we identify values for alpha that we will then test with cross validation at 15:05 and later.
The way you explain..and the amount of effort you put in the videos is great. I have learnt a lot from you sir. I always feel so positive and motivated while learning from you. Thank a lot🎈
Thank you! :)
this guy is a living legend ❤
Thank you!
Hey, Josh! Is it ok that we are using train+test to find alpha values? I mean that we are peeping into the future. do we know good thresholds using a test sample(not only train), or am I wrong? Thank you
You can set aside a set of data for final validation.
@3:35 when you remove the leaves how do you change the decision rule of the parent node ?
@11:55 when you build the tree from the full data set how do you get the SSR ? i mean what is the input to get it ?
1) The parent node reverts to the decision it made before the branch was added. In this case that means the average of the drug effectiveness for all values with dosage >= 14.5
2) Calculating the SSR is described in the video that explains Regression Trees here: ua-cam.com/video/g9c66TUylZ4/v-deo.html
Thank you for the video, I just have one question. We use pruning to figure out the optimal number of leaves for the tree we build using the original training data. The purpose of the original validation/ testing data was to validate the performance of this tree, as we did not use that data in the creation of said tree. Now in the pruning process, the original testing/ validation data is used in conjunction with the original training data to determine alpha. This means, that the original validation data had a large impact on the tree we created from the orinal training data and is therefore no longer independant from it. How is this procedure justified and would it be necessary to get a third independant data set for a true validation of the trees predictive performance?
Often people reserve a 3rd set of data only to be used after pruning for true validation.
The term testing data is used a bit misleading here. You usually have 3 sets of data: Training, Validation and Testing. The Validation data is used to optimize the model (in this case the pruning). The test data is seperate and should - as you said - be completely independent from the training data.
I've noticed that quite some people use 'validation' and 'test' data synonymously, when they are in fact two different things.
It is such a great explanation. Super helpful! Clear and fun! Really appreciate your time in making the video. Thank you!
Glad it was helpful!
Hey Josh, Great video ! Quick questions: for classification tree, do we simply replace SSR with gini impurity and follow similar steps?
Yes.
you are literally doing god's work
Thank you!
Thank you for the amazing content Josh!
I had a doubt, do we change the alpha and the number of nodes simultaneously cause I thought that we change just alpha and then check which tree performs better (keeping one thing constant and changing the other gives us a better way to compare). Also at 11:32, you used the entire data (train+test) to build the model and find different values of alpha and then used those values on train split to build the model, compare test scores and finally find the best value of alpha. If we use the train+test data to find alpha, isn't that causing data leakage?
Also, why do we compute the alpha values beforehand? why don't we do it the usual way(the way we do in ridge regression) where we find the optimal value of regularization parameter using GridSearchCV because anyhow alpha is analogous to regularization parameter (I think).
1) We find values for alpha to result in better performance after pruning.
2) We can keep some data set aside for validation.
3) Perhaps for performance reasons.
@@statquest sorry. I don't get your answer for #1. Can you please elaborate more? Why do you find alpha using test+training data? If you just use the training data to find alpha, you will get an alpha to result in a good performance, but it's better if we use both training and testing set. CAn you please confirm?
@@shawnkim6287 Presumably using more data results in better performance.
@@statquest Thank you. Can you please confirm what I understand from this video?
You pick a number of alphas from a set which combines both training and test set. First, you have the full (untouched) tree where alpha is 0. Then, you increase alpha until pruning leaves will give us a lower tree score. You increase alpha until pruning additional leaves will give us a lower tree score. Long story short, a set of alphas are derived from the (training and testing) sets.
You train the models using a set of alpha's that we calculated before.
We then test the trained models by picking the optimal alpha value from the alphas by selecting the lowest SSE in the first fold which is the testing set. You pick the next optimal alpha value by selecting the lowest SSE in the second fold as the next testing set. If you are using 5-fold CV, then you have to pick 5 optimal alpha's on 5 different testing set. You just select the best alpha that appear the most.
can you please verify if what I understand is correct? Thank you, Josh!
@@shawnkim6287 That seems reasonable to me. If you want to double check, you can find the algorithm in The Introduction to Statistical Learning in R (which is a free PDF if you look for it).
Thank you so much for this amazing video. Very Amazing!
Thank you!
Thanks for the great video. One question though: Why is the full-sized tree build from all data (see 11:25) and not just the testing data? Couldn't this potentially give problems w.r.t. leakage?
You can always create a validation dataset and hold onto that until the end.
I thought as much but I was a bit astonished that it was emphasized to use all data. Thanks for the clarification.
Regarding choosing α (starting at 11:19) when we fit a new regression to the FULL data, does this not cause us to 'overfit' α to some extent? I was wondering what would happen if we did the following:
i) Split the data k different ways into a set that we find α for and a set that we ignore.
ii) On the set that we find alpha for, we get [α11, α12, α13] (the first set of α's such that we get better tree scores cutting the tree by 1, 2, and 3 levels respectively.) up to [αk1, αk2, αk3].
iii) We then take the average α for each cut so [(α11+α21+...+αk1) / k, (α12+α22+...+αk2) / k, (α13+α23+...+αk3) / k] as our set of final α's.
iv) Perform K-Fold Cross Validation using the above to see what α gives the lowest SSR for it's optimal tree.
Would my method make little to no difference? Or is my method overfitting more in some sense? Let me know what you think!
It seems reasonable to me.
In the video, you selected 4 values of alpha i.e. 0, 10k, 15k and 22k. How did you come up with exact alpha values? In the real-world, the dimensions of data will be huge so how to decide value alpha?
In the real world, you keep track of the sum of the squared residuals at each point in the tree. You can then figure out what values for alpha work directly from that information.
Hi Josh, Thanks for this GREAT video!!! Just wanted to ask what's the principal to choose several alpha as the starting values? ie. in the video, you choose these 3 alpha values as the candidates to determine the final alpha: 10,000, 15,000, 22,000, how did you come up with these three alpha?
I started with alpha =0. I then increased alpha until I got a lower SSR + alpha score by pruning a branch. That was my second value for alpha. I then increased until I got a lower SSR + alpha by pruning another branch...etc.
@@statquest Thanks! Make perfect sense to me. Hooray! : )
@@statquest Thank you Josh! Really nice videos! I am looking at most of them! I rank them by the total BAM count of each video :)
One thing I don't understand is: in the example above the full tree and the tree with one pruned leaf would have an equal tree score at alpha = 5494.8 - 543.8 = 4951.0, whereas the trees with one and two pruned nodes would have an equal score at 19243.7 - 5494.8 = 13748.9. So instead of choosing a candidate alpha of 10000 as you did in the very first step, you could have chosen any value 4951.0 < alpha < 13748.9.
Would perhaps the best value to choose for a candidate alpha then be the mean of 4951.0 and 13748.9 = 18699.9 rather than 10000?
@@fredrikedin8880 What time point, minutes and seconds, are you asking about?
@@statquest Sorry for my unclear question. Between 12:00 and 12:54 you describe that you increase alpha until pruning one leaf renders a lower score. So the border between the full tree and the tree with one pruned leaf is at 10000 here, and the border between the trees with one and two pruned leaves is at 15000. Then couldn't the candidate alpha for the tree with one pruned leaf be set to 12500, i.e. the middle of the interval where this subtree is optimal, rather than at the border to the full tree?
Why do you need to create the tree with the full dataset (training+test, at 11:25) ? Is it not enough with the trees created with the training data during the k-fold?
We use all of the data to identify reasonable values for alpha. One those are identified, we use cross validation to identify the best value for alpha.
@@statquest I think I got confused because you first used the full dataset, and again the k-folds and later the full set .
I think I can first do the k-fold to choose the best alfa. And later prune the trees with the full dataset and that best alfa. Isn't it?
yep
@@statquest However, in theory we do not know the test set. Do you mean train and validation set maybe?
@@dumtektek5141 I mean "whatever data you are using to build the tree".
Teachers all over the world, must learn from josh bro!
Thanks!
As always awesome! Thank you Josh!!! Horraaaayyyy
Thanks! :)
at 14:19 I think it should be lowest tree score, instead of lowest sum of squared residuals. Correct me if I am wrong.
It is the lowest Sum of Squared Residuals with the Testing Data, not the lowest tree score. In other words, when alpha = 10,000, we have a tree that does not over fit the training data (which would result in a larger SSR for the Testing Data) or under fit the training data (which would also result in a larger SSR for the Testing Data).
Thank you very much Josh. Could I as a question: what if we throw away the alpha, but just use the three sub-trees and the full tree to run the cross validation and find out the optimal tree structure with the lowest SSR? I mean deciding the best alpha is essentially the same as finding out the best tree structure, for example, in the video 15:16, the alpha=10000 gives the lowest SSR, i can also say the three-leaf-node tree gives the lowest SSR, and then I just choose the three-leaf-node tree as my final decision tree. So it seems meaningless to find out alpha.
I believe we use alpha because we use it prune a full sized tree, rather than build sets of smaller trees, like you are describing. In other words, we don't set out to build a 3 node tree - we set out to build a full sized tree, and then prune according to various values for alpha. Since we are using Cross Validation, and thus, have a different training dataset each time, this means that the same value for alpha may not always result in a tree with the same number of nodes.
Wonderful explanation. Thank you very much
You are welcome!
Hello,
first of all thanks for the great material you produced and shared, certainly among the clearest and effective I've come across.
My questions are about the cross-validation trees to determine the right alpha values.
As a premise, if I understood correctly, we first determine candidate alpha values by :
a) create a "full" tree from the full training+testing datasets
b) produce the corresponding family of "pruned" versions (and I guess asses their SSRs in preparation for the next step) based on the morphology of the "full" tree (meaning, all possible pruned trees are considered - is that correct?)
c) identify the candidate alpha values as those by which the "full" tree's score becomes higher than one of the pruned versions.
Assuming the above is correct, when we move on to cross-validate in order to ultimately determine the right alpha, I understand that we resample a training set (and a corresponding test set) for a number of times.
Each time, we build a new tree from the training set, and its associated set of pruned versions (let me call these tress a "cross-validation family of trees" (CVFTs)), and assess their SSRs based on the test set for the current round in order to contribute to ultimately calculate the actual alpha to use.
First question: how come every CVFTs in your slides has a number of members that equals the number of candidate values for alpha?
couldn't a resampled training set might give rise to trees with more or even fewer leaves - and corresponding pruned versions - than the tree that was used to identify the candidate alpha values? And in that case, the candidate alpha values might be in larger or smaller number than the possible number of trees in the CVFTs at hand.
I imagine that a possible answer is that the number of members in a CVFTs can actually be different than the number of candidate alphas, and that the pruned tress in a CVFTs are actually identified through their Tree Scores when each of the alpha candidate values is applied -- and if so I guess the issue is that perhaps this mechanism does not stand out 100% from the presentation...
Second question: if we assess the trees in each CVFTs only by their SSRs, wouldn't always the tree with more leaves (therefore alpha=0) win?
Thanks much
What you wrote for b) "all possible pruned trees are considered" is not correct. When we remove a leaf, we don't just create all possible subtrees with one leaf removed. Instead, we pick the one subtree that, when we remove one leaf, results in the smallest increase in the sum of squared residuals.
@@statquest
Josh, OK, that makes sense -- so this is repeated on each new subtree to produce the set of trees where the candidate alpha values are then formulated as at minute 13:03, correct?
If so, are my subsequent questions still standing?
@@stedev2256 Each time we do cross validation we get a new "full sized tree", which may have a different size than the original. We then use the pre-determined alpha values to prune that new tree and use the test dataset to find out which tree (and alpha value) is best for that iteration.
As for your second question, this is where the "testing" data comes in handy. A full sized tree with the most leaves (and alpha=0) will probably overfit the training data, and thus, do a pretty bad job predicting the testing data. So in practice, the full size tree (with alpha = 0) performs great with the training data (low SSR) but poorly on testing data (high SSR).
@@statquest
Josh,
thanks, I was actually rephrasing / correcting my last post, and clarified a number of things to myself while doing that... I didn't think you could see my second post while I was editing it... sorry.
But it was not all in vain, as what you wrote last confirms what I was getting to while revising my question and in light of your previous answer, and things seem clear now.
Thanks much
@@stedev2256 bam! :)
Are the trees at 11:25 (based on all the data), 13:36 (based on 1st set of training data), and at 14:40 (based on 2nd set of training data), identical on the dosage threshold values in the blue boxes or not? That is unclear to me...
Ever time we build a "full sized" tree (with alpha=0), we fit it to whatever data we are using and find optimal threshold values that are specific to the current dataset. So if we are using all of the data, we fit the tree to all of the data and find the optimal threshold values for that specific dataset. Then, when we are using the 1st set of training data, we fit the tree to that data and find new optimal thresholds for that specific dataset. Then, when we are using the 2nd set of training data, we fit the tree to that data and find new optimal thresholds for that specific dataset.
I'm a bit confounded. Isn't it the case that the more the nodes are, the smaller the SSR (as in the example in 9:30)**? If so, how is it possible that a pruned tree with an alpha of 10.000 (**14:22**) has a smaller SSR than a full tree with an alpha of 0? Or, could my second sentence** possibly be true only in terms of constructing the regression tree (based on the training data) and not necessarily when using it to the test data? I hope I'm formulating the question clear enough :P
The full tree will always have the smallest SSR when applied to the training data, because it was built specifically to minimize the SSR for that training data. However, good performance (lowest SSR) with the training data does not mean we will have good performance (lowest SRR) with the testing data, which is a completely different set of data. The goal for pruning is to minimize the SSR when the tree is applied to testing data.
@@statquest
Got it. Thank you :)
This channel is gold mine i am telling ya :D
can you cover box cox theorem (power transformations)
Josh, the videos on this channel are nothing short of superb. I have only one suggestion: how about a dark theme for these presentations? That white background is like a supernova, especially on my 55" TV.
Hi Josh, those explanatory vidoes are incredible. Thanks for the great work!
Thank you!!!
Thanks a lot for this. I came here after getting confused reading this concept from a book. I am inspired by your teaching style. Your style of teaching by examples is the best way to transfer knowledge without losing the audience at any point.
May I ask how much time do you spend to create a tutorial like this? Also what kind of tools do you use to make these videos.
Each video takes a long time - maybe 6 weeks or more. And I talk about how I do everything in this video: ua-cam.com/video/crLXJG-EAhk/v-deo.html
Thanks for the great video, Josh. Quick question. At 15:06 what is the exact meaning of "on average"? Do you mean simply the most frequently selected alpha during the CV is the final value? Or, should I have to calculate the average of SSR for test set, something like that?
It means that the same alpha value may not always give you the best tree during each fold of cross validation, but most of the time it does.
@@statquest Thank you for the quick reply and sorry for the unclear question. What I wanted to ask is the way of averaging. Should I count how many times each alpha was the best during each fold of C.V.? And then pick the most frequently selected alpha?
@@onyman8837 yes
Great video, thanks a lot! I just have one question to the cross-validation procedure:
When creating the original sequence of subtrees from the whole big dataset we get certain discrete alpha values for each subtree (in the video it is 10 000, 15 000 and 22 000). You said we obtain these values as soon as pruning leaves gives us a lower tree score. So the pruning is done first and then the alpha is increased till the tree score is lower than for the unpruned tree. Are we then using these discrete alpha values in the cross-validation procedure? Because then there must be something like a guarantee that these discrete alpha values also lead to a pruning in the training subset we using in the current step of cross-validation, or not?
To be honest, I do not know the exact details of how the cross validation is performed.
@@statquest Thank you anyways!
we love you DOSS.. hope me too will surely one day be a patreon member
Your channel is amazing man. Great job
Thank you! :)
So a question. We learned previously that cross validation is used to test the model on different "blocks" of the test set. But in this case you are advocating for the cross validation to be used for hyper parameter tuning. Does that mean the test sets remain constant?
A lot of people ask about this and I could have probably done a much better job wording things. The way I see it, is that we have all of the data and we can split that in to "all the data we want for training" and "all the data we want for testing". We then build a tree and prune etc. using all the data we want for training and test against the testing data.
Josp. Great job as usual however you just pulled the starting alpha and subsequent values out of the air. How did you come up with these values?
At 8:22 I explain where we get alpha. We use cross validation. That means we try a bunch of values and test each one to find out which is best.
@@statquest ok I’ll take another look. Thanks.
14:11, in cross-validation, we are calculating ssr using testing data, not tree score????
The tree score = SSR + alpha * number of leaves. So we calculate the SSR from the testing data and that value is used to calculate the tree score.
thanks joshh for this beautiful video i have a question in the statquest classification tree using python you extract the alphas only from train data however in this statquest you extract.alphas from all data(train and test data) ?
Unfortunately I was a little sloppy with the terminology here. Because we are applying CV to the data, "train and test" refer to just the data used for training, but split up into "training and testing" sets for CV.
@@statquest ahh okayyy so whenever i want to determine the value of alphas i dont have to use the whole data maybe the train would be sufficient? but implying cv imposes both the train and test?
@@marahakermi-nt7lc Well, cross validation splits the data you give it into training and testing sets.
@StatQuest with Josh Starmer, I have purchased your book but I didn't find these concepts (pruning, random forest, adaboost, gradient boosting) in that. Is there a way to access these presentation slides?
Those will be in a future book.
how to decide the alpha value ? from which value should i start the testing of tree score as in this video you took alpha=0,10000,15000,22000 if i have a different data set then how should i take that alpha value?
This question is answered at 12:10 in the video. Start with alpha = 0 and increment it until we prune the tree. That's your first value. Then continue to increment it until you prune more etc.
@@statquest but in which order we should increase the alpha like, if I take first alpha =0 then what should be my second value of alpha? As you took 10,000 in video how can we choose second value of that ?
@@Aman-uk6fw Start with alpha = 1. See if it results in pruning. If not, increment alpha by 1. Repeat until you prune, that's your first value for alpha. etc.
@@statquest thanks you get it 😁
Potentially stupid question: Will we still need an additional (outside of the "full data") testing set to test the final (pruned) tree?
That's a good question and it comes up a lot. You can always use an additional testing dataset, but then you have the problem of deciding which data would make the best dataset. A better way might be to estimate of how things will perform by looking at the sum of squared residuals for the testing data from all of the trees associated with the final level of alpha you pick. For example, when we doing the cross validation to figure out which value for alpha to use, we can keep track of the sum of squared residuals for each level of alpha and at the end, when we know the alpha value, just average the sum of squared residuals for that level.
@@statquest But still, shouldn't we always have a dataset that is never seen by a model? Or did you just mean that?
@@toja240 Whether or not you should have a special testing dataset that is never used for cross validation or anything else really depends on how much data you have. If you have a ton of data, sure, hold out some of it to be a special testing-dataset. If you don't have a ton of data, then you just estimate the models performance based on how it did in the cross-validation.
@@statquest OK, thanks for a fast reply. I really appreciate your work. Keep it up!
Thank Josh, for your lucid explanation. We are longing for the XGBoost Videos. Any updates on that ?
This was actually the very first XGBoost video. XGBoost uses unique trees and to understand why, you have to know everything about normal regression trees. This information was originally in a XGBoost StatQuest, but it was too much, so I made it a stand alone video. That being said, the next video I put out, in the next two weeks or so, will be on XGBoost, then the next and the next, etc. XGBoost is a huge algorithm with a lot of parts. I've got 3 videos worth of material so far and I've only scratched the surface. I'm expecting to have at least 4 XGBoost videos, maybe more.
@@statquest Thanks so much for the reply. I've seen all your tree based videos. For the last two days, I've been reading about XGBoost all over the internet and it was quiet difficult to grasp the whole picture of XGB. I genuinely thought it would be lot more easier and intuitive if it had been explained by you. Appreciate all your work. Couldn't be more excited for the future XGBoost videos. BAM!
@@mohamedhanifansari9224 Just a few more weeks to wait! (and I'm just as excited as you are about this XGBoost thing - it's become an obsession!)
@@statquest Thanks so much! :')
I want to ask a question. Do we use same alpha values for trees that built on train data as the ones we found using the trees built on full data? Is it possible as trees for two different datasets may differ, the alpha values we chose from full sized data aren't right for training trees? Not right here I mean maybe that alpha values doesn't reduce the tree score after pruning the training tree, and we need to use another alpha value. Also why exactly we need to train our tree on full dataset, and choose that model is not clear.
Yes, it's possible to get different trees, but if the dataset is large enough, it shouldn't be a problem. To see how this is done in practice, see: ua-cam.com/video/q90UDEgYqeI/v-deo.html
excellent explanation. thank you
Thank you!
Great video! Trying to implement now, from scratch, and I have a few questions: at 12:12 “now we will increase alpha until pruning leaves gives us a lower tree score”. So, do we precompute all of the trees first? (I have L trees if I start with L leaves?), and then continue to increase k by some increment until the tree in question has the lowest score? If I’m wrong, maybe somebody could clarify? Thanks so much, and thanks for the video!!
To be honest, I'm not sure the best way to implement this. However, my guess it that when you build a tree, you can keep track of how much the SSR decreases with each additional branch. Then, when you increase alpha, you can compare it to those stored SSR values and when use that to decide when to prune. I believe that this would be easier and faster than building a bunch of trees at the very start.
@@statquest oh, that's a good point. So every single node can have an SSR that is calculated when building the tree (stored in each node, leaf or not), and then finding the best subtree for the same training set & some fixed alpha can be done with just one more traversal, recursively. I will give this a try! 🙂
@@rkotcher Bam! Let me know how it goes. Also, if you have the Introduction to Statistical Inference in R (it's a free download), they may have pseudo code you can use as a guide.
@@statquest Thanks for the reference! I'll have to check it out. You mentioned to say how it goes - I had two issues calculating SSR cleanly during tree construction (my implementation is recursive, so right off the bat this might not be optimal). The first issue is that, because you must calculate SSR after the current node is attached to the parent node, you've already finished building the subtree so the results will include that subtree. The second issue is that, because you always build either the left or right subtree first, calculating SSR might mean that you're missing distant relative nodes (aunt/uncle nodes, for example). I believe (?) this is different than what we want, which is actually to calculate SSR for all subtrees constructed from removing exactly one node. Anyway, if somebody stumbles upon this and agrees/disagrees, I'd be curious to hear :) I ended up calculating SSR & tree score in a second pass.
@@rkotcher Bam!
Thank you so much for posting this!
Hooray! :)
Great explanation. I really enjoyed the video, but I'm a bit confused. Why use all of the data to build the initial tree in step 1? If we do that there is no test data left for testing the tree.
This is assuming you've already separated out your testing and validation data.