CatBoost Part 2: Building and Using Trees

Поділитися
Вставка
  • Опубліковано 14 чер 2024
  • Just like we saw in CatBoost Part 1, Ordered Target Encoding, we're going to use the training data one row at a time to build and calculate the output values from trees. This is part of CatBoot's determined effort to avoid leakage like there is no tomorrow. We'll also learn how CatBoost makes predictions once the trees made.
    NOTE: This StatQuest is based on the original CatBoost manuscript... arxiv.org/abs/1706.09516
    ...and an example provided in the CatBoost documentation...
    catboost.ai/en/docs/concepts/...
    English
    This video has been dubbed using an artificial voice via aloud.area120.google.com to increase accessibility. You can change the audio track language in the Settings menu.
    Spanish
    Este video ha sido doblado al español con voz artificial con aloud.area120.google.com para aumentar la accesibilidad. Puede cambiar el idioma de la pista de audio en el menú Configuración.
    Portuguese
    Este vídeo foi dublado para o português usando uma voz artificial via aloud.area120.google.com para melhorar sua acessibilidade. Você pode alterar o idioma do áudio no menu Configurações.
    If you'd like to support StatQuest, please consider...
    Patreon: / statquest
    ...or...
    UA-cam Membership: / @statquest
    ...buying my book, a study guide, a t-shirt or hoodie, or a song from the StatQuest store...
    statquest.org/statquest-store/
    ...or just donating to StatQuest!
    www.paypal.me/statquest
    Lastly, if you want to keep up with me as I research and create new StatQuests, follow me on twitter:
    / joshuastarmer
    0:00 Awesome song and introduction
    1:10 Building the first tree
    6:05 Quantifying the effectiveness of the first threshold
    6:56 Testing a second threshold
    9:05 Building the second tree
    10:21 The main idea of how CatBoost works
    12:15 Making predictions
    13:02 Symmetric Decision Trees
    14:56 Summary of the main ideas
    Corrections:
    2:05 Red should have gone into bin 0 instead of bin 1.
    7:23 I should have said that the cosine similarity was 0.71.
    #StatQuest #CatBoost #DubbedWithAloud

КОМЕНТАРІ • 85

  • @statquest
    @statquest  Рік тому +2

    NOTE: At 7:23 I should have said that the cosine similarity was 0.71.
    To learn more about Lightning: lightning.ai/
    Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/

  • @sahilpalsaniya724
    @sahilpalsaniya724 Рік тому +3

    "BAM" and its variants are stuck in my head. every time I solve a problem my head plays your voice

  • @Monkey_uho
    @Monkey_uho Рік тому +1

    Awesome work !
    I've been watching a lot of your videos to understand the basics ML algorithms, continue like that ! Thank you for taking the time and the energy to spread knowledge with others. Also, I would like to say that, like others, I will also love a video explaining the concepts behind LighGBM.

    • @statquest
      @statquest  Рік тому

      Thank you! And one day I hope to do LightGBM

    • @weipenghu4463
      @weipenghu4463 11 місяців тому +1

      looking forward to it❤

  • @aakashdusane
    @aakashdusane Місяць тому +4

    Not gonna lie catBoost nuances were significantly more difficult to understand than any other ensemble model till date. Although the basic intuition is pretty straightforward.

    • @statquest
      @statquest  Місяць тому +2

      It's a weird one for sure.

  • @razielamadorrios7284
    @razielamadorrios7284 Рік тому +4

    Such a great video Josh! I really enjoyed it. Any chance to do an explanation for lightGBM? thanks in advance. Additionally, I'm a huge fan of your work :)

  • @Quami111
    @Quami111 Рік тому +9

    In 2:09 and 12:40, you assigned row with height=1.32 to bin=1, but you said that rows with smaller heights would have bin=0. It doesn't appear in 11:24, row with height=1.32 has bin=0, so I guess it is a mistake.

    • @statquest
      @statquest  Рік тому +12

      Oops! That was a mistake. 1.32 was supposed to be in bin 0 the whole time.

    • @OscarMartinez-gg5du
      @OscarMartinez-gg5du Рік тому +1

      ​@@statquest In 1:15 when create the randomized tree for the first build, the height also seems to be shuffled for their corresponding Favorite Color and that changes the examples for the creation of the stumps. However, the explanation is very clear, I love your videos!!

  • @LL-hj8yh
    @LL-hj8yh 8 місяців тому +2

    Hey Josh, thanks as always! Are you planning to roll out lightgbm videos as well?

    • @statquest
      @statquest  8 місяців тому +1

      Eventually that's the plan.

  • @TheDataScienceChannel
    @TheDataScienceChannel Рік тому +4

    As always a great video. Was wondering if you intend to add a code tutorial as well?

    • @statquest
      @statquest  Рік тому +1

      I'll keep it in mind!

    • @asmaain5856
      @asmaain5856 Рік тому

      @@statquest please for soon I reaaaaally need it

    • @rikki146
      @rikki146 Рік тому

      API for shallow models are mostly similar :\

  • @user-lu5ds2qp2f
    @user-lu5ds2qp2f Рік тому +1

    Big Fan !! 🙌

  • @near_.
    @near_. Рік тому +1

    Awesome.
    I'm your new subscriber 🙂

  • @rishabhsoni
    @rishabhsoni 7 місяців тому +1

    Great video. One question: Is the intuition behind using high cosine similarity to pick threshold that essentially we are adding the scaled leaf output to create predictions and if leaf outputs are more closer to residuals then we are moving in right direction as residuals represent how far away are we from actual target?
    Usually we minimize the residuals which kind of means that you find similarity with target

    • @statquest
      @statquest  6 місяців тому

      I think that is correct. A high similarity means the output value is close to the residuals, so we're moving the right direction.

    • @rishabhsoni
      @rishabhsoni 6 місяців тому

      But one question that comes to mind: cosine similarly is based on L2 norm so Euclidean distance. Wouldnt in this case no of rows of data act as dimension and cause weird output due to curse of dimensionality

  • @nitinsiwach1989
    @nitinsiwach1989 5 місяців тому

    What do bins have to do with the ordered encoding computation as you mentioned at 11:26? In the video, you have mentioned one use-case for the bins which is to reduce the number of thresholds tested like other gradient boosting methods.

    • @statquest
      @statquest  5 місяців тому

      The bins are used to give us a discrete target value for Ordered Target Encoding (since it doesn't work directly with a continuous target.) For details, see: ua-cam.com/video/KXOTSkPL2X4/v-deo.html

  • @nitinsiwach1989
    @nitinsiwach1989 4 місяці тому

    Hello Josh,
    Thank you for your amazing channel
    In the catboost package, why do we have both 'depth' and 'max_leaves' as parameters? One would think that since the trees here are oblivious, the two are deterministically related. Can you shed some light on this?

    • @statquest
      @statquest  4 місяці тому

      That's good question. Unfortunately, there have been a lot of changes to CatBoost since it was originally published and it's hard to get answers for what's going on.

  • @reynardryanda245
    @reynardryanda245 Рік тому

    12:41 how did you get the optionCount for prediction? I thought that it’s the amount of time that color for that bin appears sequentially. But if it’s for prediction, we don’t know the actual bin right?

    • @statquest
      @statquest  Рік тому

      At 12:41, we are trying to predict the hight of someone who likes the color blue. So, in order to change "blue" into a number, we look at the training data on the left, which has two rows with the color blue in it. One of those rows is in Bin 0 and the other is in Bin 1. Thus, to get the option count for "blue", we add 0 + 1 = 1. In other words, the option count for the new observation is derived entirely from the training dataset.

  • @user-fi2vi9lo2c
    @user-fi2vi9lo2c 8 місяців тому

    Dear Josh, I have a question about using Catboost for Classification. In this video, which tells us about using Catboost for Regression, we calculated output values for a leaf as an average of residuals in a leaf. How do we calculate output value for Classification? Do we use the same formula as for Gradient Boosting? I mean, (Sum of residuals) in the numerator and Sum of (Previous probability(i)*(1-Previous probability(i)) in denominator.

    • @statquest
      @statquest  8 місяців тому

      CatBoost is, fundamentally, based on Gradient Boost, which does classification by converting the target into a log(odds) value and then treating it like a regression problem. For details, see: ua-cam.com/video/jxuNLH5dXCs/v-deo.html

  • @serdargundogdu7899
    @serdargundogdu7899 9 місяців тому

    I wish, you could replay this part again :)

  • @aryanshrajsaxena6961
    @aryanshrajsaxena6961 2 місяці тому

    Will we use k-fold target encoding for the case of more than 2 bins?

    • @statquest
      @statquest  2 місяці тому +1

      I believe that is correct.

  • @alphatyad8131
    @alphatyad8131 8 місяців тому

    Excuse me again Dr. Starmer. Do you know how CatBoost determines the final tree (I mean from many trees of gradient boosting that CatBoost builds) till that becomes a rule so it can predict new data?
    Cause I haven't found a source that tells an explicit explanation of how CatBoost made the decision trees till it can be used to predict. Thanks in advance, Dr.
    (Or for anyone who knows, I would appreciate your help)

    • @statquest
      @statquest  8 місяців тому +1

      You build a bunch of trees and see if the predictions have stopped improving. If so, then you are done. If not, and it looks like the general trend is to continue improving, then build more trees.

    • @alphatyad8131
      @alphatyad8131 8 місяців тому

      I got it & am so appreciate it, Dr. And then if I could ask again;
      Is it safe if we say CatBoost is similar to the XGBoost method in the way it chooses features for building the tree (made predictor) & defining -in this case- the classification class for the given data?

    • @statquest
      @statquest  8 місяців тому +1

      @@alphatyad8131 They're pretty different. To learn more about XGBoost, see: ua-cam.com/video/OtD8wVaFm6E/v-deo.html and ua-cam.com/video/8b1JEDvenQU/v-deo.html

    • @alphatyad8131
      @alphatyad8131 8 місяців тому +1

      @@statquest Well explanation, Dr. Josh Starmer. Actually, I still learning by watching your videos on 'Machine Learning'. I appreciate it, feel not stuck in the same place as before thanks to your help.
      Have a nice day Dr.

  • @user-hv2lq3yt4w
    @user-hv2lq3yt4w 8 місяців тому

    TKS a lot~ i'm looking for an answer!
    For the new data whose "Favorite Color" is blue, why does it belong to bin#0 instead of bin#1 ?

    • @statquest
      @statquest  8 місяців тому

      The new data is not assigned to a bin at all. We just use the old bin numbers associated with the Training Data (and only the training data) to convert the color, "blue", to a number. The bin numbers in the training data are used for the sum of the 1's for the numerator.

    • @user-hv2lq3yt4w
      @user-hv2lq3yt4w 8 місяців тому

      @@statquest I misunderstood, sorry~ for new data whose "Favorite Color" is blue, we use all the rows with the same color, "blue", where OptionCount and n.

    • @statquest
      @statquest  8 місяців тому

      @@user-hv2lq3yt4w yep

  • @danieleboch3224
    @danieleboch3224 3 місяці тому

    i have a question about leaf outputs. don't gradient boosting algorithms on trees build a new tree all the way down and after that assign some values to their leafs? you rather did it iteratively, calculating outputs when the tree wasn't built yet.

    • @statquest
      @statquest  3 місяці тому

      As you can see in this video, not all gradient boosting algorithms with trees do things the same way. In this case, the trees are built differently, and this is done to avoid leakage.

    • @danieleboch3224
      @danieleboch3224 3 місяці тому

      @@statquest thanks, i got it now! but i got another question, in the catboost documentation there is a leaf estimation parameter (set to "Newton") and it is weird as the newton method is the exact method that is used in finding leaf values in xgboost, it uses the second derivative of the loss function and creates a tree according to new information criteria based on that method. but why would we need that if we already build trees in the ordered way finding the best split with the cosine similarity function?

    • @statquest
      @statquest  3 місяці тому +1

      @@danieleboch3224 To be honest, I an only speculate about this. My guess is that they started to play around with different leaf estimation methods and found the one xgboost uses works better than the one they originally came up with. To be honest, the "theory" of catboost seems to be quite different from how it works in practice, and this is very disappointing to me.

  • @alexpowell-perry2233
    @alexpowell-perry2233 8 місяців тому

    at 11:48 when you are calculating the output values of the second tree, the residual for the 3rd record with a favourite Colour value of 0.525 and a Residual of 1.81 gets sent down the LHS leaf, even though the LHS leaf contains Residuals that are

    • @statquest
      @statquest  8 місяців тому +1

      Oops! That's a mistake. Sorry for the confusion!

  • @frischidn3869
    @frischidn3869 11 місяців тому

    What will the residuals and leaf output be when it is a multiclass classification?

    • @statquest
      @statquest  11 місяців тому +1

      Presumably it's log likelihoods from cross entropy. I don't show how this works with CatBoost, but I show how it works with Neural Networks here: ua-cam.com/video/6ArSys5qHAU/v-deo.html

  • @alexpowell-perry2233
    @alexpowell-perry2233 8 місяців тому

    How does catboost decide on the best split at level 2 in the tree if it has to be symmetric? What if the best threshold for the LHS node is different to the best threshold for the RHS node?

    • @statquest
      @statquest  8 місяців тому

      It finds the best threshold given that it has to be the same for all nodes at that level. Compared to how a normal tree is created, this is not optimal. However, the point is not to make an optimal tree, but instead to create a "weak learner" so that we can combine a lot of them to build something that is good at making predictions. Pretty much all "boosting" methods do something to make the trees a little worse at predicting on their own because trees are notorious for overfitting the training data. By making the trees a little worse, they prevent overfitting.

    • @alexpowell-perry2233
      @alexpowell-perry2233 8 місяців тому

      @@statquest thanks so much for the reply but I still dont quite understand this - so does each LEVEL get a similarity score? I dont understand how you can quantify a threshold when this threshold is being applied to more than 1 node in the tree? In your example you showed us how to calculate the cosine similarity for a split that is being applied to just 1 node - how do we calculate this when its being applied to, (in the case of a level 2 split) 2 nodes simultaneously? I also have one more question - since the tree must be symmetrical, i am assuming that a characteristic (in the case of your example - "Favourite Film") can only ever appear in a tree once?

    • @statquest
      @statquest  8 місяців тому

      @@alexpowell-perry2233 In the video I show how the cosine similarity is calculated using 2 leaves. Adding more leaves doesn't change the process. Regardless of how many leaves are on a level, we calculate the cosine similarity between the residuals and the predictions for all of the data.
      And yes, a feature will not be used if it can no longer split the data into smaller groups.

  • @sanukurien2752
    @sanukurien2752 2 місяці тому

    what happens during inference time when the target is not available? How are the categorical variables encoded then?

    • @statquest
      @statquest  2 місяці тому

      You use the full training dataset to encode the new data.

  • @yufuzhang1187
    @yufuzhang1187 Рік тому +2

    Dr. Starmer, when you have a chance, can you please make videos on LIghtGBM, which is quite popular these days? Also, can you do ChatGPT or GPT or Transformer, clearly explained! Thank you so much!

    • @statquest
      @statquest  Рік тому +3

      I'm working on Transformers right now.

    • @yufuzhang1187
      @yufuzhang1187 Рік тому +1

      @@statquest Thank you so much! Looking forward!

    • @xaviernogueira
      @xaviernogueira Рік тому +1

      ​@@statquest excited for that

  • @Mark_mochi
    @Mark_mochi 6 місяців тому +1

    In 8:25, why does the threshold change to 0.87 all of a sudden?

    • @statquest
      @statquest  6 місяців тому

      Oops. That looks like a typo.

  • @user-zq4cv6yn8u
    @user-zq4cv6yn8u 8 місяців тому +1

    Thank you for your content! It's very nice, everything is clear, I hope you want stop producing your content :)

  • @serdargundogdu7899
    @serdargundogdu7899 9 місяців тому

    how was "favorite color < 29" changed into "favorite color < 0.87" in 8:28 ? Could you please explain?

    • @statquest
      @statquest  9 місяців тому

      That's just a horrible and embarrassing typo. :( It should be 0.29.

  • @alphatyad8131
    @alphatyad8131 Рік тому

    Dr. Starmer, I try to manually calculate and use a calculator too for several times but it was different from the results in 7:23. I get 0.7368, but there is 0.79. Am I missing something? Does anyone get the same result as me?

    • @statquest
      @statquest  Рік тому +1

      That's just a typo in the video. Sorry for the confusion.

    • @alphatyad8131
      @alphatyad8131 Рік тому +1

      ​Okay. Thank you for your attention and the great explanation, Dr. Josh Starmer. Such an honor and my pleasure to contribute to this video. Have a great day, Dr.

  • @recklesspanda8669
    @recklesspanda8669 11 місяців тому

    is it still work like that if i use classification?

    • @statquest
      @statquest  11 місяців тому +1

      I believe classification is just like classification for standard Gradient Boost: ua-cam.com/video/jxuNLH5dXCs/v-deo.html

    • @recklesspanda8669
      @recklesspanda8669 11 місяців тому +1

      @@statquest thank you🤗

  • @YUWANG-du4pv
    @YUWANG-du4pv Рік тому

    Dr. Starmer, could you explain lightGBM🤩

  • @TrusePkay
    @TrusePkay Рік тому

    Do a video on LightGBM

  • @nilaymandal2408
    @nilaymandal2408 5 місяців тому +1

    5:28

  • @TheDankGoat
    @TheDankGoat 8 місяців тому

    obnoxious, arrogant, has mistakes, but useful....

    • @statquest
      @statquest  8 місяців тому

      What parts do you think are obnoxious? What parts are arrogant?And what time points, minutes and seconds, are mistakes? (The mistakes I might be able to correct or at least have a note mentions them).