CatBoost Part 1: Ordered Target Encoding

Поділитися
Вставка
  • Опубліковано 6 чер 2024
  • One of the defining features of CatBoost is its concerted effort to avoid data leakage at all costs. In this video, we'll see how it eliminates a potential threat in Target Encoding by ordering the data and encoding it sequentially. This ordered approach is central to everything CatBoost does and we'll see it again in Part 2 when we talk about how it builds trees.
    NOTE: This StatQuest is based on the original CatBoost manuscript... arxiv.org/abs/1706.09516
    ...and an example provided in the CatBoost documentation...
    catboost.ai/en/docs/concepts/...
    English
    This video has been dubbed using an artificial voice via aloud.area120.google.com to increase accessibility. You can change the audio track language in the Settings menu.
    Spanish
    Este video ha sido doblado al español con voz artificial con aloud.area120.google.com para aumentar la accesibilidad. Puede cambiar el idioma de la pista de audio en el menú Configuración.
    Portuguese
    Este vídeo foi dublado para o português usando uma voz artificial via aloud.area120.google.com para melhorar sua acessibilidade. Você pode alterar o idioma do áudio no menu Configurações.
    If you'd like to support StatQuest, please consider...
    Patreon: / statquest
    ...or...
    UA-cam Membership: / @statquest
    ...buying my book, a study guide, a t-shirt or hoodie, or a song from the StatQuest store...
    statquest.org/statquest-store/
    ...or just donating to StatQuest!
    www.paypal.me/statquest
    Lastly, if you want to keep up with me as I research and create new StatQuests, follow me on twitter:
    / joshuastarmer
    0:00 Awesome song and introduction
    1:56 A slight problem with k-fold target encoding
    3:42 Ordered Target Encoding
    Corrections:
    4:09 It is also worth noting that if there were more than 2 target values, for example, if Loves Troll 2 could be 0, 1 and 2, then, when calculating the OptionCount for a sample with Loves Troll 2 = 1, we would include rows that had Loves Troll 2 = 1 and 2.
    #StatQuest #CatBoost #dubbedwithaloud

КОМЕНТАРІ • 84

  • @statquest
    @statquest  Рік тому +3

    Corrections:
    4:09 It is also worth noting that if there were more than 2 target values, for example, if Loves Troll 2 could be 0, 1 and 2, then, when calculating the OptionCount for a sample with Loves Troll 2 = 1, we would include rows that had Loves Troll 2 = 1 and 2.
    To learn more about Lightning: lightning.ai/
    Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/

  • @mrcoet
    @mrcoet Рік тому +9

    Thank you! I'm doing my master thesis and I'm checking your channel every day waiting for Transformers. Thank you again!

    • @statquest
      @statquest  Рік тому +7

      I'm still working on it.

    • @dihancheng952
      @dihancheng952 Рік тому +1

      @@statquest same eagerly waiting here

  • @firstkaransingh
    @firstkaransingh Рік тому +5

    Finally a video on cat boost. I was waiting for a proper explanation.

  • @xaviernogueira
    @xaviernogueira Рік тому +7

    Glad to see CatBoost! Would love to hear more about data leakage mitigation.

    • @statquest
      @statquest  Рік тому +7

      Thanks! Yes, I think at one point I need to do a video just on all the types of leakage.

  • @aghazi94
    @aghazi94 Рік тому +2

    I have been waiting for this for so long. Thanks alot

  • @joy5636
    @joy5636 Рік тому +1

    Wow, I am so excited to see the Catboost topic! thank u !

  • @AllNightNightwish
    @AllNightNightwish Рік тому +2

    Hi Josh, I agree with your point here about it being unnecessary (also having seen the previous longer explanation you posted a while back). However, I think their main point and contribution was not the mitigation in a single tree, but throughout the ensemble. If i understand it correctly, by using ordered boosting and randomization over each tree they guarantee that there is no leakage between the separate trees, because none of the samples have ever seen the original value. They use multiple models trained on different fractions of the dataset for each tree, just so they can make predictions that don't have any leakage at all. I'm still not sure that it wouldn't just work with leave one out encoding but given that context it seems to be more useful at least.

    • @statquest
      @statquest  Рік тому

      Part 2 in this series (which comes out in less than 24 hours), shows how the trees are built using the same approach that limits leakage. I guess one of my issues with CatBoost making such a big deal about leakage is that, even though other methods (XGBoost, lightGBM, Random Forests, etc) might result in leakage, they still perform well - and the whole point of avoiding leakage is simply to have a model perform well.

  • @TJ-hs1qm
    @TJ-hs1qm Рік тому +2

    Hey Josh, I was wondering if you could do a series on graph theory and NLP? exploring this stuff would be really helpful. Thanks!

  • @matteomorellini5974
    @matteomorellini5974 Рік тому

    Hi Josh first of all thanks for your amazing work and passion. I'd like to suggest you a video about Optuna which, at least in my case, would be extremely helpful

  • @davidguo1267
    @davidguo1267 Рік тому

    Thanks for the explanation. By the way, have you talked about backpropagation through time in recurrent neural networks? If not, are you planning to talk about it?

    • @statquest
      @statquest  Рік тому

      Backpropagation through time is just "unroll the RNN and then do normal backpropagation". I have thought about doing a video on it and have notes, but it's not a super high priority right now. Instead I want to get to transformers.

  • @ravi122133
    @ravi122133 3 місяці тому

    @statquest , I think in the paper they take the case when each sample has a unique category to show that it leads to leakage. and not the case that all samples have the same category. Section 3.2 Greedy TS of the CatBoost paper.

    • @statquest
      @statquest  3 місяці тому

      Yes, but in either case, you could just remove that column.

  • @murilopalomosebilla2999
    @murilopalomosebilla2999 Рік тому +2

    It may be silly, but having a boosting method with cat in its name is really cool haha

  • @heteromodal
    @heteromodal 11 місяців тому

    hey Josh - is there a mathematical justification to the prior in the numerator being defined as 0.05? regardless of a justification existing :) - is it always the case or just what you saw in their examples but it's not certain that that's a fixed value?
    thank you as always for a great video!

    • @statquest
      @statquest  11 місяців тому +1

      I saw 0.05 used as the prior here: catboost.ai/en/docs/concepts/algorithm-main-stages_cat-to-numberic and, on that page, it says you can set the prior. But I've looked in the documentation and I can't find where it is set, so I really don't know if it is always the case or not.

    • @heteromodal
      @heteromodal 11 місяців тому +1

      @@statquest Thank you!

  • @junaidbutt3000
    @junaidbutt3000 Рік тому +1

    Clear and concise as always Josh! I was wondering if there was a natural way to extend the OptionCount metric for multiclass problems? It makes sense in the binary classification, we count the observations where a category class c co-occurs with the positive class of the target variable (1 in this case). If this was adapted for multiclass problems, how would we adapt this encoding equation?

    • @statquest
      @statquest  Рік тому +1

      Great question - and the CatBoost documentation has a good description of how it works for more classes: catboost.ai/en/docs/concepts/algorithm-main-stages_cat-to-numberic

    • @texla-kh9qx
      @texla-kh9qx Рік тому

      @@statquest From the documentation, "Multiclassification The label values are integer identifiers of target classes (starting from "0").", it seems that they simply integer encoding the multiclasses? Isn't this introduce an artificial ordering in the target classes?

    • @statquest
      @statquest  Рік тому

      @@texla-kh9qx You have to remember that we don't split the data based the target value, so using integer values for the target isn't a problem.

    • @texla-kh9qx
      @texla-kh9qx Рік тому

      @@statquest The categorical features of independent variables is encoded by target statistics, i.e. the transformation from categories to numerical values. If there is an artificial ordering in target variable y, it propagates to that categorical feature of X. So integer encoding multiclasses seems not a good choice.

    • @statquest
      @statquest  Рік тому

      @@texla-kh9qx If you look at the equations for target encoding independent variables, you'll see that they don't include the target value, just the number of rows with the same category. So I don't believe that the target values propagate to the independent variables.

  • @beautyisinmind2163
    @beautyisinmind2163 10 місяців тому

    Catgorial boosting is only suitable for data with categorial features or we can use it even if our data has no categorical features? While using on continuous features does it require any conversion?

    • @statquest
      @statquest  10 місяців тому

      You can certainly use CatBoost on a dataset that doesn't have any categorical features. And it wouldn't require conversion.

  • @frischidn3869
    @frischidn3869 Рік тому +1

    Hello, thanks for the video. I wanna ask, what if the target variable (Loves Troll 2) is in multiclass (Like, Dislike, So-so). How will the encoding work then for the Favorite Color?
    And should we encode the target variable first such as
    0= Dislike
    1= So-so
    2= Like
    Before we then proceed to CatBoost encoding the feature (Favorite Color)?

    • @statquest
      @statquest  Рік тому

      When there are more than 2 classes, the equation changes, but just a little bit. You can find it in the documentation: catboost.ai/en/docs/concepts/algorithm-main-stages_cat-to-numberic

    • @frischidn3869
      @frischidn3869 Рік тому +1

      @@statquest It is said there "The label values are integer identifiers of target classes (starting from "0")"
      So I have to encode the target variable first outside CatBoost algorithm as 0, 1, 2 if it is 3 classes?

    • @statquest
      @statquest  Рік тому

      @@frischidn3869 Sounds like it.

  • @luiscarlospallaresascanio2374

    Que usaste para traducir el texto, en español? :0 ya había visto otros videos de traducción pero no pensé que pasarían a hacer el cambio tan rápido

  • @tapiotanskanen3494
    @tapiotanskanen3494 Рік тому +1

    1:57 - Is this correct? On the chapter 3.2 - *Greedy TS* - they talk about a problem _"This estimate is noisy for low-frequency categories...",_ but your example has (maximally) high-frequency category. Later they stipulate _"Assume i-th feature is categorical, _*_all its values_*_ are unique, ..."._ To me this means that there are only single row for each category. In other words, each category (label) is unique, i.e. we have exactly one example per category (label).

    • @statquest
      @statquest  Рік тому +1

      The video is correct. If you keep reading the manuscript, just a few more paragraphs, you'll get to the section titled "Leave-one-out TS", and you'll see what I'm talking about in this video.

    • @texla-kh9qx
      @texla-kh9qx Рік тому

      The video is talking about the example with constant categorical feature introduced in "Leave-one-out TS" section of their paper. However, I think the formula for target statistics in this video is different from the one in the paper, though the conclusion is still the same. Put it another way, the categorical feature who has uniform value originally carries no information at all. After the target statistics encoding, that categorical feature is transformed to a numerical feature with binary values which exactly distinguishes the binary target classes. This is clearly a target leakage as you can do perfect prediction relies on a single feature.

  • @ericchang927
    @ericchang927 Рік тому +1

    Greate video!!! could you pls also introduce lightgbm? Thanks!

    • @statquest
      @statquest  Рік тому +2

      I'll keep that in mind. I have some notes on it already so hopefully I can do it soon.

  • @daniellaicheukpan
    @daniellaicheukpan Рік тому

    hi Josh. thanks for your videos. I have one question. in your example, color blue can be encoded to several numerical values. Assume that I trained and deployed this model, when a new data comes with color = blue, which have no "loves troll 2" column How can the model know how to encode the color into which value? thanks so much

    • @statquest
      @statquest  Рік тому

      You use all of the color blue samples in the original training dataset.

    • @daniellaicheukpan
      @daniellaicheukpan Рік тому

      @@statquest that means take the average?

    • @statquest
      @statquest  Рік тому

      @@daniellaicheukpan I was thinking more along the lines of plugging all of the blue rows into the equation. That might be the same as taking the average, but I haven't worked that out.

  • @dl569
    @dl569 Рік тому +1

    can't wait to see Transformer, PLEASE!!!!!!

  • @shubhamgupta6551
    @shubhamgupta6551 Рік тому

    How was the ordered target encoding applied at the time of scoring? There will not be any target variable and we don't have a single value for a category i.e Blue color encoded multiple time with different values

    • @statquest
      @statquest  Рік тому

      We use the entire training datasets to encode new data.

  • @johndavid5907
    @johndavid5907 9 місяців тому

    Hi there sir, can you tell me that the value prior variable is holding is that the value of significance level value?

    • @statquest
      @statquest  9 місяців тому

      0.05 is often used as a threshold for statistical significance, but in this case, that concept has nothing to do with how we assign a value to the prior. In theory, the prior could be anything, like 12, and that's not even an option for the threshold for statistical significance.

  • @tessa10001
    @tessa10001 Рік тому +2

    Where was this when i made my master thesis with catboost :(

  • @EvanZamir
    @EvanZamir 9 місяців тому

    Lightning can be used with CatBoost?

    • @statquest
      @statquest  9 місяців тому

      Lightning AI provides a platform to do things easily in the cloud. So, anytime you have a ton of data or a large model, Lightning can help.

  • @c.nbhaskar4718
    @c.nbhaskar4718 Рік тому

    Great tutorial but i am Eagerly waiting for statquest on Transformers

  • @BlueRS123
    @BlueRS123 11 місяців тому

    Will you cover LightGBM?

    • @statquest
      @statquest  11 місяців тому

      I've got notes on it and when I have time I will.

    • @BlueRS123
      @BlueRS123 11 місяців тому

      @@statquest Cool! Are videos of gradient descent optimizers planned, too? (Momentum, Adam, etc.)

    • @statquest
      @statquest  11 місяців тому

      @@BlueRS123 I've got notes for Adam as well, so it's just a function of finding some time.

  • @EvanZamir
    @EvanZamir 9 місяців тому

    My guess is the ordered target encoding acts like a form of regularization.

    • @statquest
      @statquest  9 місяців тому

      Yes, that makes sense to me.

  • @guimaraesalysson
    @guimaraesalysson Рік тому

    In this simple example of people who liked the colors whether or not they liked the movie, wouldn't "leakage" make sense? After all, if for example 90% of people who like blue liked the movie, wouldn't knowing that the color the next person likes is blue already provide information? Why is the leak a leak in this case?

    • @statquest
      @statquest  Рік тому +1

      Leakage comes form using the same row's target value to modify it's value for Favorite Color. This is typically dealt with by using k-fold target encoding - ua-cam.com/video/589nCGeWG1w/v-deo.html

  • @nitinsiwach1989
    @nitinsiwach1989 6 місяців тому

    Not only is the motivation unjustifiable. The way Target encoding is done by catboost also makes no sense. Even in your toy example the different categories are numerically exactly the same when encoded and there is absolutely no reason it should be the case

  • @Joaopedro_
    @Joaopedro_ 6 місяців тому

    Manda um salve para o Caio Ducati