To learn more about Lightning: lightning.ai/ Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/
This is literally the best explanation about statistics & traditional ML model. I am so lucky to see your video with my journey of data science started.
Thank you Josh and people contributing to it, it is very helpfull and nice content, that enhance my studying expirience ! :) By the way, there is the best explanations and visualizations I have ever seen for these topics, and for free ! What a fairy tale Thanks from Ukrainian student !
Hi Josh I wanted to thank you for your content, I'm finishing your stats playlist it's very good. Statsquatch has become my friend. Big hug straight from Brazil!
Hi Josh. Just came across your channel. Your method of explaining is so concise, clear and appealing. Definitely I would learn a lot from this channel.
but what happens in inferring? say you trained a great model and now you are predicting the new data, do you use the mean of the old data or the mean of the new data? if you use target encoding, well in the new data you don't have a target? so what now?!?
For target encoding. Usually you store a dictionary of the encodings for the categorical data. If it’s an unseen value from the training set. Best practice is to use the mean
Hi Josh, a heartfull Thank you for sharing these encoding techniques I have one doubt; it may look stupid, but I just want to clarify it with you. On 13:40, the encoding of green colour with target value 1 is 0.42, and below that, green colour with target value 1 is 0.67. So when encoding transforms the new data, will system change the green colour to 0.42 or 0.67?
When encoding new data, we'll use all of the training data to convert the option "green" into numbers. Thus, I believe we'll convert it to we'll convert it to ((3 * (2/3)) + (2 * (3/7))) / (3 + 2) = 0.57
I NEED YOUR HELP At 4:20 our brother Josh said that some ML models might have problems with label encoding and as always he explains it in a beautiful way. I'm writing my Thesis and I need at least one paper that confirms this (although it makes totally perfect sense...). It's been 2 days of and still can't find anything...anybody can help me?... Thank you people, stay smart and keep learning
I think part of the problem is that the issue with label encoding is pretty obvious so no one thinks to publish about it, and reviewers might think it is too obvious to be published. Sort of like publishing a paper that says 2 + 2 = 4. However, I was able to find this: arxiv.org/pdf/2201.11358.pdf
Hi Josh, thanks for explanation. But I want to know how to transform the unseen data using k-fold target encoding? Is it oke to use mean value of the transformed category? Thanks before
First of all, Thankyou so much for the amazing content My Question : How will we be perform encodings when we are solving multi classification problem or say regression problem, how to tackle such cases?
13:33 okay...but how do you use this trained model? If you wanted to make a prediction using k-target which of the numbers for that color would you use? (Ps - for colors I'd use their HSV values, or for simple colors like this I'd just use their hue which is a simple spectrum)
@@statquest Could you please elaborate on how would you do this? I'm probably missing something obvious, but thanks in advance! Is it that you apply the K-Fold target encoding to the new data based on the training data?
@@BlueRS123 When we have new data, instead of splitting the training data into k-folds, we use all of it to calculate the option mean and the overall mean.
Great video. 👍🏽 I find it less confusing, however, to say categorical or qualitative data instead of discrete data. Numeric data can be discrete (integers)
Thank you for explanation. Is it possible to buy these slides in this video? I didnt find them in Illustrated Guide To Machine Learning. How to buy all your slides?
Hi Josh. You rock. Can you help me understand how the k-fold target encoding results are used on the test data that appears months after the model is trained and is in deployment?
thanks for a great video! i am trying to apply k-fold target encoding on my train and test data. i target encoded my train data using k-fold target encoding just like the video, but how should i encode my test data ? If the feature is BLUE, should i get the mean of BLUE (target encoded) in the train data and use it for test data? OR should i just use the whole train data to get new target encoding values for the test data?
@@khushalkumar31 Ultimately, this, and all hyperparameters in any model, should be determined with cross validation. In other words, try a bunch of values (maybe 1, 10, 100, 1000) and see which one works the best. For details on how cross validation works, see: ua-cam.com/video/fSytzGwwBVw/v-deo.html
i have two differnt datsets those have columns 750 after one hot encoding and aligning manually now my problem is i'm facing long load time in the process so how can i reduce those columns si
If you don't store them that way, they should load faster. If the datasets are not super huge, you should be able to one-hot-encode in RAM and then run it through your analysis.
What if you are using a categorical column data to create new features through feature engineering (with groupby in python for instance)? Does it mean that you should do your feature engineering first and then apply K-fold Traget Encoding after doing this?
Hey! I have a question regarding K-Fold target encoding. While I grasp the concept and the reason behind obtaining multiple values for the same category, I am uncertain about how to encode new data. Specifically, if I train a machine learning model using K-Fold target encoding and subsequently wish to test the model with new data, for which I do not have the target value, how should I go about encoding this data? Should I take for instance the mean of all values corresponding to "red" ?
@@statquest Thank you very much! I have another question though, what if I want to encode the target itself ? In my problem, I have a classification problem where the target can take ~250 different values. I guess Label encoding or One hot are not suitable for this, and I don't really know if I can use target encoding to encode the target itself... Is there another way to encode data in this situation.
@@Monkey_uho Depending on your method, you might need to encode the target. For example, if you were using a neural network, you would just have one output per target value. Or, if you are using a tree based method (like xgboost), you can probably just assign a number to to each target value - this is because you don't use the target value itself to determine how to build the tree.
If you just have 2 options, it's the same as here. Otherwise you use counts for each option as described here: catboost.ai/en/docs/concepts/algorithm-main-stages_cat-to-numberic
here, at 7.54 I see that while calculating weighted average for 'Red', you mentioned 'overall mean' as 3/7 however Isn't it incorrect?, shouldn't it be '1/7'? please correct in case I am claculating it wrong?
Overall mean is the average of the "Loves Troll 2" column for all colors. The "option mean" is the average of the "Loves Troll 2" column for a specific color (like "red")
Hi Thank you for the video but I just have some questions and hope I get clarifications from you: 1) Could you explain why you were concerned about data leakage when doing the k-fold target encoding? The data leakage I know is when some test set information is incorporated into the model training. Is the dataset shown in the video supposed to be the full dataset or is it just the training/testing data for a machine learning model? 2) Target encoding feels very different from one-hot encoding and feels like we are replacing the original column with a derived column. Hence I was wondering, what is the intuition and reason behind target encoding? In what situation would I use it over one-hot encoding?
1) There are different ways to corrupt training, and one of the ways is to use the target values in the training data to impute values that are also used in the training data. 2) One-hot-encoding is great when there are only a few options. But if you have a lot of options, target encoding can be better.
Hie, just a simple question... when we are testing out... we will also need to "replace" the actual Red, Green, and Blue values into numbers. In testing - real world phase, we won't have the target. And in the K-Flod case we have different numbers of Red, Green, Blue. So, in test environment how would you replace them? I mean, would you take mean of Red, Green and Blue, from K-flod, store somewhere and replace in the testing time or anything else? Thanks!!!
Isn't the formula of weighted mean is (num of option pos targets + m*overall mean)/(n+m), since n*Option Mean = n * (sum of option targets)/n = sum of option targets = num of option pos targets?
Yep, you can write it either way. I like the way I wrote it because it makes the weighting explicit and easy to see. The other thing is that 'n' does not have to be the number of rows, and thus, it doesn't have to cancel out the denominator of the mean. So that's something to think about as well.
That's a good idea. I'll keep that in mind. I show how to one-hot encoding in my video on XGBoost ua-cam.com/video/GrJP9FLV3FE/v-deo.html but I haven't shown how to use Target Encoding.
Hi Josh. How can you apply these methods in regression models? One Hot Encoding seems self-explainatory, but the weighted mean or the catboost methods seem, like they need additional steps...
Then you use the same value for both. The goal isn't necessarily to give each category a unique value, but to convert the variables into something the algorithm can use in a helpful way, and if multiple variables give you the same mean, well, that might be what helps the most.
Hi Josh! Thank you very much for sharing information regarding statistics. Do you have a video regarding linear model and linear mixed model? I have been struggling with these two. Have a nice day!
@@adiskaop I'll keep that in mind. However, I'll tell you the basic idea of linear mixed models boils down to a model where you don't have enough data to estimate all of the parameters in a standard linear models. So, instead of estimating parameters, you just make assumptions.
@@berke7255 You just plug the data into the equation for the Weighted Mean, as specified at 11:05. To encode a row, we treat that one row of interest as subset "A", and all of the other rows as subset "B".
Shoud I care about data leakage if I won't use my model to prediction? I'm trying to use a clustering algorithm to describe my data and I need to convert categorical data to numeric
@@statquest Hi, first of all, thanks for replying! At mark 10:30 youre showing the method for 2 fold target encoding where you divide dataset into 2 subsets. Then at 13:55 you're dividing it into 7 subsets as an example of 7-fold target encoding but at the same time Leave one out target encoding. Isn't it the same? The terminology got me confused. Whats the difference between the terminology ie. 7-fold target encoding and leave-one-out target encoding.
@@jakubstrawa8629 For k-fold encoding, if k=the number of rows, then it is the exact same as leave-one-out encoding and we can use either term interchangeably.
@@statquest Ohhh okay, i understand now the difference. So if we have a dataset let's say 10 rows and we divide it into 3 parts (3/3/4) then its only 3-fold encoding but not the leave-one-out?
hey Josh and thank you as always for an amazing tutorial! :) when we perform k-fold target encoding - it applies on all the data, and then you split to train-test sets? and then - we you have new data that you feed to the trained model and want to get predictions about - what would be the values you assign to the categorical features? the ones you'd get by just performing target encoding on all the initial data you had (train + test)? thank you!
Typically, we would split the data into training and testing bits. Then apply k-fold target encoding on the training data and build the model. Then, when testing, we use all of the training data to encode each row of testing data and run it through the model.
This is the first Statquest I've ever watched which I've actually found quite confusing - two things it leaves unanswered: a) It's quite a leap to go from encoding categorical features in a way that each category always has the same numerical representation (e.g. blue is always 1), to a way that doesn't, and there is no explanation of why that works. I'm thinking - "How does the model know that 0.22 and 0.5 are referring to the same thing - isn't that an issue?". b) With k-fold TE, how do you encode the test data?
a) The model doesn't know that 0.22 and 0.5 are referring to the same thing. Is this ideal? Probably not - leaving the features as discrete options to begin with would probably be better. But if we don't have that choice, we have to make some sort of compromise. b) We just use the full training dataset.
@@statquest Ah okay, that makes sense re. a), thanks so much for the response. For b), makes sense that you would use the whole dataset for training, but how would you encode the categorical variables if you wanted to use the model with new data/in production?
@@tompease95 To be clear - when you have new data and your model is in production, you use the full training dataset. In other words, given the 7 rows of data in the example in the video, you would use all 7 rows to encode new data when using this in a production setting. You can precompute these values from the full training dataset so the encoding can be very fast.
Great video! One question - in using kfold target encoding it seems like you're being counter productive in producing encodings that is useful to a model. For example a node of an XGBoost tree will look at 'colour' and split based on the value of the encodings, treating it as a continous variable. But what if by kfold encoding maybe 'Blue' has values between [0.3, 0.5] and 'Red' has values between [0.2, 0.5 ] - now you've disrupted the models ability to effectively split on this node! Appreciate any thoughts on this you might have
First of all, I'm pretty sure XGBoost efficiently uses one-hot encoding for all discrete features, no matter how many options there are (by implementing sparse matrices). However, assuming you still have to translate the discrete features to continuous values, if there is overlap between Red and Blue, than that might actually be fine. The important thing isn't to split Red from Blue, but to make good predictions, and that might still happen, even with overlap.
If my column has 3 categories, should I convert it into 2 columns? (because with the information from two columns, I can determine the value of the third category). I understand that this conversion is commonly done in linear models to handle multicollinearity caused by the additional columns, but I'm unsure about its application in models like decision trees. Great videos!
The way design matrices are made for linear models is different than one-hot-encoding because they, specifically, need to deal with multicollinearity. However, most other ML methods, including decision trees and neural networks, don't have that problem, so for them, if you have 3 options, you have 3 columns.
Hi Josh.. Another great video.. Just curious, if we want to encode postcodes or something similar like user-id's, customer-id's where the target variable is not available which technique should we use?
Hey Josh, Love the videos. I'm left with one question: is there anything we can do when we are doing multiclass classification and need to transform our predicted variable so that the algorithm isn't working with string data?
The methodology is very clear here. But one thing that bothers me is how a categorical feature can now have different numeric values - like Blue being .22 and .50. Let's say that all people who liked Blue also like Troll 2, and nobody else liked it. It seems that relationship would be easily recognized in a model if Blue was kept as is, or was given the same numeric value, but now it could be masked with the encoding techniques giving Blue different values. Am I missing something?
Yes, when you have crazy situations where there is a super tight correlation between a category and the label, this can cause problems. Thus, the folks that created CatBoost came up with a work around. To learn more about that strategy, see: ua-cam.com/video/KXOTSkPL2X4/v-deo.html
If anybody can help me understand the ONLY thing that's not clear to me: - In K-Fold target encoding, based on what do you choose the value of the weight? - Moreover, when I try it, I have results slightly above 1 (like 1.04,1.07, etc). It shouldn't be right? But I can't get why tho
I talk about how to think about what would be a good value for 'm' at 7:11 . However, if you have a large dataset, I've seen a lot of people just set it to 10. Also, you the output should always be between 0 and 1.
Hi Josh! I'm sorry but I don't understand something in the K fold target encoding. If blue have different values when I concatenate subsets into one single dataframe, how does the algorithm know that both are blue given that they have different values? What confuses me is that every previous methods keep the information of the color no matter the row and only this one change it... Thanks for all your good work 😁
The algorithm doesn't keep track of the original colors - just the values that replaced them. In this example, we end up with two different numbers that represent blue. If the dataset is relatively large, then those numbers should be relatively close to each other compared to the numbers that represent other colors, so in that sense, by proximity, information about the original color "blue" is retained.
To learn more about Lightning: lightning.ai/
Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/
It's mind-boggling how much better Josh is at explaining complicated topics than anyone else.
BAM!
Thank you so much for your videos, this is by far the best educational Machine Learning channel I’ve ever come across
Wow, thanks!
This is literally the best explanation about statistics & traditional ML model. I am so lucky to see your video with my journey of data science started.
Thank you!
Much better explanation than what I had at class!
BAM! :)
What class?
Thank you Josh and people contributing to it, it is very helpfull and nice content, that enhance my studying expirience ! :)
By the way, there is the best explanations and visualizations I have ever seen for these topics, and for free ! What a fairy tale
Thanks from Ukrainian student !
Thank you! I'm glad it helped!
I really loved your explanation and your sense of humor. I really did!
Thanks!
Hi Josh I wanted to thank you for your content, I'm finishing your stats playlist it's very good. Statsquatch has become my friend. Big hug straight from Brazil!
Muito obrigado! :)
Hi Josh. Just came across your channel. Your method of explaining is so concise, clear and appealing. Definitely I would learn a lot from this channel.
Awesome, thank you!
Love the dry humour in your videos 🤣. Great content too!
Thank you!
Absolutely looooooooooove all videos of this channel (especially 'shameless self-promotion' hehe)
bam! :)
thank you your explanations are always simple and clear.
You are welcome!
but what happens in inferring? say you trained a great model and now you are predicting the new data, do you use the mean of the old data or the mean of the new data? if you use target encoding, well in the new data you don't have a target? so what now?!?
Typically u took the avg of the training data for each category for inference
BAM! :)
For target encoding. Usually you store a dictionary of the encodings for the categorical data. If it’s an unseen value from the training set. Best practice is to use the mean
Hi Josh, a heartfull Thank you for sharing these encoding techniques
I have one doubt; it may look stupid, but I just want to clarify it with you.
On 13:40, the encoding of green colour with target value 1 is 0.42, and below that, green colour with target value 1 is 0.67.
So when encoding transforms the new data, will system change the green colour to 0.42 or 0.67?
When encoding new data, we'll use all of the training data to convert the option "green" into numbers. Thus, I believe we'll convert it to we'll convert it to ((3 * (2/3)) + (2 * (3/7))) / (3 + 2) = 0.57
Totally great explanation, congratulations
Many thanks!
How do you use k-fold target encoding for a test data set, since blue now has several distinct numeric values as a predictor in the training set?
Great questions - with new data, you can use all of the original training data to find the best value.
Hey Josh, great job. Thnak you a lot!
Thank you!
I NEED YOUR HELP
At 4:20 our brother Josh said that some ML models might have problems with label encoding and as always he explains it in a beautiful way. I'm writing my Thesis and I need at least one paper that confirms this (although it makes totally perfect sense...). It's been 2 days of and still can't find anything...anybody can help me?...
Thank you people, stay smart and keep learning
I think part of the problem is that the issue with label encoding is pretty obvious so no one thinks to publish about it, and reviewers might think it is too obvious to be published. Sort of like publishing a paper that says 2 + 2 = 4. However, I was able to find this: arxiv.org/pdf/2201.11358.pdf
@@statquest Josh, thank you do much. Really, it was already enough the video you made. Can't imagine how much i appreciate it, wish you the best
Hi Josh, thanks for explanation. But I want to know how to transform the unseen data using k-fold target encoding? Is it oke to use mean value of the transformed category? Thanks before
I have the same question
Great questions - with new data, you can use all of the original training data to find the best value.
First of all, Thankyou so much for the amazing content
My Question : How will we be perform encodings when we are solving multi classification problem or say regression problem, how to tackle such cases?
I think you should consider one-hot encoding in that case.
Awesome as always ☺️👏
Thank you! :)
13:33 okay...but how do you use this trained model? If you wanted to make a prediction using k-target which of the numbers for that color would you use?
(Ps - for colors I'd use their HSV values, or for simple colors like this I'd just use their hue which is a simple spectrum)
You can use the entire training dataset for new data.
@@statquest Could you please elaborate on how would you do this? I'm probably missing something obvious, but thanks in advance!
Is it that you apply the K-Fold target encoding to the new data based on the training data?
@@BlueRS123 When we have new data, instead of splitting the training data into k-folds, we use all of it to calculate the option mean and the overall mean.
Never learning ML this fun before 😂
bam!
For one-hot encoding, shouldn't there be one less position to avoid the dummy variable trap?
That only applies for models that depend on design matrices (like linear models or logistic regression).
Great video. 👍🏽
I find it less confusing, however, to say categorical or qualitative data instead of discrete data.
Numeric data can be discrete (integers)
Noted
Thank you for explanation. Is it possible to buy these slides in this video? I didnt find them in Illustrated Guide To Machine Learning. How to buy all your slides?
Thanks! Unfortunately my slides are not available. :(
it's time you introduce Quadruple Bam!!!!
Not yet - I'm waiting for 1 million subscribers.
Hi josh .... thanks for your extreme efforts.... I hope to see a statquest about lightgbm and catboost.....
This is actually a lead in to CatBoost (as well as word embeddings).
Hi Josh. You rock.
Can you help me understand how the k-fold target encoding results are used on the test data that appears months after the model is trained and is in deployment?
Once trained, you use all of the training data to encode future data.
is there a sklearn package to do the K-fold target encoding ?
Not that I know of, but there are lots of good examples on the web. Here's one: www.kaggle.com/code/anuragbantu/target-encoding-beginner-s-guide
@@statquest thanks, will have a look
thanks for a great video! i am trying to apply k-fold target encoding on my train and test data. i target encoded my train data using k-fold target encoding just like the video, but how should i encode my test data ? If the feature is BLUE, should i get the mean of BLUE (target encoded) in the train data and use it for test data? OR should i just use the whole train data to get new target encoding values for the test data?
Use the full training dataset to encode the test data.
Question . How did you arrive at the conclusion that the weight of the overall mean should be 2 ?
Answer: See 7:23
It's not clear why you choose m = 2. Can you explain again or give me some resource to read?@@statquest
@@khushalkumar31 Ultimately, this, and all hyperparameters in any model, should be determined with cross validation. In other words, try a bunch of values (maybe 1, 10, 100, 1000) and see which one works the best. For details on how cross validation works, see: ua-cam.com/video/fSytzGwwBVw/v-deo.html
Is there any tool like sk-learn that holds the Kfold or any of these advanced encodings?
Good question! I don't know.
i have two differnt datsets those have columns 750 after one hot encoding and aligning manually now my problem is i'm facing long load time in the process so how can i reduce those columns si
If you don't store them that way, they should load faster. If the datasets are not super huge, you should be able to one-hot-encode in RAM and then run it through your analysis.
Hello Josh, I have a question how can we provide value for testing data when we don't know the outcome?
When we have new data that we want to classify, we simply use all of the training data to calculate the encoding values.
Great work like always! What to do, when target encoding results in the same number for two labels?
That's fine - because the goal isn't to find unique values, the goal is to find values that help make the best predictions.
What if you are using a categorical column data to create new features through feature engineering (with groupby in python for instance)? Does it mean that you should do your feature engineering first and then apply K-fold Traget Encoding after doing this?
That makes sense to me.
Hey! I have a question regarding K-Fold target encoding. While I grasp the concept and the reason behind obtaining multiple values for the same category, I am uncertain about how to encode new data. Specifically, if I train a machine learning model using K-Fold target encoding and subsequently wish to test the model with new data, for which I do not have the target value, how should I go about encoding this data?
Should I take for instance the mean of all values corresponding to "red" ?
Yes, exactly. You use the full training dataset to encode new data when you are using the model to make new predictions.
@@statquest Thank you very much! I have another question though, what if I want to encode the target itself ? In my problem, I have a classification problem where the target can take ~250 different values. I guess Label encoding or One hot are not suitable for this, and I don't really know if I can use target encoding to encode the target itself... Is there another way to encode data in this situation.
@@Monkey_uho Depending on your method, you might need to encode the target. For example, if you were using a neural network, you would just have one output per target value. Or, if you are using a tree based method (like xgboost), you can probably just assign a number to to each target value - this is because you don't use the target value itself to determine how to build the tree.
Hey! What if my label is also categorical? How do I do target encoding then?
If you just have 2 options, it's the same as here. Otherwise you use counts for each option as described here: catboost.ai/en/docs/concepts/algorithm-main-stages_cat-to-numberic
here, at 7.54 I see that while calculating weighted average for 'Red', you mentioned 'overall mean' as 3/7 however Isn't it incorrect?, shouldn't it be '1/7'? please correct in case I am claculating it wrong?
Overall mean is the average of the "Loves Troll 2" column for all colors. The "option mean" is the average of the "Loves Troll 2" column for a specific color (like "red")
Hi
Thank you for the video but I just have some questions and hope I get clarifications from you:
1) Could you explain why you were concerned about data leakage when doing the k-fold target encoding?
The data leakage I know is when some test set information is incorporated into the model training. Is the dataset shown in the video supposed to be the full dataset or is it just the training/testing data for a machine learning model?
2) Target encoding feels very different from one-hot encoding and feels like we are replacing the original column with a derived column.
Hence I was wondering, what is the intuition and reason behind target encoding? In what situation would I use it over one-hot encoding?
1) There are different ways to corrupt training, and one of the ways is to use the target values in the training data to impute values that are also used in the training data.
2) One-hot-encoding is great when there are only a few options. But if you have a lot of options, target encoding can be better.
Hie, just a simple question... when we are testing out... we will also need to "replace" the actual Red, Green, and Blue values into numbers. In testing - real world phase, we won't have the target. And in the K-Flod case we have different numbers of Red, Green, Blue. So, in test environment how would you replace them? I mean, would you take mean of Red, Green and Blue, from K-flod, store somewhere and replace in the testing time or anything else?
Thanks!!!
When you are testing, you use the entire training dataset to determine what numbers replace "red, green and blue".
@@statquest Okay, but how can you get the target means in the testing time? There is no existence of the target, right?
@@aayushsmarten The target means come from the full training dataset.
Isn't the formula of weighted mean is (num of option pos targets + m*overall mean)/(n+m), since n*Option Mean = n * (sum of option targets)/n = sum of option targets = num of option pos targets?
Yep, you can write it either way. I like the way I wrote it because it makes the weighting explicit and easy to see. The other thing is that 'n' does not have to be the number of rows, and thus, it doesn't have to cancel out the denominator of the mean. So that's something to think about as well.
Hi Josh! Great video. Are you planning to add to these videos how to apply them in Python?
Thanks!
That's a good idea. I'll keep that in mind. I show how to one-hot encoding in my video on XGBoost ua-cam.com/video/GrJP9FLV3FE/v-deo.html but I haven't shown how to use Target Encoding.
Hi Josh. How can you apply these methods in regression models? One Hot Encoding seems self-explainatory, but the weighted mean or the catboost methods seem, like they need additional steps...
For the CatBoost method, they quantize the "label value". catboost.ai/en/docs/concepts/algorithm-main-stages_cat-to-numberic
So how do we assign values to a categorical column that underwent K-fold Target Encoding for training when it is time to make actual predictions?
If you have new data to make a prediction for, you use the entire training dataset.
What do we do when 2 variables give same mean value?
Then you use the same value for both. The goal isn't necessarily to give each category a unique value, but to convert the variables into something the algorithm can use in a helpful way, and if multiple variables give you the same mean, well, that might be what helps the most.
Hi Josh! Thank you very much for sharing information regarding statistics. Do you have a video regarding linear model and linear mixed model? I have been struggling with these two. Have a nice day!
Here's my series on linear models: ua-cam.com/play/PLblh5JKOoLUIzaEkCLIUxQFjPIlapw8nU.html
@@statquest thank you, but the linear mixed models is not in there, would love to watch your explanation about it
@@adiskaop I'll keep that in mind. However, I'll tell you the basic idea of linear mixed models boils down to a model where you don't have enough data to estimate all of the parameters in a standard linear models. So, instead of estimating parameters, you just make assumptions.
in the case of LOO, do we sum the weighted means of the remaining subsets while encoding subset?
For LOO, we use all but one row to calculate the encoding value.
@@statquest so how do we decide which subset to choose and calculate its value for encoding another subset
@@berke7255 For LOO, you don't decide. When you are encoding a row, you use all other rows do calculate the value.
@@statquest do we sum all other row values or just take the average?
@@berke7255 You just plug the data into the equation for the Weighted Mean, as specified at 11:05. To encode a row, we treat that one row of interest as subset "A", and all of the other rows as subset "B".
Hey, how can i use K-Fold Target Encoding when I have more than two targets? (not only 0 or 1) In my dataset are 1.604 different classes.
See: towardsdatascience.com/target-encoding-for-multi-class-classification-c9a7bcb1a53
Shoud I care about data leakage if I won't use my model to prediction? I'm trying to use a clustering algorithm to describe my data and I need to convert categorical data to numeric
I wouldn't worry about it in that case.
So essentially everything greater than 1-fold target encoding is the same as leave one out target encoding?
What time point, minutes and seconds, are you asking about?
@@statquest Hi, first of all, thanks for replying! At mark 10:30 youre showing the method for 2 fold target encoding where you divide dataset into 2 subsets. Then at 13:55 you're dividing it into 7 subsets as an example of 7-fold target encoding but at the same time Leave one out target encoding. Isn't it the same? The terminology got me confused. Whats the difference between the terminology ie. 7-fold target encoding and leave-one-out target encoding.
@@jakubstrawa8629 For k-fold encoding, if k=the number of rows, then it is the exact same as leave-one-out encoding and we can use either term interchangeably.
@@statquest Ohhh okay, i understand now the difference. So if we have a dataset let's say 10 rows and we divide it into 3 parts (3/3/4) then its only 3-fold encoding but not the leave-one-out?
@@jakubstrawa8629 yep.
Well, it is explained for categorization, what about regression?
Everything is the same for both methods (classification and regression). Just replaces the 1's and 0's in the target variable with continuous values.
In a better world, everyone would like Troll 2.
bam! :)
hey Josh and thank you as always for an amazing tutorial! :)
when we perform k-fold target encoding - it applies on all the data, and then you split to train-test sets? and then - we you have new data that you feed to the trained model and want to get predictions about - what would be the values you assign to the categorical features? the ones you'd get by just performing target encoding on all the initial data you had (train + test)?
thank you!
Typically, we would split the data into training and testing bits. Then apply k-fold target encoding on the training data and build the model. Then, when testing, we use all of the training data to encode each row of testing data and run it through the model.
@@statquest thanks!
This is the first Statquest I've ever watched which I've actually found quite confusing - two things it leaves unanswered:
a) It's quite a leap to go from encoding categorical features in a way that each category always has the same numerical representation (e.g. blue is always 1), to a way that doesn't, and there is no explanation of why that works. I'm thinking - "How does the model know that 0.22 and 0.5 are referring to the same thing - isn't that an issue?".
b) With k-fold TE, how do you encode the test data?
a) The model doesn't know that 0.22 and 0.5 are referring to the same thing. Is this ideal? Probably not - leaving the features as discrete options to begin with would probably be better. But if we don't have that choice, we have to make some sort of compromise.
b) We just use the full training dataset.
@@statquest Ah okay, that makes sense re. a), thanks so much for the response. For b), makes sense that you would use the whole dataset for training, but how would you encode the categorical variables if you wanted to use the model with new data/in production?
@@tompease95 To be clear - when you have new data and your model is in production, you use the full training dataset. In other words, given the 7 rows of data in the example in the video, you would use all 7 rows to encode new data when using this in a production setting. You can precompute these values from the full training dataset so the encoding can be very fast.
Great video! One question - in using kfold target encoding it seems like you're being counter productive in producing encodings that is useful to a model. For example a node of an XGBoost tree will look at 'colour' and split based on the value of the encodings, treating it as a continous variable. But what if by kfold encoding maybe 'Blue' has values between [0.3, 0.5] and 'Red' has values between [0.2, 0.5 ] - now you've disrupted the models ability to effectively split on this node! Appreciate any thoughts on this you might have
First of all, I'm pretty sure XGBoost efficiently uses one-hot encoding for all discrete features, no matter how many options there are (by implementing sparse matrices). However, assuming you still have to translate the discrete features to continuous values, if there is overlap between Red and Blue, than that might actually be fine. The important thing isn't to split Red from Blue, but to make good predictions, and that might still happen, even with overlap.
If my column has 3 categories, should I convert it into 2 columns? (because with the information from two columns, I can determine the value of the third category). I understand that this conversion is commonly done in linear models to handle multicollinearity caused by the additional columns, but I'm unsure about its application in models like decision trees.
Great videos!
The way design matrices are made for linear models is different than one-hot-encoding because they, specifically, need to deal with multicollinearity. However, most other ML methods, including decision trees and neural networks, don't have that problem, so for them, if you have 3 options, you have 3 columns.
Hello, I had watched some video but don't know the meaning of "Bam". Can anyone explain for me pls
Sure! If you want to learn about "Bam!", check out this StatQuest: ua-cam.com/video/i4iUvjsGCMc/v-deo.html It's clearly explained!!! :)
great!
:)
5:06 that would be the average. The mean would be 0.
Edit excuse me, I was confusing mean with median.carry on.
bam! :)
Hi Josh.. Another great video.. Just curious, if we want to encode postcodes or something similar like user-id's, customer-id's where the target variable is not available which technique should we use?
All of the methods I know of require some sort of target.
@@statquest Thanks Josh.. If possible could you please cover some videos on Contextual Anomaly detection .
@@DanishAlam-lp2cq I'll keep that in mind.
Hey Josh,
Love the videos. I'm left with one question: is there anything we can do when we are doing multiclass classification and need to transform our predicted variable so that the algorithm isn't working with string data?
See: stats.stackexchange.com/questions/452022/can-target-encoding-be-performed-on-a-multi-label-classification-problem
isnt it TrollS 2, not Troll 2? or is this some other thing or fictional version or something
It's Troll 2: en.wikipedia.org/wiki/Troll_2
The methodology is very clear here. But one thing that bothers me is how a categorical feature can now have different numeric values - like Blue being .22 and .50. Let's say that all people who liked Blue also like Troll 2, and nobody else liked it. It seems that relationship would be easily recognized in a model if Blue was kept as is, or was given the same numeric value, but now it could be masked with the encoding techniques giving Blue different values. Am I missing something?
Yes, when you have crazy situations where there is a super tight correlation between a category and the label, this can cause problems. Thus, the folks that created CatBoost came up with a work around. To learn more about that strategy, see: ua-cam.com/video/KXOTSkPL2X4/v-deo.html
Why did we choose m=2?
This question is answered at 7:23
If anybody can help me understand the ONLY thing that's not clear to me:
- In K-Fold target encoding, based on what do you choose the value of the weight?
- Moreover, when I try it, I have results slightly above 1 (like 1.04,1.07, etc). It shouldn't be right? But I can't get why tho
I talk about how to think about what would be a good value for 'm' at 7:11 . However, if you have a large dataset, I've seen a lot of people just set it to 10. Also, you the output should always be between 0 and 1.
Hi Josh!
I'm sorry but I don't understand something in the K fold target encoding. If blue have different values when I concatenate subsets into one single dataframe, how does the algorithm know that both are blue given that they have different values?
What confuses me is that every previous methods keep the information of the color no matter the row and only this one change it...
Thanks for all your good work 😁
The algorithm doesn't keep track of the original colors - just the values that replaced them. In this example, we end up with two different numbers that represent blue. If the dataset is relatively large, then those numbers should be relatively close to each other compared to the numbers that represent other colors, so in that sense, by proximity, information about the original color "blue" is retained.
@@statquest Thank you for your fast answer, it's getting clearer in my mind ! Best regards from France
BAM!💥
:)
3:03 5:33 6:17 6:27 6:42
i come here for beep boop boop beep beep
bam!
When I get stuck in comprehending any ML concept, I turn to StatQuest.
bam! :)
@@statquest Double bam!
I will stick with the 43k columns
:)
BAM!!
:)
Multiple BAMS
:)
Like this comment for "Shameless self liking" 😀
bam! :)
pin for absolutely no reason?
who's watching when neural networks took over the world(2152)
👇