should one hot encoding not done differently on train and test set like ohe.fit_transform(x_train) and ohe.transform(x_test) and if yes then how do we ensure that the number of distict values in a variable in train and test set are equal, so that we dont get more number of dumy variables in test?
Yes, OneHotEncoder should be done differently on train and test, just like you outlined! And because you are using fit only on train (and not on test), it will learn the categories present in train, and apply those same categories to test. In other words, you don't have to ensure that train and test have the same categories. If the testing set does have additional categories that were not present in the training set, here's how you can handle that situation: ua-cam.com/video/bA6mYC1a_Eg/v-deo.html Does that help?
Thank you for the video...but is there a library available in python right that i can use to actually run a ordinal/probit log regression? statsmodel seems to have dropped it and i haven't found another library...
How would you encode data that has an ordering, but is cyclical. An example might be a "Time of Day" feature with entries such as 'morning', 'noon', evening', 'night'. There is an ordering to this, but it's cyclical, so that there is no lowest or highest value. Thanks! I very much enjoy your explanations and videos.
Excellent question! By default, I would use OneHotEncoder. However, I might try OrdinalEncoder if I believed that there was an ordered relationship in "Time of Day" that was predictive of the target. For example, if I was predicting crime rate, and I discovered that crime tended to increase as the day goes on, I might encode morning=0, noon=1, evening=2, and night=3. Ultimately, I would choose whichever encoding was producing more accurate results. Hope that helps!
Thanks for this educative lectures, please i have a thought, if i have a pipeline with OneHotEncoder and OrdinalEncoder with the same data you used for the practice for example, how will the pipeline know which columns to OneHotEncode and the ones to OrdinalEncode, Thanks
Amazing video, subscribed! just to clarify on the plane problem, id order[third, second, first] to make first class highest rank during training right? Thanks!
Glad you liked the video! Regarding your question, reversing the ordering of the categories for the OrdinalEncoder would indeed reverse the encoding scheme so that third=0, second=1, first=2. However, reversing the encoding would not benefit (or hinder) the ML model, so you are welcome to use either ordering. Hope that helps!
thanks for the explaination, sire. however i have several questions. does label encoder works aswell for nominal unordered data? how do we find mode for the nominal data where it will shows the pre-encoded data(not to show the numbers)? how do we find mode median(Q1,Q2,Q3) for the ordinal data where it will shows the pre-encoded data(not to show the numbers)? thanks
Great video! I have a dataset with categorical and numerical values, and my quesiton is, should I encode just the categorical values and then rmerge that with the dataset? and then drop the not encoded values?
Great question! If the features are ordinal, and you want the model to learn the logical ordering of the categories, then yes the only way is to manually define category ordering.
Hi, so I'm trying to work with data that has two entries: Column X: X1,X2,X3,X4,X1 and column Y: Y3,Y4,Y2,Y1,Y1. Is there a way to get a matrix of shape rows = X1,..,X4 and columns = Y1,..,Y4 (all unique) with 1s where a relationship exists and zero otherwise? For example the first row would be X1: Y1(1),Y2(0),Y3(1),Y4(0), similar for X2, X3 and X4 rows? Thanks
Hi I want to know for an ordinal columns, after the data is split, should i just use transform on validation data (like test data) or should both train and validation data be fit_transform?
@@dataschool What i meant was, I have separate train and test data sets. I will keep aside test set and I want to split train data into train and validation sets. So, when i run scaling on train and validation set, X_train will take fit_transform but what about X_validation... will it be fit_transform or just transform like test set?
What if I had a column such as a city name and I wanted to give each city a unique ID. Assume the model I am using is the likes of Decision tree or Random Forests. Herein I cannot use Ordinal encoder as there is no specific order and using OneHotEncoder might generate a lot of columns if there are many many unique city names. Can I use Label encoder on that column then?
Great question! With a tree-based model, you should try both OrdinalEncoder and OneHotEncoder, and see which performs better. For OrdinalEncoder, if you don't specify the categories, it will just assign them alphabetically (just like LabelEncoder would). With a non-tree-based model, you should still use OneHotEncoder even if it creates lots of columns. Hope that helps!
@@dataschool Yes. That was helpful. Didn't realize that categories is an optional parameter for Ordinal Encoder. Another question though - How does these encoders handle/encode values that were not present during fitting the encoders. For example, if I fit any of these three encoders on a column in training set which only had categories 'A' and 'B' and use it to transform the same column in testing set which has categories 'A', 'B' and 'C'. How would 'C' be encoded?
For OneHotEncoder, there is a "handle_unknown" parameter that controls what happens when you pass it a previously unseen category during the transform. I'll be covering that during the tip 7 video, which comes out this Tuesday! For OrdinalEncoder, there is no similar functionality yet available, but it will be available in the next scikit-learn release (0.24). For LabelEncoder, I assume it will error (but I haven't checked). Thanks again for the questions!
Hi I have a question, do you have to first classify the column to 'categorical' prior to encoding? in some examples I see some examples where column classification is converted and in some examples I do not see it being converted. I performed a test and the results were the same whether I convert it or not. Could you please shed some light.
@@dataschool Hmm maybe not the best choice of word, but for example, the effect of 2 is twice the effect of 1, 3 thrice of 1, and so on.. Am I making any sense?
Thanks for clarifying! The answer is that it depends on the type of model. For a linear model, the answer is yes. For a tree-based model, the answer is no. Hope that helps!
Sure! Features are the inputs to the model, and are commonly referred to as "X". Label refers to the "class label" or "target value" or "response value", and is commonly referred to as "y". For example, if I was predicting "spam" or "not spam" as a classification problem, then the column that says "spam" or "not spam" for each sample is called the label. Hope that helps!
I do not understand why OneHotEncoder.fit_transform() took X[['Shape']] as argument rather than X['Shape']. EDIT: Now I think I understand. Passing a list of columns when subsetting a DataFrame will produce a new DataFrame (rather than a Series) even if the list has length 1.
Thanks for watching! Let me know if you have any questions about OneHotEncoder, OrdinalEncoder, or LabelEncoder! 👇
should one hot encoding not done differently on train and test set like ohe.fit_transform(x_train) and ohe.transform(x_test) and if yes then how do we ensure that the number of distict values in a variable in train and test set are equal, so that we dont get more number of dumy variables in test?
Yes, OneHotEncoder should be done differently on train and test, just like you outlined! And because you are using fit only on train (and not on test), it will learn the categories present in train, and apply those same categories to test. In other words, you don't have to ensure that train and test have the same categories.
If the testing set does have additional categories that were not present in the training set, here's how you can handle that situation: ua-cam.com/video/bA6mYC1a_Eg/v-deo.html
Does that help?
Thank you for the video...but is there a library available in python right that i can use to actually run a ordinal/probit log regression? statsmodel seems to have dropped it and i haven't found another library...
Thanks, you do a great job explaining those crucial concepts for ml flow anb being concise on this playlist. Great work
Glad it was helpful!
What a great content, thanks for sharing it.
As always something very good.
I'm your fan.
Thank you so much! 🙏
How would you encode data that has an ordering, but is cyclical. An example might be a "Time of Day" feature with entries such as 'morning', 'noon', evening', 'night'. There is an ordering to this, but it's cyclical, so that there is no lowest or highest value.
Thanks! I very much enjoy your explanations and videos.
Excellent question! By default, I would use OneHotEncoder. However, I might try OrdinalEncoder if I believed that there was an ordered relationship in "Time of Day" that was predictive of the target. For example, if I was predicting crime rate, and I discovered that crime tended to increase as the day goes on, I might encode morning=0, noon=1, evening=2, and night=3. Ultimately, I would choose whichever encoding was producing more accurate results. Hope that helps!
Thanks for this educative lectures, please i have a thought, if i have a pipeline with OneHotEncoder and OrdinalEncoder with the same data you used for the practice for example, how will the pipeline know which columns to OneHotEncode and the ones to OrdinalEncode,
Thanks
short and precise !! thank you so much
You're very welcome!
Thanks for sharing this useful tip
You're welcome! Glad it was useful to you!
Amazing video, subscribed! just to clarify on the plane problem, id order[third, second, first] to make first class highest rank during training right? Thanks!
Glad you liked the video! Regarding your question, reversing the ordering of the categories for the OrdinalEncoder would indeed reverse the encoding scheme so that third=0, second=1, first=2. However, reversing the encoding would not benefit (or hinder) the ML model, so you are welcome to use either ordering. Hope that helps!
i'm just wondering about something when do we encode using dummy variables? i mean when is it necessary ?
thanks for the explaination, sire. however i have several questions. does label encoder works aswell for nominal unordered data? how do we find mode for the nominal data where it will shows the pre-encoded data(not to show the numbers)? how do we find mode median(Q1,Q2,Q3) for the ordinal data where it will shows the pre-encoded data(not to show the numbers)? thanks
Great video! I have a dataset with categorical and numerical values, and my quesiton is, should I encode just the categorical values and then rmerge that with the dataset? and then drop the not encoded values?
I recommend using ColumnTransformer and Pipeline instead! More details here: courses.dataschool.io/scikit-learn-tips
Super good! Thank you!
Glad it's helpful to you! 🙌
phenomenal presentation as always
Thank you! 🙏
What would be a good way to encode, when there are lots of ordinal features with multiple labels? Is manually defining the categories the only way?
Great question! If the features are ordinal, and you want the model to learn the logical ordering of the categories, then yes the only way is to manually define category ordering.
@@dataschool Great. Thanks!
Hi, so I'm trying to work with data that has two entries: Column X: X1,X2,X3,X4,X1 and column Y: Y3,Y4,Y2,Y1,Y1. Is there a way to get a matrix of shape rows = X1,..,X4 and columns = Y1,..,Y4 (all unique) with 1s where a relationship exists and zero otherwise? For example the first row would be X1: Y1(1),Y2(0),Y3(1),Y4(0), similar for X2, X3 and X4 rows? Thanks
I'm so sorry, but I don't quite follow... I hope you were able to figure it out!
Hi I want to know for an ordinal columns, after the data is split, should i just use transform on validation data (like test data) or should both train and validation data be fit_transform?
I'm sorry, but I would need a lot more information in order to answer your question. I hope you were able to figure it out!
@@dataschool What i meant was, I have separate train and test data sets. I will keep aside test set and I want to split train data into train and validation sets. So, when i run scaling on train and validation set, X_train will take fit_transform but what about X_validation... will it be fit_transform or just transform like test set?
What if I had a column such as a city name and I wanted to give each city a unique ID. Assume the model I am using is the likes of Decision tree or Random Forests. Herein I cannot use Ordinal encoder as there is no specific order and using OneHotEncoder might generate a lot of columns if there are many many unique city names. Can I use Label encoder on that column then?
Great question! With a tree-based model, you should try both OrdinalEncoder and OneHotEncoder, and see which performs better. For OrdinalEncoder, if you don't specify the categories, it will just assign them alphabetically (just like LabelEncoder would).
With a non-tree-based model, you should still use OneHotEncoder even if it creates lots of columns.
Hope that helps!
@@dataschool Yes. That was helpful. Didn't realize that categories is an optional parameter for Ordinal Encoder.
Another question though - How does these encoders handle/encode values that were not present during fitting the encoders. For example, if I fit any of these three encoders on a column in training set which only had categories 'A' and 'B' and use it to transform the same column in testing set which has categories 'A', 'B' and 'C'. How would 'C' be encoded?
For OneHotEncoder, there is a "handle_unknown" parameter that controls what happens when you pass it a previously unseen category during the transform. I'll be covering that during the tip 7 video, which comes out this Tuesday!
For OrdinalEncoder, there is no similar functionality yet available, but it will be available in the next scikit-learn release (0.24).
For LabelEncoder, I assume it will error (but I haven't checked).
Thanks again for the questions!
Hi I have a question, do you have to first classify the column to 'categorical' prior to encoding? in some examples I see some examples where column classification is converted and in some examples I do not see it being converted. I performed a test and the results were the same whether I convert it or not. Could you please shed some light.
Great question! Doesn't make a difference because scikit-learn doesn't currently use that information from pandas.
If we use OrdinalEncoder, would the algorithm consider the relationship between predictor and response to be additive?
I'm sorry, I don't know exactly what you mean by "additive"... could you clarify? Thanks!
@@dataschool Hmm maybe not the best choice of word, but for example, the effect of 2 is twice the effect of 1, 3 thrice of 1, and so on.. Am I making any sense?
Thanks for clarifying! The answer is that it depends on the type of model. For a linear model, the answer is yes. For a tree-based model, the answer is no. Hope that helps!
Can you specify differences between Labels and Features....?
Sure! Features are the inputs to the model, and are commonly referred to as "X". Label refers to the "class label" or "target value" or "response value", and is commonly referred to as "y". For example, if I was predicting "spam" or "not spam" as a classification problem, then the column that says "spam" or "not spam" for each sample is called the label. Hope that helps!
@@dataschool Thanks Kevin!!
I do not understand why OneHotEncoder.fit_transform() took X[['Shape']] as argument rather than X['Shape'].
EDIT: Now I think I understand. Passing a list of columns when subsetting a DataFrame will produce a new DataFrame (rather than a Series) even if the list has length 1.
Thank you.
You're welcome!
Nice vídeos
Thanks!