Using One Hot Encoder for creating dummy variables & encoding categorical columns | Machine Learning

Rachit Toshniwal

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 13 лип 2020
In this tutorial, we'll go over One Hot Encoding, a very popular technique for encoding "nominal" categorical variables, i.e. ones which don't have any particular order between them, for eg. a list of : Tomato, Onion, Mango, Banana, Apple.
One Hot Encoder creates one column per category per categorical column. Hence there is a potential "downside" of expanding the feature space, but more often than not, it is not that big an issue.
With coded examples, we'll go through the main differences between One Hot Encoder and get_dummies and advantages of the former over the latter.
Here is the link for the explanation of 'dummy variable trap' as talked about in the lecture:
www.algosome.com/articles/dum...
You can of course, head over to a UA-cam video explanation of the same if watching static text isn't helping.
I've uploaded all the relevant code and datasets used here (and all other tutorials for that matter) on my github page which is accessible here:
Link:
github.com/rachittoshniwal/ma...
If you like my content, please do not forget to upvote this video and subscribe to my channel.
If you have any qualms regarding any of the content here, please feel free to comment below and I'll be happy to assist you in whatever capacity possible.
Thank you!

КОМЕНТАРІ • 41

@KA00_7 2 місяці тому
in-depth and best explanation video
@pankajgoikar4158 Рік тому
Very nice video. it cleared my concepts. Love from London.
@pankajgoikar4158 Рік тому
I've one question, don't know is the right time to ask? I need to know after applying a OneHotEncoding In independent and dependent features, do we have to combine encoded df with orignal df before applying any algorithm? or just apply algorithm on encoded independent and dependent features? Kindly let me know. Thank you in advance.
@thepresistence5935 2 роки тому
Ultimate tutorial
@harishgehlot__ 2 роки тому
This is such a great playlist brother🙏🙏
@rachittoshniwal 2 роки тому
Thanks Harish!
@atiaspire 3 роки тому ⁺⁴
Can you please make vedio on correct steps to complete a ML project life-cycle
@rachittoshniwal 3 роки тому ⁺²
Sure, I'll add it to my queue!
@jongcheulkim7284 Рік тому
Thank you
@chandravardhansinghkhichi2648 3 роки тому
great explanation buddy
@rachittoshniwal 3 роки тому
Thanks man, I'm glad it helped!
@owusubright1046 3 роки тому ⁺¹
Thanks for your assistance sir I really appreciate it a lot thanks. Please there are many feature selection techniques, which one do you think is the best to use to select the most important features to train your model on.
@rachittoshniwal 3 роки тому
There's no one right method, it varies with every dataset
@ncheymbamalu4013 2 роки тому
What about the test set's multiclass categorical features? You only one-hot encoded the train set.
@levon9 3 роки тому
Quick question, when you do the list iteration at time 18:14 to find categorical columns, why is the comparison with uppercase 'O' working, when the dtype is 'object'. What you have works, but I'm not sure why it shouldn't be lower case (which I tried, and which doesn't work). Is there a list of dtype abbreviations that are all in upper case?
Also, any reason to prefer this approach over using select_dtypes()? Thanks
Again, really enjoy your videos/explanations. I subscribed.
@rachittoshniwal 3 роки тому ⁺¹
Hi, thanks!
Maybe this SO question helps:
stackoverflow.com/questions/29245848/what-are-all-the-dtypes-that-pandas-recognizes
Also, no particular reason to not use select_dtypes() approach. Either should be just fine.
Hope it helps!
@owusubright1046 3 роки тому
Please is it okay to use padas.fillna() to replace null values
@rachittoshniwal 3 роки тому
Yes of course it is one of the ways to do it
@akashkunwar 2 роки тому
Why in this video you have transformed ordinal value column with OHE? Isn't it should be encoded by OrdinalEncoder?
@atiaspire 3 роки тому ⁺¹
I actually didn't understand the 'Sparse' use. Also if we drop one label then as per the previous vedio example of 'pink' color which was not present in train but was present in test then in that case values will be 0 for both the new label and the one which is drop. please correct me.
@rachittoshniwal 3 роки тому ⁺²
Great questions!
I think you would benefit a lot regarding sparse matrices by reading this:
dziganto.github.io/Sparse-Matrices-For-Efficient-Machine-Learning/
Regarding your second question,
sklearn will throw an error when you set drop = True and handle_unknown = ignore at the same time. And this is precisely to avoid the ambiguity you talked of.
You can only use one of the parameters at one time
I hope I was clear. If not, do let me know!
@mrbalajiganesan 3 роки тому
Thanks for the tutorial. I have one question which is, how to encode if we have thousand's of value in a feature. What should be the approach?
@rachittoshniwal 3 роки тому ⁺¹
Hi, thank you for liking the video!
I'd like to point you to some excellent resources to allay your doubts, hope they help!
1. If using a tree based model, you can use a normal ordinal encoder instead of OHE, and get quite similar results without expanding the feature space at all
nbviewer.jupyter.org/github/justmarkham/scikit-learn-tips/blob/master/notebooks/43_ordinal_encoding_for_trees.ipynb
2. And this excellent stack exchange answer:
stats.stackexchange.com/questions/367793/how-to-handle-too-many-categorical-features-with-too-many-categories-for-xgboost
(Btw, for humans, those many columns is a frightening concept, but for a computer it may not be that difficult you know!)
@mrbalajiganesan 3 роки тому ⁺¹
@@rachittoshniwal Thanks alot👍
@rachittoshniwal 3 роки тому
@@mrbalajiganesan you're welcome!
@owusubright1046 3 роки тому ⁺¹
Please sir why will someone drop a feature when performing onehot encoding.
@rachittoshniwal 3 роки тому
It is optional really. Because even if you drop one column (generated from the n categories of a column), you are not going to lose any information. It can be found out using the other columns, for eg. if all row values are zero, it implies the row belongs to the category that was dropped
@owusubright1046 3 роки тому
Okay so after reading the variable trap I understand. But assuming I have a feature containing male and female and I drop lets say the male. After building my model and deployed into production, and if the import features requires a gender meaning male/female and the model was only trained using female, wont this affect the performance of the model in production since it has not been train on males. Please explain it to me thanks.
@rachittoshniwal 3 роки тому
@@owusubright1046 it should not, because the female column contains 1 for female right. Wherever it is 0, it implies male. So the information is not lost technically.
@7justfun 3 роки тому
Namaste to you too...
@rachittoshniwal 3 роки тому
haha! namaste! :D
@mahery_ranaivoson 3 роки тому
It took me 2 hours to get to know that all the column names have space ahead. You could have warned your audience with such type of things. Not everyone likes Copy/Paste.
@rachittoshniwal 3 роки тому ⁺³
Sad to hear you faced an issue on that.
In my (small) defence, I do mention this thing about the columns in some videos, but unfortunately I didn't do it in this one.
On a positive note though, now you what that error message exactly means! And you wouldn't trip upon such errors in the future :)
@mahery_ranaivoson 3 роки тому ⁺¹
@@rachittoshniwal Thanks for your reply, by the way, can you make videos about the techniques to analyze OneHotEncoded features.
@rachittoshniwal 3 роки тому
@@mahery_ranaivoson "techniques to analyze OHE features" as in? I didn't quite get you
@mahery_ranaivoson 3 роки тому ⁺¹
@@rachittoshniwal like finding multicollinearity as you have told in the video if I'm not mistaken.
@rachittoshniwal 3 роки тому ⁺¹
@@mahery_ranaivoson sure, alright!
@phoebemagdy1554 3 роки тому
Many thanks for the highly informative tutorial.
But unfortunately, I faced a problem which is:
I first split the data and then created a OneHotEncoder instance
OHE = OneHotEncoder( sparse = false)
OHE.fit_transform(X_train [['length']])
my X_train categorical variable has 3 categories and the units available in my X_train dataset are mix of the 3 categories
but the units available in my X_test dataset have only 2 categories
OHE.fit_transform(X_test [['length']])
when I apply it on my test set it does not retain the same structure of the 3 categories as you did it in the video but unfortunately it did exactly as if I am using pd.get_dummies
kindly see the attached photo for more illustration of the problem that I faced
drive.google.com/file/d/1WGdbbbpMximUU51VXUSfq8VUMlba0l3C/view?usp=sharing
@rachittoshniwal 3 роки тому ⁺¹
Oh, no no no. You shouldn't be "fitting" OHE on the test set. Just use the OHE you've fitted on the train set and use the transform method on the test set
@phoebemagdy1554 3 роки тому ⁺¹
@@rachittoshniwal Many thanks for your prompt reply. It now worked properly exactly as you explained in the tutorial. Thanks again

Наступне

Автоматичне відтворення

Using Simple Imputer for imputing missing numerical and categorical values | Machine Learning