I really like that he explains the extra attributes and the things that people gloss over. For example, the : and axis. I'm a newbie, so that little stuff was useful to me.
Yours videos are making me passionate about the data science career again, also they are making my first Job on data analytics so much easier. Thank you so much!
Thankyou for illustrating it so well , i was not clear with the reasoning behind dropping the first column when using the dummies. But now i have clear idea about that
Hey I'm new to Python and I just wanted to say that your videos are super clear and easy to understand! This has been a great help for me! Teaching code is clearly your calling
Brother I couldn't get around using categorical variables even after watching so many tutorials and decided to not use them in the model.But you made it so easy.Thanks a lot for such a wonderful explanation:p
Hello Kevin. I am a big fan of your work.Being a big user of R, your tutorials have made me like Python so much that I have completely switched to Python at work now. It would be very helpful if you did a video series each on other basic packages in python like numpy,matplotlib, seaborn , stats models and bokeh.Learning from your videos is so much easier and less time consuming. Currently I am working on my internship during my course and I use atleast one of your tips daily at work.Thanks again. Hoping to see more good content like this.Cheers!!!!!!
Thank you. i was searching for what is drop_first=True. And i found this video. The bonus tip which you had explained cleared this doubt. Please make more videos like this, on interesting tricks and tips on python, machine learning and data science.
Like usual... I will try to understand some ML concept which is not clear for me. I make the same way: clik, clik, clik between movies from youtubers - most of them make movies from the same source, without thinking, without understand. And then finally, once again, I'm on your channel and you explain me everything with clear and slow. Thanks for your amazing job!
Hello Kevin, for a Multi-label categorical field with more than 600 entries can the same strategy of dummy variables followed ? If not then please suggest the ways in which it can be converted to numeric form. Thank You.
Thanks a lot! I was struggling to get dummies in a different dataset having 9 columns of characters and rest numbers. This was very much helpful. Keep up the good work :)
3:50 Dummy Variables - an alternative method pd.get_dummies(train.Sex) this is a top-level function meaning you have to write pandas. (or, pd.) before it such as: pandas.get_dummies()
Hey Kevin, regarding dummy variable, what is the technique I can apply in a model input data if i am foreseeing high sales performance in a future or pent up demand etc? Would you still add 0 and 1 to flag those dates?
Wow, rarely do you see such a high rating. usually about 10 - 20% vote down. Good ones are like 5%, this guy has less than 1%. Gotta subscribe to him if he is that good! I loved the first video, cant wait to see more.
Hi Kevin, How could I apply this to numeric variables? For example, if the ticket fare is in [0, 2000) have a 0 and if it is in [2000, inf) have a 1 Thanks!
If you can create a video or series on Tensorflow that is not esoteric then I would be more impressed than I already am with your video tut's, many thanks
6:45 Break the bottom piece of code into several segments and contemplate the output. Ask yourself why it happens and when it doesn't. then following along will start making more sense. Remember, a good data scientist is always thinking and he is always LEARNING. pd.get_dummies(train.Sex) pd.get_dummies(train.Sex, prefix='Sex') pd.get_dummies(train.Sex, prefix='Sex').iloc[:,:1]
Hi. I understand we only need k-1 dummy variables because we can infer the last variable from the rest, but how would that affect certain classifiers like rule-based ones for example? If they don't have that last variable they cannot create rules like "IF Vk = 1 THEN class = 0". I am thinking that they might not be able to infer it because they only use what columns they have.
I now recommend OneHotEncoder from scikit-learn if your goal is to prepare your dataset for Machine Learning. I have a whole video explaining exactly how to do this: ua-cam.com/video/irHhDMbw3xo/v-deo.html
For creating dummy variables within a pipeline, I definitely recommend using scikit-learn's OneHotEncoder instead. I have a lesson about that here: ua-cam.com/video/irHhDMbw3xo/v-deo.html
Great video and simple explanation . Thank you. One clarification if we need to feed the columns to the data frame for modelling hope we should not use drop=True (since the variable will be lost) or am i assuming wrong???
Hi Kevin, After using get_dummies method on any column, should we drop the original column? Also there will be other columns added in dataframe due to get_dummies method, so how should we assign values to training data( X ) ?
Great question, Dilip! There's no one answer as to whether you should drop the original column, but some people recommend it. As to how to include the new columns in X, this video should help: ua-cam.com/video/xvpNA7bC8cs/v-deo.html
Hey kevin :) One question....here if we use get_dummies we add more and more colums to our data frame,is there any way to do this inplace like if our series has 'adult','kid','senior_citizen' so whenever it occurs adult get replaced by 0,kid with 1,senior citizen with 2 and so on for different values whenever it occurs in the series,can I map like this ? Thanks
EDIT : I have found it,for future readers,we can do this using sklearn's preprocessing package. STEPS: 1)Import Package - from sklearn.preprocessing.LabelEncoder() 2)Make object(or whatever it is called) - le=LabelEncoder() 4)To convert into numbers- train['Sex']=le.fit_transform(train['Sex']) 5) To convert back - train['Sex']=le.inverse_transform(train['Sex']) That's it :)
Right! LabelEncoder is useful for taking a series of categorical data and converting it into a series of integers representing the categories. You can also do this within pandas using factorize: pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.factorize.html
Very clear and very helpful, thank you very much. But i still don't understand why we need to remove a column after making dummy columns? It is like with training and test data?
I am sorry if my question is silly but what is the reason behind dropping first column. I mean when we have just two categorical values then it is understood that if one is 0 then the other will be 1 , but if have three or more categorical values then one value(precisely the name of the value like "C" in embark )can no where be seen.
If you have 3 possible values, then you can encode them with 2 columns: 00, 01, and 10. If you have 4 possible values, then you can encode them with 3 columns: 000, 001, 010, and 100. Does that help?
@@dataschool ok.....tell me what i understood is correct or not....for three possible values 00 means any two values(out of three) are absent thus its understood that third one is present......
I have a question. 🙋♂️ There are two variables and dozens of observations on the set that we converted to dummy variables. If we delete one of the dummy variables and then delete the original variable, how does the train time machine understand which one belongs to which one? E.g; Sex_Female and Remarked_C have been deleted. Then came the new variable for prediction: Sex_Male: 0, Remarked_Q: 0 Remarked_Q: 0, 1, 0. Is it Sex_Female 0 or is it Remarked_C 0? How does the machine know which variable is Sex_Female and which is Remarked_C? (No ordering because real variables have been deleted) P.s. If you do not understand the question, I m sorry for my bad English.
I really like that he explains the extra attributes and the things that people gloss over. For example, the : and axis. I'm a newbie, so that little stuff was useful to me.
Great to hear!
Amazingly clear explanation!!! Thank you!!
You're very welcome!
Yours videos are making me passionate about the data science career again, also they are making my first Job on data analytics so much easier. Thank you so much!
You're welcome!
Wow!!! This is my first video watching you teach. it's crystal clear!!! Looking forward to more video!
Awesome! Thank you!
very easy to follow and understand, in contrast with many other tutorials I found, great and many thx
Great to hear! Thanks for your kind words!
I am addicted to your videos ...I want to re do my old assignments with all the tricks :)
Ha! Great to hear :)
Thankyou for illustrating it so well , i was not clear with the reasoning behind dropping the first column when using the dummies. But now i have clear idea about that
Glad I could be helpful!
Hey I'm new to Python and I just wanted to say that your videos are super clear and easy to understand! This has been a great help for me! Teaching code is clearly your calling
Thanks very much for your kind words! I really appreciate it 🙏
Brother I couldn't get around using categorical variables even after watching so many tutorials and decided to not use them in the model.But you made it so easy.Thanks a lot for such a wonderful explanation:p
Great! You're very welcome!
These are great tutorials! Finally found a clear, concise explanation for why your code is written the way it is :)
Thank you!
This was so useful! i didn't know your channel before I googled how to make dummies in pandas. Definitely going to check out your other videos :)
most underrated Panda tutor / course. absolutely amazing! thanks for sharing Kevin!
Thanks very much for your kind words!
Easy to understand, straight to the point thank you for your tutorials they have been of great help
You're welcome!
Love all you videos. Clear, easy to follow, and great tone of voice. Thanks!!
Thanks!
Dude seriously, you just saved me a lot of work.
Awesome, that's great to hear!
Best tutorial video on Dummy variable.
Hey, this was great! Just implemented this on a live project. You, do not stop making videos, please. Thanks.
Awesome! Glad I could be of help!
Hello Kevin. I am a big fan of your work.Being a big user of R, your tutorials have made me like Python so much that I have completely switched to Python at work now. It would be very helpful if you did a video series each on other basic packages in python like numpy,matplotlib, seaborn , stats models and bokeh.Learning from your videos is so much easier and less time consuming. Currently I am working on my internship during my course and I use atleast one of your tips daily at work.Thanks again. Hoping to see more good content like this.Cheers!!!!!!
That's awesome to hear! Thanks for your kind comments and suggestions! I will do my best :)
Thank you. i was searching for what is drop_first=True. And i found this video. The bonus tip which you had explained cleared this doubt. Please make more videos like this, on interesting tricks and tips on python, machine learning and data science.
You are in luck, because I'm working on a video of my top 25 pandas tricks right now!! Stay tuned...
Thanks for showing how to add this to the dataframe, very helpful!
Glad it was helpful!
3:55
try the following piece of code
train.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'Sex_male'],
dtype='object')
>>>>
Like usual... I will try to understand some ML concept which is not clear for me. I make the same way: clik, clik, clik between movies from youtubers - most of them make movies from the same source, without thinking, without understand. And then finally, once again, I'm on your channel and you explain me everything with clear and slow. Thanks for your amazing job!
Thanks so much for your kind words!
your vedios are awesome - i get back to your vedios whenever get stuck anywhere - not only i get solutions- i get bonus - which is always for real
Thanks for your kind words! Glad I can be helpful :)
Hello Kevin, for a Multi-label categorical field with more than 600 entries can the same strategy of dummy variables followed ? If not then please suggest the ways in which it can be converted to numeric form. Thank You.
You can use the same strategy, though I would recommend using OneHotEncoder from scikit-learn. Hope that helps!
Thanks a lot! I was struggling to get dummies in a different dataset having 9 columns of characters and rest numbers. This was very much helpful. Keep up the good work :)
You're very welcome! :)
I really like your teaching style. Very clear!
Thanks very much for your kind words!
That bonus is awesome... Thankyou so much... You explained it so well....
My pleasure!
Fantastic tutorial. A clear cut. Good job!
Thanks!
Great Tutorial about to deal categorical variables with dummies. The last bonus tips is helped my assignment.
Great to hear!
Thanks again for your wonderful course. It was my huge luck to find it - it`s so clear and detailed!
You're very welcome!
We'd like to know more about tensorflow and machine learning. Thanks so much for great videos.
Thanks for your suggestion!
Thanks so much Sir.
yeah, Thanks kevin, but tensorflow tutorial would be booommmm, please try it, Thanks
I appreciate the suggestion!
Can't thank you enough for the BONUS tip!!!! Impressed!!!
You're very welcome! :)
It is crystal clear , thanks man❤️
You're very welcome!
That bonus tip is amazing. Thank you!
Glad you liked it! You're very welcome :)
you are a wonderful teacher!! thank you very much!!
Thanks so much!
Thank you so much for this video, really well and easily explained!
Thank you!
Thanks for making this crystal clear!
Very helpful and very interesting....keep up the good work always bro....
Thank you!
Dude, you're amazing! new follower here!
Thanks!
3:50
Dummy Variables - an alternative method
pd.get_dummies(train.Sex)
this is a top-level function meaning you have to write pandas. (or, pd.) before it such as:
pandas.get_dummies()
Exactly what I was looking for. Perfect explanation!!
Awesome! That's great to hear!
Thank you so much for this explanation. Crystal clear concepts and videos!
You're very welcome!
very good especially the last trick !
Thank you!
Hey Kevin, regarding dummy variable, what is the technique I can apply in a model input data if i am foreseeing high sales performance in a future or pent up demand etc? Would you still add 0 and 1 to flag those dates?
Wow, rarely do you see such a high rating. usually about 10 - 20% vote down. Good ones are like 5%, this guy has less than 1%. Gotta subscribe to him if he is that good! I loved the first video, cant wait to see more.
Thank you so much!
Your bonus tips are gorgeous. Cheers !!
Thanks very much! Glad you like them :)
your bonus tip was a life saver! Thank you thank you thank you
Glad it helped!
Thanks bro. You are my hero ❤
Thank you!
Hi Kevin,
In relation to the bonus question,
Do I need to assign the results of get dummies to a variable in order to make the changes permanent?
Yes you do!
Detailed and systematic= easy to follow..
Thanks!
Great Explanation, Just Amazing.
Thank you!
This was useful. How do you create dummies for specific ranges ?
For instance, 10-50% 1 group, 50-70% - group 2, etc.
Brilliantly explained!
Glad it was helpful to you! :)
Hi Kevin,
How could I apply this to numeric variables? For example, if the ticket fare is in [0, 2000) have a 0 and if it is in [2000, inf) have a 1
Thanks!
Very nice video and great explanatio. Keep it up 👏👏
Thank you!
Excellent delivery
Thanks!
the bonus tip was freaking awesome....will save me hell lot of time....thanks a lot :)
You're very welcome! I love that tip as well :)
Thank you for all those great videos!
Thanks!
Awesome - gods work! Clear, to the point, and very very useful.
Glad it was helpful to you! :)
Thank you so much. You are a really wonderful great instructor..
Thank you so much!
Should we apply feature scaling to categorial columns?
I'm not sure there is a definitive answer to this, sorry!
If you can create a video or series on Tensorflow that is not esoteric then I would be more impressed than I already am with your video tut's, many thanks
Thanks for your suggestion!
Thank you! Last trick helps a lot!
Great!
Awesome! I have a question. Why are we dropping the first column? As in, for example, Embarked_C?
6:45
Break the bottom piece of code into several segments and contemplate the output. Ask yourself why it happens and when it doesn't. then following along will start making more sense. Remember, a good data scientist is always thinking and he is always LEARNING.
pd.get_dummies(train.Sex)
pd.get_dummies(train.Sex, prefix='Sex')
pd.get_dummies(train.Sex, prefix='Sex').iloc[:,:1]
Very useful feature I didn't know about. Thanks.
You're welcome!
Hi. I understand we only need k-1 dummy variables because we can infer the last variable from the rest, but how would that affect certain classifiers like rule-based ones for example? If they don't have that last variable they cannot create rules like "IF Vk = 1 THEN class = 0". I am thinking that they might not be able to infer it because they only use what columns they have.
I'm not sure how to answer that question, I'm sorry!
I am always inspired by your lecture thanks
Thank you! 🙏
Once again, my fighter!
Thanks for this tutorial, it's clear and to the point. :)
You're very welcome!
if we want all the three values of the Embarked feature in a single column mean values 0,1 and 2 for the Individual category How could we do it?
You could use the map method, see this video for an example: ua-cam.com/video/P_q0tkYqvSk/v-deo.html
Hi, I wanted to know which one do you prefer onehotencoder from sklearn or get_dummies pandas method....
What are the pros and cons of both methods...
I now recommend OneHotEncoder from scikit-learn if your goal is to prepare your dataset for Machine Learning. I have a whole video explaining exactly how to do this: ua-cam.com/video/irHhDMbw3xo/v-deo.html
SUPER VIDEO!! Very Useful!
Thanks!
the bonus tip was really cool. thanks a lot.
You're very welcome!
Hi, just wanted to say I love your videos. Can you please do a video on join(), concat(), and merge()?
Thanks for your suggestion! See here for concat: ua-cam.com/video/15q-is8P_H4/v-deo.html
how do u use get_dummies in data pipiline
for example when test data and train data is not split from same source ?
For creating dummy variables within a pipeline, I definitely recommend using scikit-learn's OneHotEncoder instead. I have a lesson about that here: ua-cam.com/video/irHhDMbw3xo/v-deo.html
Thank you wonderful explanation as always!
You're very welcome!
Thank you 💞 it's really great ❣️
You're welcome!
Great video and simple explanation . Thank you.
One clarification if we need to feed the columns to the data frame for modelling hope we should not use drop=True (since the variable will be lost) or am i assuming wrong???
Can you please explain the difference between join, concat, and append... .thanks
I just released a video on that topic! See here: ua-cam.com/video/iYWKfUOtGaw/v-deo.html
This really help me today! Thanks!!!
Glad it helped!
Hi Kevin, After using get_dummies method on any column, should we drop the original column?
Also there will be other columns added in dataframe due to get_dummies method, so how should we assign values to training data( X ) ?
Great question, Dilip! There's no one answer as to whether you should drop the original column, but some people recommend it. As to how to include the new columns in X, this video should help: ua-cam.com/video/xvpNA7bC8cs/v-deo.html
2:05
the Series-Map method
train['Sex_male']=train.Sex.map({'female':0, 'male':1})
train.head(2)
Dummy encoded
map({'female':0, 'male':1})
female ==> 0, male ==> 1
This was an exceptional video! Thank you so much! Sincerely!
You're very welcome! Glad it was helpful!
Sir i want to know .What does get_dummies() function do and why it is needed?
That's what the video covers! Hope it's helpful to you.
Really useful video! Thanks for sharing!
You're very welcome!
Last part (Bonus) is awesome.
Thanks!
Suppose I have multiple columns of dummy variables and I simply want a sum of the variables across those columns, how do I do that?
Hey kevin :) One question....here if we use get_dummies we add more and more colums to our data frame,is there any way to do this inplace like if our series has 'adult','kid','senior_citizen' so whenever it occurs adult get replaced by 0,kid with 1,senior citizen with 2 and so on for different values whenever it occurs in the series,can I map like this ? Thanks
EDIT : I have found it,for future readers,we can do this using sklearn's preprocessing package.
STEPS:
1)Import Package - from sklearn.preprocessing.LabelEncoder()
2)Make object(or whatever it is called) - le=LabelEncoder()
4)To convert into numbers- train['Sex']=le.fit_transform(train['Sex'])
5) To convert back - train['Sex']=le.inverse_transform(train['Sex'])
That's it :)
Right! LabelEncoder is useful for taking a series of categorical data and converting it into a series of integers representing the categories.
You can also do this within pandas using factorize: pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.factorize.html
Very clear and very helpful, thank you very much.
But i still don't understand why we need to remove a column after making dummy columns?
It is like with training and test data?
Whether or not you need to depends on the circumstances. See this video for more: ua-cam.com/video/NYtwyvyvDEk/v-deo.html
I am sorry if my question is silly but what is the reason behind dropping first column. I mean when we have just two categorical values then it is understood that if one is 0 then the other will be 1 , but if have three or more categorical values then one value(precisely the name of the value like "C" in embark )can no where be seen.
If you have 3 possible values, then you can encode them with 2 columns: 00, 01, and 10. If you have 4 possible values, then you can encode them with 3 columns: 000, 001, 010, and 100. Does that help?
@@dataschool ok.....tell me what i understood is correct or not....for three possible values 00 means any two values(out of three) are absent thus its understood that third one is present......
Exactly!
Superb explanation !! helped a lot !!
Great to hear!
so helpful, thanks a lot!
Great to hear!
Great explanation. Thank you.
You're welcome!
Thanks for the bonus tip, so grateful
You're welcome!
Which encoder should we use If the column has more than 100 categorical values?
That's a complex question, but you can always try one-hot encoding or ordinal encoding, regardless of the number of levels.
Wowzers!! This is so Cool. I am learning a ton from you. Thank you!!
Awesome! You're very welcome :)
2:12 creating a new column, and mapping variables to that column.
3:49 pd.get_dummies function to get dummy variables
6:47 .iloc to separate the dataframe
11:01 bonus code getting dummies with entire dataframe
Thanks for posting the time codes!
I have a question. 🙋♂️
There are two variables and dozens of observations on the set that we converted to dummy variables. If we delete one of the dummy variables and then delete the original variable, how does the train time machine understand which one belongs to which one?
E.g;
Sex_Female and Remarked_C have been deleted.
Then came the new variable for prediction: Sex_Male: 0, Remarked_Q: 0 Remarked_Q: 0, 1, 0. Is it Sex_Female 0 or is it Remarked_C 0?
How does the machine know which variable is Sex_Female and which is Remarked_C? (No ordering because real variables have been deleted)
P.s. If you do not understand the question, I m sorry for my bad English.
Can we assign numbers from 0,1,2,3... to a categorical variable rather than making n-1 extra columns?
Yes, but you should generally only do that if the categories have a natural ordering.