How do I create dummy variables in pandas?

Поділитися
Вставка
  • Опубліковано 6 січ 2025

КОМЕНТАРІ • 380

  • @haciendadad
    @haciendadad 5 років тому +2

    I really like that he explains the extra attributes and the things that people gloss over. For example, the : and axis. I'm a newbie, so that little stuff was useful to me.

  • @jwilliams8210
    @jwilliams8210 Рік тому +2

    Amazingly clear explanation!!! Thank you!!

  • @Negr0ni
    @Negr0ni 3 роки тому +2

    Yours videos are making me passionate about the data science career again, also they are making my first Job on data analytics so much easier. Thank you so much!

  • @brendensong8000
    @brendensong8000 4 роки тому +1

    Wow!!! This is my first video watching you teach. it's crystal clear!!! Looking forward to more video!

  • @dy1262
    @dy1262 8 років тому +16

    very easy to follow and understand, in contrast with many other tutorials I found, great and many thx

    • @dataschool
      @dataschool  8 років тому +3

      Great to hear! Thanks for your kind words!

  • @manasa41087
    @manasa41087 8 років тому +1

    I am addicted to your videos ...I want to re do my old assignments with all the tricks :)

  • @Raaajzzz
    @Raaajzzz Рік тому +1

    Thankyou for illustrating it so well , i was not clear with the reasoning behind dropping the first column when using the dummies. But now i have clear idea about that

  • @alensadventures2080
    @alensadventures2080 5 років тому +1

    Hey I'm new to Python and I just wanted to say that your videos are super clear and easy to understand! This has been a great help for me! Teaching code is clearly your calling

    • @dataschool
      @dataschool  4 роки тому

      Thanks very much for your kind words! I really appreciate it 🙏

  • @AshishSharma-pm1dc
    @AshishSharma-pm1dc 6 років тому

    Brother I couldn't get around using categorical variables even after watching so many tutorials and decided to not use them in the model.But you made it so easy.Thanks a lot for such a wonderful explanation:p

    • @dataschool
      @dataschool  6 років тому +1

      Great! You're very welcome!

  • @ashwinsingh1325
    @ashwinsingh1325 5 років тому +2

    These are great tutorials! Finally found a clear, concise explanation for why your code is written the way it is :)

  • @nadineprins1647
    @nadineprins1647 4 роки тому +1

    This was so useful! i didn't know your channel before I googled how to make dummies in pandas. Definitely going to check out your other videos :)

  • @PatrickBateman12420
    @PatrickBateman12420 5 років тому

    most underrated Panda tutor / course. absolutely amazing! thanks for sharing Kevin!

    • @dataschool
      @dataschool  5 років тому

      Thanks very much for your kind words!

  • @flutterflowhack
    @flutterflowhack 2 роки тому +1

    Easy to understand, straight to the point thank you for your tutorials they have been of great help

  • @sophiar5280
    @sophiar5280 4 роки тому

    Love all you videos. Clear, easy to follow, and great tone of voice. Thanks!!

  • @LS-rw3hn
    @LS-rw3hn 6 років тому +2

    Dude seriously, you just saved me a lot of work.

    • @dataschool
      @dataschool  6 років тому

      Awesome, that's great to hear!

  • @ajaykushwaha-je6mw
    @ajaykushwaha-je6mw 3 роки тому

    Best tutorial video on Dummy variable.

  • @anirbanaws143
    @anirbanaws143 7 років тому

    Hey, this was great! Just implemented this on a live project. You, do not stop making videos, please. Thanks.

    • @dataschool
      @dataschool  7 років тому

      Awesome! Glad I could be of help!

  • @rohitjacob8890
    @rohitjacob8890 6 років тому +2

    Hello Kevin. I am a big fan of your work.Being a big user of R, your tutorials have made me like Python so much that I have completely switched to Python at work now. It would be very helpful if you did a video series each on other basic packages in python like numpy,matplotlib, seaborn , stats models and bokeh.Learning from your videos is so much easier and less time consuming. Currently I am working on my internship during my course and I use atleast one of your tips daily at work.Thanks again. Hoping to see more good content like this.Cheers!!!!!!

    • @dataschool
      @dataschool  6 років тому

      That's awesome to hear! Thanks for your kind comments and suggestions! I will do my best :)

  • @sandy011187
    @sandy011187 5 років тому

    Thank you. i was searching for what is drop_first=True. And i found this video. The bonus tip which you had explained cleared this doubt. Please make more videos like this, on interesting tricks and tips on python, machine learning and data science.

    • @dataschool
      @dataschool  5 років тому

      You are in luck, because I'm working on a video of my top 25 pandas tricks right now!! Stay tuned...

  • @seansantiagox
    @seansantiagox 3 роки тому

    Thanks for showing how to add this to the dataframe, very helpful!

  • @fet1612
    @fet1612 5 років тому

    3:55
    try the following piece of code
    train.columns
    Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
    'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'Sex_male'],
    dtype='object')
    >>>>

  • @D4nte-RN
    @D4nte-RN Рік тому

    Like usual... I will try to understand some ML concept which is not clear for me. I make the same way: clik, clik, clik between movies from youtubers - most of them make movies from the same source, without thinking, without understand. And then finally, once again, I'm on your channel and you explain me everything with clear and slow. Thanks for your amazing job!

    • @dataschool
      @dataschool  Рік тому

      Thanks so much for your kind words!

  • @kuldipchauhan524
    @kuldipchauhan524 6 років тому

    your vedios are awesome - i get back to your vedios whenever get stuck anywhere - not only i get solutions- i get bonus - which is always for real

    • @dataschool
      @dataschool  6 років тому

      Thanks for your kind words! Glad I can be helpful :)

  • @TheNikhileshYadav
    @TheNikhileshYadav 5 років тому +1

    Hello Kevin, for a Multi-label categorical field with more than 600 entries can the same strategy of dummy variables followed ? If not then please suggest the ways in which it can be converted to numeric form. Thank You.

    • @dataschool
      @dataschool  5 років тому

      You can use the same strategy, though I would recommend using OneHotEncoder from scikit-learn. Hope that helps!

  • @bondmanu
    @bondmanu 7 років тому

    Thanks a lot! I was struggling to get dummies in a different dataset having 9 columns of characters and rest numbers. This was very much helpful. Keep up the good work :)

  • @resap.9128
    @resap.9128 6 років тому +1

    I really like your teaching style. Very clear!

    • @dataschool
      @dataschool  5 років тому

      Thanks very much for your kind words!

  • @sandhya6818
    @sandhya6818 4 роки тому +1

    That bonus is awesome... Thankyou so much... You explained it so well....

  • @ronenfischer6841
    @ronenfischer6841 7 років тому

    Fantastic tutorial. A clear cut. Good job!

  • @vijayanandhan4649
    @vijayanandhan4649 6 років тому +1

    Great Tutorial about to deal categorical variables with dummies. The last bonus tips is helped my assignment.

  • @nadyamoscow2461
    @nadyamoscow2461 3 роки тому

    Thanks again for your wonderful course. It was my huge luck to find it - it`s so clear and detailed!

  • @sushichanel7299
    @sushichanel7299 6 років тому +2

    We'd like to know more about tensorflow and machine learning. Thanks so much for great videos.

    • @dataschool
      @dataschool  6 років тому +1

      Thanks for your suggestion!

    • @sushichanel7299
      @sushichanel7299 6 років тому

      Thanks so much Sir.

    • @samc2481
      @samc2481 6 років тому

      yeah, Thanks kevin, but tensorflow tutorial would be booommmm, please try it, Thanks

    • @dataschool
      @dataschool  6 років тому

      I appreciate the suggestion!

  • @sumitbali9194
    @sumitbali9194 6 років тому +1

    Can't thank you enough for the BONUS tip!!!! Impressed!!!

  • @ashokgahatraj1210
    @ashokgahatraj1210 2 роки тому +1

    It is crystal clear , thanks man❤️

  • @thereadletter2426
    @thereadletter2426 8 років тому +2

    That bonus tip is amazing. Thank you!

    • @dataschool
      @dataschool  8 років тому

      Glad you liked it! You're very welcome :)

  • @alainleclerc4523
    @alainleclerc4523 2 роки тому +1

    you are a wonderful teacher!! thank you very much!!

  • @robertue1
    @robertue1 3 роки тому +1

    Thank you so much for this video, really well and easily explained!

  • @danielcecchin
    @danielcecchin 4 роки тому

    Thanks for making this crystal clear!

  • @dembobademboba6924
    @dembobademboba6924 Рік тому

    Very helpful and very interesting....keep up the good work always bro....

  • @luisportillo3491
    @luisportillo3491 3 роки тому +1

    Dude, you're amazing! new follower here!

  • @fet1612
    @fet1612 5 років тому

    3:50
    Dummy Variables - an alternative method
    pd.get_dummies(train.Sex)
    this is a top-level function meaning you have to write pandas. (or, pd.) before it such as:
    pandas.get_dummies()

  • @heliobteixeira1
    @heliobteixeira1 7 років тому

    Exactly what I was looking for. Perfect explanation!!

    • @dataschool
      @dataschool  7 років тому

      Awesome! That's great to hear!

  • @ameysawant5656
    @ameysawant5656 4 роки тому

    Thank you so much for this explanation. Crystal clear concepts and videos!

  • @andreacazzaniga8488
    @andreacazzaniga8488 5 років тому +1

    very good especially the last trick !

  • @RachelBb-k6q
    @RachelBb-k6q 3 місяці тому

    Hey Kevin, regarding dummy variable, what is the technique I can apply in a model input data if i am foreseeing high sales performance in a future or pent up demand etc? Would you still add 0 and 1 to flag those dates?

  • @haciendadad
    @haciendadad 5 років тому

    Wow, rarely do you see such a high rating. usually about 10 - 20% vote down. Good ones are like 5%, this guy has less than 1%. Gotta subscribe to him if he is that good! I loved the first video, cant wait to see more.

  • @krishnendusaha5940
    @krishnendusaha5940 7 років тому

    Your bonus tips are gorgeous. Cheers !!

    • @dataschool
      @dataschool  6 років тому

      Thanks very much! Glad you like them :)

  • @anotherone6276
    @anotherone6276 4 роки тому

    your bonus tip was a life saver! Thank you thank you thank you

  • @tronalddump2444
    @tronalddump2444 2 роки тому +2

    Thanks bro. You are my hero ❤

  • @ramleo1461
    @ramleo1461 5 років тому +3

    Hi Kevin,
    In relation to the bonus question,
    Do I need to assign the results of get dummies to a variable in order to make the changes permanent?

  • @22MJangel
    @22MJangel 5 років тому +1

    Detailed and systematic= easy to follow..

  • @dipakraut6058
    @dipakraut6058 5 років тому +1

    Great Explanation, Just Amazing.

  • @dikshyasurvi6869
    @dikshyasurvi6869 3 роки тому

    This was useful. How do you create dummies for specific ranges ?
    For instance, 10-50% 1 group, 50-70% - group 2, etc.

  • @bashhwu
    @bashhwu 6 років тому +1

    Brilliantly explained!

    • @dataschool
      @dataschool  6 років тому

      Glad it was helpful to you! :)

  • @carlosdiaz3428
    @carlosdiaz3428 4 роки тому

    Hi Kevin,
    How could I apply this to numeric variables? For example, if the ticket fare is in [0, 2000) have a 0 and if it is in [2000, inf) have a 1
    Thanks!

  • @alal-zj4zb
    @alal-zj4zb 5 років тому +1

    Very nice video and great explanatio. Keep it up 👏👏

  • @sheheryar89
    @sheheryar89 6 років тому

    Excellent delivery

  • @AnilSahu-hs9tq
    @AnilSahu-hs9tq 7 років тому

    the bonus tip was freaking awesome....will save me hell lot of time....thanks a lot :)

    • @dataschool
      @dataschool  7 років тому

      You're very welcome! I love that tip as well :)

  • @tensianne
    @tensianne 4 роки тому

    Thank you for all those great videos!

  • @Analyse_Us_Consulting
    @Analyse_Us_Consulting 7 років тому

    Awesome - gods work! Clear, to the point, and very very useful.

    • @dataschool
      @dataschool  7 років тому

      Glad it was helpful to you! :)

  • @muslumyildiz5694
    @muslumyildiz5694 3 роки тому

    Thank you so much. You are a really wonderful great instructor..

  • @arjunpukale3310
    @arjunpukale3310 5 років тому

    Should we apply feature scaling to categorial columns?

    • @dataschool
      @dataschool  4 роки тому

      I'm not sure there is a definitive answer to this, sorry!

  • @marklittlewood2418
    @marklittlewood2418 7 років тому +1

    If you can create a video or series on Tensorflow that is not esoteric then I would be more impressed than I already am with your video tut's, many thanks

    • @dataschool
      @dataschool  7 років тому

      Thanks for your suggestion!

  • @salamatburj9502
    @salamatburj9502 6 років тому

    Thank you! Last trick helps a lot!

  • @sarmigarmi
    @sarmigarmi 4 роки тому

    Awesome! I have a question. Why are we dropping the first column? As in, for example, Embarked_C?

  • @fet1612
    @fet1612 5 років тому

    6:45
    Break the bottom piece of code into several segments and contemplate the output. Ask yourself why it happens and when it doesn't. then following along will start making more sense. Remember, a good data scientist is always thinking and he is always LEARNING.
    pd.get_dummies(train.Sex)
    pd.get_dummies(train.Sex, prefix='Sex')
    pd.get_dummies(train.Sex, prefix='Sex').iloc[:,:1]

  • @chronicfantastic
    @chronicfantastic 8 років тому

    Very useful feature I didn't know about. Thanks.

  • @Thelaunius
    @Thelaunius 5 років тому

    Hi. I understand we only need k-1 dummy variables because we can infer the last variable from the rest, but how would that affect certain classifiers like rule-based ones for example? If they don't have that last variable they cannot create rules like "IF Vk = 1 THEN class = 0". I am thinking that they might not be able to infer it because they only use what columns they have.

    • @dataschool
      @dataschool  4 роки тому

      I'm not sure how to answer that question, I'm sorry!

  • @twafsimon103
    @twafsimon103 3 роки тому

    I am always inspired by your lecture thanks

  • @misslindiwelive
    @misslindiwelive 2 роки тому

    Once again, my fighter!

  • @yasmin_jsmn
    @yasmin_jsmn 6 років тому

    Thanks for this tutorial, it's clear and to the point. :)

  • @twafsimon103
    @twafsimon103 3 роки тому

    if we want all the three values of the Embarked feature in a single column mean values 0,1 and 2 for the Individual category How could we do it?

    • @dataschool
      @dataschool  3 роки тому

      You could use the map method, see this video for an example: ua-cam.com/video/P_q0tkYqvSk/v-deo.html

  • @vinayaknaik540
    @vinayaknaik540 4 роки тому

    Hi, I wanted to know which one do you prefer onehotencoder from sklearn or get_dummies pandas method....
    What are the pros and cons of both methods...

    • @dataschool
      @dataschool  4 роки тому

      I now recommend OneHotEncoder from scikit-learn if your goal is to prepare your dataset for Machine Learning. I have a whole video explaining exactly how to do this: ua-cam.com/video/irHhDMbw3xo/v-deo.html

  • @Jinsh0
    @Jinsh0 6 років тому +1

    SUPER VIDEO!! Very Useful!

  • @RoshanPadmanabhan
    @RoshanPadmanabhan 8 років тому

    the bonus tip was really cool. thanks a lot.

  • @JerryBlane
    @JerryBlane 5 років тому +1

    Hi, just wanted to say I love your videos. Can you please do a video on join(), concat(), and merge()?

    • @dataschool
      @dataschool  5 років тому

      Thanks for your suggestion! See here for concat: ua-cam.com/video/15q-is8P_H4/v-deo.html

  • @anandrathi871
    @anandrathi871 4 роки тому

    how do u use get_dummies in data pipiline
    for example when test data and train data is not split from same source ?

    • @dataschool
      @dataschool  4 роки тому

      For creating dummy variables within a pipeline, I definitely recommend using scikit-learn's OneHotEncoder instead. I have a lesson about that here: ua-cam.com/video/irHhDMbw3xo/v-deo.html

  • @usmanshaikh1115
    @usmanshaikh1115 6 років тому +1

    Thank you wonderful explanation as always!

  • @AhmedKhaliet
    @AhmedKhaliet 2 роки тому +1

    Thank you 💞 it's really great ❣️

  • @jaikishank
    @jaikishank 4 роки тому

    Great video and simple explanation . Thank you.
    One clarification if we need to feed the columns to the data frame for modelling hope we should not use drop=True (since the variable will be lost) or am i assuming wrong???

  • @rashayahya
    @rashayahya 5 років тому

    Can you please explain the difference between join, concat, and append... .thanks

    • @dataschool
      @dataschool  4 роки тому

      I just released a video on that topic! See here: ua-cam.com/video/iYWKfUOtGaw/v-deo.html

  • @MrSaintArmand
    @MrSaintArmand 4 роки тому

    This really help me today! Thanks!!!

  • @dilipgawade9686
    @dilipgawade9686 5 років тому

    Hi Kevin, After using get_dummies method on any column, should we drop the original column?
    Also there will be other columns added in dataframe due to get_dummies method, so how should we assign values to training data( X ) ?

    • @dataschool
      @dataschool  5 років тому

      Great question, Dilip! There's no one answer as to whether you should drop the original column, but some people recommend it. As to how to include the new columns in X, this video should help: ua-cam.com/video/xvpNA7bC8cs/v-deo.html

  • @fet1612
    @fet1612 5 років тому

    2:05
    the Series-Map method
    train['Sex_male']=train.Sex.map({'female':0, 'male':1})
    train.head(2)
    Dummy encoded
    map({'female':0, 'male':1})
    female ==> 0, male ==> 1

  • @eric3372
    @eric3372 6 років тому +1

    This was an exceptional video! Thank you so much! Sincerely!

    • @dataschool
      @dataschool  6 років тому

      You're very welcome! Glad it was helpful!

  • @ankitgupta6697
    @ankitgupta6697 5 років тому

    Sir i want to know .What does get_dummies() function do and why it is needed?

    • @dataschool
      @dataschool  4 роки тому

      That's what the video covers! Hope it's helpful to you.

  • @ferbose
    @ferbose 7 років тому

    Really useful video! Thanks for sharing!

  • @anunitb
    @anunitb 7 років тому

    Last part (Bonus) is awesome.

  • @ROT4C
    @ROT4C 4 роки тому

    Suppose I have multiple columns of dummy variables and I simply want a sum of the variables across those columns, how do I do that?

  • @PankajMishra-rt6hr
    @PankajMishra-rt6hr 8 років тому

    Hey kevin :) One question....here if we use get_dummies we add more and more colums to our data frame,is there any way to do this inplace like if our series has 'adult','kid','senior_citizen' so whenever it occurs adult get replaced by 0,kid with 1,senior citizen with 2 and so on for different values whenever it occurs in the series,can I map like this ? Thanks

    • @PankajMishra-rt6hr
      @PankajMishra-rt6hr 8 років тому

      EDIT : I have found it,for future readers,we can do this using sklearn's preprocessing package.
      STEPS:
      1)Import Package - from sklearn.preprocessing.LabelEncoder()
      2)Make object(or whatever it is called) - le=LabelEncoder()
      4)To convert into numbers- train['Sex']=le.fit_transform(train['Sex'])
      5) To convert back - train['Sex']=le.inverse_transform(train['Sex'])
      That's it :)

    • @dataschool
      @dataschool  8 років тому

      Right! LabelEncoder is useful for taking a series of categorical data and converting it into a series of integers representing the categories.
      You can also do this within pandas using factorize: pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.factorize.html

  • @8eck
    @8eck 3 роки тому

    Very clear and very helpful, thank you very much.
    But i still don't understand why we need to remove a column after making dummy columns?
    It is like with training and test data?

    • @dataschool
      @dataschool  3 роки тому

      Whether or not you need to depends on the circumstances. See this video for more: ua-cam.com/video/NYtwyvyvDEk/v-deo.html

  • @minakshimathpal8698
    @minakshimathpal8698 4 роки тому

    I am sorry if my question is silly but what is the reason behind dropping first column. I mean when we have just two categorical values then it is understood that if one is 0 then the other will be 1 , but if have three or more categorical values then one value(precisely the name of the value like "C" in embark )can no where be seen.

    • @dataschool
      @dataschool  4 роки тому +1

      If you have 3 possible values, then you can encode them with 2 columns: 00, 01, and 10. If you have 4 possible values, then you can encode them with 3 columns: 000, 001, 010, and 100. Does that help?

    • @minakshimathpal8698
      @minakshimathpal8698 4 роки тому

      @@dataschool ok.....tell me what i understood is correct or not....for three possible values 00 means any two values(out of three) are absent thus its understood that third one is present......

    • @dataschool
      @dataschool  4 роки тому

      Exactly!

  • @pranky28
    @pranky28 8 років тому

    Superb explanation !! helped a lot !!

  • @watheusbr
    @watheusbr 2 роки тому +1

    so helpful, thanks a lot!

  • @myworld123321
    @myworld123321 6 років тому

    Great explanation. Thank you.

  • @akashdhar4499
    @akashdhar4499 5 років тому

    Thanks for the bonus tip, so grateful

  • @sabinadhikari2643
    @sabinadhikari2643 3 роки тому

    Which encoder should we use If the column has more than 100 categorical values?

    • @dataschool
      @dataschool  3 роки тому

      That's a complex question, but you can always try one-hot encoding or ordinal encoding, regardless of the number of levels.

  • @jasmineejoseph84
    @jasmineejoseph84 8 років тому

    Wowzers!! This is so Cool. I am learning a ton from you. Thank you!!

    • @dataschool
      @dataschool  8 років тому

      Awesome! You're very welcome :)

  • @jordanhensiek3882
    @jordanhensiek3882 4 роки тому

    2:12 creating a new column, and mapping variables to that column.

  • @the.texnik
    @the.texnik 3 роки тому

    I have a question. 🙋‍♂️
    There are two variables and dozens of observations on the set that we converted to dummy variables. If we delete one of the dummy variables and then delete the original variable, how does the train time machine understand which one belongs to which one?
    E.g;
    Sex_Female and Remarked_C have been deleted.
    Then came the new variable for prediction: Sex_Male: 0, Remarked_Q: 0 Remarked_Q: 0, 1, 0. Is it Sex_Female 0 or is it Remarked_C 0?
    How does the machine know which variable is Sex_Female and which is Remarked_C? (No ordering because real variables have been deleted)
    P.s. If you do not understand the question, I m sorry for my bad English.

  • @vipul5340
    @vipul5340 4 роки тому

    Can we assign numbers from 0,1,2,3... to a categorical variable rather than making n-1 extra columns?

    • @dataschool
      @dataschool  4 роки тому

      Yes, but you should generally only do that if the categories have a natural ordering.