Encode categorical features using OneHotEncoder or OrdinalEncoder

Поділитися
Вставка
  • Опубліковано 11 січ 2025

КОМЕНТАРІ •

  • @dataschool
    @dataschool  4 роки тому +6

    Thanks for watching! Let me know if you have any questions about OneHotEncoder, OrdinalEncoder, or LabelEncoder! 👇

    • @anshulgandhi1707
      @anshulgandhi1707 4 роки тому +2

      should one hot encoding not done differently on train and test set like ohe.fit_transform(x_train) and ohe.transform(x_test) and if yes then how do we ensure that the number of distict values in a variable in train and test set are equal, so that we dont get more number of dumy variables in test?

    • @dataschool
      @dataschool  4 роки тому +1

      Yes, OneHotEncoder should be done differently on train and test, just like you outlined! And because you are using fit only on train (and not on test), it will learn the categories present in train, and apply those same categories to test. In other words, you don't have to ensure that train and test have the same categories.
      If the testing set does have additional categories that were not present in the training set, here's how you can handle that situation: ua-cam.com/video/bA6mYC1a_Eg/v-deo.html
      Does that help?

    • @Person_Not_Known
      @Person_Not_Known 3 роки тому

      Thank you for the video...but is there a library available in python right that i can use to actually run a ordinal/probit log regression? statsmodel seems to have dropped it and i haven't found another library...

  • @-hedredo8420
    @-hedredo8420 9 місяців тому +1

    Thanks, you do a great job explaining those crucial concepts for ml flow anb being concise on this playlist. Great work

    • @dataschool
      @dataschool  8 місяців тому

      Glad it was helpful!

  • @DevMadeEasy
    @DevMadeEasy 4 роки тому +5

    What a great content, thanks for sharing it.
    As always something very good.
    I'm your fan.

  • @mathgeek420
    @mathgeek420 4 роки тому +8

    How would you encode data that has an ordering, but is cyclical. An example might be a "Time of Day" feature with entries such as 'morning', 'noon', evening', 'night'. There is an ordering to this, but it's cyclical, so that there is no lowest or highest value.
    Thanks! I very much enjoy your explanations and videos.

    • @dataschool
      @dataschool  4 роки тому +6

      Excellent question! By default, I would use OneHotEncoder. However, I might try OrdinalEncoder if I believed that there was an ordered relationship in "Time of Day" that was predictive of the target. For example, if I was predicting crime rate, and I discovered that crime tended to increase as the day goes on, I might encode morning=0, noon=1, evening=2, and night=3. Ultimately, I would choose whichever encoding was producing more accurate results. Hope that helps!

  • @abdulraheemabdul9667
    @abdulraheemabdul9667 Рік тому

    Thanks for this educative lectures, please i have a thought, if i have a pipeline with OneHotEncoder and OrdinalEncoder with the same data you used for the practice for example, how will the pipeline know which columns to OneHotEncode and the ones to OrdinalEncode,
    Thanks

  • @nesrinehadjamar2197
    @nesrinehadjamar2197 2 роки тому

    short and precise !! thank you so much

  • @jaikishank
    @jaikishank 4 роки тому +1

    Thanks for sharing this useful tip

    • @dataschool
      @dataschool  4 роки тому

      You're welcome! Glad it was useful to you!

  • @TheWisePhotographer
    @TheWisePhotographer 3 роки тому +1

    Amazing video, subscribed! just to clarify on the plane problem, id order[third, second, first] to make first class highest rank during training right? Thanks!

    • @dataschool
      @dataschool  3 роки тому

      Glad you liked the video! Regarding your question, reversing the ordering of the categories for the OrdinalEncoder would indeed reverse the encoding scheme so that third=0, second=1, first=2. However, reversing the encoding would not benefit (or hinder) the ML model, so you are welcome to use either ordering. Hope that helps!

  • @nesrinehadjamar2197
    @nesrinehadjamar2197 2 роки тому

    i'm just wondering about something when do we encode using dummy variables? i mean when is it necessary ?

  • @26-arielnicholascaryndrasd43
    @26-arielnicholascaryndrasd43 3 роки тому

    thanks for the explaination, sire. however i have several questions. does label encoder works aswell for nominal unordered data? how do we find mode for the nominal data where it will shows the pre-encoded data(not to show the numbers)? how do we find mode median(Q1,Q2,Q3) for the ordinal data where it will shows the pre-encoded data(not to show the numbers)? thanks

  • @jorgesisco981
    @jorgesisco981 3 роки тому

    Great video! I have a dataset with categorical and numerical values, and my quesiton is, should I encode just the categorical values and then rmerge that with the dataset? and then drop the not encoded values?

    • @dataschool
      @dataschool  2 роки тому

      I recommend using ColumnTransformer and Pipeline instead! More details here: courses.dataschool.io/scikit-learn-tips

  • @8eck
    @8eck 3 роки тому

    Super good! Thank you!

    • @dataschool
      @dataschool  3 роки тому

      Glad it's helpful to you! 🙌

  • @apostolosmavropoulos177
    @apostolosmavropoulos177 3 роки тому

    phenomenal presentation as always

  • @sasidharansathiyamoorthy6918
    @sasidharansathiyamoorthy6918 3 роки тому +1

    What would be a good way to encode, when there are lots of ordinal features with multiple labels? Is manually defining the categories the only way?

    • @dataschool
      @dataschool  3 роки тому +1

      Great question! If the features are ordinal, and you want the model to learn the logical ordering of the categories, then yes the only way is to manually define category ordering.

    • @sasidharansathiyamoorthy6918
      @sasidharansathiyamoorthy6918 3 роки тому

      @@dataschool Great. Thanks!

  • @krishln7830
    @krishln7830 4 роки тому

    Hi, so I'm trying to work with data that has two entries: Column X: X1,X2,X3,X4,X1 and column Y: Y3,Y4,Y2,Y1,Y1. Is there a way to get a matrix of shape rows = X1,..,X4 and columns = Y1,..,Y4 (all unique) with 1s where a relationship exists and zero otherwise? For example the first row would be X1: Y1(1),Y2(0),Y3(1),Y4(0), similar for X2, X3 and X4 rows? Thanks

    • @dataschool
      @dataschool  3 роки тому

      I'm so sorry, but I don't quite follow... I hope you were able to figure it out!

  • @getchethanbr86
    @getchethanbr86 3 роки тому

    Hi I want to know for an ordinal columns, after the data is split, should i just use transform on validation data (like test data) or should both train and validation data be fit_transform?

    • @dataschool
      @dataschool  3 роки тому

      I'm sorry, but I would need a lot more information in order to answer your question. I hope you were able to figure it out!

    • @getchethanbr86
      @getchethanbr86 3 роки тому

      @@dataschool What i meant was, I have separate train and test data sets. I will keep aside test set and I want to split train data into train and validation sets. So, when i run scaling on train and validation set, X_train will take fit_transform but what about X_validation... will it be fit_transform or just transform like test set?

  • @Mayur7Garg
    @Mayur7Garg 4 роки тому +1

    What if I had a column such as a city name and I wanted to give each city a unique ID. Assume the model I am using is the likes of Decision tree or Random Forests. Herein I cannot use Ordinal encoder as there is no specific order and using OneHotEncoder might generate a lot of columns if there are many many unique city names. Can I use Label encoder on that column then?

    • @dataschool
      @dataschool  4 роки тому

      Great question! With a tree-based model, you should try both OrdinalEncoder and OneHotEncoder, and see which performs better. For OrdinalEncoder, if you don't specify the categories, it will just assign them alphabetically (just like LabelEncoder would).
      With a non-tree-based model, you should still use OneHotEncoder even if it creates lots of columns.
      Hope that helps!

    • @Mayur7Garg
      @Mayur7Garg 4 роки тому

      @@dataschool Yes. That was helpful. Didn't realize that categories is an optional parameter for Ordinal Encoder.
      Another question though - How does these encoders handle/encode values that were not present during fitting the encoders. For example, if I fit any of these three encoders on a column in training set which only had categories 'A' and 'B' and use it to transform the same column in testing set which has categories 'A', 'B' and 'C'. How would 'C' be encoded?

    • @dataschool
      @dataschool  4 роки тому +1

      For OneHotEncoder, there is a "handle_unknown" parameter that controls what happens when you pass it a previously unseen category during the transform. I'll be covering that during the tip 7 video, which comes out this Tuesday!
      For OrdinalEncoder, there is no similar functionality yet available, but it will be available in the next scikit-learn release (0.24).
      For LabelEncoder, I assume it will error (but I haven't checked).
      Thanks again for the questions!

  • @floyddsouza7190
    @floyddsouza7190 3 роки тому

    Hi I have a question, do you have to first classify the column to 'categorical' prior to encoding? in some examples I see some examples where column classification is converted and in some examples I do not see it being converted. I performed a test and the results were the same whether I convert it or not. Could you please shed some light.

    • @dataschool
      @dataschool  2 роки тому

      Great question! Doesn't make a difference because scikit-learn doesn't currently use that information from pandas.

  • @mikhaeldito
    @mikhaeldito 4 роки тому

    If we use OrdinalEncoder, would the algorithm consider the relationship between predictor and response to be additive?

    • @dataschool
      @dataschool  4 роки тому

      I'm sorry, I don't know exactly what you mean by "additive"... could you clarify? Thanks!

    • @mikhaeldito
      @mikhaeldito 4 роки тому

      @@dataschool Hmm maybe not the best choice of word, but for example, the effect of 2 is twice the effect of 1, 3 thrice of 1, and so on.. Am I making any sense?

    • @dataschool
      @dataschool  4 роки тому +1

      Thanks for clarifying! The answer is that it depends on the type of model. For a linear model, the answer is yes. For a tree-based model, the answer is no. Hope that helps!

  • @VarunKumar-pz5si
    @VarunKumar-pz5si 4 роки тому +1

    Can you specify differences between Labels and Features....?

    • @dataschool
      @dataschool  4 роки тому +4

      Sure! Features are the inputs to the model, and are commonly referred to as "X". Label refers to the "class label" or "target value" or "response value", and is commonly referred to as "y". For example, if I was predicting "spam" or "not spam" as a classification problem, then the column that says "spam" or "not spam" for each sample is called the label. Hope that helps!

    • @VarunKumar-pz5si
      @VarunKumar-pz5si 4 роки тому +1

      @@dataschool Thanks Kevin!!

  • @filosofiadetalhista
    @filosofiadetalhista 2 роки тому +1

    I do not understand why OneHotEncoder.fit_transform() took X[['Shape']] as argument rather than X['Shape'].
    EDIT: Now I think I understand. Passing a list of columns when subsetting a DataFrame will produce a new DataFrame (rather than a Series) even if the list has length 1.

  • @gabrielkamkar2110
    @gabrielkamkar2110 3 роки тому

    Thank you.

  • @mufseeramusthafa2170
    @mufseeramusthafa2170 4 роки тому

    Nice vídeos