Featuring Engineering- Handle Categorical Features Many Categories(Count/Frequency Encoding)

Поділитися
Вставка
  • Опубліковано 11 вер 2019
  • In this video we will be discussing about how to Handle Categorical Features using Count or Frequency Encoding.
    github.com/krishnaik06/Comple...
    If you like music support my brother's channel
    / @ultralifeproject
    Support me in Patreon: / 2340909
    Buy the Best book of Machine Learning, Deep Learning with python sklearn and tensorflow from below
    amazon url:
    www.amazon.in/Hands-Machine-L...
    You can buy my book on Finance with Machine Learning and Deep Learning from the below url
    amazon url: www.amazon.in/Hands-Python-Fi...
    Connect with me here:
    Twitter: / krishnaik06
    Facebook: / krishnaik06
    instagram: / krishnaik06
    Subscribe my unboxing Channel
    / @krishnaikhindi
    Below are the various playlist created on ML,Data Science and Deep Learning. Please subscribe and support the channel. Happy Learning!
    Deep Learning Playlist: • Tutorial 1- Introducti...
    Data Science Projects playlist: • Generative Adversarial...
    NLP playlist: • Natural Language Proce...
    Statistics Playlist: • Population vs Sample i...
    Feature Engineering playlist: • Feature Engineering in...
    Computer Vision playlist: • OpenCV Installation | ...
    Data Science Interview Question playlist: • Complete Life Cycle of...
    You can buy my book on Finance with Machine Learning and Deep Learning from the below url
    amazon url: www.amazon.in/Hands-Python-Fi...
    🙏🙏🙏🙏🙏🙏🙏🙏
    YOU JUST NEED TO DO
    3 THINGS to support my channel
    LIKE
    SHARE
    &
    SUBSCRIBE
    TO MY UA-cam CHANNEL

КОМЕНТАРІ • 51

  • @omkarr8282
    @omkarr8282 4 роки тому +3

    Thanks for sharing your research ideas... whenever I feel I am comfortable with knowing some type of concept in DS... .every video of yours adds an interesting additional perspective to my knowledge...thanks for taking the efforts to share everything that you do😀

  • @muhammadzubairbaloch3224
    @muhammadzubairbaloch3224 4 роки тому +3

    Sir I am very impressed from mode of teaching. Easily understandable. Sir please some high level computer vision based video lectures. Thanks

  • @shashwatbilgrami1796
    @shashwatbilgrami1796 4 роки тому +1

    Thank you for the videos they help a lot.

  • @shriyzfr15
    @shriyzfr15 5 місяців тому

    Was very useful Krish. More powers to you. Your work is very valuable for Data Science aspirants !!!

  • @hazhir3861
    @hazhir3861 Рік тому

    dude, you are amazing. well done.

  • @srishtikumari6664
    @srishtikumari6664 3 роки тому +1

    Thanks for sharing this video!
    I was looking for this type of code for replacement of column with it's count column. I am learning feature engineering from kaggle couses. There, count_encoder() is used and I was trying to write code for steps used in the count_encoder() which I find in this video.
    I have a doubt, the column in that tutorial was numerical and both the column (col and count_column) were used in predicting output and calculating the validation score. I calculated validation score using both the ways (i.e.included both col & count_col and then only count_col), I found score to be higher in the model using both(col & count_col). Can you please clarify, should we use the original column as well if it's numerical ?

  • @sandeepnataraj2430
    @sandeepnataraj2430 3 роки тому

    Thanks Krish, for sharing your knowledge. I want to know how to encode multi select categorical variables in Python. For example there is a field "Languages Known' which can hold multiple values like English, Kannada, Hindi or French, Telugu, English or Malayalam, Tamil, English etc. and assume the overall choice of unique languages is around 25.

  • @meharajbeguma7493
    @meharajbeguma7493 4 роки тому +2

    Hi, Sir, Can you explain how to do encoding for the target feature with multiple labels for each instance separated by comma?

  • @traptigupta2628
    @traptigupta2628 3 роки тому

    so helpful 😁

  • @balajivarma83
    @balajivarma83 4 роки тому +4

    Hi Sir, Could you please complete your kaggle competition code. Waiting for next part. Thank so much you are best 👍💯

  • @sachinborgave8094
    @sachinborgave8094 4 роки тому +1

    Super sir

  • @venkateshsadagopan2505
    @venkateshsadagopan2505 4 роки тому +1

    Hi krish,
    If we have a very large dataset with less features i.e number of features very very less compared to number of samples in the dataset, how can I approach the problem and what techniques i can apply to get reasonable result and how to avoid problems like over fitting in this case.
    Thanks in Advance

  • @MyName-ur3ir
    @MyName-ur3ir 3 роки тому

    Sir, may you please recommend ML algorithms which can be effectively applied with this technique? Pros and cons?

  • @raseluddin
    @raseluddin 3 роки тому

    Best tutorial

  • @niveditaparab6772
    @niveditaparab6772 3 роки тому

    in this case how to interpreat result for prediction coefficent with respect to encoded feature ?

  • @VinayKumar-hy6ee
    @VinayKumar-hy6ee 4 роки тому +4

    Hi can you upload a video explaining lightGBM catBoost with simple examples

  • @devanshgoel9070
    @devanshgoel9070 2 роки тому

    thank you sir

  • @ogbeidenelson8843
    @ogbeidenelson8843 4 роки тому +2

    Can we have a video on hypothesis testing and implementation in real world problem with python

  • @Elsa.zoneNt
    @Elsa.zoneNt Рік тому

    how about adding like .000000001 or 1 value to that dictionary for similar values? does that work?

  • @shashwatbilgrami1796
    @shashwatbilgrami1796 4 роки тому +1

    Bro can you please perform feature engineering technique in air flight price problem

  • @sandipansarkar9211
    @sandipansarkar9211 2 роки тому

    finished watching

  • @AshitDebdas
    @AshitDebdas 3 роки тому

    similarly, i have IP_ADDRESS as a feature how can i encode them?

  • @AshutoshSinha1111
    @AshutoshSinha1111 3 роки тому

    Thanks for sharing this informative lesson. I have a bank database and I need to identify categorical features in the table with column name
    Customer Age,
    Professional Experience,
    Annual Income,
    Family Size,
    CC Avg monthly spend,
    Education (1: Undergrad; 2: Graduate; 3: Advanced/Professional),
    Mortgage Value,
    Personal Loan (Yes/No),
    Securities Account (Yes/No),
    Credit Card (Yes/No). Can you help me with reasons for your selection.

  • @GauravPadawe
    @GauravPadawe 4 роки тому +9

    Do make video on Complete Hypothesis Testing

  • @manjunathangadi457
    @manjunathangadi457 4 роки тому

    where did you completed your data science course

  • @anandacharya9919
    @anandacharya9919 4 роки тому +2

    Sir please explain where we use binomial distribution, poss ion dis, normal dis, anova test, one and two tail test in machine learning data science ?

  • @sunnysavita9071
    @sunnysavita9071 4 роки тому +1

    sir please make video on roc_auc

  • @thepresistence5935
    @thepresistence5935 2 роки тому

    singam kadhal vandhalle kalla redum thannale 😁😂, ennakum romba pudikum

  • @shaz-z506
    @shaz-z506 4 роки тому +1

    Hi Krish,
    It's a good video, probably anyone knows about it, I haven't encountered this technique either, but I'm not sure how the distance-based algorithm will work and interpret the encoded categorical feature with this technique, please let me know if there are set of the algorithm to be used when performing regression or classification or this technique is applicable to all to all algorithm.

    • @vivekpuurkayastha1580
      @vivekpuurkayastha1580 4 роки тому

      not algorithm dependent. Another advantage is does not need feature scaling.

    • @nivedithabaskaran1669
      @nivedithabaskaran1669 Рік тому +1

      @@vivekpuurkayastha1580 Whether we scaling the numeric columns or not depends on what ML algorithm we choose right? Can you explain why feature scaling is not needed for this?

  • @kiranarun1868
    @kiranarun1868 4 роки тому +2

    How is this different than LabelEncoding...even there it counts the unique label and assigns accordingly..here it counts the frequency nd assigns ...if both are same y use this??

  • @sreeragsasidharan5192
    @sreeragsasidharan5192 3 роки тому

    sir then what about categories which have similar count..???

  • @prabhatgupta9808
    @prabhatgupta9808 3 роки тому

    Hello Sir,
    if the categories are not ordinal and,
    If we replace each categories with their counts in that column. aren't we making it ordinal as different different category will have different numeric value...?
    Can anyone please help me understand this?

    • @arpeggio7449
      @arpeggio7449 6 місяців тому

      Yes, this is very similar to ordinal encoding, except with ordinal encoding we just give a sequention value to each unique value. So this is a bit different the effects are also different, using the count is very unpredictable, what if there are two or more values with the same count? The hen they will all have the same value resulting in loss of information. Also, just like with ordinal there is an effect of how important each value is, higher numbers often result in higher activations, we already have that with ordinal, but here the effects are even stronger. So i am really not sure if this is a good solution, my guess is it isn’t.

  • @brightsides2881
    @brightsides2881 4 роки тому +3

    How many hours do the data scientist work in a week in service based companies?

  • @mohammadarif8057
    @mohammadarif8057 3 роки тому

    what if the two diff categories have the same count

  • @akansha_bhatt28
    @akansha_bhatt28 3 роки тому +1

    Can someone help me in explaining 2nd disadvantage of this algorithm?

    • @Rukshan918
      @Rukshan918 3 роки тому +1

      Suppose you have some feature like blood pressure. This can category as Low, Medium, High. In Integer encoding, we categorize this as 1(Low), 2(Medium), 3(High). So ML model can understand Low < medium < high. This is what weights means. It is valuable in prediction. In One hot encoding we just put some meaningless number, they have no weights. I think this is what it means. I had same problem & searched. This is what I understood. Someone correct me if I'm wrong.

  • @arjyabasu1311
    @arjyabasu1311 4 роки тому

    Sir is mean encoding and Frequency encoding same?? If not. please do make a video on it

    • @vivekpuurkayastha1580
      @vivekpuurkayastha1580 4 роки тому +2

      Mean Encoding is a Feature Imputation technique. Whereas Frequency encoding is encoding technique used for encoding categorical Features, also called as response coding. Use the response-encoding Library available through pip. pip install response-encoding.

  • @thealgorithm7633
    @thealgorithm7633 Рік тому

    But it will create high weightage to the higher counts and model will overfit

  • @anilsunny4443
    @anilsunny4443 3 роки тому

    I thought we were to consider the top 10 most frequent labels and perform one hot encoding on them, but this is completely different :/. Somebody pls help, I'm a beginner at this and some lead on this would be greatly appreciated.

    • @TheSecretAgentRayan
      @TheSecretAgentRayan 3 роки тому

      they're 2 different methods. You can pick any or maybe even try both seperately.

  • @dipk.mishra
    @dipk.mishra 4 роки тому

    Sir can you share that feature engineering zip file again pls !

  • @vinothkumarselvaraj185
    @vinothkumarselvaraj185 4 роки тому

    Friends,
    What if we have a column (say "Location") which consists of more than 1000 categorical variables?? FYI, this column is an independent variable and one of the important parameter for predicting the label. Answer pls....
    Thanks in advance

  • @bongiwe_khongolo
    @bongiwe_khongolo Рік тому

    J.
    John hip

  • @tikendraw
    @tikendraw 2 роки тому

    sir, since we are just assigning count numbers to the categorical values and even it may lead to problematic situation if the counts are same , why don't we use label encoding in each columns , YEs it may not be the ordinal data but it does better job then what this method that you are talking about is doing.the Goal is to assign numbers to strings .
    -- I may be wrong but according to the info you provided what i said should work too , and it is easier.
    CORRECT ME IF I AM WRONG @KRISH NAIK