Impute missing values using KNNImputer or IterativeImputer

Поділитися
Вставка
  • Опубліковано 2 лис 2024

КОМЕНТАРІ • 108

  • @dataschool
    @dataschool  4 роки тому +2

    Thanks for watching! 🙌 Let me know if you have any questions about imputation and I'm happy to answer them! 👇

    • @Liftsandeats
      @Liftsandeats 4 роки тому

      Does imputation a method to replace missing / null values in the dataset ?

    • @hakunamatata-qu7ft
      @hakunamatata-qu7ft 4 роки тому +1

      Awsome bro very useful technique

    • @dataschool
      @dataschool  4 роки тому

      That's correct: "missing value imputation" means that you are replacing missing values (also known as "null values") in your dataset with your best approximation of what values they might be if they weren't missing. Hope that helps!

    • @Steegwolf
      @Steegwolf 3 роки тому

      Have you worked with fancyimpute? Offers even more variety and works great

    • @yunusemreylmaz3642
      @yunusemreylmaz3642 3 роки тому

      Thanks for all the videos it helped me a lot. However I searched on google a long times but I could not find my problem. I am trying to fill missing values with others columns. I mean there are some missing values about cars body type but there are information about body type in another column.

  • @levon9
    @levon9 3 роки тому +5

    I really love your videos, they are just right, concise and informative, no unnecessary fluff. Thank you so much for these.

    • @dataschool
      @dataschool  3 роки тому

      Thank you so much for your kind words!

  • @sachin-b8c4m
    @sachin-b8c4m Рік тому +2

    thank you.
    love the clarity in your explanation!

  • @Matt-me2yh
    @Matt-me2yh 2 місяці тому

    Thank you! I really needed this to understand the concepts, you are an outstanding teacher.

  • @fobaogunkeye3551
    @fobaogunkeye3551 9 місяців тому +1

    Awesome video! I was wondering if you can share how the process works behind the scenes for cases where we have rows with multiple columns that are null, with respect to the iterative imputer that builds a model behind the scenes. I understand the logic when we only have a single column with null values but can't wrap my head around what will be assigned as training and test data if we have multiple columns with null values. Looking forward to your response. Thanks

  • @lovejazzbass
    @lovejazzbass 4 роки тому +1

    Kevin, you just expanded my column transformation vocabulary. Thank you.

  • @ilducedimas
    @ilducedimas 2 роки тому +1

    Awesome video, couldn't be clearer. Thanks

  • @dogs4ever1000
    @dogs4ever1000 Рік тому +1

    Thank you, this is exactly what I need. Plus you've explained it very well!

  • @atiqrehman8435
    @atiqrehman8435 10 місяців тому

    God bless you man such valuable content you are producing!

    • @dataschool
      @dataschool  10 місяців тому

      Thank you so much! 🙏

  • @ericsims3368
    @ericsims3368 3 роки тому +3

    Super helpful, as always. Is IterativeImputer the sklearn version of MICE?

    • @dataschool
      @dataschool  3 роки тому

      Great question! IterativeImputer was inspired by MICE. More details are here: scikit-learn.org/stable/modules/impute.html#multiple-vs-single-imputation

    • @SixSigmaData
      @SixSigmaData 2 роки тому

      Hey Eric! 😀

  • @dizetoot
    @dizetoot 2 роки тому +4

    Thanks for posting this. For features where there are missing values, should I be passing in the whole df to impute the missing values, or should I only include features that are correlated with the dependent variable I'm trying to impute?

    • @GabeNicholson
      @GabeNicholson 2 роки тому +2

      This goes back to the bias variance tradeoff. If you are adding 100s of other columns that are likely to be uncorrelated, then I would suggest not doing that since that will likely overfit the data. You could use the parameter "n_nearest_features" which makes the IterativeImputer only use the top "n" features to predict the missing values. This could be a way to add all your columns from your entire dataframe while still minimizing the increase in variance.

  • @mooncake4511
    @mooncake4511 3 роки тому +8

    Hi, I tried encoding my categorical variables (boolean value column) and then running the data through a KNNImputer but instead of getting 1's and 0's I got values inbetween those values, for example 0.4,0.9 etc. Is there anything I am missing, or is there any way to improve the prediction of this imputer ?

    • @matrix4776
      @matrix4776 3 роки тому +1

      That's also my question.

    • @dataschool
      @dataschool  3 роки тому +5

      Great question! I don't recommend using KNNImputer in that case. Here's what I recommend instead: (1) If you're using scikit-learn version 0.24 or later, and you have categorical data with missing values, OneHotEncoder will automatically encode the missing values as a separate category, which is a good approach. (2) If you're using version 0.23 or earlier, I recommend instead creating a pipeline of SimpleImputer (with strategy='constant') and OneHotEncoder, which will impute the missing values and then one-hot encode the results. Hope that helps!

    • @koklu01
      @koklu01 3 роки тому

      @@dataschool In that case; can we interpret the results (0.4 , 0.9) as the probabilities of those values being 0 or 1. Does it makes sense to assign a threshold like 0.5, transform below to 0 and above to 1?

    • @vishalnaik5453
      @vishalnaik5453 8 місяців тому

      That column or feature if having discrete values like 0 or 1 , better check the semantic of that column,most probably it would be under categorical

  • @mapa5000
    @mapa5000 Рік тому

    Fantastic video !! 👏🏼👏🏼👏🏼 … thank you for spreading the knowledge

    • @dataschool
      @dataschool  Рік тому

      You're welcome! Glad it was helpful to you!

  • @seansantiagox
    @seansantiagox 3 роки тому

    You are awesome man!! Saved me a lot of time yet again!!!!

    • @dataschool
      @dataschool  3 роки тому

      That's awesome to hear! 🙌

  • @primaezy5834
    @primaezy5834 3 роки тому +1

    very nice video, however i want to ask, is the knn-imputer can use for data object (string )?

    • @dataschool
      @dataschool  3 роки тому +1

      Great question! KNNImputer can't be used for strings (categorical features), but you can use SimpleImputer in that case with strategy='constant' or strategy='most_frequent'. Hope that helps!

  • @dariomelconian9502
    @dariomelconian9502 11 місяців тому

    Do you have a recommended tool/package for doing imputation with categorical variables?

    • @dataschool
      @dataschool  10 місяців тому +1

      The simplest way is to use scikit-learn's SimpleImputer.

  • @rajnishadhikari9280
    @rajnishadhikari9280 6 місяців тому

    We can do this for numerical data but what in the case of categoical data?
    Can you mention any method for that?

  • @evarondeau6595
    @evarondeau6595 Рік тому

    Hello ! Thank you very much for your interesting video ! Do you know where I can find a video like this one to know how many neighbors choose ?
    Thank you very much

    • @dataschool
      @dataschool  Рік тому

      Sure! ua-cam.com/video/6dbrR-WymjI/v-deo.html

  • @ling6701
    @ling6701 Рік тому

    Thanks, that was very interesting.

  • @-o-6100
    @-o-6100 2 роки тому +1

    Question: If we impute values of a feature based on other features, wouldn't that increase the likelihood of multicollinearity?

  • @riyaz8072
    @riyaz8072 2 роки тому +1

    why don't you have 2M subscribers man ?

  • @dariomelconian9502
    @dariomelconian9502 11 місяців тому

    Are you generally performing your imputation prior to any feature selection, or after ? I always see mixed reviews about performing it before and after..

    • @dataschool
      @dataschool  10 місяців тому

      Great question! Imputation prior to feature selection.

  • @hardikvegad3508
    @hardikvegad3508 2 роки тому

    Hey Kevin, quick question... should k in knn should always be odd... if yes than why and if no than why? as me in the interview... Thank for all your content.

    • @dataschool
      @dataschool  2 роки тому +1

      Great question! For KNNImputer, the answer is no, because it's just looking at other numeric samples and averaging them (there is never a "tie"). For KNN with binary classification, then yes an odd K is a good idea in order to avoid ties. Hope that helps!

  • @rishisingh6111
    @rishisingh6111 2 роки тому

    Thanks for sharing this!
    Why cannot KNN imputer be used for categorical variables? KKN algorithms works with classification problems.

    • @dataschool
      @dataschool  2 роки тому +1

      With KNNImputer, the features have to be numeric in order for it to determine the "nearest" rows. That is separate from using KNN with a classification problem, because in a classification problem, the target is categorical. Hope that helps!

  • @soumyabanerjee1424
    @soumyabanerjee1424 3 роки тому +1

    can iterative imputer and knn imputer works with only numerical values ? Or can it also impute string/alphanumeric values as well?

    • @dataschool
      @dataschool  3 роки тому

      Great question! Only numerical.

  • @saravanansenguttuvan319
    @saravanansenguttuvan319 4 роки тому +2

    What about the best imputer for categorical variables??

    • @dataschool
      @dataschool  4 роки тому +3

      Great question! For categorical features, you can use SimpleImputer with strategy='most_frequent' or strategy='constant'. Which approach is better depends on the particular situation. More details are here: scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html

    • @saravanansenguttuvan319
      @saravanansenguttuvan319 4 роки тому

      @@dataschool Ok...Thanks :)

    • @lejuge7426
      @lejuge7426 3 роки тому +1

      @@dataschool Thanks a lot mate you re a LEGEND

  • @intelligencejunction
    @intelligencejunction Рік тому

    Thank you!

  • @AnumGulzar-iy7tl
    @AnumGulzar-iy7tl Рік тому

    Respected Sir,
    Can we multiple imputation in eviews9 for panel data?

    • @dataschool
      @dataschool  Рік тому +1

      I'm not sure I understand your question, I'm sorry!

  • @susmitvengurlekar
    @susmitvengurlekar 4 роки тому +1

    My idea: line plot of cols which have null values with other continuous cols and box plot for discrete and then impute constant value according to result of this process, like say, Pclass is 2, so you impute median fare of Pclass 2 wherever fare is missing and Pclass is 2. Basically similar to iterative imputer, only manual work, slow but maybe better results because of human knowledge about problem statement. What are your thoughts about this idea ?

    • @dataschool
      @dataschool  4 роки тому +1

      It's an interesting idea, but a manual process is probably not practical for any large dataset, and it's definitely impractical for cross-validation (since you would have to do the imputation after each split). In general, any "manual" process (in which your intervention is required) does not fit well into the scikit-learn workflow. Hope that helps!

    • @susmitvengurlekar
      @susmitvengurlekar 4 роки тому

      @@dataschool I meant finding the values in an exploratory way and then using the values found as a constant in simple imputer in a pipeline during cross validation and evaluation. A custom transformer can also be created which does the imputation according to fitted values, like during transformation find similar records and then use the median. But then that's pretty much similar to KNNImputer and Iterative Imputer

    • @dataschool
      @dataschool  4 роки тому +1

      Sure, you could probably do that using a custom transformer. Or if you think you could make a strong case for this functionality being available in scikit-learn, then you could propose it as a feature request!

    • @susmitvengurlekar
      @susmitvengurlekar 4 роки тому

      @@dataschool I am not sure whether this is the correct platform, but I have written a library named custom_transformers which contains transformers for handling date,time, null, outlier and some commonly needed custom transformers, if you have time I would be greatly appreciated if you provided your valuable feedback on kaggle
      This is the notebook demonstrating the use of library
      www.kaggle.com/susmitpy03/demonstrating-common-transformers
      I intend to package it and publish on PyPi

  • @SUGATORAY
    @SUGATORAY 4 роки тому

    Could you please consider making another video on MissForest imputation? (#missingpy)

    • @dataschool
      @dataschool  3 роки тому

      Thanks for your suggestion!

  • @joxa6119
    @joxa6119 2 роки тому

    This imputation return an array as the OHE want a dataframe. How can we solve this if we want to put both inside a pipepline?

  • @jongcheulkim7284
    @jongcheulkim7284 2 роки тому

    Thank you^^

  • @isfantauhid
    @isfantauhid Рік тому

    Can this apply on categorical data? Or for numerical only?

    • @Tazy50
      @Tazy50 11 місяців тому +1

      No, only numerical. He mentions it at the end of video

  • @akshatrailaddha5900
    @akshatrailaddha5900 Рік тому

    is this works for categorical features also ??

    • @dataschool
      @dataschool  10 місяців тому

      SimpleImputer works for categorical features, but KNNImputer and IterativeImputer do not.

  • @gisleberge4363
    @gisleberge4363 Рік тому

    No need to standardise the SibSp and Age columns (e.g. between 0 an 1) before the imputation process? Or is that not relevant here?

    • @dataschool
      @dataschool  10 місяців тому +1

      Great question! That's not relevant here because imputation values are learned separately for each column.

  • @RA-sv3bv
    @RA-sv3bv 3 роки тому

    In the example we have only 1 missing so the imputer is having "easy" mission. What if we had not only a few missing per this column/feature and we were facing "randomly" missing values for different col/features. How does the imputer decides to fill : which column first will be imputed and then based upon this filling it will advance to the "next best" (impute handling) column and fill in missing...and so on

    • @dataschool
      @dataschool  3 роки тому

      Great question! I don't know the specific logic it uses in terms of order, but I don't believe it tries to use imputed values to impute other values. For example, IterativeImputer is just doing a regression problem, and it works the same way regardless of whether it is predicting the values for one row or multiple rows. If there are missing values in the input features to that regression problem, I assume it just ignores those rows entirely.
      I'm not sure if that entirely answers your question... it's not easy for me to say with certainty how it handles all of the possible edge cases because I haven't studied the imputer's code. Hope that helps!

  • @rongshiu2
    @rongshiu2 3 роки тому

    Kevin, how does it work if let's say B and C are both missing?

    • @dataschool
      @dataschool  3 роки тому

      I haven't read the source code, and I don't think the documentation explains it in detail, so I can't say... sorry!

  • @Kenneth_Kwon
    @Kenneth_Kwon 3 роки тому

    What if the first column has a missing value? T
    It is a categorical feature and it would be better if we use multivariate regression.
    It has 0 or 1 but if we use KNNimputrr or IterativeImputer, it imputes as float value. I think there's the same question as mine in comments.

    • @dataschool
      @dataschool  3 роки тому

      In scikit-learn, multivariate imputation isn't currently an option for categorical data. I recommend using SimpleImputer instead. Hope that helps!

    • @aronpollner
      @aronpollner Рік тому

      @@dataschool Is there any library that has this option?

  • @joxa6119
    @joxa6119 2 роки тому

    What is the effect to the dataset after imputation? Any bias or something? I understand it's a mathematical way to insert a valueinto NaN but I feel there must be any effect on this action. Then, when do we need to remove NaN and when do we need to use imputation?

    • @whaleg3219
      @whaleg3219 2 роки тому +1

      If the percentage of the NaN in a column is more than 50%, we should eliminate the column, otherwise we should impute it using univariate methods like SimpleImputer or multivariate methods mentioned by the author.

    • @joxa6119
      @joxa6119 2 роки тому

      @@whaleg3219 @DataSchool I see, what if there's NaN in target feature? Can we use imputation? Or removal or NaN is better?

  • @matrix4776
    @matrix4776 3 роки тому +1

    How to handle missing categorical variables?

    • @dataschool
      @dataschool  3 роки тому

      If you're using scikit-learn version 0.24 or later, and you have categorical data with missing values, OneHotEncoder will automatically encode the missing values as a separate category, which is a good approach. If you're using version 0.23 or earlier, I recommend instead creating a pipeline of SimpleImputer (with strategy='constant') and OneHotEncoder, which will impute the missing values and then one-hot encode the results. Hope that helps!

  • @shreyasb.s3819
    @shreyasb.s3819 3 роки тому

    I have one doubt ...which is first process missing value impuation or outlier removal?

    • @dataschool
      @dataschool  3 роки тому

      Off-hand, I don't have clear advice on that topic. I'm sorry!

    • @hemantdhoundiyal1327
      @hemantdhoundiyal1327 3 роки тому

      In my opinion, if you are using methods like median, you can first impute missing value, but if you are imputing by methods like mean ( outliers will effect these) so it is good to remove outliers first.

  • @ashwinkrishnan4435
    @ashwinkrishnan4435 3 роки тому

    What do I use if the values are catagorical

    • @dataschool
      @dataschool  3 роки тому

      You can use SimpleImputer instead, with strategy='most_frequent' or strategy='constant'. Here's an example: nbviewer.jupyter.org/github/justmarkham/scikit-learn-tips/blob/master/notebooks/27_impute_categorical_features.ipynb Hope that helps!

  • @alazkakumu
    @alazkakumu 3 роки тому

    How to use KNN to interpolate time series data?

    • @dataschool
      @dataschool  3 роки тому

      I'm not sure the best way to do this, I'm sorry!

  • @whaleg3219
    @whaleg3219 2 роки тому

    It seems that we should definitely not try it in a large dataset. It takes forever.

  • @WheatleyOS
    @WheatleyOS 3 роки тому

    I can't think of a realistic example of where KNNImputer is better than IterativeImputer, IterativeImputer seems much more robust.
    Am I the only one thinking this?

    • @dataschool
      @dataschool  3 роки тому

      The "no free lunch" theorem says that no one method will be better than other in all cases. In other words, IterativeImputer might work better in most cases, but KNNImputer will surely be better in at least some cases, and the only way to know for sure is to try both!

  • @aniket1152
    @aniket1152 3 роки тому +1

    Thank you for such an amazing video!
    I used to encode my categorical data into numerical one and then ran the KNNImputer but its giving me Error - TypeError: invalid type promotion.
    Any insights what might be going wrong?

    • @dataschool
      @dataschool  3 роки тому +1

      I'm not sure, though I strongly recommend using OneHotEncoder for encoding your categorical features. I explain why in this video: ua-cam.com/video/yv4adDGcFE8/v-deo.html Hope that helps!

  • @sergiucasian3085
    @sergiucasian3085 3 роки тому

    Thank you!