That's correct: "missing value imputation" means that you are replacing missing values (also known as "null values") in your dataset with your best approximation of what values they might be if they weren't missing. Hope that helps!
Thanks for all the videos it helped me a lot. However I searched on google a long times but I could not find my problem. I am trying to fill missing values with others columns. I mean there are some missing values about cars body type but there are information about body type in another column.
Awesome video! I was wondering if you can share how the process works behind the scenes for cases where we have rows with multiple columns that are null, with respect to the iterative imputer that builds a model behind the scenes. I understand the logic when we only have a single column with null values but can't wrap my head around what will be assigned as training and test data if we have multiple columns with null values. Looking forward to your response. Thanks
Great question! IterativeImputer was inspired by MICE. More details are here: scikit-learn.org/stable/modules/impute.html#multiple-vs-single-imputation
Thanks for posting this. For features where there are missing values, should I be passing in the whole df to impute the missing values, or should I only include features that are correlated with the dependent variable I'm trying to impute?
This goes back to the bias variance tradeoff. If you are adding 100s of other columns that are likely to be uncorrelated, then I would suggest not doing that since that will likely overfit the data. You could use the parameter "n_nearest_features" which makes the IterativeImputer only use the top "n" features to predict the missing values. This could be a way to add all your columns from your entire dataframe while still minimizing the increase in variance.
Hi, I tried encoding my categorical variables (boolean value column) and then running the data through a KNNImputer but instead of getting 1's and 0's I got values inbetween those values, for example 0.4,0.9 etc. Is there anything I am missing, or is there any way to improve the prediction of this imputer ?
Great question! I don't recommend using KNNImputer in that case. Here's what I recommend instead: (1) If you're using scikit-learn version 0.24 or later, and you have categorical data with missing values, OneHotEncoder will automatically encode the missing values as a separate category, which is a good approach. (2) If you're using version 0.23 or earlier, I recommend instead creating a pipeline of SimpleImputer (with strategy='constant') and OneHotEncoder, which will impute the missing values and then one-hot encode the results. Hope that helps!
@@dataschool In that case; can we interpret the results (0.4 , 0.9) as the probabilities of those values being 0 or 1. Does it makes sense to assign a threshold like 0.5, transform below to 0 and above to 1?
Great question! KNNImputer can't be used for strings (categorical features), but you can use SimpleImputer in that case with strategy='constant' or strategy='most_frequent'. Hope that helps!
Hello ! Thank you very much for your interesting video ! Do you know where I can find a video like this one to know how many neighbors choose ? Thank you very much
Are you generally performing your imputation prior to any feature selection, or after ? I always see mixed reviews about performing it before and after..
Hey Kevin, quick question... should k in knn should always be odd... if yes than why and if no than why? as me in the interview... Thank for all your content.
Great question! For KNNImputer, the answer is no, because it's just looking at other numeric samples and averaging them (there is never a "tie"). For KNN with binary classification, then yes an odd K is a good idea in order to avoid ties. Hope that helps!
With KNNImputer, the features have to be numeric in order for it to determine the "nearest" rows. That is separate from using KNN with a classification problem, because in a classification problem, the target is categorical. Hope that helps!
Great question! For categorical features, you can use SimpleImputer with strategy='most_frequent' or strategy='constant'. Which approach is better depends on the particular situation. More details are here: scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html
My idea: line plot of cols which have null values with other continuous cols and box plot for discrete and then impute constant value according to result of this process, like say, Pclass is 2, so you impute median fare of Pclass 2 wherever fare is missing and Pclass is 2. Basically similar to iterative imputer, only manual work, slow but maybe better results because of human knowledge about problem statement. What are your thoughts about this idea ?
It's an interesting idea, but a manual process is probably not practical for any large dataset, and it's definitely impractical for cross-validation (since you would have to do the imputation after each split). In general, any "manual" process (in which your intervention is required) does not fit well into the scikit-learn workflow. Hope that helps!
@@dataschool I meant finding the values in an exploratory way and then using the values found as a constant in simple imputer in a pipeline during cross validation and evaluation. A custom transformer can also be created which does the imputation according to fitted values, like during transformation find similar records and then use the median. But then that's pretty much similar to KNNImputer and Iterative Imputer
Sure, you could probably do that using a custom transformer. Or if you think you could make a strong case for this functionality being available in scikit-learn, then you could propose it as a feature request!
@@dataschool I am not sure whether this is the correct platform, but I have written a library named custom_transformers which contains transformers for handling date,time, null, outlier and some commonly needed custom transformers, if you have time I would be greatly appreciated if you provided your valuable feedback on kaggle This is the notebook demonstrating the use of library www.kaggle.com/susmitpy03/demonstrating-common-transformers I intend to package it and publish on PyPi
In the example we have only 1 missing so the imputer is having "easy" mission. What if we had not only a few missing per this column/feature and we were facing "randomly" missing values for different col/features. How does the imputer decides to fill : which column first will be imputed and then based upon this filling it will advance to the "next best" (impute handling) column and fill in missing...and so on
Great question! I don't know the specific logic it uses in terms of order, but I don't believe it tries to use imputed values to impute other values. For example, IterativeImputer is just doing a regression problem, and it works the same way regardless of whether it is predicting the values for one row or multiple rows. If there are missing values in the input features to that regression problem, I assume it just ignores those rows entirely. I'm not sure if that entirely answers your question... it's not easy for me to say with certainty how it handles all of the possible edge cases because I haven't studied the imputer's code. Hope that helps!
What if the first column has a missing value? T It is a categorical feature and it would be better if we use multivariate regression. It has 0 or 1 but if we use KNNimputrr or IterativeImputer, it imputes as float value. I think there's the same question as mine in comments.
What is the effect to the dataset after imputation? Any bias or something? I understand it's a mathematical way to insert a valueinto NaN but I feel there must be any effect on this action. Then, when do we need to remove NaN and when do we need to use imputation?
If the percentage of the NaN in a column is more than 50%, we should eliminate the column, otherwise we should impute it using univariate methods like SimpleImputer or multivariate methods mentioned by the author.
If you're using scikit-learn version 0.24 or later, and you have categorical data with missing values, OneHotEncoder will automatically encode the missing values as a separate category, which is a good approach. If you're using version 0.23 or earlier, I recommend instead creating a pipeline of SimpleImputer (with strategy='constant') and OneHotEncoder, which will impute the missing values and then one-hot encode the results. Hope that helps!
In my opinion, if you are using methods like median, you can first impute missing value, but if you are imputing by methods like mean ( outliers will effect these) so it is good to remove outliers first.
You can use SimpleImputer instead, with strategy='most_frequent' or strategy='constant'. Here's an example: nbviewer.jupyter.org/github/justmarkham/scikit-learn-tips/blob/master/notebooks/27_impute_categorical_features.ipynb Hope that helps!
I can't think of a realistic example of where KNNImputer is better than IterativeImputer, IterativeImputer seems much more robust. Am I the only one thinking this?
The "no free lunch" theorem says that no one method will be better than other in all cases. In other words, IterativeImputer might work better in most cases, but KNNImputer will surely be better in at least some cases, and the only way to know for sure is to try both!
Thank you for such an amazing video! I used to encode my categorical data into numerical one and then ran the KNNImputer but its giving me Error - TypeError: invalid type promotion. Any insights what might be going wrong?
I'm not sure, though I strongly recommend using OneHotEncoder for encoding your categorical features. I explain why in this video: ua-cam.com/video/yv4adDGcFE8/v-deo.html Hope that helps!
Thanks for watching! 🙌 Let me know if you have any questions about imputation and I'm happy to answer them! 👇
Does imputation a method to replace missing / null values in the dataset ?
Awsome bro very useful technique
That's correct: "missing value imputation" means that you are replacing missing values (also known as "null values") in your dataset with your best approximation of what values they might be if they weren't missing. Hope that helps!
Have you worked with fancyimpute? Offers even more variety and works great
Thanks for all the videos it helped me a lot. However I searched on google a long times but I could not find my problem. I am trying to fill missing values with others columns. I mean there are some missing values about cars body type but there are information about body type in another column.
I really love your videos, they are just right, concise and informative, no unnecessary fluff. Thank you so much for these.
Thank you so much for your kind words!
thank you.
love the clarity in your explanation!
Glad it was helpful!
Thank you! I really needed this to understand the concepts, you are an outstanding teacher.
Glad it was helpful!
Awesome video! I was wondering if you can share how the process works behind the scenes for cases where we have rows with multiple columns that are null, with respect to the iterative imputer that builds a model behind the scenes. I understand the logic when we only have a single column with null values but can't wrap my head around what will be assigned as training and test data if we have multiple columns with null values. Looking forward to your response. Thanks
Kevin, you just expanded my column transformation vocabulary. Thank you.
Great to hear!
Awesome video, couldn't be clearer. Thanks
Thank you! 🙏
Thank you, this is exactly what I need. Plus you've explained it very well!
Glad it was helpful!
God bless you man such valuable content you are producing!
Thank you so much! 🙏
Super helpful, as always. Is IterativeImputer the sklearn version of MICE?
Great question! IterativeImputer was inspired by MICE. More details are here: scikit-learn.org/stable/modules/impute.html#multiple-vs-single-imputation
Hey Eric! 😀
Thanks for posting this. For features where there are missing values, should I be passing in the whole df to impute the missing values, or should I only include features that are correlated with the dependent variable I'm trying to impute?
This goes back to the bias variance tradeoff. If you are adding 100s of other columns that are likely to be uncorrelated, then I would suggest not doing that since that will likely overfit the data. You could use the parameter "n_nearest_features" which makes the IterativeImputer only use the top "n" features to predict the missing values. This could be a way to add all your columns from your entire dataframe while still minimizing the increase in variance.
Hi, I tried encoding my categorical variables (boolean value column) and then running the data through a KNNImputer but instead of getting 1's and 0's I got values inbetween those values, for example 0.4,0.9 etc. Is there anything I am missing, or is there any way to improve the prediction of this imputer ?
That's also my question.
Great question! I don't recommend using KNNImputer in that case. Here's what I recommend instead: (1) If you're using scikit-learn version 0.24 or later, and you have categorical data with missing values, OneHotEncoder will automatically encode the missing values as a separate category, which is a good approach. (2) If you're using version 0.23 or earlier, I recommend instead creating a pipeline of SimpleImputer (with strategy='constant') and OneHotEncoder, which will impute the missing values and then one-hot encode the results. Hope that helps!
@@dataschool In that case; can we interpret the results (0.4 , 0.9) as the probabilities of those values being 0 or 1. Does it makes sense to assign a threshold like 0.5, transform below to 0 and above to 1?
That column or feature if having discrete values like 0 or 1 , better check the semantic of that column,most probably it would be under categorical
Fantastic video !! 👏🏼👏🏼👏🏼 … thank you for spreading the knowledge
You're welcome! Glad it was helpful to you!
You are awesome man!! Saved me a lot of time yet again!!!!
That's awesome to hear! 🙌
very nice video, however i want to ask, is the knn-imputer can use for data object (string )?
Great question! KNNImputer can't be used for strings (categorical features), but you can use SimpleImputer in that case with strategy='constant' or strategy='most_frequent'. Hope that helps!
Do you have a recommended tool/package for doing imputation with categorical variables?
The simplest way is to use scikit-learn's SimpleImputer.
We can do this for numerical data but what in the case of categoical data?
Can you mention any method for that?
Hello ! Thank you very much for your interesting video ! Do you know where I can find a video like this one to know how many neighbors choose ?
Thank you very much
Sure! ua-cam.com/video/6dbrR-WymjI/v-deo.html
Thanks, that was very interesting.
Glad you enjoyed it!
Question: If we impute values of a feature based on other features, wouldn't that increase the likelihood of multicollinearity?
why don't you have 2M subscribers man ?
You are so kind! 🙏
Are you generally performing your imputation prior to any feature selection, or after ? I always see mixed reviews about performing it before and after..
Great question! Imputation prior to feature selection.
Hey Kevin, quick question... should k in knn should always be odd... if yes than why and if no than why? as me in the interview... Thank for all your content.
Great question! For KNNImputer, the answer is no, because it's just looking at other numeric samples and averaging them (there is never a "tie"). For KNN with binary classification, then yes an odd K is a good idea in order to avoid ties. Hope that helps!
Thanks for sharing this!
Why cannot KNN imputer be used for categorical variables? KKN algorithms works with classification problems.
With KNNImputer, the features have to be numeric in order for it to determine the "nearest" rows. That is separate from using KNN with a classification problem, because in a classification problem, the target is categorical. Hope that helps!
can iterative imputer and knn imputer works with only numerical values ? Or can it also impute string/alphanumeric values as well?
Great question! Only numerical.
What about the best imputer for categorical variables??
Great question! For categorical features, you can use SimpleImputer with strategy='most_frequent' or strategy='constant'. Which approach is better depends on the particular situation. More details are here: scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html
@@dataschool Ok...Thanks :)
@@dataschool Thanks a lot mate you re a LEGEND
Thank you!
You're welcome!
Respected Sir,
Can we multiple imputation in eviews9 for panel data?
I'm not sure I understand your question, I'm sorry!
My idea: line plot of cols which have null values with other continuous cols and box plot for discrete and then impute constant value according to result of this process, like say, Pclass is 2, so you impute median fare of Pclass 2 wherever fare is missing and Pclass is 2. Basically similar to iterative imputer, only manual work, slow but maybe better results because of human knowledge about problem statement. What are your thoughts about this idea ?
It's an interesting idea, but a manual process is probably not practical for any large dataset, and it's definitely impractical for cross-validation (since you would have to do the imputation after each split). In general, any "manual" process (in which your intervention is required) does not fit well into the scikit-learn workflow. Hope that helps!
@@dataschool I meant finding the values in an exploratory way and then using the values found as a constant in simple imputer in a pipeline during cross validation and evaluation. A custom transformer can also be created which does the imputation according to fitted values, like during transformation find similar records and then use the median. But then that's pretty much similar to KNNImputer and Iterative Imputer
Sure, you could probably do that using a custom transformer. Or if you think you could make a strong case for this functionality being available in scikit-learn, then you could propose it as a feature request!
@@dataschool I am not sure whether this is the correct platform, but I have written a library named custom_transformers which contains transformers for handling date,time, null, outlier and some commonly needed custom transformers, if you have time I would be greatly appreciated if you provided your valuable feedback on kaggle
This is the notebook demonstrating the use of library
www.kaggle.com/susmitpy03/demonstrating-common-transformers
I intend to package it and publish on PyPi
Could you please consider making another video on MissForest imputation? (#missingpy)
Thanks for your suggestion!
This imputation return an array as the OHE want a dataframe. How can we solve this if we want to put both inside a pipepline?
Thank you^^
You're welcome 😊
Can this apply on categorical data? Or for numerical only?
No, only numerical. He mentions it at the end of video
is this works for categorical features also ??
SimpleImputer works for categorical features, but KNNImputer and IterativeImputer do not.
No need to standardise the SibSp and Age columns (e.g. between 0 an 1) before the imputation process? Or is that not relevant here?
Great question! That's not relevant here because imputation values are learned separately for each column.
In the example we have only 1 missing so the imputer is having "easy" mission. What if we had not only a few missing per this column/feature and we were facing "randomly" missing values for different col/features. How does the imputer decides to fill : which column first will be imputed and then based upon this filling it will advance to the "next best" (impute handling) column and fill in missing...and so on
Great question! I don't know the specific logic it uses in terms of order, but I don't believe it tries to use imputed values to impute other values. For example, IterativeImputer is just doing a regression problem, and it works the same way regardless of whether it is predicting the values for one row or multiple rows. If there are missing values in the input features to that regression problem, I assume it just ignores those rows entirely.
I'm not sure if that entirely answers your question... it's not easy for me to say with certainty how it handles all of the possible edge cases because I haven't studied the imputer's code. Hope that helps!
Kevin, how does it work if let's say B and C are both missing?
I haven't read the source code, and I don't think the documentation explains it in detail, so I can't say... sorry!
What if the first column has a missing value? T
It is a categorical feature and it would be better if we use multivariate regression.
It has 0 or 1 but if we use KNNimputrr or IterativeImputer, it imputes as float value. I think there's the same question as mine in comments.
In scikit-learn, multivariate imputation isn't currently an option for categorical data. I recommend using SimpleImputer instead. Hope that helps!
@@dataschool Is there any library that has this option?
What is the effect to the dataset after imputation? Any bias or something? I understand it's a mathematical way to insert a valueinto NaN but I feel there must be any effect on this action. Then, when do we need to remove NaN and when do we need to use imputation?
If the percentage of the NaN in a column is more than 50%, we should eliminate the column, otherwise we should impute it using univariate methods like SimpleImputer or multivariate methods mentioned by the author.
@@whaleg3219 @DataSchool I see, what if there's NaN in target feature? Can we use imputation? Or removal or NaN is better?
How to handle missing categorical variables?
If you're using scikit-learn version 0.24 or later, and you have categorical data with missing values, OneHotEncoder will automatically encode the missing values as a separate category, which is a good approach. If you're using version 0.23 or earlier, I recommend instead creating a pipeline of SimpleImputer (with strategy='constant') and OneHotEncoder, which will impute the missing values and then one-hot encode the results. Hope that helps!
I have one doubt ...which is first process missing value impuation or outlier removal?
Off-hand, I don't have clear advice on that topic. I'm sorry!
In my opinion, if you are using methods like median, you can first impute missing value, but if you are imputing by methods like mean ( outliers will effect these) so it is good to remove outliers first.
What do I use if the values are catagorical
You can use SimpleImputer instead, with strategy='most_frequent' or strategy='constant'. Here's an example: nbviewer.jupyter.org/github/justmarkham/scikit-learn-tips/blob/master/notebooks/27_impute_categorical_features.ipynb Hope that helps!
How to use KNN to interpolate time series data?
I'm not sure the best way to do this, I'm sorry!
It seems that we should definitely not try it in a large dataset. It takes forever.
I can't think of a realistic example of where KNNImputer is better than IterativeImputer, IterativeImputer seems much more robust.
Am I the only one thinking this?
The "no free lunch" theorem says that no one method will be better than other in all cases. In other words, IterativeImputer might work better in most cases, but KNNImputer will surely be better in at least some cases, and the only way to know for sure is to try both!
Thank you for such an amazing video!
I used to encode my categorical data into numerical one and then ran the KNNImputer but its giving me Error - TypeError: invalid type promotion.
Any insights what might be going wrong?
I'm not sure, though I strongly recommend using OneHotEncoder for encoding your categorical features. I explain why in this video: ua-cam.com/video/yv4adDGcFE8/v-deo.html Hope that helps!
Thank you!
You're welcome!