Sure. You can impute missing values in the whole dataset, including the labels. But if you have training data with some values missing in the labels - the best bet is to drop those rows because imputing the labels and then treating these examples as ground truth is not the best practice.
I'm learning Data Science, and most tutorials just use the mean value. This didn't make any sense to me. I was wondering how on earth their model works in the real world with all these wrong values that have been used during training. Now I see what pros do.
Yeah, the naive (mean) approach just works technically. It’s used to fill in the blanks so the models which can’t handle NaN could train. But the volume of incorrectly filled missing values will directly reflect the model’s generalization.
Danil thank you for sharing, interesting library, one idea would be best if next time we could compare like : 1) mean imputation 2) dropping 3) ML and then fit and predict any model to data at the end we can compare in which imputation RMSE is in minimum
@@lifecrunch Yes agree, that's why i am writing to show to you viewers that you idea works better than simple imputation, like you are giving gold to them, it would ne better if you give comparison at the end
Hi, First of all, your video provides very useful information, and I want to thank you for that. I have a question I would like to ask you. I am analyzing air pollution in a city in my country. For this purpose, I have created a dataset using air pollution data and meteorological data. I then organized these data into hourly intervals. However, I encountered a problem. My dataset contains null values. These null values appear consecutively in some parts of the dataset. For example, in the first 3000 rows, there are approximately 2500 null values for the NO2, NOX, and NO air pollutants, but in the remaining part of the dataset, there are very few null values. In addition, there are rows where data for all air pollutants are missing, but these rows cover a short period consecutively. I believe this might be due to workers turning off the devices after working hours on certain days. I have previously trained a few models to fill in these missing values, but I did not achieve good results. I would like to ask for your guidance. In these two cases, should I fill in the missing data or exclude them from the dataset? What would be the most accurate method to complete these missing values?
In the first place (a lot of consecutive missing values at the top) I would just drop them. As for those NaNs in the middle, since your data is a time series, I would use something like a rolling window or nearest neighbors values to fill in the blank spots.
For the purpose of missing values imputation - not necessary. Tuning can give a subtle accuracy improvement and it’s justified for an actual prediction model, but I wouldn’t do it for a data processing step.
Hi there this is an awesome approch for imputation. How would you go about validating this though? It would be helpful to demonstrate that its more accurate than methods like simple or iterative imputer
I have benchmarked this approach to iterative imputer along with all statistical methods. Every time verstack.NaNImputer gave better results, especially comparing to statistical methods. And there's really no magic - a sophisticated model like lightgbm is a golden standard when it comes to tabular data.
lol i am working on creating a sort of analysis automation tool for my college project and this is exactly what i was looking for. Initially i was thinking about going with the iterativeimputer or knnimputer. Is your nanimputer is better than them? if thats the case then you are a fucking genius
iterativeimputer is a similar ML based approach, while KNN imputer is more on the statistical side, but also is quite good. verstack.NaNImputer uses LGBM under the hood, which is considered to be the more powerful ML algorithm. My guess is that it performs better than the rest in most cases.
@@lifecrunch hey i have a question for you, can you make a video or something about the ways to detect and handle outliers in the training data, just like you did with missing values?. Thats a huge favor to ask, but please consider it. Also yess xd. Your nanimputer module is wayyy better performing than any other thing in the community as of now
Thank you, I really enjoy the code, but is it possible to use it when we simultaneously have missing data in features and labels(multilabel)?
Sure. You can impute missing values in the whole dataset, including the labels. But if you have training data with some values missing in the labels - the best bet is to drop those rows because imputing the labels and then treating these examples as ground truth is not the best practice.
I'm learning Data Science, and most tutorials just use the mean value. This didn't make any sense to me. I was wondering how on earth their model works in the real world with all these wrong values that have been used during training. Now I see what pros do.
Yeah, the naive (mean) approach just works technically. It’s used to fill in the blanks so the models which can’t handle NaN could train. But the volume of incorrectly filled missing values will directly reflect the model’s generalization.
Danil thank you for sharing, interesting library, one idea would be best if next time we could compare like :
1) mean imputation
2) dropping
3) ML
and then fit and predict any model to data at the end we can compare in which imputation RMSE is in minimum
Did such comparison many times. Although it is very much dependent on the data, but on average the ML missing values imputation yields better results.
@@lifecrunch Yes agree, that's why i am writing to show to you viewers that you idea works better than simple imputation, like you are giving gold to them, it would ne better if you give comparison at the end
Agree, this would be a great illustration of the concept.
Informative
Hi,
First of all, your video provides very useful information, and I want to thank you for that. I have a question I would like to ask you.
I am analyzing air pollution in a city in my country. For this purpose, I have created a dataset using air pollution data and meteorological data. I then organized these data into hourly intervals. However, I encountered a problem. My dataset contains null values. These null values appear consecutively in some parts of the dataset. For example, in the first 3000 rows, there are approximately 2500 null values for the NO2, NOX, and NO air pollutants, but in the remaining part of the dataset, there are very few null values. In addition, there are rows where data for all air pollutants are missing, but these rows cover a short period consecutively. I believe this might be due to workers turning off the devices after working hours on certain days. I have previously trained a few models to fill in these missing values, but I did not achieve good results. I would like to ask for your guidance. In these two cases, should I fill in the missing data or exclude them from the dataset? What would be the most accurate method to complete these missing values?
In the first place (a lot of consecutive missing values at the top) I would just drop them.
As for those NaNs in the middle, since your data is a time series, I would use something like a rolling window or nearest neighbors values to fill in the blank spots.
very helpful thanks, But is it require to do hyperparameter tuning of lightgbm models?
For the purpose of missing values imputation - not necessary. Tuning can give a subtle accuracy improvement and it’s justified for an actual prediction model, but I wouldn’t do it for a data processing step.
Absolutely love this library!
Hi there this is an awesome approch for imputation. How would you go about validating this though? It would be helpful to demonstrate that its more accurate than methods like simple or iterative imputer
I have benchmarked this approach to iterative imputer along with all statistical methods. Every time verstack.NaNImputer gave better results, especially comparing to statistical methods. And there's really no magic - a sophisticated model like lightgbm is a golden standard when it comes to tabular data.
is possible to get copy of the code to study sir ? thanks in advnance 👌👍
Unfortunately didn't save the code from this video... You can code along, the script is not very complicated.
@@lifecrunch 👍
lol i am working on creating a sort of analysis automation tool for my college project and this is exactly what i was looking for. Initially i was thinking about going with the iterativeimputer or knnimputer. Is your nanimputer is better than them? if thats the case then you are a fucking genius
iterativeimputer is a similar ML based approach, while KNN imputer is more on the statistical side, but also is quite good. verstack.NaNImputer uses LGBM under the hood, which is considered to be the more powerful ML algorithm. My guess is that it performs better than the rest in most cases.
@@lifecrunch hey i have a question for you, can you make a video or something about the ways to detect and handle outliers in the training data, just like you did with missing values?. Thats a huge favor to ask, but please consider it.
Also yess xd. Your nanimputer module is wayyy better performing than any other thing in the community as of now
Absolute mad lad
😎
Nice Work man
Thanks 🔥
Thank you!
Welcome!
Great, but I am not the right audience. Too fast.
You’ll get there…