Advanced missing values imputation technique to supercharge your training data.

Поділитися
Вставка
  • Опубліковано 19 січ 2025

КОМЕНТАРІ • 30

  • @FatemehBoobord
    @FatemehBoobord 27 днів тому

    Thank you, I really enjoy the code, but is it possible to use it when we simultaneously have missing data in features and labels(multilabel)?

    • @lifecrunch
      @lifecrunch  20 днів тому

      Sure. You can impute missing values in the whole dataset, including the labels. But if you have training data with some values missing in the labels - the best bet is to drop those rows because imputing the labels and then treating these examples as ground truth is not the best practice.

  • @nawaz_haider
    @nawaz_haider Рік тому +2

    I'm learning Data Science, and most tutorials just use the mean value. This didn't make any sense to me. I was wondering how on earth their model works in the real world with all these wrong values that have been used during training. Now I see what pros do.

    • @lifecrunch
      @lifecrunch  Рік тому +1

      Yeah, the naive (mean) approach just works technically. It’s used to fill in the blanks so the models which can’t handle NaN could train. But the volume of incorrectly filled missing values will directly reflect the model’s generalization.

  • @akmalmir8531
    @akmalmir8531 Рік тому

    Danil thank you for sharing, interesting library, one idea would be best if next time we could compare like :
    1) mean imputation
    2) dropping
    3) ML
    and then fit and predict any model to data at the end we can compare in which imputation RMSE is in minimum

    • @lifecrunch
      @lifecrunch  Рік тому

      Did such comparison many times. Although it is very much dependent on the data, but on average the ML missing values imputation yields better results.

    • @akmalmir8531
      @akmalmir8531 Рік тому

      @@lifecrunch Yes agree, that's why i am writing to show to you viewers that you idea works better than simple imputation, like you are giving gold to them, it would ne better if you give comparison at the end

    • @lifecrunch
      @lifecrunch  Рік тому

      Agree, this would be a great illustration of the concept.

  • @akshu7832
    @akshu7832 Місяць тому

    Informative

  • @mkaya4677
    @mkaya4677 4 місяці тому

    Hi,
    First of all, your video provides very useful information, and I want to thank you for that. I have a question I would like to ask you.
    I am analyzing air pollution in a city in my country. For this purpose, I have created a dataset using air pollution data and meteorological data. I then organized these data into hourly intervals. However, I encountered a problem. My dataset contains null values. These null values appear consecutively in some parts of the dataset. For example, in the first 3000 rows, there are approximately 2500 null values for the NO2, NOX, and NO air pollutants, but in the remaining part of the dataset, there are very few null values. In addition, there are rows where data for all air pollutants are missing, but these rows cover a short period consecutively. I believe this might be due to workers turning off the devices after working hours on certain days. I have previously trained a few models to fill in these missing values, but I did not achieve good results. I would like to ask for your guidance. In these two cases, should I fill in the missing data or exclude them from the dataset? What would be the most accurate method to complete these missing values?

    • @lifecrunch
      @lifecrunch  3 місяці тому

      In the first place (a lot of consecutive missing values at the top) I would just drop them.
      As for those NaNs in the middle, since your data is a time series, I would use something like a rolling window or nearest neighbors values to fill in the blank spots.

  • @anmolchhetri3033
    @anmolchhetri3033 2 місяці тому +1

    very helpful thanks, But is it require to do hyperparameter tuning of lightgbm models?

    • @lifecrunch
      @lifecrunch  2 місяці тому

      For the purpose of missing values imputation - not necessary. Tuning can give a subtle accuracy improvement and it’s justified for an actual prediction model, but I wouldn’t do it for a data processing step.

  • @soccerdadsg
    @soccerdadsg Рік тому

    Absolutely love this library!

  • @tnwu4350
    @tnwu4350 9 місяців тому

    Hi there this is an awesome approch for imputation. How would you go about validating this though? It would be helpful to demonstrate that its more accurate than methods like simple or iterative imputer

    • @lifecrunch
      @lifecrunch  9 місяців тому

      I have benchmarked this approach to iterative imputer along with all statistical methods. Every time verstack.NaNImputer gave better results, especially comparing to statistical methods. And there's really no magic - a sophisticated model like lightgbm is a golden standard when it comes to tabular data.

  • @yolomc2
    @yolomc2 10 місяців тому

    is possible to get copy of the code to study sir ? thanks in advnance 👌👍

    • @lifecrunch
      @lifecrunch  9 місяців тому +1

      Unfortunately didn't save the code from this video... You can code along, the script is not very complicated.

    • @yolomc2
      @yolomc2 9 місяців тому

      @@lifecrunch 👍

  • @mubashirshaikh
    @mubashirshaikh Місяць тому

    lol i am working on creating a sort of analysis automation tool for my college project and this is exactly what i was looking for. Initially i was thinking about going with the iterativeimputer or knnimputer. Is your nanimputer is better than them? if thats the case then you are a fucking genius

    • @lifecrunch
      @lifecrunch  20 днів тому +1

      iterativeimputer is a similar ML based approach, while KNN imputer is more on the statistical side, but also is quite good. verstack.NaNImputer uses LGBM under the hood, which is considered to be the more powerful ML algorithm. My guess is that it performs better than the rest in most cases.

    • @mubashirshaikh
      @mubashirshaikh 14 днів тому

      @@lifecrunch hey i have a question for you, can you make a video or something about the ways to detect and handle outliers in the training data, just like you did with missing values?. Thats a huge favor to ask, but please consider it.
      Also yess xd. Your nanimputer module is wayyy better performing than any other thing in the community as of now

  • @prestonryan3734
    @prestonryan3734 2 місяці тому

    Absolute mad lad

  • @likhithp9934
    @likhithp9934 8 місяців тому

    Nice Work man

  • @AlexErdem-lo5rz
    @AlexErdem-lo5rz 8 місяців тому

    Thank you!

  • @kalyanchatterjee8624
    @kalyanchatterjee8624 6 місяців тому

    Great, but I am not the right audience. Too fast.

    • @lifecrunch
      @lifecrunch  6 місяців тому +1

      You’ll get there…