R demo | How to impute missing values with machine learning

Поділитися
Вставка
  • Опубліковано 31 січ 2025

КОМЕНТАРІ • 57

  • @mustafa_sakalli
    @mustafa_sakalli 3 роки тому +4

    You are a legend! I've spent my hours to find proper tutorial to missing data imputation. They were all about mice and they were applying it to the 20 rows-5 columns dataset :D Since my dataset relatively big, mice package was struggling to compute missing values. But with the help of your small script I was able get a result in approximately 45 minutes. Thank you again

    • @yuzaR-Data-Science
      @yuzaR-Data-Science  3 роки тому

      Great to hear, Mustafa! I am glad it's useful not only for me :)

  • @haythemboucharif7750
    @haythemboucharif7750 2 роки тому +1

    Mister, i am french so i can tell you that we have problems with enflish, but let me say that you speak really smooth, and very very well, thank you very much

    • @yuzaR-Data-Science
      @yuzaR-Data-Science  2 роки тому

      Thanks, Haythem! Glad you liked it. I can recommend Deep Exploratory Analysis video. It's long but very useful. Cheers

  • @45tanviirahmed82
    @45tanviirahmed82 8 місяців тому

    This video ends abruptly 🤣 I was so into it, that I thought there was problem.
    Great video! you playlist on R is becoming an addiction

    • @yuzaR-Data-Science
      @yuzaR-Data-Science  8 місяців тому

      Awesome! Really happy you like it. I think I did cut it at the end, because the end turned out to be useless, which killed the retention. So, in more recent videos I try to provide the value every second... doesn't always work, but I think videos got a bit better since then :) Thank you so much for feedback!

  • @auliakhoirunnisa9447
    @auliakhoirunnisa9447 6 місяців тому

    Thank you for your explanation. It really helps me alot! Your voice is indeed calming and soothing😃
    will definitely subscribe, Sir!

  • @muhammedhadedy4570
    @muhammedhadedy4570 Рік тому

    You are a true legend. I enjoy every single video of your tutorials.

  • @jameswhitaker4357
    @jameswhitaker4357 Рік тому

    I'm just a mere junior analyst, but I am enamored by cool statistical methods. I have a lot of questions that you answered. While I have a minor mistrust in algos and ML, I have a major intrigue in how accurate and precise imputations can be.

    • @yuzaR-Data-Science
      @yuzaR-Data-Science  Рік тому +1

      Glad it was helpful! The missRanger command can even tell you for every variable, how good the imputation was via OOB - out of bag error rate. I don't know anymore whether I talk about it in the video.

  • @angelajulienne3122
    @angelajulienne3122 Рік тому

    AMAZIIIINGGGGG !! You're incredible thanks :D

  • @yaoliao3517
    @yaoliao3517 2 роки тому

    Really helpful to me. I see your recommendation from twiter.

    • @yuzaR-Data-Science
      @yuzaR-Data-Science  2 роки тому

      Glad it was helpful! And thanks for nice comments! They help :)

  • @rayray0313
    @rayray0313 3 роки тому

    Excellent stuff. Thanks for making this video.

  • @syhusada1130
    @syhusada1130 2 роки тому

    Thank you

  • @mkklindhardt
    @mkklindhardt 6 місяців тому

    Amazing 👏🏽 thank you

  • @robertasampong
    @robertasampong 2 роки тому

    Absolutely excellent! Thank you!

    • @yuzaR-Data-Science
      @yuzaR-Data-Science  2 роки тому +1

      Glad you enjoyed it! Check out later videos. You might like those too. Thanks for feedback!

  • @TheBaudoing2007
    @TheBaudoing2007 Рік тому

    thank you ! this is great

  • @angezoclanclounon1751
    @angezoclanclounon1751 3 роки тому

    Awesome video! Thanks a lot.

  • @ntran04299
    @ntran04299 Рік тому

    Thank you for this great video. May I ask the assumptions that should/must be met before using missRanger to impute data?

    • @yuzaR-Data-Science
      @yuzaR-Data-Science  Рік тому +1

      Yes, you may :) but I am afraid they are just common sense. For example, I never impute the response variable, I don't impute when there is a lot >20% of missing values, I always check the imputation results and accept or not accept depending on the result. Like, when I impute categories and after imputation only one category was filled up while the others not (in case there needed to be impited like 10% or more), then I don't accept that. so, no assumptions, but your own shit tests are important here. hope that helps! cheers

    • @ntran04299
      @ntran04299 Рік тому

      @@yuzaR-Data-Science I see thank you sir!

    • @yuzaR-Data-Science
      @yuzaR-Data-Science  Рік тому

      you are very welcome!

  • @edaemanet26
    @edaemanet26 3 роки тому

    Thank you sir this is perfect!

  • @syhusada1130
    @syhusada1130 2 роки тому

    Been coming back to this video. For a dataset with 165040 rows, missRanger crashed my Rstudio. I ended up using imputate_na with mode as the method since I'm not sure what yvar I should use in the dataset. So it produced "imputation" class, and I'm not sure what to do about it, can I just insert the result into the dataset?

    • @yuzaR-Data-Science
      @yuzaR-Data-Science  2 роки тому +2

      I would still recommend to use missRanger. I made the best experience with that. Are all 165k rows and all variables important? If not reduce the dataset or split it in smaller sets. By the way, it never crushed my RStudio, only took a little longer, if dataset was huge. Then, ask yourself are all variables (rows) contribute to the meaningful imputation? E.g. IDs or too diverse categorical variable don't, but they let missRanger think more for no return. If some variable have too many missing values, like 30% do you actually want them to be imputed?
      I suggest missRanger over "imputate_na" because you can track the OOB error (which is amazing) and because you create a new data set, which you can immediately use if OOB is low:
      d_imputed %
      missRanger(., formula = . ~ .,
      num.trees = 1000, verbose = 2, seed = 999, returnOOB = T)

    • @syhusada1130
      @syhusada1130 2 роки тому

      @@yuzaR-Data-Science thank you for the extra tips, amazing channel by the way, love it!

  • @muhammadasadkhan9620
    @muhammadasadkhan9620 Місяць тому

    yes, I have seen this video, but I don't know how to generate missing values by different mechanism like (MCAR, MAR & and MNAR) and how to test the performance of each imputation method by root mean square error and mean absolute error.

    • @yuzaR-Data-Science
      @yuzaR-Data-Science  Місяць тому

      oh, yeah, I see what you mean. in this case I did not look at MCAR etc. yet, but I'll put it on the list for the future videos. until then have a look at this, and similar, articles: cran.r-project.org/web/packages/finalfit/vignettes/missing.html

  • @dle3528
    @dle3528 2 роки тому

    This video is awesome. Congrats! Can I use this method before estimating ML models? Should I input data before or after the partition data?

    • @yuzaR-Data-Science
      @yuzaR-Data-Science  2 роки тому +2

      Thanks a lot! Imputation before conditioning is for sure better, because you have more data for the model to learn from, so the imputation quality would be better. Cheers!

    • @dle3528
      @dle3528 2 роки тому

      @@yuzaR-Data-Science thank you so much ! 😃😃

  • @chrisdietrich6400
    @chrisdietrich6400 2 роки тому

    Thanks a lot! Super helpful video! I am just wondering at which point in the data management process it would be the best to apply the imputation - I have some categorical items that I use for multiple factor analysis, which I then use for multilevel modelling. I am currently applying the imputation after I created the factors - however my intuition says it might be wiser to impute as a very first step. Do you have an opinion on that? (or some literature in regard to this?)

    • @yuzaR-Data-Science
      @yuzaR-Data-Science  2 роки тому +1

      Glad it was helpful! Well, and I am not familiar with any hard core rules, or rules of thumb. For me it depends on imputation goals and common sense. I did 3 imputations once because the dataset needed lots of operations, so, in order not to loose few point there and few here, I did 3 rounds. Another thing is, the categories or factors supposed to be recognised automatically. So, factorising before imputation makes sense to me. If you have 3 categories, 1,2 and 3 and ask for imputation of such a "numeric" variable, you might get some odd continuous numbers. However, if you want exactly that - go for it.

  • @vyshnavisanagapalli4314
    @vyshnavisanagapalli4314 11 місяців тому

    hi sir, i have gone through this video but im not able to get plot_na_pareto function in R studio. its throwing an error " builtinfunction not found".can u help me how to overcome this issue!?

    • @yuzaR-Data-Science
      @yuzaR-Data-Science  11 місяців тому

      hi, it works perfectly at my PC. have you installed and loaded the {dloork} library?

    • @vyshnavisanagapalli4314
      @vyshnavisanagapalli4314 11 місяців тому

      Thank You for replying Sir.I am getting the plot , actually there was some problem with my r studio, I rectified it .and I must say ur videos are so informative and easy understanding.

    • @yuzaR-Data-Science
      @yuzaR-Data-Science  10 місяців тому

      that's so nice of you! huge thanks! I am happy my content is useful!

  • @LeviRafal
    @LeviRafal Рік тому

    Thank you very much for your informative, succinct video! Is {missRanger} package considered the best package for multivariate imputation? Is {missRanger} package better than {mice} or {miceRanger} packages? How did you discover {missRanger} package? I'm sorry if I asked too many questions because I'm new to data imputation and would like to select the best package to impute my data.

    • @yuzaR-Data-Science
      @yuzaR-Data-Science  Рік тому +1

      Sorry for a late replay. I was traveling a lot lately. I believe missRanger is the best, but it's just a believe, not fact. I think so because it connects predictive mean matching and multiple imputation. It iterates till the OOB stops improving. I did however not directly compare the results of the packages and usability. The usability is also important, because there tons of packages that don't run without some special things. miss Ranger does, and does it quick. Having said this, if you would compare the results of different imputation, I would be grateful to know how it went. Kind regards and thanks for your nice feedback!

    • @LeviRafal
      @LeviRafal Рік тому

      @@yuzaR-Data-Science No worries. Thank you very much for your answer.😀I think I will probably stick with {missRanger} package for now due to the fact that it is easy to implement and the great features you have discussed.😄// I was wondering if you could provide me with quantitative method(s) which could be used to assess the accuracy of the imputation rather than visualisation?// In your demonstration of using {missRanger} package (5:28), I think that it is essential to include the argument `pmm.k` (e.g. pmm.k = 3) to conserve data structure/format. This is because, when I first ran the code without the argument `pmm.k`, it gave different rounding to my values. I have checked in the package's vignette, and it is confirmed that setting the `pmm.k` argument to a positive number is needed so that all imputations done during the process can be combined with a predictive mean matching (PMM) step, leading to more natural imputations and improved distributional properties of the resulting values.🤔 Best regards, Poss.

    • @yuzaR-Data-Science
      @yuzaR-Data-Science  Рік тому +1

      Oh, cool, thanks for pointing out the "pmm.k" option! I did actually often forget it. What I usually use to follow up on predictions is "num.trees = 10000, verbose = 2, seed = 1, returnOOB = T", which displays the Out Of Bag Error for each variable at each iteration. After some iterations, when the OOB stops improving, it stops imputing and you have a final dataset. I try no to accept any OOB above 10% ... but yeah, it depends on the situation. I also usually plot the data, just to see whether some very strange values were predicted ... I was never the case till now. Of coarse, the more data you have, the better the predictions. Cheers ;)

    • @LeviRafal
      @LeviRafal Рік тому

      @@yuzaR-Data-Science Oh, I see. Due to the fact that my data contains over 3,000 rows and 30 variables, I have reduced "num.trees" to shorten the processing time to 100. Consequently, this led to the different rounding of imputed values, so I added the argument "pmm.k" to retain their data structure. Thank you very much for your clarification! :)

  • @chacmool2581
    @chacmool2581 2 роки тому +1

    Using 'ggplotly' to make a missing value heatmap interactive is too computationally expensive and slow for anything but very small datasets. Instead, you could try making an interactive heatmap directly using 'd3heatmap'. Much faster. Plus you can control the aesthetics of 'd3heatmap' to a greater degree than the 'vis_miss()' function.

  • @achual1909
    @achual1909 2 роки тому

    Can I use this for time-series data?

    • @yuzaR-Data-Science
      @yuzaR-Data-Science  2 роки тому +1

      If you wanna date-format itself (day/month/year : sec/min/hour), I don't think so. But if your timepoints are columns, and you just have some things measured and sometimes missing, then for sure.