R: Regression With Multiple Imputation (missing data handling)

Поділитися
Вставка
  • Опубліковано 14 жов 2024
  • How best to treat missing data in linear regression analysis? The current view is that multiple imputation by chained equations (mice) is one of the best ways for missing data handling in regression. This multiple imputation tutorial is going to show you how to use the mice package in R to analyze datasets with missing data (MCAR, MAR) in a regression framework.
    Here is a current journal article giving theoretical background and specific recommendations regarding the use of multiple imputation for missing data:
    Austin, P. C., White, I. R., Lee, D. S., & van Buuren, S. (2020). Missing data in clinical research: a tutorial on multiple imputation. Canadian Journal of Cardiology.
    www.sciencedir...
    Companion webpage with the R code:
    www.regorz-stat...
    Tutorial for checking regression assumptions with multiple imputation:
    • Multiple Imputation an...

КОМЕНТАРІ • 39

  • @nacentdatanerd
    @nacentdatanerd 2 місяці тому +1

    This is a great video! thanks for going over the details with such clarity.
    Thanks so much!

  • @Sigourney-Cleaver
    @Sigourney-Cleaver Рік тому +2

    THANK YOU for this video with clear audio! I have been searching all over for a reference example for handling simple regressions with mice(), and so many of the videos out there sound like they were recorded via laptop mics while standing right under an air conditioner. Clear and helpful, thank you again!

  • @kamarularifinkasim3138
    @kamarularifinkasim3138 Рік тому +1

    Thank you so much for making such video. Your explanation and coding are way simple and clear which it is easier to understand and very helpful for my analysis for my dissertation where I used simulacrum dataset

  • @malithapatabendige6541
    @malithapatabendige6541 Рік тому +2

    Thanks for this! It is crystal clear up to pooling. However, I have 2 questions.
    1. How can we get a final dataset with pooled results? the combine function gives a dataset with 10 or 20 cycles and do we need to get one final pooled dataset?
    2. If we have more than one variable with missing data, do we need to do the regression model for each of these?
    3. Do we need to upload the full dataset with other non-missing variables for the MICE process?

    • @RegorzStatistik
      @RegorzStatistik  Рік тому +1

      1. With multiple imputation there is no pooled dataset. The results are pooled, not the datasets.
      2. During imputation more than one variable can be imputed.
      3. If you want to use other variables to help with imputation then you have to upload them.

    • @malithapatabendige6541
      @malithapatabendige6541 Рік тому +1

      @@RegorzStatistik Thanks very much for your prompt reply.
      1. It means we can select one of 5 (if m = 5) datasets with imputed values for the final analysis. Am I right?
      2. What is the aim of 'pooling the results'?
      Is it to decide whether our assumptions are correct? (MNAR or MAR)
      3. What if the pooled results contain statistically significant estimates?
      4. Can we use Random forest for this?
      Many thanks

    • @RegorzStatistik
      @RegorzStatistik  Рік тому +1

      @@malithapatabendige6541
      1.-3.
      No.
      MI has 3 steps:
      Step 1: Imputing m datasets
      Step 2: Running your analysis in each of your datasets - you don't choose one dataset but you use all of them. So you get m different regression results.
      Step 3: Pooling the results - here you get one result from your m results - and this pooled result (and its p-values) is what counts.
      I recommend reading an introductory journal article about MI to get a theoretical understanding of the procedure.
      I don't know if MI works with random forests.

    • @malithapatabendige6541
      @malithapatabendige6541 Рік тому +2

      @@RegorzStatistik Thanks. These 3 steps are clear. But, nobody has mentioned how to 'interpret' pooled results and how to get the 'final imputed data for the analysis of the original research. Basically, once it is pooled, what imputed dataset is to be selected out of m number of sets.
      "Step 3: Pooling the results - here you get one result from your m results - and this pooled result (and its p-values) is what counts" - next step has not been mentioned anywhere. It is strange what are we supposed to do with the pooled result and where can we get one single dataset with imputed data to 'start' the original analysis.

    • @malithapatabendige6541
      @malithapatabendige6541 Рік тому

      @@RegorzStatistik I think I have to compare pooled estimates, p-values, F-statistic, etc, with each of m data sets and get the BEST GUESS of the imputed data set out of it. Thanks.

  • @bornaloncar2458
    @bornaloncar2458 11 місяців тому

    Thank you, this is very informative. Could you point me to a source or clarify 1. how the regression is meant to be set up if more than 1 item/variable is missing and you want to imputate? Is the dependent variable in the regression model the only variable that gets imputated? 2. How do you obtain a table that combines inputated data and original data? Thank you!!

    • @RegorzStatistik
      @RegorzStatistik  11 місяців тому +1

      1. I don't have a source available. But MI does not change whether there is 1 item missing or more (in my example, there are rows with more than 1 item missing - so the dependent variable is not the only variable that gets imputed)
      2. Only by combining those tables per hand (e.g. with tidyverse). However, that rarely makes sense because you don't have one imputed dataset! In my example you have 50 imputed datasets so combining those 50 datasets with the original dataset would lead to somethin quite large and difficult to interpret.

  • @emilypet01
    @emilypet01 Місяць тому

    Thank you for this video. Is it possible to get the F-statistic for the pooled Model? And is there a way to get star standardized coefficients as well?

    • @RegorzStatistik
      @RegorzStatistik  Місяць тому

      I don't know the answer to those two questions. (For the second question there could be one very complicated possible solution: Taking all imputed samples, standardizing all predictors and the criterion variable in each imputed sample, and then using those standardized values for the regression. I guess that would give you standardized regression results - but I am not a 100% certain that this would be correct).

  • @EHJ599
    @EHJ599 Рік тому

    Thank you very much for this clear and helpful tutorial!
    Interestingly, my imputed datasets consisted of fewer rows per variable than I expected (9 to be exact). Do you have any idea what happened and how to get R to impute all missingness? Thank you in advance :).

    • @EHJ599
      @EHJ599 Рік тому

      Ps. I checked if the # of ms or iterations made a difference. It did not, and neither did the seed or a change of methods.

    • @RegorzStatistik
      @RegorzStatistik  Рік тому +1

      Based on that information I don't know why that happened.

  • @shadens98
    @shadens98 8 місяців тому

    Super interesting video, do you have any videos or tips on how we can get the pooled results of MLR after MI using spss? i try to do it, but for the important values i get either no pooled values or many missings in the pooled values so i can report them properly?

    • @RegorzStatistik
      @RegorzStatistik  8 місяців тому +1

      Unfortunately, I don't know how to do it in SPSS.

    • @shadens98
      @shadens98 8 місяців тому

      thanks a lot for getting back to me so quickly! will try to it out with R, is there something extra one must do if i am importing already imputed data file from SPSS before i run the regression and pooled regression code there?@@RegorzStatistik

    • @RegorzStatistik
      @RegorzStatistik  8 місяців тому

      @@shadens98 I only know how to do imputation completely in R, unfortunately.

  • @DariaKoksal
    @DariaKoksal Рік тому +1

    Thank you very much for the video! Could you explain please how to save the complete file?

    • @RegorzStatistik
      @RegorzStatistik  Рік тому +1

      In my code example the dataframe with the completed data is called imp.datasets. You can save that as you would any other dataframe in R, e.g. with the write.csv() function.

  • @andreapatrignani2026
    @andreapatrignani2026 7 місяців тому

    Thank you veary much, i have a question, why does you do the pooling on imputed values model instead of compleate dataset? couldn't be better to have information also from the not imputed datas in the model before pooling? so u can have better datas for modelling and after pooling?

    • @RegorzStatistik
      @RegorzStatistik  7 місяців тому

      Pooling is the 3rd step, after running the model in all imputed datasets (2nd step) and "imputed datasets" does not mean that they only contain the cases with missing values, those are completed datasets. You can see that at 0:10:09 in the video - the regression result is based on the df a regression with all cases.

  • @elissamsallem688
    @elissamsallem688 Рік тому

    Thank you for this video! If I want impute missing values for only 1 categorical variable in a large dataset. What should I do?

    • @RegorzStatistik
      @RegorzStatistik  Рік тому

      The key question is which other variables to include in order to impute the categorical variable. You should at least include all variables you are going to use in your regression model.

  • @gallinule6213
    @gallinule6213 4 місяці тому

    Is this the same approach that you'd use for multiple imputation in logistic regression, or just linear regression?

    • @RegorzStatistik
      @RegorzStatistik  4 місяці тому +1

      I haven't used it for logistic regression, yet, so I don't know whether the pooling function of mice works for that as well.

    • @gallinule6213
      @gallinule6213 4 місяці тому

      @@RegorzStatistik Good to know, thanks for the response!

  • @666dazai
    @666dazai 4 місяці тому

    Hello, thank you for this video but I get this error and I could not figure out how to solve it:
    > imp.data

    • @RegorzStatistik
      @RegorzStatistik  4 місяці тому

      This looks to me that for some of the models the regression did not converge. However, I am somewhat astonished about "glm.fit" - I would expect that message in, e.g., a logistic regression, not in a linear regression.

    • @666dazai
      @666dazai 4 місяці тому

      @@RegorzStatistik I used logreg as the imputation method for my variables as they are dichotomous. I am suspecting that is the reason

    • @RegorzStatistik
      @RegorzStatistik  4 місяці тому

      @@666dazai That could be the case - I am not sure whether that package works with log regression or not (haven't tried it yet).

    • @666dazai
      @666dazai 4 місяці тому

      @@RegorzStatistik Alright, thank you for your answer!

  • @christoph3933
    @christoph3933 10 місяців тому

    How about auxiliary variables? Are they not needed here?

    • @RegorzStatistik
      @RegorzStatistik  10 місяців тому +1

      I think in this case age is an auxiliary variable since it is not used in the regression model (but during imputation).

  • @solomonwafula311
    @solomonwafula311 Рік тому

    What if I want to impute variables before using them in PCA. regressions may not work. Kindly suggest how to handle that

    • @RegorzStatistik
      @RegorzStatistik  Рік тому

      Maybe you could look into the package missMDA. There seems to be a function you can use for imputing a PCA (but I haven't used it yet).
      search.r-project.org/CRAN/refmans/missMDA/html/MIPCA.html