THANK YOU for this video with clear audio! I have been searching all over for a reference example for handling simple regressions with mice(), and so many of the videos out there sound like they were recorded via laptop mics while standing right under an air conditioner. Clear and helpful, thank you again!
Thank you so much for making such video. Your explanation and coding are way simple and clear which it is easier to understand and very helpful for my analysis for my dissertation where I used simulacrum dataset
Thanks for this! It is crystal clear up to pooling. However, I have 2 questions. 1. How can we get a final dataset with pooled results? the combine function gives a dataset with 10 or 20 cycles and do we need to get one final pooled dataset? 2. If we have more than one variable with missing data, do we need to do the regression model for each of these? 3. Do we need to upload the full dataset with other non-missing variables for the MICE process?
1. With multiple imputation there is no pooled dataset. The results are pooled, not the datasets. 2. During imputation more than one variable can be imputed. 3. If you want to use other variables to help with imputation then you have to upload them.
@@RegorzStatistik Thanks very much for your prompt reply. 1. It means we can select one of 5 (if m = 5) datasets with imputed values for the final analysis. Am I right? 2. What is the aim of 'pooling the results'? Is it to decide whether our assumptions are correct? (MNAR or MAR) 3. What if the pooled results contain statistically significant estimates? 4. Can we use Random forest for this? Many thanks
@@malithapatabendige6541 1.-3. No. MI has 3 steps: Step 1: Imputing m datasets Step 2: Running your analysis in each of your datasets - you don't choose one dataset but you use all of them. So you get m different regression results. Step 3: Pooling the results - here you get one result from your m results - and this pooled result (and its p-values) is what counts. I recommend reading an introductory journal article about MI to get a theoretical understanding of the procedure. I don't know if MI works with random forests.
@@RegorzStatistik Thanks. These 3 steps are clear. But, nobody has mentioned how to 'interpret' pooled results and how to get the 'final imputed data for the analysis of the original research. Basically, once it is pooled, what imputed dataset is to be selected out of m number of sets. "Step 3: Pooling the results - here you get one result from your m results - and this pooled result (and its p-values) is what counts" - next step has not been mentioned anywhere. It is strange what are we supposed to do with the pooled result and where can we get one single dataset with imputed data to 'start' the original analysis.
@@RegorzStatistik I think I have to compare pooled estimates, p-values, F-statistic, etc, with each of m data sets and get the BEST GUESS of the imputed data set out of it. Thanks.
Thank you for this video. Is it possible to get the F-statistic for the pooled Model? And is there a way to get star standardized coefficients as well?
I don't know the answer to those two questions. (For the second question there could be one very complicated possible solution: Taking all imputed samples, standardizing all predictors and the criterion variable in each imputed sample, and then using those standardized values for the regression. I guess that would give you standardized regression results - but I am not a 100% certain that this would be correct).
Thank you, this is very informative. Could you point me to a source or clarify 1. how the regression is meant to be set up if more than 1 item/variable is missing and you want to imputate? Is the dependent variable in the regression model the only variable that gets imputated? 2. How do you obtain a table that combines inputated data and original data? Thank you!!
1. I don't have a source available. But MI does not change whether there is 1 item missing or more (in my example, there are rows with more than 1 item missing - so the dependent variable is not the only variable that gets imputed) 2. Only by combining those tables per hand (e.g. with tidyverse). However, that rarely makes sense because you don't have one imputed dataset! In my example you have 50 imputed datasets so combining those 50 datasets with the original dataset would lead to somethin quite large and difficult to interpret.
In my code example the dataframe with the completed data is called imp.datasets. You can save that as you would any other dataframe in R, e.g. with the write.csv() function.
Thank you veary much, i have a question, why does you do the pooling on imputed values model instead of compleate dataset? couldn't be better to have information also from the not imputed datas in the model before pooling? so u can have better datas for modelling and after pooling?
Pooling is the 3rd step, after running the model in all imputed datasets (2nd step) and "imputed datasets" does not mean that they only contain the cases with missing values, those are completed datasets. You can see that at 0:10:09 in the video - the regression result is based on the df a regression with all cases.
Super interesting video, do you have any videos or tips on how we can get the pooled results of MLR after MI using spss? i try to do it, but for the important values i get either no pooled values or many missings in the pooled values so i can report them properly?
thanks a lot for getting back to me so quickly! will try to it out with R, is there something extra one must do if i am importing already imputed data file from SPSS before i run the regression and pooled regression code there?@@RegorzStatistik
The key question is which other variables to include in order to impute the categorical variable. You should at least include all variables you are going to use in your regression model.
Thank you very much for this clear and helpful tutorial! Interestingly, my imputed datasets consisted of fewer rows per variable than I expected (9 to be exact). Do you have any idea what happened and how to get R to impute all missingness? Thank you in advance :).
This looks to me that for some of the models the regression did not converge. However, I am somewhat astonished about "glm.fit" - I would expect that message in, e.g., a logistic regression, not in a linear regression.
Maybe you could look into the package missMDA. There seems to be a function you can use for imputing a PCA (but I haven't used it yet). search.r-project.org/CRAN/refmans/missMDA/html/MIPCA.html
THANK YOU for this video with clear audio! I have been searching all over for a reference example for handling simple regressions with mice(), and so many of the videos out there sound like they were recorded via laptop mics while standing right under an air conditioner. Clear and helpful, thank you again!
Thank you so much for making such video. Your explanation and coding are way simple and clear which it is easier to understand and very helpful for my analysis for my dissertation where I used simulacrum dataset
This is a great video! thanks for going over the details with such clarity.
Thanks so much!
Thanks for this! It is crystal clear up to pooling. However, I have 2 questions.
1. How can we get a final dataset with pooled results? the combine function gives a dataset with 10 or 20 cycles and do we need to get one final pooled dataset?
2. If we have more than one variable with missing data, do we need to do the regression model for each of these?
3. Do we need to upload the full dataset with other non-missing variables for the MICE process?
1. With multiple imputation there is no pooled dataset. The results are pooled, not the datasets.
2. During imputation more than one variable can be imputed.
3. If you want to use other variables to help with imputation then you have to upload them.
@@RegorzStatistik Thanks very much for your prompt reply.
1. It means we can select one of 5 (if m = 5) datasets with imputed values for the final analysis. Am I right?
2. What is the aim of 'pooling the results'?
Is it to decide whether our assumptions are correct? (MNAR or MAR)
3. What if the pooled results contain statistically significant estimates?
4. Can we use Random forest for this?
Many thanks
@@malithapatabendige6541
1.-3.
No.
MI has 3 steps:
Step 1: Imputing m datasets
Step 2: Running your analysis in each of your datasets - you don't choose one dataset but you use all of them. So you get m different regression results.
Step 3: Pooling the results - here you get one result from your m results - and this pooled result (and its p-values) is what counts.
I recommend reading an introductory journal article about MI to get a theoretical understanding of the procedure.
I don't know if MI works with random forests.
@@RegorzStatistik Thanks. These 3 steps are clear. But, nobody has mentioned how to 'interpret' pooled results and how to get the 'final imputed data for the analysis of the original research. Basically, once it is pooled, what imputed dataset is to be selected out of m number of sets.
"Step 3: Pooling the results - here you get one result from your m results - and this pooled result (and its p-values) is what counts" - next step has not been mentioned anywhere. It is strange what are we supposed to do with the pooled result and where can we get one single dataset with imputed data to 'start' the original analysis.
@@RegorzStatistik I think I have to compare pooled estimates, p-values, F-statistic, etc, with each of m data sets and get the BEST GUESS of the imputed data set out of it. Thanks.
Thank you for this video. Is it possible to get the F-statistic for the pooled Model? And is there a way to get star standardized coefficients as well?
I don't know the answer to those two questions. (For the second question there could be one very complicated possible solution: Taking all imputed samples, standardizing all predictors and the criterion variable in each imputed sample, and then using those standardized values for the regression. I guess that would give you standardized regression results - but I am not a 100% certain that this would be correct).
Thank you, this is very informative. Could you point me to a source or clarify 1. how the regression is meant to be set up if more than 1 item/variable is missing and you want to imputate? Is the dependent variable in the regression model the only variable that gets imputated? 2. How do you obtain a table that combines inputated data and original data? Thank you!!
1. I don't have a source available. But MI does not change whether there is 1 item missing or more (in my example, there are rows with more than 1 item missing - so the dependent variable is not the only variable that gets imputed)
2. Only by combining those tables per hand (e.g. with tidyverse). However, that rarely makes sense because you don't have one imputed dataset! In my example you have 50 imputed datasets so combining those 50 datasets with the original dataset would lead to somethin quite large and difficult to interpret.
Thank you very much for the video! Could you explain please how to save the complete file?
In my code example the dataframe with the completed data is called imp.datasets. You can save that as you would any other dataframe in R, e.g. with the write.csv() function.
Thank you veary much, i have a question, why does you do the pooling on imputed values model instead of compleate dataset? couldn't be better to have information also from the not imputed datas in the model before pooling? so u can have better datas for modelling and after pooling?
Pooling is the 3rd step, after running the model in all imputed datasets (2nd step) and "imputed datasets" does not mean that they only contain the cases with missing values, those are completed datasets. You can see that at 0:10:09 in the video - the regression result is based on the df a regression with all cases.
Super interesting video, do you have any videos or tips on how we can get the pooled results of MLR after MI using spss? i try to do it, but for the important values i get either no pooled values or many missings in the pooled values so i can report them properly?
Unfortunately, I don't know how to do it in SPSS.
thanks a lot for getting back to me so quickly! will try to it out with R, is there something extra one must do if i am importing already imputed data file from SPSS before i run the regression and pooled regression code there?@@RegorzStatistik
@@shadens98 I only know how to do imputation completely in R, unfortunately.
Thank you for this video! If I want impute missing values for only 1 categorical variable in a large dataset. What should I do?
The key question is which other variables to include in order to impute the categorical variable. You should at least include all variables you are going to use in your regression model.
Thank you very much for this clear and helpful tutorial!
Interestingly, my imputed datasets consisted of fewer rows per variable than I expected (9 to be exact). Do you have any idea what happened and how to get R to impute all missingness? Thank you in advance :).
Ps. I checked if the # of ms or iterations made a difference. It did not, and neither did the seed or a change of methods.
Based on that information I don't know why that happened.
Hello, thank you for this video but I get this error and I could not figure out how to solve it:
> imp.data
This looks to me that for some of the models the regression did not converge. However, I am somewhat astonished about "glm.fit" - I would expect that message in, e.g., a logistic regression, not in a linear regression.
@@RegorzStatistik I used logreg as the imputation method for my variables as they are dichotomous. I am suspecting that is the reason
@@666dazai That could be the case - I am not sure whether that package works with log regression or not (haven't tried it yet).
@@RegorzStatistik Alright, thank you for your answer!
Is this the same approach that you'd use for multiple imputation in logistic regression, or just linear regression?
I haven't used it for logistic regression, yet, so I don't know whether the pooling function of mice works for that as well.
@@RegorzStatistik Good to know, thanks for the response!
What if I want to impute variables before using them in PCA. regressions may not work. Kindly suggest how to handle that
Maybe you could look into the package missMDA. There seems to be a function you can use for imputing a PCA (but I haven't used it yet).
search.r-project.org/CRAN/refmans/missMDA/html/MIPCA.html
How about auxiliary variables? Are they not needed here?
I think in this case age is an auxiliary variable since it is not used in the regression model (but during imputation).