Great job, but what about missingness that exist in a single column and also it's more than 50%? Is deep models like GAN would be useful for imputation?( In time-series prediction). Many thanks🙏
I assume GAN refers to some kind of neural network. Imputation works regardless of the amount of missing data, under these three conditions: 1) You are doing multiple imputation and not single imputation so that you can quantify the uncertainty introduced by the imputation process. 2) The imputation model contains all features of your data that are relevant for the analysis. 3) The missingness does not depend on the missing value itself. (i.e. data are MAR or MCAR) I do not really see what neural nets would add over throughfully developed imputation model but they are likely to increase sample size requirements.
@@mronkko "Hi again Mikko, I'm tackling a unique challenge with my dataset and believe your insights could greatly help. Could you share any contact info for more brief discussion? Thanks!"
I take the comment about my accent as a compliment ;) Funny thing: I used to live in the US and part of the accent was lost during that time. Even if that is about 20 years ago now, I still see my accent diminish when I spend a couple of days there. But now that we cannot travel the accent is as strong as ever!
I found your channel recently, and started liking your teaching approach. I want to ask if pairwise deletion is possible in regression y = X*beta + e, beta = inv(X'X)X'y. It is possible to calculate a pairwise version of X'X. Would love to hear your thoughts. Thx
In pairwise X'X you would need to adjust for sample size for each cell. But in principle you can estimate pairwise covariances of all the variables and then estimate regression from that covariance matrix. The resulting estimator should be consistent under MCAR but getting the standard errors right would require adjustments to the complete data standard error formulas. I have not seen any paper discussing how to do this and therefore I would not be comfortable using this approach. That being said, that I have not read something does not mean that it does not exist. I have just come to the conclusion that because FIML and multiple imputation exist already and I know how to do both, there is little reason for me to learn about other approaches to adjusting for missing data in estimation.
Hi It was mentioned that "the imputed data can only be used within the pooling testing and cannot be used for the model testing". Does it mean the data is only imputed/simulated for the purpose of analysing its reliability?. If it cannot be used for model testing, does it mean we still need to use the actual data and perform the deletion of missing data? Correct me if I'm wrong Thank you
Hi, thank you for the content. I would like to know how to choose the reference variable, for example, in your case IQ is taken as a reference when imputing job performance. Actually I have a lot of variables in my data set where some of them have a lot of missing values. How can I identify which variable to refer when I want to impute another one?
Your imputation model needs to use all variables and model all relationships that you have in your main model. In addition, you can use auxiliary variables (I have a video about that). The rule with auxiliary variables is that you should be liberal in including them. However, if your sample size is small you can start to get bias and computational difficulties if you include too many.
Deleting observations is never ideal if you only consider it from statistical perspective. However, simplicity is also a virtue in applied research (for example, you would be less likely to make mistakes if you keep things simple) and simple techniques should be used over complex ones if the difference in outcomes is small. Deleting observations is OK if a) your sample size is sufficient after deletion and b) your missing data are MCAR. I would not use pairwise deletion because using a different sample size for different analyses complicates things, but this depends on how the data are missing.
I loved that you added those simulation results. That was very interesting and helped my understanding
You are welcome!
Very helpful and simplified explanation. Thanks for the video!
You are welcome!
Great job, but what about missingness that exist in a single column and also it's more than 50%? Is deep models like GAN would be useful for imputation?( In time-series prediction). Many thanks🙏
I assume GAN refers to some kind of neural network. Imputation works regardless of the amount of missing data, under these three conditions:
1) You are doing multiple imputation and not single imputation so that you can quantify the uncertainty introduced by the imputation process.
2) The imputation model contains all features of your data that are relevant for the analysis.
3) The missingness does not depend on the missing value itself. (i.e. data are MAR or MCAR)
I do not really see what neural nets would add over throughfully developed imputation model but they are likely to increase sample size requirements.
@@mronkko "Hi again Mikko, I'm tackling a unique challenge with my dataset and believe your insights could greatly help. Could you share any contact info for more brief discussion? Thanks!"
@@mohamadmatinhavaei9859 I take consulting orders through instats.org/expert/mikko--rönkkö-829.
Strong Finnish accent :).. Thank you for the awsome content
I take the comment about my accent as a compliment ;) Funny thing: I used to live in the US and part of the accent was lost during that time. Even if that is about 20 years ago now, I still see my accent diminish when I spend a couple of days there. But now that we cannot travel the accent is as strong as ever!
I found your channel recently, and started liking your teaching approach. I want to ask if pairwise deletion is possible in regression y = X*beta + e, beta = inv(X'X)X'y. It is possible to calculate a pairwise version of X'X. Would love to hear your thoughts. Thx
In pairwise X'X you would need to adjust for sample size for each cell. But in principle you can estimate pairwise covariances of all the variables and then estimate regression from that covariance matrix. The resulting estimator should be consistent under MCAR but getting the standard errors right would require adjustments to the complete data standard error formulas. I have not seen any paper discussing how to do this and therefore I would not be comfortable using this approach. That being said, that I have not read something does not mean that it does not exist. I have just come to the conclusion that because FIML and multiple imputation exist already and I know how to do both, there is little reason for me to learn about other approaches to adjusting for missing data in estimation.
Hi
It was mentioned that "the imputed data can only be used within the pooling testing and cannot be used for the model testing".
Does it mean the data is only imputed/simulated for the purpose of analysing its reliability?. If it cannot be used for model testing, does it mean we still need to use the actual data and perform the deletion of missing data?
Correct me if I'm wrong
Thank you
I need more context. Can you give me a timestamp from the video?
Good Job 👍👍👍
Thanks!
Hi, thank you for the content. I would like to know how to choose the reference variable, for example, in your case IQ is taken as a reference when imputing job performance. Actually I have a lot of variables in my data set where some of them have a lot of missing values. How can I identify which variable to refer when I want to impute another one?
Your imputation model needs to use all variables and model all relationships that you have in your main model. In addition, you can use auxiliary variables (I have a video about that). The rule with auxiliary variables is that you should be liberal in including them. However, if your sample size is small you can start to get bias and computational difficulties if you include too many.
Thanks a lot sir
Most welcome
Our teacher is focusing on us using KNN to impute data. This seems like a biased method like the traditional methods but I'm not 100% sure.
What does KNN stand for.
Hi! In what types of research can I use pairwise/listwise deletion?
Deleting observations is never ideal if you only consider it from statistical perspective. However, simplicity is also a virtue in applied research (for example, you would be less likely to make mistakes if you keep things simple) and simple techniques should be used over complex ones if the difference in outcomes is small. Deleting observations is OK if a) your sample size is sufficient after deletion and b) your missing data are MCAR. I would not use pairwise deletion because using a different sample size for different analyses complicates things, but this depends on how the data are missing.
got this, thank you!
Sir... Which book you are using
Enders 2010. It is cited in the video.
thank you
You are welcome