Handling missing values in PCA

Поділитися
Вставка
  • Опубліковано 3 лют 2025

КОМЕНТАРІ • 31

  • @mab963
    @mab963 6 років тому +4

    Merci beaucoup, très utile!

  • @vincentsmith8339
    @vincentsmith8339 4 роки тому +1

    Great presentation

  • @DanielMorenoSoto
    @DanielMorenoSoto 6 років тому +2

    Professor Husson,
    I want to thank you and acknowledge the great usefulness of this tool. For me it has always been a great issue how to deal with missing values, especially with some molecular biology techniques (such as quantitative PCR) that will eventually throw some NAs just due to the detection threshold of the machine not being reached.
    I've just used the package and it worked (it imputed values and ran the PCA), but I wanted to know if it's possible for the number of dimensions estimated by estim_ncpPCA to be equal to zero (I used zero when imputing), which was the case with my data.
    Thanks in advance.
    Best regards.

  • @juanmanuelcelyarevalo8703
    @juanmanuelcelyarevalo8703 14 днів тому

    Thanks!! Great video :)

  • @3Mus-cat-tears
    @3Mus-cat-tears Рік тому

    Professor Husson,
    Thank you for the video and the explaination of the concept.
    I have one question though. I have ran through your code and generated all these graphs, but how do I put those imputed values back into the original orange dataset?

    • @HussonFrancois
      @HussonFrancois  Рік тому

      the imputed values are in the object res$completeObs

  • @EricSmith9000
    @EricSmith9000 6 років тому +1

    Well done. Many thanks!

  • @javierhernando5063
    @javierhernando5063 2 роки тому

    Nice video! So helpful! Do you think this could be applied to gene expression? A bunch of genes after treatment to see if they form clusters?

    • @HussonFrancois
      @HussonFrancois  2 роки тому +1

      Yes sure!

    • @javierhernando5063
      @javierhernando5063 2 роки тому

      @@HussonFrancois Would you recommend to scale and center variables? Taking into account that gene expression values are already normalized using a control gene, and every sample (and consequently, the genes) is treated after treatment in relation to before treatment (every patient). I mean, we are already carrying out a normalization

  • @cindyconlin2124
    @cindyconlin2124 2 роки тому

    Professor Husson,
    Thank you very much for the helpful video and packages. Is it possible to impute data with a large number of variables, and then use only a subset of the variables in the plot for MIPCA and also in the PCA? (I have "total" variables that are a sum of variables within that subcategory. If I use only the total variables to impute, estim_ncpPCA tells me to use 0 dimensions, but if I impute with the subvariables also, estim_ncpPCA tells me to use 1 dimension. I would like to use the full dataset to impute, but then only use the "total" variables for the rest of the analysis).
    If this is possible, how would I specify the subset of variables I would like to see plotted in the plot(mi) command?
    If this isn't a wise strategy, could you kindly advise me on better approaches? Many thanks.

  • @fabricen26
    @fabricen26 8 років тому +1

    Great! Thank for your help

  • @kirkgeier417
    @kirkgeier417 2 роки тому

    Big thank you!

  • @kevintschirhart412
    @kevintschirhart412 7 років тому +1

    Great video! Merci

  • @doctorwhyphi
    @doctorwhyphi 5 років тому +1

    Bonjour! Could I used the imputed data for an exploratory factor analysis of a Likert Questionnaire ? Merci

    • @HussonFrancois
      @HussonFrancois  5 років тому +1

      Yes, You can use the imputed dataset for any statistical method. But, don't forget that the links between variables are reinforced when you complete the data, especially if you have a lot of missing values.

    • @doctorwhyphi
      @doctorwhyphi 5 років тому

      @@HussonFrancois Merci Beaucoup!

  • @valabreu5870
    @valabreu5870 6 років тому

    Dear Dr. Husson, I'm doing the ncpMCA function on a data frame of categorical variables, but keep getting this error 'Error in tab.disj.comp - vrai.tab : non-conformable arrays'. Would you please advise me? Thank you so much

  • @aboubakeraden9116
    @aboubakeraden9116 5 років тому

    Bonjour Professeur,
    J'aurai quelques questions à vous poser :
    - la multicollinearité entre les inputs features est-elle un problème pour les neural nets ?
    - après tous ces années de hype des réseaux de neurones (ou deep learning) comment se fait-il que les reseaux de neurones échouent dans 95% de cas de litterature ou de competition Kaggle PORTANT SUR DES DONNEES STRUCTUREES OU TABULAIRES à pouvoir battre des modeles non-lineaires comme ceux à base d'arbres (Xgboost, Adaboost...) ?
    - y-a-t-il une explication scientifique à savoir pourquoi les reseaux de neurones ne fonctionnent pas aussi bien sur les donnees structurees qu'ils le furent sur les donnees non-structurees (images, videos, audios, textes, sequences...) ?

  • @meghanshirleybezerra7079
    @meghanshirleybezerra7079 4 роки тому

    Hello! I am getting a value of '0' for the estimated ncps. Can you advise what to do in this situation? Thanks so much.

    • @HussonFrancois
      @HussonFrancois  4 роки тому

      A value of 0 means that the imputation by the mean of each variable is the best. It also means that there is not a lot of links betwwen your variables.

    • @meghanshirleybezerra7079
      @meghanshirleybezerra7079 4 роки тому

      @@HussonFrancois Thank you for the quick reply! I don't have very many missing variables - is it possible this is also a potential reason for mean imputation being the best option?

  • @Nicolas-mp4oq
    @Nicolas-mp4oq 3 роки тому

    thx

  • @liwenzhao8785
    @liwenzhao8785 6 років тому

    Got error when I tried to compute:
    nb

    • @noereyna2553
      @noereyna2553 2 роки тому

      We’re you able to solve this? I’m wondering the same thing

  • @machheydt176
    @machheydt176 7 років тому

    I'm currently working on my Msc degree and I'm in need of a package that can impute my missing data. So I've found this video and I started reading your paper, but you mention quite early that the aim of missMDA is more to try and visualise the PCA despite missing data, whereas other approaches might be better suited to impute the missing data.
    I don't know if this is the right place for such a discussion, but what is the difference between performing 'your' PCA on a set including missing data, and performing 'regular' pca after you've imputed the data with a better suited package/approach? It seems that performing PCA on a set of which you've imputed the data more correctly is inherently more correct? Or are there more intricacies that I'm not (yet) aware of?
    I'm just curious. Anyway, this video does help a lot understanding your paper, as I'm not by any means a good statistician or mathematician or anything. Merci!

    • @HussonFrancois
      @HussonFrancois  7 років тому +2

      In fact, we first consider missMDA to handle missing values in PCA, but then we study the quality of the imputations obtained by missMDA and the results was better or equivalent than the other competitive methods such as random forest for instance.

    • @machheydt176
      @machheydt176 7 років тому +1

      Could you then please elaborate why on page 3 of your article you mention that
      'The main difference between the two R packages missMDA and pcaMethods is that the primary aim of missMDA is to estimate PCA parameters and obtain the associated graphical representations in spite of missing values, whereas pcaMethods focuses more on imputation aspects.' ?
      What does this then really mean? Is missMDA 'better' at imputing values than pcaMethods? Even though pcaMethods 'focuses more on imputation aspects'?
      Thank you for your reply!

    • @mattbeets
      @mattbeets 3 роки тому

      @@machheydt176 I know this is a very late answer, but maybe this quote from the missMDA::MIPCA documentation gives a clue to the difference between methods? (at least the Bayesian vs. bootstrap methods in missMDA):
      "The methods differ by the way in which the variability due to missing values is reflected. The method used is controlled by the method.mi argument. By default, MIPCA uses the parametric bootstrap method.mi="Boot". This bootstrap method is more recommended to evaluate uncertainty in PCA (through confidence ellipses). Otherwise, the Bayesian method can be used by specifying the argument method.mi="Bayes". It is based on an iterative algorithm which alternates imputation of the data set and draw of the PCA parameters in a posterior distribution. These steps are repeated Lstart times to reach a convergence. Then, one imputed data set is kept each L iterations to ensure independence between imputed values from a data set to another. The Bayesian method is more recomm[e]nded to apply a statistical method on an incomplete data set."

  • @nayldev9185
    @nayldev9185 4 роки тому

    Tu as un très bon anglais !

  • @HouDa-fi9nq
    @HouDa-fi9nq 2 роки тому

    3:52