Principal Component Analysis (PCA) in R (presence-absence data)
Вставка
- Опубліковано 3 січ 2025
- In this tutorial, we discuss what a principal component analysis (PCA) is, walk through an example in R using species presence-absence data, and create and interpret a PCA biplot.
By far the best video I could find on this topic, thanks a lot!
BY FAR THE BEST AND MOST USEFUL VIDEO I HAVE SEEN SINCE BEING BORN!!!!!! THANK YOU FOR EXPLAINING THIS TO ME!
I know right?
Thanks for making it easy for us❤
OMG finally an explanation of this that makes sense to me. Thanks!
Thanks so much for the video, helped me a lot, this was exactly what I needed for my ciliate data! Greetings from Austria:)
So glad this could be helpful for your ciliate data! Good luck with your project!!
Incredible video! Simple and straight to the point! I wish you well. Regards from Brazil 😇
Thank you so much, Peter! I am so glad people are finding this tutorial helpful!
I find this tutorial really helpful. Thankyou for making this video.
Oh my god I'm so happy I found this video before my ecology final tomorrow. Thanks so much 😭💖
Video was incredibly helpful in walking through and understanding PCA
Thank you, Kelsey!
Hello :) first of all, I would like to thank you for THE most straight forward and on the subject of ecology/biology video I have ever found!! It helped me understand and solve several mistakes and questions I made/had. REALLY, thank you! I would appreciate and love if you could find the time to do more videos like this.
I only have one question: how to avoid the overlapping of the site labels/species labels? I have 30 sites and more thank 30 species, the problem being that I can not see what sites are overlapping :(
Thank you again and I wish you success and great accomplishments in your field of study!
Thank you so much! I have a few more similar videos, but am planning on starting up making more soon! I just came back from 5 months of field work so I am just getting back into the groove of things!
Those overlapping labels are always the worst! I have trouble doing it with the biplot function because it's a bit limited in the level of customization you can do. An alternative would to use the fviz_pca_biplot() function in the 'factoextra' package. Similarly the first thing you would put in the brackets would be the name of the PCA object. With this function there is an argument you can add in the brackets called repel which can be 'repel = TRUE' or 'repel = FALSE'. If you try setting it to TRUE it might solve your problem.
Another option which is a bit of a 'cheat' is what I usually do to be honest because it is much simpler. For some reason I find any text label repel code I find doesn't really do what I want it to. So I save the plot but as a metafile rather than an image or however else you generally save it. I then insert the metafile into PowerPoint. Here, I can right click, go to group, then ungroup. This unconstrains the text and any other discrete plot element so you can edit it. I use it to move around my labels so they aren't on top of each other and sometimes to revise some variable names to make the aesthetic better. Just be careful to not accidentally alter your plot itself!
I hope this is helpful but please let me know if you have any questions!!
@@justonebirdsopinion Hello again, Thank you for your comment!!!
I am back with another question :) What if we want to make a PCA with the environmental data?
chim
thanks dude for explaining it 😎
you're welcome
🙌🙌🙌Grateful for the video 👏👏
Amazing video!! Thanks!!
Epic channel so far! Thanks for the video, from South Africa :) Could you do a PCA video with some environmental variables linked with the species' absence-presence? Would that be a scaling = 2 PCA?
Responding here with my other account so if you respond I'll get a notification😂
@@liamtaylor7710 Hi Liam! What you are thinking of is a RDA rather than a PCA. An RDA tries to fit the composition (presence-absence) to explanatory variables (environmental predictors). I will for sure make a video on this so stay posted!!
RDA stands for redundancy analysis by the way!
@@justonebirdsopinion thanks. You're an absolute legend!
@@liamtaylor7710 Haha so glad you think so! The RDA video is posted - I hope it's helpful! If it's not what you were looking to do, just let me know and I'd be happy to chat to come up with the proper analysis. Please don't hesitate to comment if you have any questions!
Wow, thanks!
Great explanation! Can I do a pca using rda with numeric variables of 11 levels?
Thanks! Yes you can, numeric variables are perfectly fine. Although if you end up with a whole bunch of variables consider doing some dimensionality reduction.
Great video! Super helpful!!
Would a PCA like this, for presence absence of species, be able to include explanatory variables that explain the distribution of the community in the PCA? Your video was super helpful and I was able to run the PCA with my data but now I'd like to visualize in the PCA plane how other variables relate to the spread and layout of the species in the pca. Is a PCA still even what I should be using?
Thanks so much! You are going to want to do a redundancy analysis (RDA) instead - it is essentially just a PCA with explanatory predictors for the response. I have a video on this on my channel, but let me know if you have any questions :) Good luck with your analysis!
Hi, thanks for the video!!! really helpful and helped me through the first step.
The species data I have is also species/absence. However, because I'm dealing with plants, I have p/a data for 159 species :( and therefore my explained variance values are very low (i.e. like 0.03 for PC1). What would you recommend in this case? Should I take out less important or very rare species?
Hi! Thank you :) I'm glad the video was helpful so far!
Oof that sounds frustrating! It sounds like this could be largely due to autocorrelation between your species. You can create a correlation matrix in R and omit anything that is inflating your results (I suggest omitting anything with a value greater than 0.7 or less than -0.7). Here is the code for that (with df being the name of your dataframe):
pearson
Hello! I just wanted to thank you for this amazing video! I also would like to ask you something. What would you do if you want to represent your species under different habitats (e.g. Forest, Meadow and Scrubland) but each habitat has their own amount of sites (e.g. 20 sites each one)? Would you combine your sites to represent the habitats? I'm a little bit lost about which approach I should take to work with my data. Thank you so much!
so how does the result (relatedness and unrelatedness) inform in the science question at hand? (not an ecologist)
This video is really helpful to me, and i would like to know about the variable, is there any minimum amount for the variable in using PCA? such as 5 variables of places with 6 or 7 parameters, could it be use PCA to solve it? Thanks
Yes but how do the principal components that come back when you get the summary from a PCA in R correlate back to the variables you input? I have yet to make sense out of this, nobody seems to explain it clearly and simply, not my teacher and not 1 single youtube video I've watched. I'm lost...
Hi Greg. I am not super sure I understand what you mean, but will try to help! The principal components are axes explaining variation in your data (the variables you input). So high variance explained by a principal component indicates higher relatedness between the variables. If PC1 and PC2 cumulatively explain a super high amount like 90% of your variation, the story of your data is super clear! Pretty much all the relationships between your variables is explained by those two axes. Whereas if there are many principal components explaining very little variation, then the relationship is more complicated or the variables may not be related at all. A biplot can be used to visualize the relationships between the variables (the magnitude of a positive/negative relationship). But the PC values are important to add validity to those relationships. Again, if the PC1 and 2 axes explain 90% the relatedness between variables is going to be much more reliable than if they were cumulatively explaining 20%. If you are interested in a deeper dive, you can look at the individual PC scores for each variable individually in the summary output to look the distance between variables along each PC axis. If the visual biplot is not answering your question, then maybe doing this could help. Does any of this answer your question? Apologies if I am misunderstanding.
I guess by now that you've noticed that your spelling of "principal" is incorrect.