I had listened to several other lectures on this topic but the pace and the detail covered in this video is simply the best. Please keep up the good work!
OMG, this tutorial is perfection, I´m serious. You make it sound so easy and you explain every single step. Also, that is the prettiest plot I´ve seen. Thank you so much for this.
I never comments on videos, but you really saved me here. Nothing was working on my dataset and this came smoothly. Well done on the explanations too, everything as crystal clear.
I have my exam in 2 days and Your video saved me tons of effort in combing through so many other articles and videos explaining PCA. A BIG Thank You! Hope you do many more videos and impart your knowledge to newbies like me. :)
really love your explanantion! thank you so much for your video, really helpful and i can understand it! keep it up! looking forward to your many more upcoming videos
I can say for sure that it´s the best explanation I´ve ever seen!! Go on and I would be really grateful if you make one of Time Series and Forecasting :)
Very nice tutorial, nicely explained and really complete, looking forward to learn more in R with other of your vids, thank you for the tremendous help!
Thank you so much for this SUPER helpful video. (P.S. The explanation with the iris dataset was especially convenient for me as I'm working on a dataset with dozens of recorded plant traits:D)
Yes you should be able to. What have you tried? If you have a column called names with the label for each point, something like this should work: ggplot(df, aes(PC1, PC2, label = names)) + geom_text() Or use geom_label() if you prefer. You can also check out the ggrepel package if you have many overlapping points.
@@hefinrhys8572 I have 18 observations and 9 variables w/represented my environmental parameters. I successfully produced the ggplot figure. But I wanted to put a label in all the points in the figure to know what variables cluster together. i tried your suggestion but it gives me the numerical value, not the environmental variables. Any other suggestion?
Ok so the Sepal.Width contributes mostly over 80% to the PC2 and the other three to PC1 more. 14:32 and so Sepal Width is fair enough as an info to separate setosa in the next plot. Isn't it also advisable to apply pca to linear problems?
You're correct about the relative contributions of the variables to each principal component. The Setosa species is discriminated from the other two species mainly by PC1, to which sepal.width contributes less that than the other variables. As PCA is a linear dimension reduction technique, it will best reveal clusters of cases that are linearly separable, but PCA is still a valid and useful approach to compress information, even in situations where this isn't true, or when we don't know about the structures in the data. Non-linear techniques such as t-SNE and UMAP are excellent at revealing non-linearly-separable clusters of cases in data, but interpreting their axes is very difficult/impossible.
when i generate the PCA with the code explained @ 20:46 my legend appears as a gradient rather than the separate values (as in your three different species appearing in red, blue green. how can i change this?
Can I confess something that baffles me? Because, I see this all the time. OK, so you, personally, are motivated to share your knowledge with the world, right? I mean, you took time, effort, energy, focus, planning, equipment, software, etc. to prepare this explanation and exercises. You screen-captured it, you set up your microphone, you edited the video, you did all this enormous amount of work. You're clearly motivated. Yet, when it actually comes time to deliver that instruction, you think it is 100% acceptable to place all your code into an absolutely miniscule fraction of the entire screen. Like, pretty-close to 96% of the screen is 'dead-space' from the perspective of the learner. The size of the typeface is miniscule (depending on your viewing system). It would be like producing a major blockbuster film, but then publishing it at the size of a postage stamp. Surely, it would be possible for you to 'zoom-into' that section of the IDE to show people what it was you were typing - the operators, the functions, the arugments, etc. I'm not really picking on you, individually, per se. I see this happen all the time with instructors of every stripe. I have this insane idea that instruction has much, much less to do with the insturctor's ability to demonstrate their knowledge to an uninformed person and has much, much more to do with the instructor's ability to 'meet' the student 'where' they are and to carry the student from a place of relative ignoracne (about a specific topic) to a place of relative competence. One of the best tools for assessing whether you're meeting that criteria is to PRETEND that you know nothing about the topic - then watch your own video (stripping-out all the assumptions you would automatically make about what is going on based on your existing knowledge). If you didn't have a 48" monitor and excellent eye-sight, would you be able to see what was being written? Like... why would you do that? If writing of the code IS NOT important - don't bother showing it. If writing of the code IS important, then make it (freaking) visible and legible. This really baffles me. I guess instructors are so "in-their-own-head" when they're delivering content, they don't take time to realize that no one can see what is happening. . It just baffles me how often I see this.
If 'zooming-in' is not easily achieved, the least instructors could do is go into the preferences of the IDE and jack-up the size of the text so that it would be reasonably legible on a screen typical of, say, a laptop or tablet. It just seems like such a low-hanging fruit, and easy fix to facilitate learning and ensure legibility.
At 5:50, don't you mean that if we measured sepal width in kilometers then it would appear LESS important? Because if we measured it in kilometers instead of millimeters, our numerical values will be smaller and vary far less, making it less important in the context of PCA. Thank you for this video.
Yes, you're absolutely correct! What I meant to say was that if that length was kilometers, but we neasured it in millimeters, then it would be given greater importance. But yes, larger values are given greater importance.
How is it possible to generate outliers uniformly in the p-parallelotope defined by the coordinate-wise maxima and minima of the ‘regular’ observations in R?
Amazing video Hefin, there are lot of details covered in 27 min video, we just have to be careful not to miss any second of the video. I have a question: How does the scores are calculated for each PC's ? Why do we have to check the correlation between the variables and the PC1 & PC2 ? what value it adds practically ?
Hi Hefin, Thanks for this tutorial. What do we do if PC1 and PC2 can only explain around 50% of the variation? Do we also include PC3 and PC4? If so, how?
Many thanks for your efforts to make this complex issue much easier for us. Could you enlight me to understand group similarly and dissimilarity using pca?
SVD still finds eigenvectors as it's a generalization of eigen-decomposition. This might be useful: web.mit.edu/be.400/www/SVD/Singular_Value_Decomposition.htm
Very informative video. Can you tell me? When i m plotting the last plot ggplot it showed error like . R said there is no package called digest. How to deal with it kindly advise.
Hi! I have a question, does it make sense to run a PCA on discrete data? I am trying something using your tutorial as a guide but I get a weird result in the plot, and I am wondering it it is because of the nature of my data. Thanks
Great question! If your data are not ordinal, you may get some use out of PCA if you numerically encode your discrete variables, but you may get more out of Multiple Correspondence Analysis (MCA) than PCA. Have a look here: www.rpubs.com/piterii/dimension_reduction
Thank you Hefin Rhys for explaining PCA in detail. Can you please explain how to find weights of a variable by PCA for making a composite index? Is it rotation values that are for PC1, PC2, etc.? For example, if I have (I=w1*X+w2*Y+w3*Z) then how to find w1, w2, w3 by PCA.
I have a question, what if i want want to perform PCA on data that have not just different scale, but also different unit, such as data that involves environmental parameters such as temperature, humidity, light intensity, etc. Will scaling the data can solve this? Thank you
Hi, yes this is a common situation. Scaling our variables means we can use them to find meaningful principal components, irrespective of their different measurement scales. Try running PCSLA on your data set with and without scaling the variables, you'll likely see a big difference. Scaling is valid (and important) for vaeiables with different measurement scales.
The prcomp function assumes the columns are variables, and each row is a case. In this way, the resulting components maximise the explained variance of the original variables. I'm not sure how you would interpret the principal components if you first transposed the matrix. Try it, and see what you get.
Yes so in the video you link to, the matrix they create has the cases as the columns, and the variables as the rows. This is why they use the t() function to transpose the matrix so that the columns are variables, and the rows are cases, which the prcomp function expects. Does that make sense?
@@hefinrhys8572 Well, I'm not sure if it's a problem or confusion with the names with which the columns are called and the rows in Spanish and the English translation, we usually put the individuals in the rows and the characteristics in the columns, but from what I understand you call the individuals variables and the cases the characteristics, am I right? Can you see the following table and confirm which ones with the variables for you? drive.google.com/file/d/11QipxFBhlL6hoJ45_1SIU0VrKNiANndc/view I would really appreciate your help, I'm really confused
Ok so the language of columns and rows can be confusing as there are many different words that mean the same thing. So your interpretation is the wrong way round. Features == variables == characteristics, individuals == cases == subjects. So in the table you link to, the columns are variables/features/characteristics, and the rows are individuals/cases/features. So in that example, you would NOT transpose as it is in the format prcomp expects.
Great tutorial but it leaves me with the question, what do i do with it? Is this just the begining of a K means classification that gives me an idea of the proper k?
Very cool Hefin. I'm trying to run a data reduction for panel data (220 countries, about 25 years of data, and about 100 different variables). Could PCA be used for this?
Hi Simon, it will depend on what kind of data you have and what your goal is. All the variables will need to be numeric as PCA can't handle categorical variables (check out independent correspondence analysis for this). If you want to find linear combinations of variables to explain most the variation in the data, then PCA is a good choice. If you're just interested in seeing whether there are subgroups of subjects in your dataset, you might want to try a non-linear dimension reduction algorithm like t-SNE or UMAP :)
Thank you so much for your videos!! Your videos are the best I have seen hands down :) All of your explanations and step by step through R are what I needed to work on my research. One area I am having trouble with (since I am not a statistician) is making sure I run my data through all the necessary statistical tests before running the PCA. My data is similar to the iris dataset (skull measurements categorized by family and subfamily levels) but I am seeing different sources run different tests before the PCA (ANOVA vs non-parametric tests). If anything, would you be able to recommend some good sources for me to refer to? Thank you! I really appreciate it!
There I have a question. Why "iris[,-5]*myPr$rotation" is not equal to "myPr$x" ? Isn't the "myPr$rotation" matrix factor loadings? Thanks in advance...
I run into the error when running line 17 (in the download file): Error in data.frame(..., check.names = FALSE) : arguments imply differing number of rows: 510, 382. What it going wrong?
There are some answers here that might help: stats.stackexchange.com/questions/5774/can-principal-component-analysis-be-applied-to-datasets-containing-a-mix-of-cont But I would ask what your goal is with this. Are you looking to uncover some underlying latent variables in your data? In which case factor analysis may be the way to go. If it's just to reduce dimensionality to uncover clusters/patterns in the data, then PCA might work, but it will treat those 0/1 variables as continuous, which might not yield the results you're hoping for.
Thank you for this very clear video. Question about interpretation: I get just the 1 cluster in my ggplot, what does this mean? that all my variables relate to the same construct (component) and that they cant really be differentiated?
So when you apply PCA to your own data and plot the first two components, you see just a single cloud of data? This would indicate that you don't have distinct, linearly-separable sub-classes of cases in your dataset. PCA will still compress the majority of the information of your many variables into a smaller number of variables, so even if it doesn't reveal a class structure in your data, it can still be beneficial for dimension reduction.
@@hefinrhys8572 thanks for the quick reply. Yes I only see a single cloud. I am not using PCA for dimension reduction - just using it to explore my data before including these variables into a SEM. In particular, I wanted to see if it makes sense to relate these 5 variables to a single latent variable in my SEM. All the loadings for PC1 are 0.7 or 0. 8, or more, and PC1 captures 0.7 of variation. Can I take this result as support for considering these 5 variables as part of the same measuring model (linked to the same latent variable) in my SEM? theoretically it makes sense to, but I wanted to see if the data supported this. I have never done PCA or SEM so no idea if I am doing this right.
So you could actually include categorical variables by numerically encoding them, or dummy coding them. The issue is that PCA finds new axes that minimise the variance of the data along them and calculating variance for a categorical variable doesn't really make sense. If you have categorical variables, you could look at Independent Correspondence Analysis (ICA), or you could apply PCA to your continuous variables, select the components that capture most variance, and combine these with your categorical variables for your downstream analysis. This may or may not yield satisfactory results.
Thank you for this great video. can you show how to seek multicolinearity or treat multicolinearity with PCA ? I have a data set with 40 variables with high intercorrelation because of cross reactivity . VIF and matrix correlation doesnt work probably because of multiple comparison ....:(((
Thanks for your nice job! I have a question. I have a biostat data. As you told in this video, we do not need to know what is our variable for colour grouping! Actually, I have a problem, and it does not work for me! aes(x = PC1, y = PC2 , col= ??? ) I really appreciate it if you reply me back!
Why am I getting the error "Too few points to calculate an ellipse". Can someone please explain in dummy terms. I am using my own data btw and following along this tutorial.
Amazing tutorial. Very simple and straight to the point. Already subscribed. I have some questions. PCA is an unsupervised method, isn't it? Is it possible to further decompose the data for Versicolor and Virginica to find further grouping? I have read before there are supervised methods. Do you have some tutorial for those?
Thanks enthiran! Yes, PCA is unsupervised because we don't give it any information about group membership, we give it unlabelled data and let if find the optimal projection of the data into a lower dimensional space that maximises the explained variance. If you wanted to build a model to predict group membership, then you would need to use a supervised cluster analysis algorithm, where you supply a training dataset with grouping labels (this is what makes it supervised). The algorithm will then learn which features in the data associate with each group, such that when you give the model unlabelled data, it will predict group membership. I have a video on various clustering algorithms here: ua-cam.com/video/PX5nSBGB5Tw/v-deo.html
I had listened to several other lectures on this topic but the pace and the detail covered in this video is simply the best.
Please keep up the good work!
Thanks Sadia! Glad to be of help.
OMG, this tutorial is perfection, I´m serious. You make it sound so easy and you explain every single step. Also, that is the prettiest plot I´ve seen. Thank you so much for this.
You're very welcome! If you like pretty plots, check out my video on using ggplot2 ;) ua-cam.com/video/1GmQ5BdAhG4/v-deo.html
Noone explains R better than Hefin. Give this man a medal already!!
Never a tutorial about PCA so clear and simply. Thanks
I'm in graduate school and you just explained PCA better than my professor. GOD BLESS YOU!!!!
5 year old video still one of the best I found on the topic on YT. Thumbs up
I never comments on videos, but you really saved me here. Nothing was working on my dataset and this came smoothly. Well done on the explanations too, everything as crystal clear.
I have my exam in 2 days and Your video saved me tons of effort in combing through so many other articles and videos explaining PCA. A BIG Thank You! Hope you do many more videos and impart your knowledge to newbies like me. :)
Finally a perfect tutorial for POA in Rstudio. Thanks mate!
How i came across this video a week before ,my final year, project due date is a miracle. Thank you so much Hefin Rhys.
Jackie Mwaniki doing?
@@mohamedadow8153 my topic is on Macroeconomic factors and the stock prices using the APT framework.
Explained everything one might need. If only every tutorial on UA-cam is like this one!
really useful video thank you, I've just started my MSc project using PCA, so thank you for this. I will be following subsequent videos.
Quite literally, the best tutorial I've ever seen on an advanced multivariate topic. Job well done, sir!
The best run through I've seen for using and understanding PCA.
Excellent tutorial. I have used this for analysis of my research. Thanks a lot for sharing your valuable knowledge.
Great help, been doing my own work following step by step this tutorial...the whole night
In all honesty this is the best tutorial I've seen in months. Nice job!
Damn, your accent is hypnotic! The explanation is good too!
Thanks! 😘
Excellent! Words cannot show how grateful I am!
I've been going through your tutorials and I'm so impressed. Legend!!!
This video gave a major leap in my study. Thanks.
Best explanation I’ve found so far! Thanks mate, legend!
Uploaded the script as well what a guy
really love your explanantion! thank you so much for your video, really helpful and i can understand it! keep it up! looking forward to your many more upcoming videos
its so funny I don't think you realize but myPR "my pyaar" in Urdu/Hindi means my love. Thank you for an amazing and extremely helpful video
This tutorial is outstanding. Excellent explanation! Thank you very much!!!
Thanks for the the video, it helped me a lot!! Your explanation is very didactic!
I can say for sure that it´s the best explanation I´ve ever seen!! Go on and I would be really grateful if you make one of Time Series and Forecasting :)
Thanks Elena! Thank you also for the feedback; I may make a video on time series in the future.
The explanation is just perfect. Thank you.
Great video. Very instructive. Please keep making them
Very nice tutorial, nicely explained and really complete, looking forward to learn more in R with other of your vids, thank you for the tremendous help!
Thank you! I'm glad it helped.
Excellent walkthrough. Thank you!
Added to my stats/math playlist! Very useful.
Thank you so much for the very clear and concise explanation!
In fact I found out how to overcome the multicolinearity , by using the eigen values of PC1 and PC2! I love PCA!
This is gold. I absolutely love you for this
Thanks for nice and easy way of explanation.It really helps me a lot.
Thank you so much for this SUPER helpful video. (P.S. The explanation with the iris dataset was especially convenient for me as I'm working on a dataset with dozens of recorded plant traits:D)
Hi, i wonder if it's possible to put label in each points? I tried geom_text but i get error
Yes you should be able to. What have you tried? If you have a column called names with the label for each point, something like this should work:
ggplot(df, aes(PC1, PC2, label = names)) +
geom_text()
Or use geom_label() if you prefer.
You can also check out the ggrepel package if you have many overlapping points.
@@hefinrhys8572 I have 18 observations and 9 variables w/represented my environmental parameters. I successfully produced the ggplot figure. But I wanted to put a label in all the points in the figure to know what variables cluster together. i tried your suggestion but it gives me the numerical value, not the environmental variables. Any other suggestion?
Thank you so so much!! You just saved the day and helped me really understand my homework for predictive analysis.
It is a really nice and clear tutorial! Thanks a lot, Hefin~
You're welcome Flora! Thank you!
Excellent tutorial Hefin. Hooked and subscribed...
Vesselin Nikov thank you! Feel free to let me know if there are other topics you'd like to see covered.
Ok so the Sepal.Width contributes mostly over 80% to the PC2 and the other three to PC1 more. 14:32 and so Sepal Width is fair enough as an info to separate setosa in the next plot. Isn't it also advisable to apply pca to linear problems?
You're correct about the relative contributions of the variables to each principal component. The Setosa species is discriminated from the other two species mainly by PC1, to which sepal.width contributes less that than the other variables. As PCA is a linear dimension reduction technique, it will best reveal clusters of cases that are linearly separable, but PCA is still a valid and useful approach to compress information, even in situations where this isn't true, or when we don't know about the structures in the data. Non-linear techniques such as t-SNE and UMAP are excellent at revealing non-linearly-separable clusters of cases in data, but interpreting their axes is very difficult/impossible.
Thank you so much! This is GREAT! You explained very clearly and smoothly.
thank you so much for this video. incredibly helpful.
Finally understood this goddamn topic! Thank you dude
Very nice, guys hit the subscribe button, the best explanation so far.
Sweet baby Jesus. Thank you for making this video!
You're very welcome!
Perfect! Never seen such explanation
this tutorial is slap bang fuckin perfect, god bless you, you magnificant bastard
😘
@@hefinrhys8572 stats assignment due in 12 hours and you saved me alot of hassle
thank you so much! you are the best, very clear explanation.
You are really a life saver! Thank you!
Hello! Thanks for the video, just a question how would you modify the code if you have NA values? In advance, thank you!
when i generate the PCA with the code explained @ 20:46 my legend appears as a gradient rather than the separate values (as in your three different species appearing in red, blue green. how can i change this?
Can I confess something that baffles me? Because, I see this all the time. OK, so you, personally, are motivated to share your knowledge with the world, right? I mean, you took time, effort, energy, focus, planning, equipment, software, etc. to prepare this explanation and exercises. You screen-captured it, you set up your microphone, you edited the video, you did all this enormous amount of work. You're clearly motivated. Yet, when it actually comes time to deliver that instruction, you think it is 100% acceptable to place all your code into an absolutely miniscule fraction of the entire screen. Like, pretty-close to 96% of the screen is 'dead-space' from the perspective of the learner. The size of the typeface is miniscule (depending on your viewing system). It would be like producing a major blockbuster film, but then publishing it at the size of a postage stamp. Surely, it would be possible for you to 'zoom-into' that section of the IDE to show people what it was you were typing - the operators, the functions, the arugments, etc. I'm not really picking on you, individually, per se. I see this happen all the time with instructors of every stripe. I have this insane idea that instruction has much, much less to do with the insturctor's ability to demonstrate their knowledge to an uninformed person and has much, much more to do with the instructor's ability to 'meet' the student 'where' they are and to carry the student from a place of relative ignoracne (about a specific topic) to a place of relative competence. One of the best tools for assessing whether you're meeting that criteria is to PRETEND that you know nothing about the topic - then watch your own video (stripping-out all the assumptions you would automatically make about what is going on based on your existing knowledge). If you didn't have a 48" monitor and excellent eye-sight, would you be able to see what was being written? Like... why would you do that? If writing of the code IS NOT important - don't bother showing it. If writing of the code IS important, then make it (freaking) visible and legible. This really baffles me. I guess instructors are so "in-their-own-head" when they're delivering content, they don't take time to realize that no one can see what is happening. . It just baffles me how often I see this.
If 'zooming-in' is not easily achieved, the least instructors could do is go into the preferences of the IDE and jack-up the size of the text so that it would be reasonably legible on a screen typical of, say, a laptop or tablet. It just seems like such a low-hanging fruit, and easy fix to facilitate learning and ensure legibility.
@@EV4UTube chill out dude
At 5:50, don't you mean that if we measured sepal width in kilometers then it would appear LESS important? Because if we measured it in kilometers instead of millimeters, our numerical values will be smaller and vary far less, making it less important in the context of PCA.
Thank you for this video.
Yes, you're absolutely correct! What I meant to say was that if that length was kilometers, but we neasured it in millimeters, then it would be given greater importance. But yes, larger values are given greater importance.
@@hefinrhys8572 Alright, thanks for the reply and for the video!
How is it possible to generate outliers uniformly in the p-parallelotope defined by the
coordinate-wise maxima and minima of the ‘regular’ observations in R?
Great teacher you are, thanks
Super well-explained, thank you!
a perfect tutorial for PCA... Thank you
Amazing video Hefin, there are lot of details covered in 27 min video, we just have to be careful not to miss any second of the video. I have a question: How does the scores are calculated for each PC's ? Why do we have to check the correlation between the variables and the PC1 & PC2 ? what value it adds practically ?
Hi Hefin,
Thanks for this tutorial. What do we do if PC1 and PC2 can only explain around 50% of the variation? Do we also include PC3 and PC4? If so, how?
simple and clear. very good
Thank you so much for this tutorial, it really helped me!
Very informative and clear Thanks.
Hi Hefin, can I put a percentage in the PCA 1 and PC2 in the x and y-axis? How to do that?
Fantastic video Hefin! thanks
Many thanks for your efforts to make this complex issue much easier for us. Could you enlight me to understand group similarly and dissimilarity using pca?
Clear and straight forward, good work!
Bully for you! Lol
Where did you define PC1 and PC2 (where you use them in the ggplot)? I'm getting "Error: object 'PC1' not found"
Good tutorial!I have learnt a lot. Thanks !
10:21 - When using "prcomp", the calculation is done by a singular value decomposition. So, these are not actually eigenvectors, right?
SVD still finds eigenvectors as it's a generalization of eigen-decomposition. This might be useful: web.mit.edu/be.400/www/SVD/Singular_Value_Decomposition.htm
@@hefinrhys8572 Thank you answering! I will look into it.
Amazing video! Thanks for the explaining everything very simply. Could you please do a video on PLS-DA?
If my biological data only has numbers(1,2 & 3 digits) and a lot of zeros, do I need to scale also?
Very informative video. Can you tell me? When i m plotting the last plot ggplot it showed error like . R said there is no package called digest. How to deal with it kindly advise.
Hi! I have a question, does it make sense to run a PCA on discrete data? I am trying something using your tutorial as a guide but I get a weird result in the plot, and I am wondering it it is because of the nature of my data. Thanks
Great question! If your data are not ordinal, you may get some use out of PCA if you numerically encode your discrete variables, but you may get more out of Multiple Correspondence Analysis (MCA) than PCA. Have a look here: www.rpubs.com/piterii/dimension_reduction
Great presentation! However, why did you not binarize the categorical variable first, and then do the subsequent analysis?
Thanks!
Thank you Hefin Rhys for explaining PCA in detail. Can you please explain how to find weights of a variable by PCA for making a composite index? Is it rotation values that are for PC1, PC2, etc.? For example, if I have (I=w1*X+w2*Y+w3*Z) then how to find w1, w2, w3 by PCA.
Thank again. Quick one....Would you mind also doing the Fama and Macbeth Analysis without using the KenFrench Dataframe?
Error in svd(x, nu = 0, nv = k) : infinite or missing values in 'x'
???
I have a question, what if i want want to perform PCA on data that have not just different scale, but also different unit, such as data that involves environmental parameters such as temperature, humidity, light intensity, etc. Will scaling the data can solve this? Thank you
Hi, yes this is a common situation. Scaling our variables means we can use them to find meaningful principal components, irrespective of their different measurement scales. Try running PCSLA on your data set with and without scaling the variables, you'll likely see a big difference. Scaling is valid (and important) for vaeiables with different measurement scales.
I have a question the to use the prcomp command, is it not necessary to transpose the matrix to do analysis on individuals and not on variables?
The prcomp function assumes the columns are variables, and each row is a case. In this way, the resulting components maximise the explained variance of the original variables. I'm not sure how you would interpret the principal components if you first transposed the matrix. Try it, and see what you get.
@@hefinrhys8572 I'm really not sure I'm confused, I'm based on this video
ua-cam.com/video/0Jp4gsfOLMs/v-deo.html
Thanks for your answer
Yes so in the video you link to, the matrix they create has the cases as the columns, and the variables as the rows. This is why they use the t() function to transpose the matrix so that the columns are variables, and the rows are cases, which the prcomp function expects. Does that make sense?
@@hefinrhys8572 Well, I'm not sure if it's a problem or confusion with the names with which the columns are called and the rows in Spanish and the English translation, we usually put the individuals in the rows and the characteristics in the columns, but from what I understand you call the individuals variables and the cases the characteristics, am I right?
Can you see the following table and confirm which ones with the variables for you?
drive.google.com/file/d/11QipxFBhlL6hoJ45_1SIU0VrKNiANndc/view
I would really appreciate your help, I'm really confused
Ok so the language of columns and rows can be confusing as there are many different words that mean the same thing. So your interpretation is the wrong way round. Features == variables == characteristics, individuals == cases == subjects. So in the table you link to, the columns are variables/features/characteristics, and the rows are individuals/cases/features. So in that example, you would NOT transpose as it is in the format prcomp expects.
Great tutorial but it leaves me with the question, what do i do with it? Is this just the begining of a K means classification that gives me an idea of the proper k?
Lol you just replied in 26:00... Thank you so much!
Very cool Hefin. I'm trying to run a data reduction for panel data (220 countries, about 25 years of data, and about 100 different variables). Could PCA be used for this?
Hi Simon, it will depend on what kind of data you have and what your goal is. All the variables will need to be numeric as PCA can't handle categorical variables (check out independent correspondence analysis for this). If you want to find linear combinations of variables to explain most the variation in the data, then PCA is a good choice. If you're just interested in seeing whether there are subgroups of subjects in your dataset, you might want to try a non-linear dimension reduction algorithm like t-SNE or UMAP :)
How can i input desired fonts and font size in that graph ?
Thank you so much for your videos!! Your videos are the best I have seen hands down :) All of your explanations and step by step through R are what I needed to work on my research.
One area I am having trouble with (since I am not a statistician) is making sure I run my data through all the necessary statistical tests before running the PCA. My data is similar to the iris dataset (skull measurements categorized by family and subfamily levels) but I am seeing different sources run different tests before the PCA (ANOVA vs non-parametric tests). If anything, would you be able to recommend some good sources for me to refer to? Thank you! I really appreciate it!
Thank you! This was very helpful to me
Outstanding. Thank you.
Hi, good job but If I have an input data as a wave how can I take and separate the values of the crests starting from a certain threshold?
There I have a question. Why "iris[,-5]*myPr$rotation" is not equal to "myPr$x" ? Isn't the "myPr$rotation" matrix factor loadings? Thanks in advance...
You "R" AWESOME!!!
amazing video, thank you
I run into the error when running line 17 (in the download file): Error in data.frame(..., check.names = FALSE) : arguments imply differing number of rows: 510, 382. What it going wrong?
How can I upload my data into RStudio to work with ?
Can you run PCA on factor variables coded as 0 vs 1. 1 meaning presence of something?
There are some answers here that might help: stats.stackexchange.com/questions/5774/can-principal-component-analysis-be-applied-to-datasets-containing-a-mix-of-cont
But I would ask what your goal is with this. Are you looking to uncover some underlying latent variables in your data? In which case factor analysis may be the way to go. If it's just to reduce dimensionality to uncover clusters/patterns in the data, then PCA might work, but it will treat those 0/1 variables as continuous, which might not yield the results you're hoping for.
Thank you for this very clear video. Question about interpretation: I get just the 1 cluster in my ggplot, what does this mean? that all my variables relate to the same construct (component) and that they cant really be differentiated?
So when you apply PCA to your own data and plot the first two components, you see just a single cloud of data? This would indicate that you don't have distinct, linearly-separable sub-classes of cases in your dataset. PCA will still compress the majority of the information of your many variables into a smaller number of variables, so even if it doesn't reveal a class structure in your data, it can still be beneficial for dimension reduction.
@@hefinrhys8572 thanks for the quick reply. Yes I only see a single cloud. I am not using PCA for dimension reduction - just using it to explore my data before including these variables into a SEM. In particular, I wanted to see if it makes sense to relate these 5 variables to a single latent variable in my SEM. All the loadings for PC1 are 0.7 or 0. 8, or more, and PC1 captures 0.7 of variation. Can I take this result as support for considering these 5 variables as part of the same measuring model (linked to the same latent variable) in my SEM? theoretically it makes sense to, but I wanted to see if the data supported this. I have never done PCA or SEM so no idea if I am doing this right.
Why can I not use categorical data?
So you could actually include categorical variables by numerically encoding them, or dummy coding them. The issue is that PCA finds new axes that minimise the variance of the data along them and calculating variance for a categorical variable doesn't really make sense. If you have categorical variables, you could look at Independent Correspondence Analysis (ICA), or you could apply PCA to your continuous variables, select the components that capture most variance, and combine these with your categorical variables for your downstream analysis. This may or may not yield satisfactory results.
Thank you for this great video. can you show how to seek multicolinearity or treat multicolinearity with PCA ? I have a data set with 40 variables with high intercorrelation because of cross reactivity . VIF and matrix correlation doesnt work probably because of multiple comparison ....:(((
Thanks for your nice job! I have a question.
I have a biostat data. As you told in this video, we do not need to know what is our variable for colour grouping!
Actually, I have a problem, and it does not work for me! aes(x = PC1, y = PC2 , col= ??? )
I really appreciate it if you reply me back!
Why am I getting the error "Too few points to calculate an ellipse". Can someone please explain in dummy terms. I am using my own data btw and following along this tutorial.
Amazing tutorial. Very simple and straight to the point. Already subscribed. I have some questions. PCA is an unsupervised method, isn't it? Is it possible to further decompose the data for Versicolor and Virginica to find further grouping? I have read before there are supervised methods. Do you have some tutorial for those?
Thanks enthiran! Yes, PCA is unsupervised because we don't give it any information about group membership, we give it unlabelled data and let if find the optimal projection of the data into a lower dimensional space that maximises the explained variance. If you wanted to build a model to predict group membership, then you would need to use a supervised cluster analysis algorithm, where you supply a training dataset with grouping labels (this is what makes it supervised). The algorithm will then learn which features in the data associate with each group, such that when you give the model unlabelled data, it will predict group membership. I have a video on various clustering algorithms here: ua-cam.com/video/PX5nSBGB5Tw/v-deo.html