Applied Principal Component Analysis in R

Spencer Pao

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 4 січ 2025

КОМЕНТАРІ • 77

@nicholaslewis7148 2 роки тому ⁺²
Thank you for doing such a straight forward and simple video. No one seems to have a video with PCA and Eigenvalues in real application to a dataset
@jonashansen6391 3 роки тому ⁺¹
Great video. Simple intuitive rundown of something complex.
@AndrzejFLena 3 роки тому
Bloody love your channel mate - thanks for all your great work! Peace from UK :D
@joshpoland804 2 роки тому ⁺¹
This is so incredibly helpful. Thank you!
@andrewdelgado7536 3 роки тому
Dude you're a lifesaver for these videos! THANK YOU! - Clinical Research PhD Student
@dendeibrahimadekanmbi8022 3 роки тому
Thanks for this presentation
@SwoleMastrChase 4 роки тому
Good stuff! Im doing PCA for my senior project and this video helped me out a ton!
@SpencerPaoHere 4 роки тому
I'm glad this helped!
@lauradiaz5829 3 роки тому
Amazing video!
@shivangigujar3184 4 роки тому
Very nicely explained!!!
@PinkWitch 2 роки тому
Very helpful in preparing for exam PA. Thx ~
@marlonedy55 2 роки тому
Thank you for sharing your knowledge. You could make a video about RDA and CCA. Greetings from Ecuador 🇪🇨🙋‍♂️
@telmamendes9423 4 роки тому
Thank you so much! It helped a lot!
@MUHAMMADYASEEN-ce5xx 4 роки тому
do u know about multivariate PCA in R studio?
@davekimmerle9453 Рік тому ⁺¹
Hey Spencer, great video, keep up the good work!
I wanted to ask again if you have an academic article, paper or book that I could cite in my thesis when I do PCA?
@SpencerPaoHere Рік тому
Try This:
royalsocietypublishing.org/doi/10.1098/rsta.2015.0202
@oliesting4921 3 роки тому ⁺¹
Very great video! Hope you will do more tutorials in R. Do you use stepAIC to do feature importance? How is AIC method different from PCA method? Thank you!
@SpencerPaoHere 3 роки тому ⁺¹
Hi! I'm glad you liked it. StepAIC is used to determine which model is best; it does this by calculating scores when subtracting features (hence can be considered a form of feature selection)
The PCA method is used to shrink the number of features, reducing complexity but maximizing variance of your features.
@fishfish20 2 роки тому
Where is the next part? I love your work
@SpencerPaoHere 2 роки тому
Hmm. You can probably scroll through the videos on my channel to see what you're looking for. if not, let me know!
@kavyashreenm6815 2 роки тому
Please do a video on " how to avoid overlapping of labels in kmeans clusterplot and pca scatterplot?"
@SpencerPaoHere 2 роки тому
Hmm. I don't really follow. Are you referring to a preprocessing step for the visualization side?
@adriantyanandya4338 Рік тому
Hello Spancer. First I want to thank you for your great video. I have a question, why i can't use 'princomp' for the reason "can only be used with more units than variables" . Is there a solution to solve it? Thank you.
@SpencerPaoHere Рік тому
Thanks! Hmm. well, it is perfectly reasonable (and ideal) to have more observations than there are features! This helps with the collinear effect.
@dcsekhar Рік тому
sir i have a data of categorical independent variables which i have converted into 0,1 and i am trying to fit a logistic regression because response variable is also categorical in nature 0,1 so can this technique be used to avoid multicollinearity problem in the dataset and do discriminate analysis for prediction
@SpencerPaoHere 10 місяців тому
Yes. There are other forms of penalization techniques out there -- but you can use PCA to avoid multicollinearity (in fact that is one of the main purposes of PCA)
@kx7522 3 роки тому
How do we construct an Index with PCA? Do we multiply the raw data of each column with the proportion variation and sum them up?
@SpencerPaoHere 3 роки тому ⁺¹
Now there is a really lengthy answer to that question and I would not be doing the answer justice without this post: -- in essence it depends.
This might better answer your question.
stats.stackexchange.com/questions/133492/creating-a-single-index-from-several-principal-components-or-factors-retained-fr
@kx7522 3 роки тому
@@SpencerPaoHere Thanks :) I have another question, do I need to scale my dataset before doing PCA if I were to use the 'prcomp' function in R?
@SpencerPaoHere 3 роки тому
@@kx7522 Yes! (Because PCA's backend end is sum of squares): You should scale & normalize.
@chiennguyenminh8109 3 роки тому
Hi, I wonder how you can do autocomplete variable so fast? In 11:02
@SpencerPaoHere 3 роки тому ⁺¹
oh haha. I just edited the video so that you won't see me type the words out (keeping out the more mundane parts)
However, in R, there are ways to autocomplete. Try the 'Tab' + 'enter' button when writing variables/functions etc.
@chiennguyenminh8109 3 роки тому
@@SpencerPaoHere Thank you so much!!!
@XxRoos898xX 3 роки тому
Hey Spencer Pao,
Thank you for the great video!
Just a few questions:
- Why are you entering cor = true (pc.teeth
@SpencerPaoHere 3 роки тому ⁺¹
Hello, I'm glad you liked it!
1) The (cor = True) parameter indicates whether to use to correlation or covariance matrix for the PCA calculation. In this case, since it is set to true, I am using the correlation matrix.
2) PCA is naturally trying to rotate the observation's axis for best fit.
3) How to decide what rotation is sort of an artform. You'd have to try different rotational methods to see which one is best suited to your use case. I would try maybe 2-3 different popular rotational algorithms and observe the outcomes.
@XxRoos898xX 3 роки тому
@@SpencerPaoHere Thank you for the clarification! I am only familiar with PCA in SPSS.
My teacher explained PCA in short as the following steps:
- examine the correlation matrix
Variables have to be correlated for PCA to be appropriate (i.e., if they are not correlated they are unlikely to share common factors).
+ as a guide look for correlations > 0.35 in absolute size
- extract all potential factors
- examine eigen values
The total variance explained by each factor is the eigenvalue.
The most common method is to only retain factors that account for variance >1.
An alternative method is to use a scree plot (look for the break in the curve as an indication of the point at which further factors stop giving us a worthwhile extra amount of explained variance).
- examine the factor matrix (loadings - determine which loadings load heavely on which components (does it make sense theoretically that they are loading heavely on a certain component?))
- examine the final stastistics (communalties fall since only a subset of the factors are used)
Low communalties suggest that a variable may need to be excluded (i.e., it is not explain well by your components)
- explore how rotations influence your PCA(if needed)
How do you feel about the above explanation?
Is it possible to have to highly correlated variables? E.g., would a correlation of >.90 be a problem for PCA?
Also if I wanted to examine communalties in R do you know how I would code this in?
I am just starting to learn about PCA, I do apologise if there are any mistakes in the above text.
@SpencerPaoHere 3 роки тому
@@XxRoos898xX Hello. Are you referring to the mathematics of PCA? I like the shortest explanations possible that address the main steps of the algorithm haha. But by all means, when it comes to teacher explanations, I'd stick with what they are saying since they are the ones who are grading your descriptions :p
High level overview:
1) Compute covariance or correlation matrix
2) Calculate the Eigenvectors / eigenvalues from covariance or correlation matrix
3) Sort by the eigenvectors by eigenvalues and choose whichever number of eigenvectors fits your screeplot.
4) Transform your data into new subspace using the chosen eigenvectors
The communalities you are referring to is equivalent to Sum of Squares of Loadings. Check out 15:02
@XxRoos898xX 3 роки тому
@@SpencerPaoHere Thank you! I will rewatch your video again :) it truely is amazing! I will check out 15:02 again, thank you for your quick replies
Sorry, my notes were from my teachers explanation on how to interpret/work through a spss output of a PCA - we didn't touch the algorithms behind PCA (I think this may be why it is difficult for me to follow some of the steps in R)
@dimariscolonmolina2223 3 роки тому
Hello, I have a PCA analysis where I used the prcomp and I want to show in the graph species and the environmental parameter that is sorting (grain size particle) but it does not show in the graph the sorting , this is the code that I have at the moment. Really appreciate all th help, thank you!!!
pr.envt1
@SpencerPaoHere 3 роки тому
This sounds like a graphing issue. Have you taken a look at the sort() function? This can 'order' your independent variables when graphing. You can use the sort function on the "X" variables that you are trying to plot, and it should sort the variables to your liking.
@dimariscolonmolina2223 3 роки тому
@@SpencerPaoHere ok, so how will be the Rcode that I send you with the sort function. If you have an email address I be happy to send out my Rscript to see what I'm doing wrong, if is no problem with you
@SpencerPaoHere 3 роки тому
I am wary about revealing an email address publicly. Do you have a github? Maybe you can push your stuff up there and I can take a look at it?
Or, even better, if you have reproducible code with a sample data (i.e the iris dataset) and can generate the problem, I can help diagnose the issue.
But, in essence, for the variable that you wanted to plot, try doing something like plot(sort(name_variable)...)
@dimariscolonmolina2223 3 роки тому
@@SpencerPaoHere ok awesome will do
@adityapratapsingh6068 2 роки тому
Does having a categorical variable along with continuous variables make any difference? What modifications are needed in the analysis if the dataset is so? Thank you.
@SpencerPaoHere 2 роки тому
You'd have to one hot encode your categorical variables. Else, the interpretation of your categorical variables would be ordered. (which is something you don't want)
@adityapratapsingh6068 2 роки тому
@@SpencerPaoHere So suppose my variable has been categorised in a rating from 1 to 5. So their values are like 0, 2, 5, 1 and so on. Can i as such use them for the analysis?
@SpencerPaoHere 2 роки тому
@@adityapratapsingh6068 Yep! Just one hot encode that feature. Then, you should be good to go!
@menoknow2 2 роки тому
Hi Spencer, great video! I was wondering if you could use PCA for binary data? I see that the bot.canine feature is binary. Thanks!
@SpencerPaoHere 2 роки тому
Yes! You absolutely can use PCA for binary outcomes. But, there are better options to use when it comes to modeling categorical variables.
@azarael77 2 роки тому
I read an article where it said that there might be problems if the difference in item difficulties between the variables (the proportion of people agreeing to an item) is too high. It recommends to use correspondence analysis in this case.
@MarinaUganda 3 роки тому
I have beent rying to plot a biplot of the varimax rotated components. Is this possible?
@SpencerPaoHere 3 роки тому ⁺¹
It should be! You can attempt to store the component values into X and Y variable and plot(X, Y) as needed.
@MarinaUganda 3 роки тому
@@SpencerPaoHere It won't work unfortunately when I use fviz.
@SpencerPaoHere 3 роки тому
@@MarinaUganda Hmm weird. Once you have obtained your fit rotated on varimax, you would plot your loadings to get the visualization.
Your code should look something like this:
fit
@Aaarya299 3 роки тому
how to analyze principal components using variance values ?.
@SpencerPaoHere 3 роки тому
Well.. technically the variances are the eigenvalues. You can checkout the covariance matrix between the principal components and check out how much variability is explained by each component. To find out much of the variability is explained, you get the diagonal values and divide by the sum of the diagonal values to get the 'explainability' of each component.... I am not sure if I answered your question.
@ziddneyyy 3 роки тому
Thanks for the tutorial, but do u know how to convert the principal component analysis to the principal component regression? im stucked after i got the component that will be used for the regression but idk how to convert it, thx
@SpencerPaoHere 3 роки тому ⁺¹
Hi!
You could run the lm() on your components and you Y variable.
So it would be something like this:
df
@ziddneyyy 3 роки тому
@@SpencerPaoHere OMG THANKS A LOT, U R HELPING TOO MUCH RIGHT NOW
@AJman24 3 роки тому
@@SpencerPaoHere would we include all the components or just the main ones? And how would we do prediction using this regression as the test data will be in a different format ?
@SpencerPaoHere 3 роки тому
@@AJman24 The idea behind PCA is to maximize the variance explained with as few components as possible. So, whatever your threshold of variance is will determine how many components you want to use.
@AJman24 3 роки тому
@@SpencerPaoHere but what if we want to compare the testing results with another model let's say ridge and we have already set aside a few observations for that and we want to test our pcr on the same observations as the ones that we used for ridge?
@ajaydhungana1921 3 роки тому
Does number of entry among columns matters or not?
@SpencerPaoHere 3 роки тому
Are you referring to records/rows? It really depends on what type of machine learning model you use. In general, the more data you have, the better your model will be off. (Not always the case i.e billions of records for a PCA might be a bit over board.)
But, if you don't have a lot of data, then you will be worse off. (i.e < 100 records/observations)
@asirbillah4987 3 роки тому
Hello, would you plz tell whether its possible for me to pursue MS in ML/DL in a good US uni if I have 3.5ish cgpa in undergrad?
@SpencerPaoHere 3 роки тому ⁺¹
I'd argue that anything is possible. I'd take a look at Georgia Tech (*Note that there are a ton EXCELLENT MS programs out there). But, I specifically know people who have gone through Georgia Tech's program and have heard of good things about it. They have a DS program that is remote / part time.
@justin2icy 3 роки тому
Thank you, this was very helpful! One quick question I had was, if I wanted to know which variables were closely related to another variable, how could I interpret that? For example, of the 8 variables, which 3 are strongly related to bot.cannine? How would I go about doing that?
@SpencerPaoHere 3 роки тому
Hi! Try running the R^2 value to find the correlation amongst variables. If the relationship between say X and Y is close to 1, then you know that they are strongly related. You can run the correlation function on all the features and generate a correlation matrix.
@pramitthapa283 Рік тому
All the experts in youtube video say ‘PCA is dimensionality reduction and bla bla bla…’. However, no one explains in simpler way how the reduced dimensions or principal components (explain x% of variance) exactly mean regarding variables, in a way a beginner in statistics can understand.
@muhammadzubairchishti1795 2 роки тому
Dear Respected Professor! Thank you so much for providing us free knowledge. I highly appreciate your precious efforts. Kindly please give me your email address since I want to send my issue regarding R codes to you. Thank you.
@SpencerPaoHere 2 роки тому ⁺¹
I am not a professor. :p
Though I can answer any questions you might have in the comment section, I can be reached out at
business.inquiry.spao@gmail.com
@muhammadzubairchishti1795 2 роки тому
@@SpencerPaoHere Thank you so much for your kind reply.

Наступне

Автоматичне відтворення

Ensemble Method: Boosting (Hypothesis Boosting)