StatQuest: PCA in Python

Поділитися
Вставка
  • Опубліковано 22 січ 2025

КОМЕНТАРІ • 413

  • @statquest
    @statquest  4 роки тому +16

    Correction:
    3:23 The array should only have wt through wt5, ko1 through ko5.
    Support StatQuest by buying my books The StatQuest Illustrated Guide to Machine Learning, The StatQuest Illustrated Guide to Neural Networks and AI, or a Study Guide or Merch!!! statquest.org/statquest-store/

    • @GoktugAsc123
      @GoktugAsc123 4 роки тому

      Thank you, I was mentioning 3:23. Your videos are great.
      I am a medical doctor from Turkey and currently, I am planning a career change to data science and I have been watching your videos to get prepared for a data scientist position. Could you create a few videos regarding data science interviews if it is relevant for your channel content? Best Regards, Göktuğ Aşcı, MD.

    • @statquest
      @statquest  4 роки тому +1

      @@GoktugAsc123 I'll keep that in mind.

    • @statquest
      @statquest  3 роки тому +3

      @@keerthik3791 Unfortunately the random forest implementations for Python are really bad and they don't have all of the features. If you're going to use a random forest, I would highly recommend that you do it in R instead.

    • @keerthik2168
      @keerthik2168 3 роки тому

      @@statquest Thankyou for the suggestion. I am good at Python, MATLAB. Can I do random forest in MATLAB? Or is learning R necessary here?

    • @statquest
      @statquest  3 роки тому

      @@keerthik2168 I have no idea. I've never tried to do random forests in Matlab.

  • @pressiyamu8976
    @pressiyamu8976 4 роки тому +117

    Dude you deserve a humanitarian award.

  • @mohammedghouse235
    @mohammedghouse235 3 роки тому +18

    Not only the best PCA demonstration but also THE BEST introduction to Python. Hats off to you man!!

  • @LittleScience
    @LittleScience 3 роки тому +7

    I have been dabbling in data science for a while now, and only now learned that pandas stand for "panel data" xd
    This channel never ceases to amaze

  • @advaitshirvaikar4751
    @advaitshirvaikar4751 4 роки тому +50

    Whenever I search for some machine learning based explanation, I add 'by statquest' in it ^_^. Keep up the great work :')

    • @statquest
      @statquest  4 роки тому +5

      Thank you very much!

    • @shaktishivalingam3880
      @shaktishivalingam3880 4 роки тому +5

      @@statquest It's True I do the same thing ..thank you for your hard work

  • @mattheckel2609
    @mattheckel2609 4 роки тому +17

    "Note: We use samples as columns in this example because... but there is no requirement to do so."
    "Alternatively, we could have used..."
    "One last note about scaling with sklearn vs scale() in R"
    This is some of the gold that sets StatQuest apart. Thank you! ❤

  • @tl-lay
    @tl-lay 4 роки тому +4

    YOU ARE SAVING MY DEGREE I LOVE YOU SO MUCH I CANT EVEN BELIEVE THIS IS THE SAME MATERIAL IM LEARNING IN MY MACHINE LEARNING CLASS RIGHT NOW.

  • @hayskapoy
    @hayskapoy 5 років тому +36

    Finally! You explain in the language I understand much better than English haha Thanks !!!

  • @vedparulekar478
    @vedparulekar478 Рік тому +2

    One of the best videos ever made on this topic. This channel has helped me a lot in understanding machine learning in greater detail. Keep up the good work !!

  • @AniruddhModi-y2x
    @AniruddhModi-y2x Місяць тому +1

    The fact that you said bam when the plot showed what we wanted really shows that even if you are a pro python programmer, you still feel happy when you code correct, relatableeee

  • @BrunetteViking
    @BrunetteViking 2 роки тому +1

    This channel is the best UA-cam channel that I discovered. Thank you, sir!

  • @DATABOI
    @DATABOI 7 років тому +74

    Python. Now you're speaking my language :)

  • @x11y22z33me
    @x11y22z33me 2 роки тому +1

    Simply loving StatQuest. Concise, clear and fun videos. One point I noted while watching this video is that the latest version of sklearn PCA() will center the data for you, but not scale it. So if you just need centering for doing pca, you don't need to worry about preprocessing.

    • @statquest
      @statquest  2 роки тому

      Thanks for the update!

  • @neptunesbounty1786
    @neptunesbounty1786 5 років тому +2

    I learn so much better in Python for some reason, I think it's because it's more interactive and you can play around with the data! Good one. Stattttquueeeeeest.

    • @statquest
      @statquest  5 років тому +2

      Thanks! There should be a lot more Python videos and learning material out soon.

    • @godsperson5571
      @godsperson5571 3 роки тому +1

      @@statquest looking forward to it :).

  • @samirsaci6723
    @samirsaci6723 3 роки тому +1

    I push the like button even before I play the video. Because Josh never fails to amaze me.

  • @reneeliu6676
    @reneeliu6676 6 років тому +5

    I am watching the 1st minute and I'm already super excited. Thanks!!

  • @spag5296
    @spag5296 4 роки тому +3

    You've got the right formula for simple explanations. Teach me dawg

  • @raphael3835
    @raphael3835 6 років тому +3

    The only good step by step explanation I found on the web. Thank you so much!

    • @statquest
      @statquest  6 років тому +1

      Hooray!!! Thank you so much! :)

  • @antomartanto
    @antomartanto 3 роки тому +1

    You are one the best teacher that i've ever found. Thank you very much!

  • @amribrahim7850
    @amribrahim7850 3 роки тому +3

    Awesome. Please create more videos about how to implement the machine learning as well as data science concepts explained here into Python. That would be super helpful for us, in particular beginners.

  • @oswaldocastro9600
    @oswaldocastro9600 6 років тому +7

    Hi Josh... Simply incredible all StatQuest videos... Triple Bam!!!

  • @shanmugapriyak7269
    @shanmugapriyak7269 4 роки тому +1

    Always can find a new and detailed explanation of steps from your videos! Thank you!

  • @LincolnFrias
    @LincolnFrias 6 років тому +2

    It's awesome to have the explanation based on python code. Thanks a lot!

    • @statquest
      @statquest  6 років тому +1

      No problem. I'm doing a lot more python coding these days, so hopefully I'll more of these "in python" videos.

  • @jack.1.
    @jack.1. 3 роки тому +2

    Wish there were more statquest coding in python videos, they are the best! Much prefer to regular content although that is still really high quality

  • @jiangxu3895
    @jiangxu3895 3 роки тому +3

    Thank you Josh. Such practice is important and valuable!! And you really also taught some Python tricks that I don’t know.

  • @saiakhil4751
    @saiakhil4751 3 роки тому +1

    Wow Josh.. Thanks for that unpacking concept. I never knew that my whole life...

  • @christopheryogodzinski6860
    @christopheryogodzinski6860 7 років тому +2

    Another Great StatsQuest in the books!

  • @RimaHandewi
    @RimaHandewi 9 місяців тому +1

    Wow, your explanation is so clearly!!

  • @jiayoongchong2606
    @jiayoongchong2606 4 роки тому +2

    6:31 using scikit PCA
    8:35 plotting scree plot
    10:37 loading scores for each principal component

    • @statquest
      @statquest  4 роки тому +1

      Thanks for the time point! I'll add those to the description to divide the video into chapters.

  • @rohitrajora9832
    @rohitrajora9832 3 роки тому +2

    Really appreciate this and would love to see more concepts implemented in python.

  • @nonalcoho
    @nonalcoho 4 роки тому +1

    I like the way you plot the ratio of each PC~~
    It is really easy to read!
    BAM~~~~~~~~~~

  • @fvviz409
    @fvviz409 4 роки тому +5

    MAKE MORE PYTHON CONTENT PLEASE I LOVE IT

    • @statquest
      @statquest  4 роки тому +2

      I'm working on it. :)

  • @KikiBah
    @KikiBah 3 роки тому +1

    This was so clear, thanks! Finally I can do PCA in python, BAM 😊 You DA BEST!

  • @jeremylv3029
    @jeremylv3029 3 роки тому +1

    Man, u r a gem. I will pay for the knowledge later after my graduation bro. lol

  • @rabiabibi8634
    @rabiabibi8634 5 років тому +5

    Hi Josh. The best PCA explanation. Thanks a lot :-) May GOD bless you 😊

    • @statquest
      @statquest  5 років тому +1

      Thank you! :)

    • @pressiyamu8976
      @pressiyamu8976 4 роки тому

      Yes, May god bless you 100 times. May the troubles of today’s world not reach your doorstep. You’re a great person.

  • @merrimac1
    @merrimac1 5 років тому +8

    Thanks for the tutorial! One thing I don't understand is why the PC1 can separate the wt and ko samples. Their gene expression values are generated in a same way.

    • @3stepsahead704
      @3stepsahead704 2 роки тому

      Just stating I have the same question 2 years later.

  • @vipulsonawane7508
    @vipulsonawane7508 2 роки тому +1

    What a playlist, I simply loved it 😘

  • @IntegralDeLinha
    @IntegralDeLinha 2 роки тому +1

    Woww! That was absolutely awesome!!! Thank you so much!

  • @kannurajnathamuni9966
    @kannurajnathamuni9966 4 місяці тому +1

    You are the best!!!! It would be great if you could make a video on speculative decoding using medusa and quantization of neural networks in general

  • @timharris72
    @timharris72 6 років тому

    This was a reallly good explanation using Python

  • @henkhbit5748
    @henkhbit5748 4 роки тому +1

    As always a great presentation and the python code just give the extra bite...

  • @danielcozetto421
    @danielcozetto421 3 роки тому +2

    Hello Josh, Thank you for the amazing video! Quick question, at 9:18 how can I adapt "index=[*wt, *ko] for an excel input? Lets say that we have the same variables (Genes vs wt/ko) but in an excel file. How can I add these labels to the final plot (9:47)? Thank you again!!

    • @statquest
      @statquest  3 роки тому

      I'm not sure I understand your question. You can export your data from excel and import it into python (or R or whatever). Or are you asking about something else?

  • @danielvmartins4635
    @danielvmartins4635 Рік тому +1

    Excellent work!!! 👏👏

  • @olehsorokin7963
    @olehsorokin7963 3 роки тому

    That's a cool one. The fact that observations are columns makes it so confusing though. I'm really used to the tidy data notation

  • @miskaknapek
    @miskaknapek 3 місяці тому +1

    very much enjoy your explanation style.
    many thanks for the great videos!

    • @statquest
      @statquest  3 місяці тому +1

      Thanks!

    • @miskaknapek
      @miskaknapek 3 місяці тому +1

      @@statquest excellent going - really.
      difficult to know what's up and down in data science, and so i'm happy your videos cover subjects from mathematical concepts to code implementation.
      excellent spirit and explanations, again.
      (sorry about the superlative avalanche - in the vast ocean that's the net, it's difficult finding authoritative sources covering subjects well )
      bests from Germany/Denmark ;)

    • @statquest
      @statquest  3 місяці тому

      @@miskaknapek BAM! :)

  • @angeloperera2022
    @angeloperera2022 11 місяців тому +1

    Amazing video! I initially watched the video explaining PCA and i was mind-blown, thank you so much! I was hoping to ask if anyone on the comment section or even StatQuest if possible, would know how to implement PCA in a multivariate timeseries dataset and also "examine the loading scores" in such a dataset. Thanks in advance! :)
    P.S - extremely clueless on anything coding or ML, but Ive got to use PCA (and other dimensionality reduction methods) on my timeseries dataset. so would greatly appreciate any direction on how to proceed.

    • @statquest
      @statquest  11 місяців тому +1

      See: stats.stackexchange.com/questions/158281/can-pca-be-applied-for-time-series-data

  • @KnightPapa
    @KnightPapa 5 років тому +2

    Thank you! This video helped a lot with what I'm trying to do.

  • @KomangWahyuTrisna
    @KomangWahyuTrisna 4 роки тому +1

    i really like your clear explanation. please do some videos about deep learning and NLP.

  • @geraldopontes37
    @geraldopontes37 3 роки тому +1

    Your videos are great! Thanks

  • @liranzaidman1610
    @liranzaidman1610 4 роки тому +2

    Amazing! this is so important, thanks a lot.

  • @karannchew2534
    @karannchew2534 3 роки тому

    Queation please...
    09:50 wt and ko samples are both created with the same random function Poisson (10, 1000). Why are wt samples (and ko samples) more correlated??

    • @statquest
      @statquest  3 роки тому

      Because rd.randrange(10, 1000) returns a random number between 10 and 1000. Once we get that random value, we use it to generate 5 values for the wt samples using a poisson distribution. Then we select another random value between 10 and 1000 and use it to generate 5 values for the ko using a different (because the random value is different) poisson distribution.

  • @epsilonprincipia9012
    @epsilonprincipia9012 5 років тому +1

    Good explanation. Thank you so much.

  • @MainakDev
    @MainakDev Рік тому

    at 5:10 why do we scale our data?

    • @statquest
      @statquest  Рік тому +1

      I explain why we scale the data in this video: ua-cam.com/video/oRvgq966yZg/v-deo.html

  • @3stepsahead704
    @3stepsahead704 2 роки тому

    Very concise, I will surely be coming back to this video, however I would like to know why PCA is able to group these two categories (wt and ko), when it's shown they are generated from the same random method. If all indexes were generated at the same time, I would get it, but as they are generated index by index, I seem not to be able to grasp it.

    • @statquest
      @statquest  2 роки тому

      The trick is at 3:48. For each group, wt and ko, we select a different parameter for the poisson distribution and generate 5 measurements from each of those two different distributions. One set is for wt and the other set is for ko.

    • @3stepsahead704
      @3stepsahead704 2 роки тому

      @@statquest I think my confusion comes from the fact that these will make the two groups different from one another (all w's different from ko's), but I wouldn't predict them to be similar within the group (wt1 is close in vertical to wt2, and to wt3...,), thus I tend to believe PCA should tell them apart, but not in exactly two groups (wt's vs ko's), I would predict more like two clouds instead of two "vertical line of points" in the 2-D.

    • @statquest
      @statquest  2 роки тому

      @@3stepsahead704 Remember how PCA actually works, it finds the axis that has the most variation (which is between WT and KO) and focuses on that. And then find the secondary differences (among the WT and KO). However, because the differences between WT and KO are big, the scale on the x-axis will be much bigger than the scale on the y-axis. Thus, the samples will appear to be in a vertical line rather than spaced apart like you might guess they should be. In short, check the scales of the axes, they will explain the difference between what you think you see and what you expect.

    • @3stepsahead704
      @3stepsahead704 2 роки тому +1

      @@statquest Thank you very much for taking the time to explain this. I now get it!

  • @gbchrs
    @gbchrs 3 роки тому

    is centering included in sklearn's pca model and that's why there is no extra step to center?

  •  3 роки тому

    I don't know why in the PCA graph you plot the "features", in some other videos, they plot all the data point and visualize the data in the new subspace... And I don't know what are the meaning of the x-axis in the same plot, what does -10 mean in the PC1-89.9%? thanks

    • @statquest
      @statquest  3 роки тому

      I don't plot the features, I plot the subjects. For details, see: ua-cam.com/video/FgakZw6K1QQ/v-deo.html

  • @azrahasan3796
    @azrahasan3796 4 роки тому

    Hi, You are a lifesaver. I am trying to do PCA analysis on my own data but since every demo video either use the databases and you created your own data. I am missing some crucial steps, especially in defining index when i am doing it with my data. Will it be too much to ask few more videos on machine learning where you use the excel sheet data from your laptop.

    • @azrahasan3796
      @azrahasan3796 4 роки тому

      I am a newbie in data science and programming. I am a Molecular Biologist who would love to learn machine learning.

    • @statquest
      @statquest  4 роки тому

      I'll keep that in mind for a future video.

  • @metvava
    @metvava 10 місяців тому

    great video! thanks for these!!! have you done a redundancy analysis and dbRDA plot video? thank you for contributing to our education

    • @statquest
      @statquest  10 місяців тому

      I haven't done that yet.

    • @metvava
      @metvava 10 місяців тому +1

      @@statquest let us know if you ever do! It would be a double bam from me. It just clicks the way you explain! Thank you again for your content!!!

  • @guohanzhao7813
    @guohanzhao7813 6 років тому

    COOOOOL, so easy to understand!

  • @manojhkumar2239
    @manojhkumar2239 10 днів тому

    Wow, thanks. One question: while verifying loading scores, I saw that 'False' command. Typically, for PCA, the data needs to be scaled, right? But false means it is not scaled, so I am confused. Please clarify this.

    • @statquest
      @statquest  9 днів тому

      The data are already all on the same scale.

  • @wuyanyun
    @wuyanyun 5 років тому +3

    Thank you! I’ve been struggling with this problem for so long !

    • @statquest
      @statquest  5 років тому +1

      Hooray! I'm glad the video was helpful. :)

  • @wavyjones96
    @wavyjones96 2 роки тому

    If u end up using the PCA data...would not cause data lakeage in our predictive model since scaling should be done after train test split?

    • @statquest
      @statquest  2 роки тому

      If you're using for machine learning, presumably you can come up with a standard scaling and centering procedure.

  • @karannchew2534
    @karannchew2534 3 роки тому

    Question please. This line trasforms the original data to a 10x10 array: pca_data = pca.transform(scaled_data)
    The video says: it generates the coordinates for PCA graph based on loading score and scaled data.
    Apart from the coordinates in graph, what do the values actually represent? How should I interpret them - Is it the amount of variance of sample values attributed to each PC? The distance of each sample on PC line to the origin? What is the unit?

    • @statquest
      @statquest  3 роки тому

      The coordinates do not have units. And, as far as I know, they are just coordinates.

    • @statquest
      @statquest  3 роки тому +1

      Oops!! I make a mistake a deleted your follow up comment. Sorry about that. However, my response is "Yes, the PCA graph is a graph that uses PCs as the axes."

    • @karannchew2534
      @karannchew2534 3 роки тому +1

      @@statquest No problem. Thanks for confirming.

  • @vizz2328
    @vizz2328 3 роки тому

    what happens if we give the n_components=d, d being the no of dimensions? Does PCA denoise the data because there won't be any reduction in dimensions?

    • @statquest
      @statquest  3 роки тому

      I don't think it will. However, there is still value because you can draw the scree plot and see how many PCs are really useful (it might only be a few, or it could be all of them).

  • @sarathkareti8993
    @sarathkareti8993 3 роки тому

    This video is really awesome! I am just confused on one thing, what are your predictors and what is your target?

    • @statquest
      @statquest  3 роки тому

      PCA does not have predictors and targets. All variables are just...variables. For more details about PCA, see: ua-cam.com/video/FgakZw6K1QQ/v-deo.html

  • @tymothylim6550
    @tymothylim6550 3 роки тому +1

    Thanks for the great video! :)

  • @avadhootukirde1575
    @avadhootukirde1575 3 роки тому +1

    What's about the source of dataset??

    • @statquest
      @statquest  3 роки тому +1

      The dataset is created within the code.

    • @avadhootukirde1575
      @avadhootukirde1575 3 роки тому +2

      @@statquest thanks for your quick reply

  • @zishanahmedshaikh
    @zishanahmedshaikh 6 років тому

    Hi Joshua, Great Videos!

  •  3 роки тому

    In the loading_scores, it appears this error: valueError: Length of passed values is 8, index implies 9686. I'm using my own dataset

    • @statquest
      @statquest  3 роки тому +1

      Maybe look at the first few items to make sure it is what you expect it to be.

    •  3 роки тому +1

      @@statquest yes I got it, thanks

  • @reshmababuraj1900
    @reshmababuraj1900 4 роки тому

    What will be the negative values in the loading scores indicates?

    • @statquest
      @statquest  4 роки тому

      Loading scores are explained here: ua-cam.com/video/FgakZw6K1QQ/v-deo.html

    • @reshmababuraj1900
      @reshmababuraj1900 4 роки тому +1

      @@statquest Thank you.

  • @donkkey245
    @donkkey245 4 роки тому +1

    dear instructor, will you release a python version of your ml course. supper fan here!

    • @statquest
      @statquest  4 роки тому +1

      One day I will.

    • @donkkey245
      @donkkey245 4 роки тому +1

      @@statquest hope that day comes quick. stay well.

  • @jessicanathania5718
    @jessicanathania5718 4 роки тому

    dude i'm trying to do isotonic regression with toydataset, but the error show x is not a 1D ARRAY, can i use PCA to turn it into 1D?

  • @dnuyc
    @dnuyc 3 роки тому

    Great tutorial, sorry if my question may be ammature, but how did they differentiate WT and KO apart in the final PCA, I thought the data set was randomly generated?

    • @statquest
      @statquest  3 роки тому

      Early on we gave the rows and columns names and kept track of them.

  • @jiayiwu4101
    @jiayiwu4101 4 роки тому

    What is the point to look at loading scores at the final step? My understanding is as follows. Each gene is a sample. If their loading scores on PC1 are similar, it means a lot of samples are projecting around a similar position on PC1. So they are clustering apparently. Am I right?

    • @statquest
      @statquest  4 роки тому

      In this case, loading scores tell us which genes have the most influence on the PCs. This can tell us which genes have the most variation and are the most useful for determining why the cells cluster the way the do. For more details, see: ua-cam.com/video/FgakZw6K1QQ/v-deo.html

    • @jiayiwu4101
      @jiayiwu4101 4 роки тому +1

      @@statquest Thank you! Just found you replied to my response very fast! Wish I knew how to look at those notifications earlier!

  • @Cat_Sterling
    @Cat_Sterling 2 роки тому

    Thank you!!! When we are speaking about variation in PCA, is that the same as variance?

    • @statquest
      @statquest  2 роки тому

      Yep.

    • @Cat_Sterling
      @Cat_Sterling 2 роки тому

      @@statquest Thank you very much for the clarification! I googled it, and seems that it's two different things, but sometimes they can be used interchangeably or be the same thing.

    • @statquest
      @statquest  2 роки тому

      @@Cat_Sterling Yes, I guess it depends on how you want to use them and whether you divide by 'n' or 'n-1', but, at least on a conceptual level, they are the same.

    • @Cat_Sterling
      @Cat_Sterling 2 роки тому +1

      @@statquest Thank you so much again! Really appreciate your reply! Your channel helped me so much!!!

  • @shreyjain6447
    @shreyjain6447 3 роки тому

    What if I get 4 variables with maximum variation in the scree plot? How would I then plot a PCA plot?

    • @statquest
      @statquest  3 роки тому

      You can draw multiple pca graphs (PC1 vs PC2, PC1 vs PC3 etc.)

  • @naviddavanikabir
    @naviddavanikabir 4 роки тому

    fantastic, like always.
    I wonder how Poisson distribution caused each wt samples and ko samples to be correlated with each other?

    • @statquest
      @statquest  4 роки тому

      Because we generated the data, I selected different lambda values for the wt from the ko samples.

  • @saulmartinez7351
    @saulmartinez7351 2 роки тому

    4:46 Why the gene4,ko1 has a value over 1000 if the command says "get a random value between 0 and 1000?
    Thanks for the value !!

    • @statquest
      @statquest  2 роки тому +1

      We select a random number between 10 and 1000 to be the mean of a poisson distribution. That's just the average value, and there can be larger and smaller values.

    • @saulmartinez7351
      @saulmartinez7351 2 роки тому +1

      @@statquest oh! i see!! thank you so much, I still learning about this

  • @manannawal8466
    @manannawal8466 4 роки тому

    how do we interpret positive and negative load factors of features in terms of separating the sample?

    • @statquest
      @statquest  4 роки тому

      I show some examples of this in my main PCA video: ua-cam.com/video/FgakZw6K1QQ/v-deo.html

    • @manannawal8466
      @manannawal8466 4 роки тому

      @@statquest I saw that video. I want to use PCA to determine weights of variable in my index. I have 2 doubts:
      1) Shall I take square of load factor and multiply by variation of the principal component to get weight for that variable?
      2) what is the role of negative and positive sign if I am using PCA method to derive weights. Shall I give negative sign load factors, negative weights in my index?

    • @statquest
      @statquest  4 роки тому +1

      @@manannawal8466 1) The loading scores are weights to begin with and determine how much of each variable is combined to make the principal component. In other words, you should need to transform the loading scores.
      2) The positive and negative signs are just relative to the other variables and how they contribute to the slope of the principal component. For example, if you had two genes, 1 and 2, then loading scores 5 and -2 (for genes 1 and 2 respectively) would give us the exact same slope as -5 and 2.
      Since the positive and negative signs are somewhat arbitrary, people often ignore them and instead concentrate on the magnitude of the loading scores as a way to determine variable importance.

    • @manannawal8466
      @manannawal8466 4 роки тому +1

      @@statquest thank you :)

  • @allinone0126
    @allinone0126 2 роки тому

    hai,
    could help me to this question please?
    Xnp = np.asarray(X.todense())
    # Run a principal component analysis on Xnp
    # How much of the variance can be explained
    # by the first 10, 50, 100, 200, and 500 principal components?

    • @statquest
      @statquest  2 роки тому

      If I had more time I could help, but today is super busy. Maybe someone else wants to help.

  • @timokimo61
    @timokimo61 2 роки тому

    Please could someone tell me which are the variables\dimensions we want to reduce and which are the observations\samples? im a little confused, especially that i found 10 principal components in the scree plot
    the genes are the variables/features , and the wt, ko are the samples right?

    • @statquest
      @statquest  2 роки тому

      The genes are features and the types of mice are the samples. Did you run the code that I wrote, or did you write your own? You can download my code for free.

    • @timokimo61
      @timokimo61 2 роки тому

      @@statquest thank you very much, you know due to the reverse of order of variables and samples, i got a little confused , plus the scree plot you showed in the video had 10 principal components which is equal to the number of samples, but perhaps the scree plot was just for visualizations and there is more than 10 principal components, however i didn't run the code
      I have another small question please, what if the samples are far less than the variables, let's say it's an image and the variables are the pixels with 4000 pixels in total for each image, and the samples are just 200 , in that case i would not get more than 200 principal components right? or in other words, only 200 PC or less would be useful to me right?

    • @statquest
      @statquest  2 роки тому

      @@timokimo61 I answer that question in this video: ua-cam.com/video/oRvgq966yZg/v-deo.html

    • @timokimo61
      @timokimo61 2 роки тому

      @@statquest i have seen the video before , yea.
      But a simple answer here would help me alot 😄

  • @godsperson5571
    @godsperson5571 3 роки тому

    Thanks for the "easy" to follow tutorial. I am trying to do a PCA for my RNAseq data but when I run scaled_tmp = StandardScaler().fit_transform(tmp.T) I get an error message: 'could not convert string to float: 'lcl|NC_000913.3_cds_NP_414542.1_1'. The lcl... is my target gene ID and I cannot edit it since i will need it later on to identify speficific genes. Please how do I solve this error message?

    • @statquest
      @statquest  3 роки тому

      It looks like one of the columns in your matrix is some sort of identifier instead of sequencing data. In the video, when we create the data, we move identifiers to be row names or column names (see: 3:17). Other than the row and column names, the matrix that we do math on can only contain numbers because... how do we do math with identifiers?

    • @godsperson5571
      @godsperson5571 3 роки тому +1

      @@statquest Thanks I was able to rectify the issue. I did not indicate "my gene_id" column as my index when loading data. After setting the index column it now works well.

    • @statquest
      @statquest  3 роки тому

      @@godsperson5571 Hooray!

  • @dafran500
    @dafran500 3 роки тому

    How can I select the final components to apply them to new data?

    • @statquest
      @statquest  3 роки тому

      You can use the loading scores. For details, see: ua-cam.com/video/_UVHneBUBW0/v-deo.html

    • @dafran500
      @dafran500 3 роки тому

      @@statquest Thanks! Were can I find the loading scores in python?

    • @statquest
      @statquest  3 роки тому

      @@dafran500 For details on how to do PCA in python, see: ua-cam.com/video/Lsue2gEM9D0/v-deo.html

  • @sualihahjan8673
    @sualihahjan8673 2 роки тому

    What if i already have a dataset which i will upload only?
    what do i pass in this line in index?
    pca_df = pd.DataFrame(pca_data, index=[], columns=labels)

    • @statquest
      @statquest  2 роки тому

      Whatever you want the row names to be

  • @neilanthony7596
    @neilanthony7596 2 роки тому

    Suppose a number of items exist of type 1 and 40 variables associated with each. Further items of type 2 exist, also having the same 40 associated variables. Is there a way to find which variables, or combination of variables, best discriminates whether an arbitrary item belongs to type 1 or type 2? Is this supervised PCA? Thank you for any help.

    • @statquest
      @statquest  2 роки тому

      Consider using LDA instead of PCA for your problem. For details, see: ua-cam.com/video/azXCzI57Yfc/v-deo.html

    • @neilanthony7596
      @neilanthony7596 2 роки тому +1

      @@statquest That's the perfect solution to this problem, thanks very much! N

  • @RachelDance
    @RachelDance 2 роки тому

    Your channel has helped me immeasurably :) I just had one question here, and that is how precisely to go from the data sample array you start with, to the scaled data by hand? I tried but didn't get the correct answer? I did watch the PCA Explained video as well, but just didnt get the same result here and wonder if you could clarify exactly how it gets from one to the other... should be: scaled_data = (data['wt1'][i] - np.mean['wt1])/ np.std(data['wt1']) ... for each datapoint i and each column right? this isnt real code im just making a point that its z = x-u / s :)

    • @statquest
      @statquest  2 роки тому

      It depends on how the data are oriented. Sometimes it's in columns, sometimes rows. So check to make sure your data is in columns.

    • @RachelDance
      @RachelDance 2 роки тому

      @@statquest for the test code you supplied, so columns, am I using the correct method?

    • @statquest
      @statquest  2 роки тому

      @@RachelDance It sounds like it.

  • @weiqingwang1202
    @weiqingwang1202 4 роки тому

    Is loading score eigenvalues? Wish to see a more linear algebra method of explaining pca!

    • @statquest
      @statquest  4 роки тому

      For more details on how PCA works, see: ua-cam.com/video/FgakZw6K1QQ/v-deo.html

  • @harryliu1005
    @harryliu1005 3 роки тому

    Hi Josh!, this is a very excellent video that helped me a lot!!!
    I have a question, what if PC3 PC4 is also essential? Do I need to draw 2 2-D graphs, or what do I need to do?

    • @statquest
      @statquest  3 роки тому

      If you want to draw the PCs and the data, then you'll have to draw multiple graphs. Or you can use the projections from the first 4 PCs and input to a dimension reduction algorithm like t-SNE: ua-cam.com/video/NEaUSP4YerM/v-deo.html

  • @sayanbhattacharya3233
    @sayanbhattacharya3233 4 роки тому

    Please post some intuitions on sparse deconvolution and compressive sensing..Would love to understand your approach..❤️

  • @vasanthakumar1991
    @vasanthakumar1991 6 років тому +1

    BAM!!! I understood what u said. I show my gratitude. But I have a query.
    I am confused with my dataset regarding which to consider a row and which as columns
    My dataset is regarding Phase measurement units (PMU) used in electrical grid or sort of the distribution lines we see around.
    One single PMU measures 21 electrical parameters for a timestamp.
    We use around Four PMU each measuring the 21 parameters at different locations at the same time continuously over a period of time.
    How can I arrange the above data for Performing PCU sir?

    • @vasanthakumar1991
      @vasanthakumar1991 6 років тому

      Sir those two case you mentioned that PCU would work is what I am also interested in calculating apart from the combination of all of the PMUs time stamp.
      Can u mention how to arrange the data (Rows and columns) for both of the mentioned viable cases?
      Thanking you so much!!You are really awesome sir

  • @nidhimadugonda5770
    @nidhimadugonda5770 3 роки тому

    How do we import all those libraries???
    Do we have to download anythiny extra ???

    • @statquest
      @statquest  3 роки тому +1

      It probably depends on what Python you have. I believe I used Anaconda which comes with all of these libraries.

  • @koalaggcc
    @koalaggcc 11 місяців тому

    wt and ko are chosen randomly from the same distribution, why are they very different from the perspective of PCA?

    • @statquest
      @statquest  11 місяців тому

      They use the same distribution, but different parameters for that distribution. Specifically, they use different values for lambda.

    • @koalaggcc
      @koalaggcc 11 місяців тому +1

      Oh i see! Thank you so much Josh. I watch your video from time to time in the past, and a lot more recently, and I'm always amazed at how extremely talented you're at teaching and explaining things!! Do you have somewhere that I can show some appreciation (aka pay tuition) if I don't plan to buy anything?@@statquest

    • @statquest
      @statquest  11 місяців тому

      @@koalaggcc There are lots of ways to support StatQuest. Here's a link that describes them all: statquest.org/support-statquest/

  • @kajalmishra6895
    @kajalmishra6895 3 роки тому

    Is scalling to be done for both test and train dataset?

  • @ayeshaali6462
    @ayeshaali6462 2 роки тому

    loading_scores=pd.Series(pca.components_[0],index=genes)
    what should i write in place of genes

    • @statquest
      @statquest  2 роки тому

      If you changed the index, as described at 3:17, you should probably use the same thing you changed it to.

  • @samiotmani9092
    @samiotmani9092 11 місяців тому +1

    Incredible French accent “Poisson distribution” , I saw it three times 😆

  • @kamogelomaila3904
    @kamogelomaila3904 7 років тому +1

    Hi Joshua, thanks for that. really helpful. i'm quite new to python myself, and i'm trying to compile a PCA across a range of macro-economic factors (inflation,gdp,fx, policy rate etc.,), now in all that you've done above where is the display of the PCA i.e: the newly uncorrelated data set, is it the loading scores you printed? or the wt, and ko variables you plotted? Thanks

  • @ccli
    @ccli 2 роки тому

    generally, in ML, we use 'columns' as 'features(variables)' and ''rows' as 'examples', but in the video, it is inverse. but is is not a big deal.

    • @statquest
      @statquest  2 роки тому

      It depends on the field you are in. I used to work in Genetics and this is the format they used. So it's always worth checking to make sure you have the data correctly oriented.

  • @innocenceesstt1
    @innocenceesstt1 4 роки тому

    Thank you very much for this tutorial. Please can you explain how to get correlation matrix

    • @statquest
      @statquest  4 роки тому +1

      With numpy, you use corrcoef().

    • @innocenceesstt1
      @innocenceesstt1 4 роки тому +1

      @@statquest Thank you very much

  • @mramadan2009
    @mramadan2009 4 роки тому

    Hi Josh Thank you for your efforts,
    really statquest is a magnificent channel ,
    Could you please make video for Singular Value decomposition SVD.
    thanks

  • @ayoubmarah4063
    @ayoubmarah4063 5 років тому

    is it necessary to scale the data ? becuse sometimes a variable might have a std near 0 wich generate NaNs

    • @statquest
      @statquest  5 років тому

      You don't have to scale the data, but it is highly recommended. For more details why scaling is important, see this StatQuest: ua-cam.com/video/oRvgq966yZg/v-deo.html