Thank you Gabor! I don't work in data science or genetics, but canine genetics is a longstanding interest of mine. I never thought that I would be able to dive into canine genomics and do my own analyses, but it looks like it's on the horizon. I had no idea that the main programs that used were accessible and learnable by me, sitting at home with my desktop and the internet. Biggest thanks. I am about to start your Data Wrangling With Plink playlist.
Hi Abdulrahman! The code in the video starts with a binary ped file. Actually, any type of PLINK file is suitable, but then you have to adapt the code a bit. Adaptations are likely necessary anyway, as you might not have the same species, or you want to tweak the quality control parameters or similar... Video on the data formats is here: ua-cam.com/video/vZyf5aXlB-k/v-deo.html
Dear Professor, thank you for the enlightening video and book . for Coli et al.(2018) data , my experience in the data analysis was successful but once the data changed(human Genomics) ,I faced a problem with the data size . the massage I got from RStudio was you need a larger computer what should I do in this case? Thank you,
Hi, There is a possibility to do the entire PCA using PLINK, which might be less demanding to the computer, and then do just the visualization in R. An example and workflow in the video here: ua-cam.com/video/vos6VeuNcaM/v-deo.html
Thanks for the video! I am learning so much with the Genomic Bootcamp book and this playlist! One doubt though, I have read the PLINK manual, and: 1) --distance-matrix is deprecated, it exists in PLINK 1.9 just for back compatibility. " New scripts should migrate to "--distance 1-ibs flat-missing" and "--distance ibs flat-missing". 2) --distance-matrix should perform Identity-By -State (IBS) distance between Individuals, and therefore the result is not a correlation matrix, which is the base of PCA. In this case, we are then not running a PCA, but a Multidimensional Scaling (MDS) also known as Principal Coordinate Analysis (PCoA). Please correct me if I am wrong! Cheers
Yes, this video is based on the MDS computations via R. To my experience the output is very similar to PCA. If you are interested in direct, and quicker approach, you can use the --pca option in PLINK.
@@pabgrg3199 Thank you! This would be indeed very interesting! I will try to look for suitable data, or if you have any suggestions, I am happy to hear that one as well!
Thank you professor . I have tried this approach and the simple pca with plink with your data, and the results look similar, but when I use my own vcf (merged from several vcf files with bcftools)and calculate pca for it, the plots look different. Do you have any suggestions for me to check?
Dear Professor, thank you for the informative video. I have one question. If you used the same dataset with Colli et al. (2018), why did you get lower PC1 (13.09 vs. 18.2 in the paper) and PC2 (5.5 vs.8.93)? My PCs with this script on a different dataset is also lower than what my colleagues obtained. Thank you in advance for your explanation.
hmmm... The small difference to the paper could be explained with a small difference in analyzed data. In the paper, they exclude crossbreds and some other animals, which is not done in the presentation. The other issue is also interesting. Do you mean that this script yields lower numbers on the proportion of variance explained by principal components 1 and 2, compared to another way of doing PCA, even in the case of exactly matching data sets? The core computation is done by the cmdscale() R function - Classical (Metric) Multidimensional Scaling. Perhaps there is a difference in the method of calculation that yields slightly different proportions of explained variance. The pictures with different approaches should look very similar though. Could you confirm this?
Of course you can! You can insert the PCA components into the model to account for population structure (e.g. in a GWAS setting, but this in my experience does not work too good). Also, if e.g. the components clearly distinguish between subpopulations (e.g. "breeds"), then you can use the PCA to account for breed-identity in your model.
Thank you for these nice explanations. I have a question regarding the meaning of the colors of thee PCA plot? How can we find the meaning of them? And can we use these commands for human genomics?
Hi, yes, exactly the same approach could be used in human genomics,. You just do not need to specify the species in PLINK though, as the program considers human genome/chromosome count by default. For your first question, the meaning of colors is not there, because there are way too many breeds considered. The description would take away the whole screen. If you want to enable the description when visualizing a smaller number of groups, you need to turn on the "legend" by changing show.legend in the script to TRUE (has to be written in all capitals)
@@fatmamokhtar19 Each color represents a different breed. Ideally, one can distinguish different breeds/groups/families also visually just by looking at the PCA plot, if these are highlighted by different colors. In the script itself, the colors are assigned on the same line 49, by "color = famids", so each differing value from the first column of the .fam file (i.e. the FID) gets a different color automatically. Does this answer your question?
What is the interpretation for the output "Using up to 8 threads (change this with --threads)"? I am having trouble finding an explanation by what is meant by 'threads"
Hi, the --threads have something to do with the number of computer CPU cores the PLINK run uses. With this option, you can change it. So if something takes a very long time to run, one might increase it to speed up the process. (not sure if this option works under Win though...)
If you do a PCA for a single population, you can see if there is any obvious sub-population structure, if individuals tend to cluster in different places, or if you one big "cloud" of dots, which does not imply a strong sub-population structure. Also, you better see relationships between individuals - whichever are closer to each other are also generally more similar to each other. If the question was about setting up a run for a single population instead of this many, the video next week (23.06.) will show something similar. Well... It will be an even more simple way of PCA with PLINK with three breeds, but from there you could simplify if needed.
Dear Professor, I really enjoyed your video. Can you please make some videos about simulation scripts for a genetic improvement study with packages like Alphasimr and Plink
Hello Gabor, thank you for your videos. Are you also doing online tutoring? I am doing a project but for sure I need a help from somebody to explain me and show some things. Thank you for your reply.
Dear Gábor Mészáros, First of all, thank you so much for the helpful valuable video series. I really need help on the following issue during PCA for the cmdscale function. I received the cmdscale error: NA values not allowed in 'd'. I'm looking forward to hearing from you. Best Regards
Hi, I am not sure what is the cause of the error, but you can solve it two ways: 1) if you really need the cmdscale function, and the missing values are the problem (it seems), you can try to remove them e.g. using the "drop_NA()" function. I made a short video on it here: ua-cam.com/video/w6m7OAMZAzk/v-deo.html 2) If your goal is the PCA itself, there are other ways to achieve it. There is a simpler version, directly from PLINK that does not use cmdscale at all (so I hope the problem will not apply here). A video, including the scipt, on this here: ua-cam.com/video/vos6VeuNcaM/v-deo.html
@@GenomicsBootCamp, Because of this, I'm always following your channel. Your tutorial brings me close to where I am supposed to be, in terms of knowledge in genomic analysis. I am sorry for asking this question here instead of on the changing file format clip. But please could you shed some light on why you interchangeably use make-bed and recode to make bed and map files.
@@georgewanjala4605 Thanks for the question! There is no specific reason behind it, most of the time. The advantage of ped+map files is that you can open them with text editor. It is easy to explain that 1 line in the ped file is one animal. So if my goal is to show you the contents on screen, I use the --recode. The ped+map files take a bit more space though. In general the binary ped files bed+bim+fam are the most ideal solution, as they take the least space, and the runtime of analyses is quicker than with other PLINK file types. So you can use any of the two, the results do not change.
Hi, Thank you for these helpful videos I have a question: How to add the PCAs to my final data to use them in my analysis. I want to use them as covariates in my analysis. Thank you
Hi, The coordinates that are plotted to the graph are saved in the mds_populations R data set, in the "points" part. You can get them as mds_populations$points, with each of the eigenvectors in columns and each of the individuals in a line, but without ID numbers. A follow-up merging is needed, which might lead to problems. I suggest you do the PCA with PLINK instead - see the "Simple PCA analysis with PLINK" video on this channel. There you get the ".eigenvec" files, which have the values and the IDs for each individual. Less work and the possibility of a merging mistake are eliminated (as it is already done). ua-cam.com/video/vos6VeuNcaM/v-deo.html
Check the data what you try to visualize, if it is in the size and format you expect. The PCA plot is nothing else just plotting the first and second principla component for each individual. So the file you visualize should have these. Also, the data should be of course different for each individual. Do you use the data as in the example? In that case it should work as shown.
Hi, I am not good in Python (yet), so I don't know what functions to use exactly. A quick solution for you might be to check out the new video on the channel "Simple PCA analysis with PLINK". Here the PCA itself is done by PLINK, and all you need to do is do an X-Y plot of columns 3 and 4 from the .eigenvec file
Thank you for your feedback! Could you elaborte? Are the PCA plots hard to understand, or the one in this video specifically? What aspect is not undertandeable and would require more explanation?
Thank you Gabor! I don't work in data science or genetics, but canine genetics is a longstanding interest of mine. I never thought that I would be able to dive into canine genomics and do my own analyses, but it looks like it's on the horizon. I had no idea that the main programs that used were accessible and learnable by me, sitting at home with my desktop and the internet. Biggest thanks. I am about to start your Data Wrangling With Plink playlist.
Great to hear! Also, we have a shared interest in canine genetics then!
Dear professor, thank you for the informative videos. Could you tell me please how the input data for this code should looks like?
Hi Abdulrahman! The code in the video starts with a binary ped file. Actually, any type of PLINK file is suitable, but then you have to adapt the code a bit. Adaptations are likely necessary anyway, as you might not have the same species, or you want to tweak the quality control parameters or similar...
Video on the data formats is here: ua-cam.com/video/vZyf5aXlB-k/v-deo.html
Dear Professor, thank you for the enlightening video and book .
for Coli et al.(2018) data , my experience in the data analysis was successful but once the data changed(human Genomics) ,I faced a problem with the data size . the massage I got from RStudio was you need a larger computer what should I do in this case?
Thank you,
Hi,
There is a possibility to do the entire PCA using PLINK, which might be less demanding to the computer, and then do just the visualization in R. An example and workflow in the video here: ua-cam.com/video/vos6VeuNcaM/v-deo.html
Thanks for the video! I am learning so much with the Genomic Bootcamp book and this playlist!
One doubt though, I have read the PLINK manual, and:
1) --distance-matrix is deprecated, it exists in PLINK 1.9 just for back compatibility. " New scripts should migrate to "--distance 1-ibs flat-missing" and "--distance ibs flat-missing".
2) --distance-matrix should perform Identity-By -State (IBS) distance between Individuals, and therefore the result is not a correlation matrix, which is the base of PCA. In this case, we are then not running a PCA, but a Multidimensional Scaling (MDS) also known as Principal Coordinate Analysis (PCoA).
Please correct me if I am wrong!
Cheers
Yes, this video is based on the MDS computations via R. To my experience the output is very similar to PCA. If you are interested in direct, and quicker approach, you can use the --pca option in PLINK.
thank you, excellent video...would love to watch more PCA examples
Thanks for your comment!
Are you interested in the interpretation of PCA plots or some other aspect?
@@GenomicsBootCamp thank you for replying. It would be helpful if there were a few more PCA examples with human data if possible.
@@pabgrg3199 Thank you! This would be indeed very interesting! I will try to look for suitable data, or if you have any suggestions, I am happy to hear that one as well!
Thank you professor . I have tried this approach and the simple pca with plink with your data, and the results look similar, but when I use my own vcf (merged from several vcf files with bcftools)and calculate pca for it, the plots look different. Do you have any suggestions for me to check?
Dear Professor, thank you for the informative video. I have one question. If you used the same dataset with Colli et al. (2018), why did you get lower PC1 (13.09 vs. 18.2 in the paper) and PC2 (5.5 vs.8.93)? My PCs with this script on a different dataset is also lower than what my colleagues obtained. Thank you in advance for your explanation.
hmmm... The small difference to the paper could be explained with a small difference in analyzed data. In the paper, they exclude crossbreds and some other animals, which is not done in the presentation.
The other issue is also interesting. Do you mean that this script yields lower numbers on the proportion of variance explained by principal components 1 and 2, compared to another way of doing PCA, even in the case of exactly matching data sets?
The core computation is done by the cmdscale() R function - Classical (Metric) Multidimensional Scaling. Perhaps there is a difference in the method of calculation that yields slightly different proportions of explained variance.
The pictures with different approaches should look very similar though. Could you confirm this?
I still wonder if there is way to use PCA component as a covariant.
Of course you can! You can insert the PCA components into the model to account for population structure (e.g. in a GWAS setting, but this in my experience does not work too good). Also, if e.g. the components clearly distinguish between subpopulations (e.g. "breeds"), then you can use the PCA to account for breed-identity in your model.
@@GenomicsBootCamp thank you so much. Your channel is eye-opening to genomics.
Thank you for these nice explanations. I have a question regarding the meaning of the colors of thee PCA plot? How can we find the meaning of them? And can we use these commands for human genomics?
Hi, yes, exactly the same approach could be used in human genomics,. You just do not need to specify the species in PLINK though, as the program considers human genome/chromosome count by default.
For your first question, the meaning of colors is not there, because there are way too many breeds considered. The description would take away the whole screen. If you want to enable the description when visualizing a smaller number of groups, you need to turn on the "legend" by changing show.legend in the script to TRUE (has to be written in all capitals)
@@GenomicsBootCamp Thank you for your answer, but what is the meaning of the colors of the PCA plot? where can I find it?
@@fatmamokhtar19 Each color represents a different breed. Ideally, one can distinguish different breeds/groups/families also visually just by looking at the PCA plot, if these are highlighted by different colors. In the script itself, the colors are assigned on the same line 49, by "color = famids", so each differing value from the first column of the .fam file (i.e. the FID) gets a different color automatically.
Does this answer your question?
@@GenomicsBootCamp Thank you so much for all this explanation.
What is the interpretation for the output "Using up to 8 threads (change this with --threads)"?
I am having trouble finding an explanation by what is meant by 'threads"
Hi, the --threads have something to do with the number of computer CPU cores the PLINK run uses. With this option, you can change it. So if something takes a very long time to run, one might increase it to speed up the process. (not sure if this option works under Win though...)
Great work! How practically can we use pca for a single population?
If you do a PCA for a single population, you can see if there is any obvious sub-population structure, if individuals tend to cluster in different places, or if you one big "cloud" of dots, which does not imply a strong sub-population structure.
Also, you better see relationships between individuals - whichever are closer to each other are also generally more similar to each other.
If the question was about setting up a run for a single population instead of this many, the video next week (23.06.) will show something similar. Well... It will be an even more simple way of PCA with PLINK with three breeds, but from there you could simplify if needed.
@@GenomicsBootCamp thank you. I will wait till the next video.
can i use a ped file to perform PCA
Dear Professor, I really enjoyed your video. Can you please make some videos about simulation scripts for a genetic improvement study with packages like Alphasimr and Plink
Thank you!
Hello Gabor,
thank you for your videos. Are you also doing online tutoring? I am doing a project but for sure I need a help from somebody to explain me and show some things. Thank you for your reply.
Hi :). Thank you for this amazing channel.
Dear Gábor Mészáros,
First of all, thank you so much for the helpful valuable video series.
I really need help on the following issue during PCA for the cmdscale function.
I received the cmdscale error: NA values not allowed in 'd'.
I'm looking forward to hearing from you.
Best Regards
Hi,
I am not sure what is the cause of the error, but you can solve it two ways:
1) if you really need the cmdscale function, and the missing values are the problem (it seems), you can try to remove them e.g. using the "drop_NA()" function. I made a short video on it here: ua-cam.com/video/w6m7OAMZAzk/v-deo.html
2) If your goal is the PCA itself, there are other ways to achieve it. There is a simpler version, directly from PLINK that does not use cmdscale at all (so I hope the problem will not apply here). A video, including the scipt, on this here: ua-cam.com/video/vos6VeuNcaM/v-deo.html
Thanks a lot, Professor. Please, add some more topics and further analysis using plink.
Yes, more types of analyses will come with time.
@@GenomicsBootCamp, Because of this, I'm always following your channel. Your tutorial brings me close to where I am supposed to be, in terms of knowledge in genomic analysis. I am sorry for asking this question here instead of on the changing file format clip. But please could you shed some light on why you interchangeably use make-bed and recode to make bed and map files.
@@georgewanjala4605 Thanks for the question! There is no specific reason behind it, most of the time.
The advantage of ped+map files is that you can open them with text editor. It is easy to explain that 1 line in the ped file is one animal. So if my goal is to show you the contents on screen, I use the --recode. The ped+map files take a bit more space though.
In general the binary ped files bed+bim+fam are the most ideal solution, as they take the least space, and the runtime of analyses is quicker than with other PLINK file types.
So you can use any of the two, the results do not change.
Hi,
Thank you for these helpful videos
I have a question:
How to add the PCAs to my final data to use them in my analysis. I want to use them as covariates in my analysis.
Thank you
Hi,
The coordinates that are plotted to the graph are saved in the mds_populations R data set, in the "points" part. You can get them as mds_populations$points, with each of the eigenvectors in columns and each of the individuals in a line, but without ID numbers. A follow-up merging is needed, which might lead to problems.
I suggest you do the PCA with PLINK instead - see the "Simple PCA analysis with PLINK" video on this channel. There you get the ".eigenvec" files, which have the values and the IDs for each individual. Less work and the possibility of a merging mistake are eliminated (as it is already done).
ua-cam.com/video/vos6VeuNcaM/v-deo.html
I am getting a single dot in the graph i dont know why kindly guid me.
Check the data what you try to visualize, if it is in the size and format you expect.
The PCA plot is nothing else just plotting the first and second principla component for each individual. So the file you visualize should have these. Also, the data should be of course different for each individual.
Do you use the data as in the example? In that case it should work as shown.
@@GenomicsBootCamp yes sir, it worked there was a minor mistake from my side, thankyou so much
Can you tell me what was the mistake? I am having the same issue.
hi, I'm having the same error where I only get a single dot. Could you share what fixed it?
I need pca code in spyder 3.8
Hi, I am not good in Python (yet), so I don't know what functions to use exactly. A quick solution for you might be to check out the new video on the channel "Simple PCA analysis with PLINK". Here the PCA itself is done by PLINK, and all you need to do is do an X-Y plot of columns 3 and 4 from the .eigenvec file
Hard to understand these PCA plots
Thank you for your feedback!
Could you elaborte? Are the PCA plots hard to understand, or the one in this video specifically? What aspect is not undertandeable and would require more explanation?