Genomics in practice - Principal component analysis (PCA) based on SNP data

Genomics Boot Camp

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 7 лис 2024

КОМЕНТАРІ • 52

@shirihoshen5921 Рік тому ⁺¹
Thank you Gabor! I don't work in data science or genetics, but canine genetics is a longstanding interest of mine. I never thought that I would be able to dive into canine genomics and do my own analyses, but it looks like it's on the horizon. I had no idea that the main programs that used were accessible and learnable by me, sitting at home with my desktop and the internet. Biggest thanks. I am about to start your Data Wrangling With Plink playlist.
@GenomicsBootCamp Рік тому ⁺¹
Great to hear! Also, we have a shared interest in canine genetics then!
@abdulrahmankhatib3539 3 роки тому ⁺³
Dear professor, thank you for the informative videos. Could you tell me please how the input data for this code should looks like?
@GenomicsBootCamp 3 роки тому
Hi Abdulrahman! The code in the video starts with a binary ped file. Actually, any type of PLINK file is suitable, but then you have to adapt the code a bit. Adaptations are likely necessary anyway, as you might not have the same species, or you want to tweak the quality control parameters or similar...
Video on the data formats is here: ua-cam.com/video/vZyf5aXlB-k/v-deo.html
@sumaihal-hazzaa8668 3 роки тому ⁺¹
Dear Professor, thank you for the enlightening video and book .
for Coli et al.(2018) data , my experience in the data analysis was successful but once the data changed(human Genomics) ,I faced a problem with the data size . the massage I got from RStudio was you need a larger computer what should I do in this case?
Thank you,
@GenomicsBootCamp 3 роки тому
Hi,
There is a possibility to do the entire PCA using PLINK, which might be less demanding to the computer, and then do just the visualization in R. An example and workflow in the video here: ua-cam.com/video/vos6VeuNcaM/v-deo.html
@stes5429 8 місяців тому ⁺¹
Thanks for the video! I am learning so much with the Genomic Bootcamp book and this playlist!
One doubt though, I have read the PLINK manual, and:
1) --distance-matrix is deprecated, it exists in PLINK 1.9 just for back compatibility. " New scripts should migrate to "--distance 1-ibs flat-missing" and "--distance ibs flat-missing".
2) --distance-matrix should perform Identity-By -State (IBS) distance between Individuals, and therefore the result is not a correlation matrix, which is the base of PCA. In this case, we are then not running a PCA, but a Multidimensional Scaling (MDS) also known as Principal Coordinate Analysis (PCoA).
Please correct me if I am wrong!
Cheers
@GenomicsBootCamp 8 місяців тому
Yes, this video is based on the MDS computations via R. To my experience the output is very similar to PCA. If you are interested in direct, and quicker approach, you can use the --pca option in PLINK.
@pabgrg3199 3 роки тому ⁺¹
thank you, excellent video...would love to watch more PCA examples
@GenomicsBootCamp 3 роки тому
Thanks for your comment!
Are you interested in the interpretation of PCA plots or some other aspect?
@pabgrg3199 3 роки тому ⁺²
@@GenomicsBootCamp thank you for replying. It would be helpful if there were a few more PCA examples with human data if possible.
@GenomicsBootCamp 3 роки тому
@@pabgrg3199 Thank you! This would be indeed very interesting! I will try to look for suitable data, or if you have any suggestions, I am happy to hear that one as well!
@awsedrawsedrft Рік тому
Thank you professor . I have tried this approach and the simple pca with plink with your data, and the results look similar, but when I use my own vcf (merged from several vcf files with bcftools)and calculate pca for it, the plots look different. Do you have any suggestions for me to check?
@Fasilgetachew 3 роки тому ⁺¹
Dear Professor, thank you for the informative video. I have one question. If you used the same dataset with Colli et al. (2018), why did you get lower PC1 (13.09 vs. 18.2 in the paper) and PC2 (5.5 vs.8.93)? My PCs with this script on a different dataset is also lower than what my colleagues obtained. Thank you in advance for your explanation.
@GenomicsBootCamp 3 роки тому
hmmm... The small difference to the paper could be explained with a small difference in analyzed data. In the paper, they exclude crossbreds and some other animals, which is not done in the presentation.
The other issue is also interesting. Do you mean that this script yields lower numbers on the proportion of variance explained by principal components 1 and 2, compared to another way of doing PCA, even in the case of exactly matching data sets?
The core computation is done by the cmdscale() R function - Classical (Metric) Multidimensional Scaling. Perhaps there is a difference in the method of calculation that yields slightly different proportions of explained variance.
The pictures with different approaches should look very similar though. Could you confirm this?
@mohammadj.shamim9342 Рік тому ⁺¹
I still wonder if there is way to use PCA component as a covariant.
@GenomicsBootCamp Рік тому
Of course you can! You can insert the PCA components into the model to account for population structure (e.g. in a GWAS setting, but this in my experience does not work too good). Also, if e.g. the components clearly distinguish between subpopulations (e.g. "breeds"), then you can use the PCA to account for breed-identity in your model.
@mohammadj.shamim9342 Рік тому
@@GenomicsBootCamp thank you so much. Your channel is eye-opening to genomics.
@fatmamokhtar19 3 роки тому ⁺¹
Thank you for these nice explanations. I have a question regarding the meaning of the colors of thee PCA plot? How can we find the meaning of them? And can we use these commands for human genomics?
@GenomicsBootCamp 3 роки тому ⁺²
Hi, yes, exactly the same approach could be used in human genomics,. You just do not need to specify the species in PLINK though, as the program considers human genome/chromosome count by default.
For your first question, the meaning of colors is not there, because there are way too many breeds considered. The description would take away the whole screen. If you want to enable the description when visualizing a smaller number of groups, you need to turn on the "legend" by changing show.legend in the script to TRUE (has to be written in all capitals)
@fatmamokhtar19 3 роки тому
@@GenomicsBootCamp Thank you for your answer, but what is the meaning of the colors of the PCA plot? where can I find it?
@GenomicsBootCamp 3 роки тому ⁺²
@@fatmamokhtar19 Each color represents a different breed. Ideally, one can distinguish different breeds/groups/families also visually just by looking at the PCA plot, if these are highlighted by different colors. In the script itself, the colors are assigned on the same line 49, by "color = famids", so each differing value from the first column of the .fam file (i.e. the FID) gets a different color automatically.
Does this answer your question?
@fatmamokhtar19 3 роки тому ⁺¹
@@GenomicsBootCamp Thank you so much for all this explanation.
@MM-jm1il 3 роки тому ⁺¹
What is the interpretation for the output "Using up to 8 threads (change this with --threads)"?
I am having trouble finding an explanation by what is meant by 'threads"
@GenomicsBootCamp 3 роки тому
Hi, the --threads have something to do with the number of computer CPU cores the PLINK run uses. With this option, you can change it. So if something takes a very long time to run, one might increase it to speed up the process. (not sure if this option works under Win though...)
@pattarapolsumreddee1053 3 роки тому ⁺¹
Great work! How practically can we use pca for a single population?
@GenomicsBootCamp 3 роки тому ⁺¹
If you do a PCA for a single population, you can see if there is any obvious sub-population structure, if individuals tend to cluster in different places, or if you one big "cloud" of dots, which does not imply a strong sub-population structure.
Also, you better see relationships between individuals - whichever are closer to each other are also generally more similar to each other.
If the question was about setting up a run for a single population instead of this many, the video next week (23.06.) will show something similar. Well... It will be an even more simple way of PCA with PLINK with three breeds, but from there you could simplify if needed.
@pattarapolsumreddee1053 3 роки тому ⁺¹
@@GenomicsBootCamp thank you. I will wait till the next video.
@xyzstudent5642 Рік тому
can i use a ped file to perform PCA
@lucasf.c.y.dossoukpongan4684 2 роки тому ⁺¹
Dear Professor, I really enjoyed your video. Can you please make some videos about simulation scripts for a genetic improvement study with packages like Alphasimr and Plink
@Exosap 3 роки тому ⁺²
Thank you!
@nikolalicakova2375 2 роки тому
Hello Gabor,
thank you for your videos. Are you also doing online tutoring? I am doing a project but for sure I need a help from somebody to explain me and show some things. Thank you for your reply.
@hediatnani3860 3 роки тому ⁺¹
Hi :). Thank you for this amazing channel.
@elifvaccari8134 2 роки тому
Dear Gábor Mészáros,
First of all, thank you so much for the helpful valuable video series.
I really need help on the following issue during PCA for the cmdscale function.
I received the cmdscale error: NA values not allowed in 'd'.
I'm looking forward to hearing from you.
Best Regards
@GenomicsBootCamp 2 роки тому ⁺¹
Hi,
I am not sure what is the cause of the error, but you can solve it two ways:
1) if you really need the cmdscale function, and the missing values are the problem (it seems), you can try to remove them e.g. using the "drop_NA()" function. I made a short video on it here: ua-cam.com/video/w6m7OAMZAzk/v-deo.html
2) If your goal is the PCA itself, there are other ways to achieve it. There is a simpler version, directly from PLINK that does not use cmdscale at all (so I hope the problem will not apply here). A video, including the scipt, on this here: ua-cam.com/video/vos6VeuNcaM/v-deo.html
@mdrasheduzzaman7613 3 роки тому
Thanks a lot, Professor. Please, add some more topics and further analysis using plink.
@GenomicsBootCamp 3 роки тому ⁺⁴
Yes, more types of analyses will come with time.
@georgewanjala4605 3 роки тому ⁺¹
@@GenomicsBootCamp, Because of this, I'm always following your channel. Your tutorial brings me close to where I am supposed to be, in terms of knowledge in genomic analysis. I am sorry for asking this question here instead of on the changing file format clip. But please could you shed some light on why you interchangeably use make-bed and recode to make bed and map files.
@GenomicsBootCamp 3 роки тому ⁺¹
@@georgewanjala4605 Thanks for the question! There is no specific reason behind it, most of the time.
The advantage of ped+map files is that you can open them with text editor. It is easy to explain that 1 line in the ped file is one animal. So if my goal is to show you the contents on screen, I use the --recode. The ped+map files take a bit more space though.
In general the binary ped files bed+bim+fam are the most ideal solution, as they take the least space, and the runtime of analyses is quicker than with other PLINK file types.
So you can use any of the two, the results do not change.
@drm1404 3 роки тому ⁺¹
Hi,
Thank you for these helpful videos
I have a question:
How to add the PCAs to my final data to use them in my analysis. I want to use them as covariates in my analysis.
Thank you
@GenomicsBootCamp 3 роки тому ⁺¹
Hi,
The coordinates that are plotted to the graph are saved in the mds_populations R data set, in the "points" part. You can get them as mds_populations$points, with each of the eigenvectors in columns and each of the individuals in a line, but without ID numbers. A follow-up merging is needed, which might lead to problems.
I suggest you do the PCA with PLINK instead - see the "Simple PCA analysis with PLINK" video on this channel. There you get the ".eigenvec" files, which have the values and the IDs for each individual. Less work and the possibility of a merging mistake are eliminated (as it is already done).
ua-cam.com/video/vos6VeuNcaM/v-deo.html
@kunjanparikh7669 2 роки тому
I am getting a single dot in the graph i dont know why kindly guid me.
@GenomicsBootCamp 2 роки тому
Check the data what you try to visualize, if it is in the size and format you expect.
The PCA plot is nothing else just plotting the first and second principla component for each individual. So the file you visualize should have these. Also, the data should be of course different for each individual.
Do you use the data as in the example? In that case it should work as shown.
@kunjanparikh7669 2 роки тому
@@GenomicsBootCamp yes sir, it worked there was a minor mistake from my side, thankyou so much
@jaweriamumtaz3462 11 місяців тому
Can you tell me what was the mistake? I am having the same issue.
@jovanajovanovska925 5 місяців тому
hi, I'm having the same error where I only get a single dot. Could you share what fixed it?
@mellmiss8522 3 роки тому
I need pca code in spyder 3.8
@GenomicsBootCamp 3 роки тому
Hi, I am not good in Python (yet), so I don't know what functions to use exactly. A quick solution for you might be to check out the new video on the channel "Simple PCA analysis with PLINK". Here the PCA itself is done by PLINK, and all you need to do is do an X-Y plot of columns 3 and 4 from the .eigenvec file
@PC-lu3zf 2 роки тому ⁺¹
Hard to understand these PCA plots
@GenomicsBootCamp 2 роки тому
Thank you for your feedback!
Could you elaborte? Are the PCA plots hard to understand, or the one in this video specifically? What aspect is not undertandeable and would require more explanation?

Наступне

Автоматичне відтворення