Genomics in practice - SNP data quality control with PLINK

Поділитися
Вставка
  • Опубліковано 26 сер 2024

КОМЕНТАРІ • 35

  • @ricardodutradobem3144
    @ricardodutradobem3144 3 роки тому +9

    Dear Professor Gabor, I would like to thank you immensely. I'm working with runs of homozygosity in cattle and I've been learning a lot from your videos and the book. In an increasingly individualistic world, I congratulate you for sharing your knowledge.

  • @user-ri4wn7jh7v
    @user-ri4wn7jh7v 7 місяців тому +1

    Thank you so much for simplifying a process that seemed too complicated.

  • @mohammadj.shamim9342
    @mohammadj.shamim9342 2 роки тому +2

    Thank you so much. To be honest they are very informative and helpful. I was very interested in quality control as it is fundamental in GWAS analysis.

  • @artjamwithross4374
    @artjamwithross4374 3 роки тому +2

    Thank you so much for these very useful videos! 🙏🏼👍🏼 They have guided me so much when using plink 👍🏼

  • @moslemmoghbeli4325
    @moslemmoghbeli4325 6 місяців тому +1

    thank you, this is amazing

  • @samrawittsehay1610
    @samrawittsehay1610 3 роки тому +4

    Thank you Prof. Gabor for your continuous tutorial. I was wondering why the sum of observed homozygosity and observed heterozygosity is different from one, even though we did the QC. Theoretically, the sum of observed homozygosity and observed heterozygosity should give one.

    • @GenomicsBootCamp
      @GenomicsBootCamp  3 роки тому +1

      Thanks for your comment! Could you point out which SNP you have in mind? I would like to check in detail.
      For now, my guess is that the difference comes from the way of how the missing values are counted. Even after QC, we are left with some missing values, that were below the specified thresholds.

    • @samrawittsehay1610
      @samrawittsehay1610 3 роки тому

      @@GenomicsBootCamp Okay, I will contact you via mail

  • @michegn
    @michegn Рік тому +2

    Dear Professor Gabor. Thanks for developing such a friendly tutorial series. Im trying to run all your proposed quality control criteria as in line 15. But R gives me an error "Error: --geno accepts at most 1 parameter". It seems it only allows one parameter at a time. My command was"> system("plink --bfile AIM_ped_2023_03_21 --geno 0.1 --mind 0.1 --maf 0.05 --hwe 0.0000001 --allow-no-sex --nonfounders --make-bed --out afterQC")". I really appreciate your guidance. once again THANKS!

    • @GenomicsBootCamp
      @GenomicsBootCamp  Рік тому +1

      This error comes up when there is a missing space or similar. For you, there seems to be a weird dash just after geno's 0.1 Try to re type that. It should work as you wrote, so there is a typo or similar...

  • @emmanouilathanasakis310
    @emmanouilathanasakis310 11 місяців тому +2

    Dear Professor Gabor, great lessons! Can I ask if it possible to illustrate how to convert plink genotyping data to VCF file format? There are several pre-processing steps before conversion as also after conversion and to clarify the steps will be great! Thanks in advance

    • @GenomicsBootCamp
      @GenomicsBootCamp  11 місяців тому +1

      There is a similar video on the channel, maybe this is what you are looking for?
      Convert between PLINK to VCF file formats ua-cam.com/video/EJDknrHAkXs/v-deo.html

  • @HuyHoiHay
    @HuyHoiHay 2 роки тому +1

    thank you

  • @jadecelis6838
    @jadecelis6838 Рік тому +1

    Thank you for the clear explanation! For a SNP dataset I have to do QC for hardy weinberg equilibrium and heterozygosity. Do you have a video on how to do this as well? And how to choose the window when using the function '--indep-pairwise'?

    • @GenomicsBootCamp
      @GenomicsBootCamp  Рік тому +1

      Hi, the HW check is covered in the general quality control video. About the LD pruning, I am uncertain if there is a "best" window to choose, I go with the defaults here, tbh.

    • @jadecelis6838
      @jadecelis6838 Рік тому

      Thank you! Your videos are extremely helpful!

  • @shivambhardwaj2683
    @shivambhardwaj2683 2 роки тому +1

    Dear professor Gabor, for genomic data analysis we are interested to identify the loci undergoing selection. Most of the animal samples are taken from established herds, i.e. no random mating. Then why do we use HWE quality check where we are keeping only those loci which are not showing deviation from equilibrium ?

    • @GenomicsBootCamp
      @GenomicsBootCamp  2 роки тому +1

      Greetings! When implemented in the quality control, the HWE check is not supposed to limit only loci strictly behaving according to the HWE expectations. It is rather a tool to remove those SNPs with a huge difference between observed and expected proportions.
      But one should be **careful** in the quality check all the time. Sometimes we look exactly for these SNPs, so a removal / HWE filtering in QC is disastrous for the results. Also, if we study multiple populations together, e.g. for admixture, or Fst, a HWE filtering on the joint data could remove the most interesting SNPs.
      So to summarize, HWE check is implemented, to filter out potentially wrongly genotyped SNPs, but its use should be considered on a case-by-case basis.

    • @shivambhardwaj2683
      @shivambhardwaj2683 2 роки тому

      @@GenomicsBootCamp Thank you professor 👍

  • @elielsonveloso7517
    @elielsonveloso7517 2 роки тому +1

    Dear Professor Gabor, Thanks a lot for such an interesting tutorial! I was wondering which of these QC parameters and tresholds I should use when working with SNP panel data rather than exome ou whole genome sequencing. Would you have any reference or guidelines for such application ?

    • @GenomicsBootCamp
      @GenomicsBootCamp  2 роки тому +1

      Hi, to clarify: You work with exome and whole genome sequence data? From your comment I understod that you work with SNP, which is the same data type as in the video, so the thresholds there would fit you.
      In case you work with WGS data of sorts, those should have their own process of quality control, already before the variant calling. If you go further to e.g. extract the SNPs you have in the WGS data, than you can make a PLINK file out of them and see how the current quality control criteria hold up.

  • @bellofolaniyi5546
    @bellofolaniyi5546 Рік тому +1

    Thank you Prof Gabor. Thank you for your explanation. Could - - dog option be used for chicken consider that they both have equal number of chromosomes?

    • @GenomicsBootCamp
      @GenomicsBootCamp  Рік тому

      Yes, technically possible, but it is not a good practice (although you might see a similar approach in my older videos).
      So it works, but much better is to specify the exact chromosome number using --chr-set. As you see in the link below, the --dog is just a shortcut for --chr-set 38
      www.cog-genomics.org/plink/1.9/input#chr_set

  • @ccdj35
    @ccdj35 2 роки тому +1

    8:28 There is something bothering me with deleting alleles that occurs less than 5%. Doesn't assuming that these differences could be a mistake create a situation where we oversee we are losing some actual alleles here?

    • @GenomicsBootCamp
      @GenomicsBootCamp  2 роки тому

      The low frequency alleles are not deleted because they would be faulty. In some cases we do not need, or do not want them in our data set. But each and every time we need to consider if any of the quality control parameters are needed or not.
      For example using a MAF threshold carelessly in Fst calculations might actually delete the results we are looking for. There are more such examples. So one should be careful all the time!

  • @mwanganamubita9617
    @mwanganamubita9617 Рік тому +1

    @GenomicsBootCamp Dear Prof. Gabor, when preparing the map file for PLINK, how should SNPs be ordered if there are SNPs on several chromosomes and genetic position is set to 0? Does ordering in terms of the physical positions take precedence over the chromosome number?

    • @GenomicsBootCamp
      @GenomicsBootCamp  Рік тому +1

      Hi, normally the map files are ordered according to chromosome (ascending) and then by base pair position of the SNP within chromosome (ascending).
      But tbh, I have seen map files ordered by alphabetic order of SNP names (so a total mess chromosome and base pair position wise) and was worked just fine. Thus I assume PLINK has an internal mechanism to figure out the correct order and just use it.

    • @mwanganamubita9617
      @mwanganamubita9617 Рік тому +1

      Thanks Prof. Gabor. Just a quick question, what's the generally accepted HWE threshold?p < 0.05 with Boniferron correction?

    • @GenomicsBootCamp
      @GenomicsBootCamp  Рік тому

      That is an interesting approach... If HWE is to be used, In generally use it to see if there are some SNPs that do not comply with the expectations. For this I use 10-5 or 10-6 p value, so e.g. 0.00001 without stronger justification.

  • @getinetmekuriawtarekegn1916
    @getinetmekuriawtarekegn1916 3 роки тому +1

    Dear Gabor, thanks for the very important guidance of managing genomic data. I have one problem: I want upgrade the R in rstudio. Plink works in recent R version. And installed R version 4.0.5; however, when I check the library in rstudio it has not changed (see lib="/Library/Frameworks/R.framework/Versions/4.0/Resources/library"). I upgraded the rstudio too but still R version, 4.0, couldnt change to 4.0.5. Could you advise me plse?

    • @GenomicsBootCamp
      @GenomicsBootCamp  3 роки тому

      Hi,
      PLINK should work with any R version, as we really use just the system() option here...
      To change to other R versions in R Studio, go to Tools>Global options>General tab on the left
      Right on the top, there is the "R version" currently in use. Just hit the "Change" button near it, and give your choice there.

  • @sowadanognigamal134
    @sowadanognigamal134 Рік тому

    dear prof, I really appreciate the work you are doing for us. my question is that I'm currently working on rice SNP data. I download the 3k snp data from SNBI but I don't know how to extract the ones that match my accessions and create my own files. kindly guide me. much aprereciet

    • @GenomicsBootCamp
      @GenomicsBootCamp  Рік тому

      Hi, Your request to "match" data is not clear to me.
      Do you have your own genotypes, and you want to merge them with some open access data?
      Or you just want to select specific subset of the many open access data you downloaded?
      Or you mea something else?

  • @georgewanjala4605
    @georgewanjala4605 3 роки тому

    It looked like R sofware.

    • @GenomicsBootCamp
      @GenomicsBootCamp  3 роки тому +2

      The question is not clear to me. Do you mean the system() command in R. If yes, this could be invoked just from R itself.