RNA Sequencing - Building a FASTQ to BAM pipeline

Поділитися
Вставка
  • Опубліковано 12 вер 2024

КОМЕНТАРІ • 111

  • @DannyArends
    @DannyArends  5 місяців тому +5

    UPDATE APRIL 2024
    Thanks for the engagement, comments and feedback Due to updates to STAR and PICARD tools, two additional steps (git checkout) are required to get the versions used in the video. I have updated the "0_installSoftware" script to make sure the correct versions are used. Please let me know if you get stuck on any additional steps.

  • @jaypatankar
    @jaypatankar Рік тому +3

    Dear Danny! I am inspired by your teaching style. I loved the series and found it very very useful and easy to follow. A lot of times coding abilities come with an attitude which gets in the way when people teach. You are the best! Thank you for this series and waiting to learn more!

    • @DannyArends
      @DannyArends  Рік тому

      Thanks for the kind words, and leaving a message. It's good to hear you're enjoying the lectures.

  • @CHD.77
    @CHD.77 4 місяці тому +2

    I recently saw that you would be making an explanatory video to address some issues that have arisen with PICARD. Some time ago, I installed all the software and everything worked great with the analysis. Although I haven't used them again... I have a question: Will all the software I have installed still work, or might it encounter errors due to updates? I would like to know if you will be publishing the video from April 27 on any platform. Thank you.

    • @DannyArends
      @DannyArends  4 місяці тому

      I updated the code recently to check out specific versions of picard and STAR, the main issue was that updates to both tools changed their behavior. PICARD now requires BAM files to have ReadGroups. STAR changed their libC version making it not compatible with the linux used in the tutorial.
      A "git checkout xxxxx" command to revert back to the previous version was added to the script. It's always tricky to keep scripts short enough to fit in a

  • @augustinechukwunta680
    @augustinechukwunta680 7 місяців тому +1

    Thank you so much for the batch 2 of the tutorial - it's super helpful

  • @testforall555
    @testforall555 Рік тому +2

    Thank you very much for these lectures/tutorials. They are precious to me. I am biologist who struggle to comprehend a lot from was presented here, especially the R codes. I replicate your steps on WSL and I get stuck at IGV, an error came with "no X11 etc.". I spend ages and installed putty and tried to forward and at the end I gave up and will try dual booting or virtual Linux like you. (Also, another error came when using bgzip that it is not known although i installed it and tried t on another ubuntu and it works, very weird. (By the way, sratoolkit on ubuntu, like what you told us to download has different path: it does not have etc or usr, just directly to the folder.)) Oh, the people who are trying to imitate bioinformatician need to know enormous amount of information (plus and pardon me in this, the true bioinformatician themselves seem complicate the matter and making so different versions and steps). I will try to set up account on twitch to follow you also there. Looking forward for the next lecture, IN SHAA ALLAH. Thanks a lot and best wishes. Mohamed.

    • @DannyArends
      @DannyArends  Рік тому +1

      Hi Mohamed,
      Thanks for your comments,
      X11 is the window manager for Linux which is unfortunately not available for WSL2 (which is command line only), fortunately you can install it under windows, and look at the BAM file in windows.
      bgzip is part of htslib, make sure you make a symbolic link (using ln) to add it to the ~/bin/ folder, so that you can execute it without having to type the full path ~/Software/htslib/bgzip
      And yes, bioinformatics is difficult because you have to know both the biology, and the computer science. I am always amazed by how many things can go wrong when installing software, and how complex it has become in the last couple of years.
      Danny

    • @testforall555
      @testforall555 Рік тому +1

      @@DannyArends Thank you. Mohamed

  • @nomawethumasina5269
    @nomawethumasina5269 2 місяці тому +1

    Good day, thank you for the video, very informative. I just want to know if i can use the vitual machine for human genome? I will be checking for gene expression levels between two groups and i figured that is a large data set. I will appreciate your guidance in this. thanks

    • @DannyArends
      @DannyArends  2 місяці тому

      For a human genome it would be advisable to run it in Linux, or even better on a university high performance compute cluster. The data size is going to be orders of magnitude more than the example S. Cerevisiae RNA sequencing dataset.

  • @histephenson007
    @histephenson007 Рік тому +1

    Thank you very much .. These videos are very helpful.
    btw which video games r u playing? ;)

    • @DannyArends
      @DannyArends  Рік тому +1

      I mostly play strategy/builder games, such as RimWorld, Dwarf Fortress, and I love doing multiplayer project zomboid

  • @sMr_Borgov
    @sMr_Borgov 6 місяців тому +1

    Hi! Thanks for the video, very instructive!! I have a quick question: Does Picard automatically detect and delete unwanted duplicates from the bam file? or is there a risk to loose biological data information when performing this step?

    • @DannyArends
      @DannyArends  6 місяців тому +1

      Picard can do removal or flagging, removal produces a BAM without the duplicates while flagging keeps the reads in the file but puts a duplicate flag in the information field so downstream tools know yhis read is highly likely to come from a sequencing artifact.
      Any preprocessing step can theoretically remove real biology. However, due to how short read sequencing preps shear dna (usually through mechanical force), it's very unlikely to have many reads with exactly the same start and end position. When using e.g. enzymatic digestion the chances of this are higher and the possibility of removing real biology goes up.

    • @sMr_Borgov
      @sMr_Borgov 6 місяців тому +1

      @@DannyArends Thanks! this is helpful! what would be the interpretation if after deduplication with Pcard the fold coverage decreases significantly (say from 40 to 10), would this be a source of concern? I have seen in the video how this affect specific chromosomes but I am wondering about the file as a whole. Thanks again!

    • @DannyArends
      @DannyArends  6 місяців тому +1

      I've never seen it drop so dramatically, but I guess a 4 fold reduction in read coverage globally would indeed point to an issue during library prep, which could cause significant differences in the results/interpretation downstream. Depending on if you're doing A) genome seq for SNP hunting this might cause "fake" SNP to appear as a result of PCR introduced mutations, and B) in RNA seq might lead to drop out of low expressed genes, or significantly less power to detect differential expression.

    • @sMr_Borgov
      @sMr_Borgov 6 місяців тому +1

      Thanks! That sounds reasonable. I am in the second category so it is probably wise to skip the Picard step (or else do both and compare the outcomes)@@DannyArends

    • @DannyArends
      @DannyArends  6 місяців тому +1

      Do not skip the step, run it with and without and look (using the IGV) at a couple highly expressed genes to figure out if reads seem equally distributed across the whole length of the gene. If you see a tower of 100s of reads at a single position all with the same start and end position is a clear indicator of a PCR artifacts, not real biology.

  • @Ice84letters
    @Ice84letters Рік тому +1

    I am reading a document called "RNA-Seq workflow: gene-level exploratory analysis and differential expression" and i am very confused about the workflow. They say that transcript abundance quantification methods like Kallisto, Salmon to estimate abundances without aligning reads, followed by tximport package are better because skip the generation of large files...in this video in what moment and with what tool do we do this? i guess i still do not understand all the workflow and all the tools that can be used. Thanks a lot

    • @DannyArends
      @DannyArends  Рік тому +1

      The workflow explained/showcased here is an alignment guided analysis and goes into detail which steps are taken in such a pipeline. Alignment guided helps to do other things besides RNA expression, like look for SNPs and look at alternative splicing. This cannot be done using alignment free methods.
      The pipeline is written from scratch, as such the goal of it is to NOT use an established pipeline. It's not a "build a pipeline in 15 minutes" by just using an established pipeline wrapped up in a single executable. It's a learning/educational introduction which spends 9 hours going into detail how RNA sequencing works to teach people how much steps are involved in the process.

  • @laurynwinter9086
    @laurynwinter9086 Рік тому +1

    Thank you so much for this informative series! I have a question: What if we cannot find the VCF file on ensemble? Do we skip those parts? Is there somewhere else to look? My organism is Papio anubis. Thank you!

    • @DannyArends
      @DannyArends  Рік тому +1

      I think you mean the gff/gtf file, since vcf is the output of SNP calling. In case your organism doesn't have an annotated genome RNA sequencing becomes a lot more challenging. You can still align reads, but you won't have the information on where a gene/it's exons start and end.
      You can use computational tools to generate an first annotation based on homology, or If you have RNA sequencing data on multiple tissues you can generate genome sequence annotation using the ensembl annotation pipeline or other tools.
      We're currently doing something similar for fruit bats, but this is not a trivial exercise.

    • @DannyArends
      @DannyArends  16 днів тому

      Yes, if you don't have a curated database of known SNPs it's not possible to do SNP/Indel realignment.

  • @freezingtolerance7493
    @freezingtolerance7493 Рік тому +1

    Thanks for your video.. just quick question... among trimming methods, there is also called ngsshort package except for trimmomatic... do you think I can also use ngsshort rather than trimmomatics for rna-seq data?

    • @DannyArends
      @DannyArends  Рік тому

      Yep, any trimmer will do, this is the reason pipelines are more or less modular, you can easily swap out one tool for another one. Do make sure that besides read trimming based on quality it will also trim sequencing adapters (just in case they're still on the read data)

    • @freezingtolerance7493
      @freezingtolerance7493 Рік тому

      @@DannyArends based on my coverage data, meanmapq is almost 55~60 across samples, but meandepth is 27~39. Du you think that I need to picard to remove duplication?

  • @vinayaraj100
    @vinayaraj100 10 місяців тому +1

    Thank you very much for these lectures/tutorials. Upon executing bgzip command the following error msg was shown
    > cmd

    • @DannyArends
      @DannyArends  10 місяців тому

      The bgzip version you're using is too old, 1.13 does not support -k (keep the original file) (1). Version 1.14 and above supports the option according to the documentation (2). Upgrading your htslib and bgzip should fix the error.
      (1) www.htslib.org/doc/1.13/bgzip.html
      (2) www.htslib.org/doc/1.14/bgzip.html

  • @conlosjuguetesdemihijo6310
    @conlosjuguetesdemihijo6310 Рік тому +2

    HI Danny. help me
    "I can't find the index file gtf.gz in the Ensembl FTP files. Where can I download it from? My organism is Capsicum annuum."

    • @DannyArends
      @DannyArends  Рік тому +2

      Hey @conlosjuguetesdemihijo6310 unfortunately that species has no genome or transcriptome information available on Ensembl.
      You probably look for a gtf file from where you got the reference genome.
      Google brings me to the following page: www.ncbi.nlm.nih.gov/genome?term=AYRZ00000000&cmd=DetailsSearch
      Here I see there is a reference genome available: ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/878/395/GCF_002878395.1_UCD10Xv1.1/GCF_002878395.1_UCD10Xv1.1_genomic.fna.gz
      as well as the transcriptome (in GFF format) at: ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/878/395/GCF_002878395.1_UCD10Xv1.1/GCF_002878395.1_UCD10Xv1.1_genomic.gff.gz
      You can convert the GFF to GTF using the AGAT tool: agat.readthedocs.io/en/latest/gff_to_gtf.html

    • @conlosjuguetesdemihijo6310
      @conlosjuguetesdemihijo6310 Рік тому

      @@DannyArends thank you so much

  • @JaskaranSingh-om5mv
    @JaskaranSingh-om5mv Рік тому +1

    I am using a UTM virtual Ubuntu on my M2 MacBook Air, can I use the softwares directly on MacBook terminal? Because the Ubuntu is taking a lot of space on my system 😢

    • @DannyArends
      @DannyArends  Рік тому

      In theory (and practice) yes, you should be able to get it working on a Mac, There might be some issues with java versions and/or installation of tools via homebrew instead of apt. Good luck !

  • @zahraamasrawi7266
    @zahraamasrawi7266 Рік тому +1

    Hi Danny,I have no words to thank you for this incredible lessons.I
    have a question how the code will be if Iam using data from a company.the file is tar.gz.sorry for the silly question Iam a beginner.thank you again

    • @DannyArends
      @DannyArends  Рік тому

      Just extract the raw data from the compressed file, you can extract a tar.gz file by using:
      tar -xvzf filename.tar.gz
      Just change filename.tar.gz to how your file is called and this command will extract it for you.

    • @zahraamasrawi7266
      @zahraamasrawi7266 Рік тому

      Thank you ❤❤.I'll try.

    • @zahraamasrawi7266
      @zahraamasrawi7266 Рік тому

      Hello Danny,I have another issue ,I couldn't solve for awhile,when using the code for trimmomatic . Unable to access java file error came out.I am using wsl2.and my working directory is network drive.thank you for your help

    • @DannyArends
      @DannyArends  Рік тому

      If wsl2 is unable to find java, you're going to have to install it under wsl2, see for example this thread on stackoverflow: stackoverflow.com/questions/63866813/what-is-the-proper-way-of-using-jdk-on-wsl2-on-windows-10
      Good luck

  • @farrkf
    @farrkf 5 місяців тому +1

    Hi Danny, thank you again for the second series of the tutorial!
    I followed your tutorial step-by-step but I encountered an error in generating the index using STAR:
    terminate called after throwing an instance of 'std::bad_alloc'
    what(): std::bad_alloc
    Aborted
    My input was similar to yours except that I changed the --genomeSAindexNbases to 14 as I'm working with the human sample:
    STAR --runThreadN 2 --runMode genomeGenerate --genomeDir ~/genome/STAR --genomeSAindexNbases 14 --genomeFastaFiles ~/genome/Homo_sapiens.GRCh38.dna.primary_assembly.fa --sjdbGTFfile ~/genome/Homo_sapiens.GRCh38.111.gtf
    Correct me if I'm wrong but based of my googling, it has to do with RAM? The setting of the VM was exactly the same with your 1st tutorial.
    Your reply is highly appreciated, Danny! Thanks

    • @DannyArends
      @DannyArends  5 місяців тому +1

      Yes, this is an out of RAM/HDD space error. The human genome is 162 times bigger than the yeast genome. You'll have to adjust the RAM to 64 Gb minimum and allocate 250 times more disk space since all files will be bigger (genome, vcf, index, etc)

    • @farrkf
      @farrkf 5 місяців тому +1

      @@DannyArends Noted. Will change the RAM size accordingly. Thank you! Cheers

  • @juliangrandvallet5359
    @juliangrandvallet5359 Рік тому +1

    Minute 29:40, the file that tabix read is still compressed?

    • @DannyArends
      @DannyArends  Рік тому

      Yes, tabix is able to read uncompressed fasta, as well as bgzip compressed fasta files.

  • @danieljairenriquezvera3340
    @danieljairenriquezvera3340 Рік тому +1

    Thank you for your tutorial, There is any way to make the Index genome in star with less than 8Gb RAM (All i assigned to virtual box)?

    • @DannyArends
      @DannyArends  Рік тому +1

      Depending on your organism, there might already be an index available e.g. Human, Mouse, Rat, and other model organisms have an index available for download. However with only 8 Gb of RAM you'll run into very slow alignments during the next steps, since it'll use your HDD space to swap out RAM during alignment.

    • @danieljairenriquezvera3340
      @danieljairenriquezvera3340 Рік тому +1

      @@DannyArends thank you so much. Also, How can i get the vcf snp reference file for human genome?

    • @DannyArends
      @DannyArends  Рік тому +1

      You can get the known SNPs in humans (in vcf format) from the Ensembl ftp site: www.ensembl.org/info/data/ftp/index.html

    • @danieljairenriquezvera3340
      @danieljairenriquezvera3340 Рік тому

      @@DannyArends thank you again

  • @VikashKumar-jz8hw
    @VikashKumar-jz8hw 8 місяців тому

    How to handle if we do not know the exact adapter sequences?

    • @DannyArends
      @DannyArends  8 місяців тому

      It almost never occurs that the researcher analyzing the data does not know the adapter sequences. However, In case you don't you can use fastQC www.bioinformatics.babraham.ac.uk/projects/fastqc/ to figure out which adapter sequences are present.

    • @VikashKumar-jz8hw
      @VikashKumar-jz8hw 8 місяців тому

      @@DannyArends Thanks for your quick response! Actually I am stuck with demulltiplexing, sorry for wrong question. Do you have any suggestion regarding matching the indexes if somehow it is not yielding data, particularly for 6bp indexes?

    • @DannyArends
      @DannyArends  8 місяців тому

      Oww, sorry for not seeing your follow up question,
      I would first look to see if those 6bp indices are in your fastq file. Just dump the first 1000 reads or so, and look at them in a text editor to figure out what is going on (e.g. is there an adapter before the 6bp, it might be that the adapter trimming step removes part of the index). You can also grep on the 6bp index that you expect to be in the data and then dump the reads containing these indices in separate files.
      Normally demultiplexing is done after the chastity filter and before doing any adapter trimming

  • @solomonantonio8817
    @solomonantonio8817 Рік тому +1

    Thanks for the great video. When I try to run the script as one it stops after Step 4.1 and does not start Step 5. Just stays with a plus sign like if something is missing from the code.

    • @DannyArends
      @DannyArends  Рік тому +1

      The + sign is usually a sign of forgetting to close a curly bracket { } or an unclosed string " ", try looking up at where the > turns into a plus, often the unclosed curly bracket/string it close by.
      To test whether it's a } or a ” type one of them in R and look which one generates an error, since an unclosed { cannot be closed by a ", and vise versa.

    • @solomonantonio8817
      @solomonantonio8817 Рік тому +1

      @@DannyArends i can simply copy the rest of the code and run it and it works. So no idea why there is a random break when I try to copy the entire script and run it in one go.

    • @DannyArends
      @DannyArends  Рік тому +1

      Interesting, perhaps you're exceeding your maximum clipboard size on virtual box or set by your operating system (this could be causing issues on large scripts). Does it work when you save it and run it from the command line using Rscript?

    • @solomonantonio8817
      @solomonantonio8817 Рік тому +1

      @@DannyArends I’m not using virtual box. I installed Ubuntu to boot alongside Windows.
      I haven’t tried that as yet. Still a bit new to R and all these codes and commands. Not sure how to set up the cmdlineargs like you have in your video.

    • @DannyArends
      @DannyArends  Рік тому +1

      That will come in the next video or later on. Easiest is to just copy paste it in 2 parts. I'll see if I can replicate the weird behavior you're experiencing Tuesday when I have access to my Ubuntu workstation at work.

  • @dariocosemans8326
    @dariocosemans8326 5 місяців тому +1

    I want to do RNAseq data analysis using the human genome. When reading the following lines in, I get a failure error:
    # Generate genome/transcriptome index using STAR
    STAR --runThreadN 2 --runMode genomeGenerate \
    --genomeDir ~/genome/STAR \
    --genomeSAindexNbases 10 \
    --sjdbGTFfile ~/genome/Homo_sapiens.GRCh38.111.gtf \
    --genomeFastaFiles ~/genome/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa
    STAR version: 2.7.11b compiled: 2024-01-25T16:12:02-05:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
    Apr 06 00:23:00 ..... started STAR run
    Apr 06 00:23:00 ... starting to generate Genome files
    terminate called after throwing an instance of 'std::bad_alloc'
    what(): std::bad_alloc
    Afgebroken
    What is going wrong?

    • @DannyArends
      @DannyArends  5 місяців тому

      Bad_alloc means not enough memory, the human genome is much bigger (~150 times) compared to the yeast genome. You're going to have to allocate a lot more RAM (and HDD space) to the virtual machine to create the index, I think the recommendation is ~64 Gb for a human genome (128Gb for some larger genomes like grain or unions). Normally people use pre-built indexes for large genomes (e.g. the index is built on HPC infrastructure), and then use the pre-built index on their local machine. During alignment the memory depends not on the genome size but on the length of the reads, so you'll need less memory during the alignment phase.

    • @dariocosemans8326
      @dariocosemans8326 5 місяців тому

      @@DannyArends Okay, thank you! Can I download these pre-built indices from the internet? Does this change any steps from the video?

    • @dariocosemans8326
      @dariocosemans8326 5 місяців тому

      The site from the person that wrote STAR is not accessible for me

    • @DannyArends
      @DannyArends  5 місяців тому

      You could in theory if you find one matching your genome version.

    • @dariocosemans8326
      @dariocosemans8326 5 місяців тому

      @@DannyArends Do you know where to look for it because I have no idea and when looking I always end up with people that got the same problem as me but no solution. Is it maybe a good idea to use HISAT2 instead of STAR? I'm new to this and just thinking a loud

  • @darkjoker7256
    @darkjoker7256 Рік тому +1

    Hi I'm trying to install IGV on windows subsystem linux but its not working Can anyone help me out

    • @DannyArends
      @DannyArends  Рік тому

      Hey there, my advice would be to install IGV for windows directly (not under wsl2) and then load in the bam files, which you can access from the network path \\wsl$\

    • @darkjoker7256
      @darkjoker7256 Рік тому +1

      @@DannyArends ok sir I'll try that and get back to you

  • @kwakuenoch823
    @kwakuenoch823 Рік тому +1

    Thank you for this useful resource. I am stuck somewhere getting to the middle of the project, I hope you might be able to help me.
    Upon executing this command, I get bgzip not found, but I successfully installed all the software and linked them into a bin folder.
    # Compress the fasta file using bgzip (keep original)
    cmd

    • @DannyArends
      @DannyArends  Рік тому +1

      "Command not found" is a path problem, usually through forgetting a symbolic link. Update you $PATH variable to contain the bgzip install directory. Or add a symlink in /home//bin to the bgzip executable, and add that to your $PATH variable

    • @kwakuenoch823
      @kwakuenoch823 Рік тому +1

      @@DannyArends Thank you for the quick response.

    • @kwakuenoch823
      @kwakuenoch823 Рік тому +1

      @@DannyArends I have found the reason for the problem and I thought I should let you know about it. The reason I was getting the bgzip: not found error was because everytime I was going into R, I typed sudo R. Just typing R and hitting enter resolved the problem.

    • @DannyArends
      @DannyArends  Рік тому +1

      Yes, this is a major difference since using sudo you run R as the administrator account which is a different user account altogether.

    • @kwakuenoch823
      @kwakuenoch823 Рік тому +1

      @@DannyArends now I know. Grateful sir!

  • @rahulgopalam6479
    @rahulgopalam6479 Рік тому +1

    Hello, I need your advice again.
    EXITING because of INPUT ERROR: could not open genomeFastaFile: /home/rahul/genome/Saccharomyces_cerevisiae.R64-1-1.dna.primary_assembly.fa
    I'm getting this message when I try to generate the index using STAR.

    • @DannyArends
      @DannyArends  Рік тому

      Hi, does the file exist in that folder? The file is generated using the R script, so make sure it's located there, either by setting your working directory there in R (using setwd) or by copying it into the folder afterwards.

    • @rahulgopalam6479
      @rahulgopalam6479 Рік тому +1

      @@DannyArends Thanks for the information. I downloaded it in a different file. It's working fine now.

    • @DannyArends
      @DannyArends  Рік тому

      Good to hear it's working 👍