RNA Sequencing - Building a FASTQ to BAM pipeline

Danny Arends

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 30 січ 2025

КОМЕНТАРІ • 112

@DannyArends 9 місяців тому ⁺⁵
UPDATE APRIL 2024
Thanks for the engagement, comments and feedback Due to updates to STAR and PICARD tools, two additional steps (git checkout) are required to get the versions used in the video. I have updated the "0_installSoftware" script to make sure the correct versions are used. Please let me know if you get stuck on any additional steps.
@jaypatankar Рік тому ⁺³
Dear Danny! I am inspired by your teaching style. I loved the series and found it very very useful and easy to follow. A lot of times coding abilities come with an attitude which gets in the way when people teach. You are the best! Thank you for this series and waiting to learn more!
@DannyArends Рік тому
Thanks for the kind words, and leaving a message. It's good to hear you're enjoying the lectures.
@augustinechukwunta680 11 місяців тому ⁺¹
Thank you so much for the batch 2 of the tutorial - it's super helpful
@DannyArends 11 місяців тому
You're very welcome!
@testforall555 2 роки тому ⁺²
Thank you very much for these lectures/tutorials. They are precious to me. I am biologist who struggle to comprehend a lot from was presented here, especially the R codes. I replicate your steps on WSL and I get stuck at IGV, an error came with "no X11 etc.". I spend ages and installed putty and tried to forward and at the end I gave up and will try dual booting or virtual Linux like you. (Also, another error came when using bgzip that it is not known although i installed it and tried t on another ubuntu and it works, very weird. (By the way, sratoolkit on ubuntu, like what you told us to download has different path: it does not have etc or usr, just directly to the folder.)) Oh, the people who are trying to imitate bioinformatician need to know enormous amount of information (plus and pardon me in this, the true bioinformatician themselves seem complicate the matter and making so different versions and steps). I will try to set up account on twitch to follow you also there. Looking forward for the next lecture, IN SHAA ALLAH. Thanks a lot and best wishes. Mohamed.
@DannyArends 2 роки тому ⁺¹
Hi Mohamed,
Thanks for your comments,
X11 is the window manager for Linux which is unfortunately not available for WSL2 (which is command line only), fortunately you can install it under windows, and look at the BAM file in windows.
bgzip is part of htslib, make sure you make a symbolic link (using ln) to add it to the ~/bin/ folder, so that you can execute it without having to type the full path ~/Software/htslib/bgzip
And yes, bioinformatics is difficult because you have to know both the biology, and the computer science. I am always amazed by how many things can go wrong when installing software, and how complex it has become in the last couple of years.
Danny
@testforall555 2 роки тому ⁺¹
@@DannyArends Thank you. Mohamed
@Ice84letters Рік тому ⁺¹
I am reading a document called "RNA-Seq workflow: gene-level exploratory analysis and differential expression" and i am very confused about the workflow. They say that transcript abundance quantification methods like Kallisto, Salmon to estimate abundances without aligning reads, followed by tximport package are better because skip the generation of large files...in this video in what moment and with what tool do we do this? i guess i still do not understand all the workflow and all the tools that can be used. Thanks a lot
@DannyArends Рік тому ⁺¹
The workflow explained/showcased here is an alignment guided analysis and goes into detail which steps are taken in such a pipeline. Alignment guided helps to do other things besides RNA expression, like look for SNPs and look at alternative splicing. This cannot be done using alignment free methods.
The pipeline is written from scratch, as such the goal of it is to NOT use an established pipeline. It's not a "build a pipeline in 15 minutes" by just using an established pipeline wrapped up in a single executable. It's a learning/educational introduction which spends 9 hours going into detail how RNA sequencing works to teach people how much steps are involved in the process.
@CHD.77 9 місяців тому ⁺²
I recently saw that you would be making an explanatory video to address some issues that have arisen with PICARD. Some time ago, I installed all the software and everything worked great with the analysis. Although I haven't used them again... I have a question: Will all the software I have installed still work, or might it encounter errors due to updates? I would like to know if you will be publishing the video from April 27 on any platform. Thank you.
@DannyArends 9 місяців тому
I updated the code recently to check out specific versions of picard and STAR, the main issue was that updates to both tools changed their behavior. PICARD now requires BAM files to have ReadGroups. STAR changed their libC version making it not compatible with the linux used in the tutorial.
A "git checkout xxxxx" command to revert back to the previous version was added to the script. It's always tricky to keep scripts short enough to fit in a
@nomawethumasina5269 6 місяців тому ⁺¹
Good day, thank you for the video, very informative. I just want to know if i can use the vitual machine for human genome? I will be checking for gene expression levels between two groups and i figured that is a large data set. I will appreciate your guidance in this. thanks
@DannyArends 6 місяців тому
For a human genome it would be advisable to run it in Linux, or even better on a university high performance compute cluster. The data size is going to be orders of magnitude more than the example S. Cerevisiae RNA sequencing dataset.
@dariocosemans8326 10 місяців тому ⁺¹
I want to do RNAseq data analysis using the human genome. When reading the following lines in, I get a failure error:
# Generate genome/transcriptome index using STAR
STAR --runThreadN 2 --runMode genomeGenerate \
--genomeDir ~/genome/STAR \
--genomeSAindexNbases 10 \
--sjdbGTFfile ~/genome/Homo_sapiens.GRCh38.111.gtf \
--genomeFastaFiles ~/genome/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa
STAR version: 2.7.11b compiled: 2024-01-25T16:12:02-05:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Apr 06 00:23:00 ..... started STAR run
Apr 06 00:23:00 ... starting to generate Genome files
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Afgebroken
What is going wrong?
@DannyArends 10 місяців тому
Bad_alloc means not enough memory, the human genome is much bigger (~150 times) compared to the yeast genome. You're going to have to allocate a lot more RAM (and HDD space) to the virtual machine to create the index, I think the recommendation is ~64 Gb for a human genome (128Gb for some larger genomes like grain or unions). Normally people use pre-built indexes for large genomes (e.g. the index is built on HPC infrastructure), and then use the pre-built index on their local machine. During alignment the memory depends not on the genome size but on the length of the reads, so you'll need less memory during the alignment phase.
@dariocosemans8326 10 місяців тому
@@DannyArends Okay, thank you! Can I download these pre-built indices from the internet? Does this change any steps from the video?
@dariocosemans8326 10 місяців тому
The site from the person that wrote STAR is not accessible for me
@DannyArends 10 місяців тому
You could in theory if you find one matching your genome version.
@dariocosemans8326 10 місяців тому
@@DannyArends Do you know where to look for it because I have no idea and when looking I always end up with people that got the same problem as me but no solution. Is it maybe a good idea to use HISAT2 instead of STAR? I'm new to this and just thinking a loud
@vinayaraj100 Рік тому ⁺¹
Thank you very much for these lectures/tutorials. Upon executing bgzip command the following error msg was shown
> cmd
@DannyArends Рік тому
The bgzip version you're using is too old, 1.13 does not support -k (keep the original file) (1). Version 1.14 and above supports the option according to the documentation (2). Upgrading your htslib and bgzip should fix the error.
(1) www.htslib.org/doc/1.13/bgzip.html
(2) www.htslib.org/doc/1.14/bgzip.html
@conlosjuguetesdemihijo6310 Рік тому ⁺²
HI Danny. help me
"I can't find the index file gtf.gz in the Ensembl FTP files. Where can I download it from? My organism is Capsicum annuum."
@DannyArends Рік тому ⁺²
Hey @conlosjuguetesdemihijo6310 unfortunately that species has no genome or transcriptome information available on Ensembl.
You probably look for a gtf file from where you got the reference genome.
Google brings me to the following page: www.ncbi.nlm.nih.gov/genome?term=AYRZ00000000&cmd=DetailsSearch
Here I see there is a reference genome available: ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/878/395/GCF_002878395.1_UCD10Xv1.1/GCF_002878395.1_UCD10Xv1.1_genomic.fna.gz
as well as the transcriptome (in GFF format) at: ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/878/395/GCF_002878395.1_UCD10Xv1.1/GCF_002878395.1_UCD10Xv1.1_genomic.gff.gz
You can convert the GFF to GTF using the AGAT tool: agat.readthedocs.io/en/latest/gff_to_gtf.html
@conlosjuguetesdemihijo6310 Рік тому
@@DannyArends thank you so much
@JaskaranSingh-om5mv Рік тому ⁺¹
I am using a UTM virtual Ubuntu on my M2 MacBook Air, can I use the softwares directly on MacBook terminal? Because the Ubuntu is taking a lot of space on my system 😢
@DannyArends Рік тому
In theory (and practice) yes, you should be able to get it working on a Mac, There might be some issues with java versions and/or installation of tools via homebrew instead of apt. Good luck !
@sMr_Borgov 11 місяців тому ⁺¹
Hi! Thanks for the video, very instructive!! I have a quick question: Does Picard automatically detect and delete unwanted duplicates from the bam file? or is there a risk to loose biological data information when performing this step?
@DannyArends 11 місяців тому ⁺¹
Picard can do removal or flagging, removal produces a BAM without the duplicates while flagging keeps the reads in the file but puts a duplicate flag in the information field so downstream tools know yhis read is highly likely to come from a sequencing artifact.
Any preprocessing step can theoretically remove real biology. However, due to how short read sequencing preps shear dna (usually through mechanical force), it's very unlikely to have many reads with exactly the same start and end position. When using e.g. enzymatic digestion the chances of this are higher and the possibility of removing real biology goes up.
@sMr_Borgov 11 місяців тому ⁺¹
@@DannyArends Thanks! this is helpful! what would be the interpretation if after deduplication with Pcard the fold coverage decreases significantly (say from 40 to 10), would this be a source of concern? I have seen in the video how this affect specific chromosomes but I am wondering about the file as a whole. Thanks again!
@DannyArends 11 місяців тому ⁺¹
I've never seen it drop so dramatically, but I guess a 4 fold reduction in read coverage globally would indeed point to an issue during library prep, which could cause significant differences in the results/interpretation downstream. Depending on if you're doing A) genome seq for SNP hunting this might cause "fake" SNP to appear as a result of PCR introduced mutations, and B) in RNA seq might lead to drop out of low expressed genes, or significantly less power to detect differential expression.
@sMr_Borgov 11 місяців тому ⁺¹
Thanks! That sounds reasonable. I am in the second category so it is probably wise to skip the Picard step (or else do both and compare the outcomes)@@DannyArends
@DannyArends 11 місяців тому ⁺¹
Do not skip the step, run it with and without and look (using the IGV) at a couple highly expressed genes to figure out if reads seem equally distributed across the whole length of the gene. If you see a tower of 100s of reads at a single position all with the same start and end position is a clear indicator of a PCR artifacts, not real biology.
@histephenson007 Рік тому ⁺¹
Thank you very much .. These videos are very helpful.
btw which video games r u playing? ;)
@DannyArends Рік тому ⁺¹
I mostly play strategy/builder games, such as RimWorld, Dwarf Fortress, and I love doing multiplayer project zomboid
@WintryLaurel Рік тому ⁺¹
Thank you so much for this informative series! I have a question: What if we cannot find the VCF file on ensemble? Do we skip those parts? Is there somewhere else to look? My organism is Papio anubis. Thank you!
@DannyArends Рік тому ⁺¹
I think you mean the gff/gtf file, since vcf is the output of SNP calling. In case your organism doesn't have an annotated genome RNA sequencing becomes a lot more challenging. You can still align reads, but you won't have the information on where a gene/it's exons start and end.
You can use computational tools to generate an first annotation based on homology, or If you have RNA sequencing data on multiple tissues you can generate genome sequence annotation using the ensembl annotation pipeline or other tools.
We're currently doing something similar for fruit bats, but this is not a trivial exercise.
@DannyArends 5 місяців тому
Yes, if you don't have a curated database of known SNPs it's not possible to do SNP/Indel realignment.
@juliangrandvallet5359 Рік тому ⁺¹
Minute 29:40, the file that tabix read is still compressed?
@DannyArends Рік тому
Yes, tabix is able to read uncompressed fasta, as well as bgzip compressed fasta files.
@farrkf 10 місяців тому ⁺¹
Hi Danny, thank you again for the second series of the tutorial!
I followed your tutorial step-by-step but I encountered an error in generating the index using STAR:
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Aborted
My input was similar to yours except that I changed the --genomeSAindexNbases to 14 as I'm working with the human sample:
STAR --runThreadN 2 --runMode genomeGenerate --genomeDir ~/genome/STAR --genomeSAindexNbases 14 --genomeFastaFiles ~/genome/Homo_sapiens.GRCh38.dna.primary_assembly.fa --sjdbGTFfile ~/genome/Homo_sapiens.GRCh38.111.gtf
Correct me if I'm wrong but based of my googling, it has to do with RAM? The setting of the VM was exactly the same with your 1st tutorial.
Your reply is highly appreciated, Danny! Thanks
@DannyArends 10 місяців тому ⁺¹
Yes, this is an out of RAM/HDD space error. The human genome is 162 times bigger than the yeast genome. You'll have to adjust the RAM to 64 Gb minimum and allocate 250 times more disk space since all files will be bigger (genome, vcf, index, etc)
@farrkf 10 місяців тому ⁺¹
@@DannyArends Noted. Will change the RAM size accordingly. Thank you! Cheers
@freezingtolerance7493 Рік тому ⁺¹
Thanks for your video.. just quick question... among trimming methods, there is also called ngsshort package except for trimmomatic... do you think I can also use ngsshort rather than trimmomatics for rna-seq data?
@DannyArends Рік тому
Yep, any trimmer will do, this is the reason pipelines are more or less modular, you can easily swap out one tool for another one. Do make sure that besides read trimming based on quality it will also trim sequencing adapters (just in case they're still on the read data)
@freezingtolerance7493 Рік тому
@@DannyArends based on my coverage data, meanmapq is almost 55~60 across samples, but meandepth is 27~39. Du you think that I need to picard to remove duplication?
@solomonantonio8817 Рік тому ⁺¹
Thanks for the great video. When I try to run the script as one it stops after Step 4.1 and does not start Step 5. Just stays with a plus sign like if something is missing from the code.
@DannyArends Рік тому ⁺¹
The + sign is usually a sign of forgetting to close a curly bracket { } or an unclosed string " ", try looking up at where the > turns into a plus, often the unclosed curly bracket/string it close by.
To test whether it's a } or a ” type one of them in R and look which one generates an error, since an unclosed { cannot be closed by a ", and vise versa.
@solomonantonio8817 Рік тому ⁺¹
@@DannyArends i can simply copy the rest of the code and run it and it works. So no idea why there is a random break when I try to copy the entire script and run it in one go.
@DannyArends Рік тому ⁺¹
Interesting, perhaps you're exceeding your maximum clipboard size on virtual box or set by your operating system (this could be causing issues on large scripts). Does it work when you save it and run it from the command line using Rscript?
@solomonantonio8817 Рік тому ⁺¹
@@DannyArends I’m not using virtual box. I installed Ubuntu to boot alongside Windows.
I haven’t tried that as yet. Still a bit new to R and all these codes and commands. Not sure how to set up the cmdlineargs like you have in your video.
@DannyArends Рік тому ⁺¹
That will come in the next video or later on. Easiest is to just copy paste it in 2 parts. I'll see if I can replicate the weird behavior you're experiencing Tuesday when I have access to my Ubuntu workstation at work.
@danieljairenriquezvera3340 Рік тому ⁺¹
Thank you for your tutorial, There is any way to make the Index genome in star with less than 8Gb RAM (All i assigned to virtual box)?
@DannyArends Рік тому ⁺¹
Depending on your organism, there might already be an index available e.g. Human, Mouse, Rat, and other model organisms have an index available for download. However with only 8 Gb of RAM you'll run into very slow alignments during the next steps, since it'll use your HDD space to swap out RAM during alignment.
@danieljairenriquezvera3340 Рік тому ⁺¹
@@DannyArends thank you so much. Also, How can i get the vcf snp reference file for human genome?
@DannyArends Рік тому ⁺¹
You can get the known SNPs in humans (in vcf format) from the Ensembl ftp site: www.ensembl.org/info/data/ftp/index.html
@danieljairenriquezvera3340 Рік тому
@@DannyArends thank you again
@VikashKumar-jz8hw Рік тому
How to handle if we do not know the exact adapter sequences?
@DannyArends Рік тому
It almost never occurs that the researcher analyzing the data does not know the adapter sequences. However, In case you don't you can use fastQC www.bioinformatics.babraham.ac.uk/projects/fastqc/ to figure out which adapter sequences are present.
@VikashKumar-jz8hw Рік тому
@@DannyArends Thanks for your quick response! Actually I am stuck with demulltiplexing, sorry for wrong question. Do you have any suggestion regarding matching the indexes if somehow it is not yielding data, particularly for 6bp indexes?
@DannyArends Рік тому
Oww, sorry for not seeing your follow up question,
I would first look to see if those 6bp indices are in your fastq file. Just dump the first 1000 reads or so, and look at them in a text editor to figure out what is going on (e.g. is there an adapter before the 6bp, it might be that the adapter trimming step removes part of the index). You can also grep on the 6bp index that you expect to be in the data and then dump the reads containing these indices in separate files.
Normally demultiplexing is done after the chastity filter and before doing any adapter trimming
@darkjoker7256 Рік тому ⁺¹
Hi I'm trying to install IGV on windows subsystem linux but its not working Can anyone help me out
@DannyArends Рік тому ⁺¹
Hey there, my advice would be to install IGV for windows directly (not under wsl2) and then load in the bam files, which you can access from the network path \\wsl$\
@darkjoker7256 Рік тому ⁺¹
@@DannyArends ok sir I'll try that and get back to you
@zahraamasrawi7266 Рік тому ⁺¹
Hi Danny,I have no words to thank you for this incredible lessons.I
have a question how the code will be if Iam using data from a company.the file is tar.gz.sorry for the silly question Iam a beginner.thank you again
@DannyArends Рік тому
Just extract the raw data from the compressed file, you can extract a tar.gz file by using:
tar -xvzf filename.tar.gz
Just change filename.tar.gz to how your file is called and this command will extract it for you.
@zahraamasrawi7266 Рік тому
Thank you ❤❤.I'll try.
@zahraamasrawi7266 Рік тому
Hello Danny,I have another issue ,I couldn't solve for awhile,when using the code for trimmomatic . Unable to access java file error came out.I am using wsl2.and my working directory is network drive.thank you for your help
@DannyArends Рік тому
If wsl2 is unable to find java, you're going to have to install it under wsl2, see for example this thread on stackoverflow: stackoverflow.com/questions/63866813/what-is-the-proper-way-of-using-jdk-on-wsl2-on-windows-10
Good luck
@liutrvcyrsui 3 місяці тому ⁺¹
+1
@kwakuenoch823 2 роки тому ⁺¹
Thank you for this useful resource. I am stuck somewhere getting to the middle of the project, I hope you might be able to help me.
Upon executing this command, I get bgzip not found, but I successfully installed all the software and linked them into a bin folder.
# Compress the fasta file using bgzip (keep original)
cmd
@DannyArends 2 роки тому ⁺¹
"Command not found" is a path problem, usually through forgetting a symbolic link. Update you $PATH variable to contain the bgzip install directory. Or add a symlink in /home//bin to the bgzip executable, and add that to your $PATH variable
@kwakuenoch823 2 роки тому ⁺¹
@@DannyArends Thank you for the quick response.
@kwakuenoch823 2 роки тому ⁺¹
@@DannyArends I have found the reason for the problem and I thought I should let you know about it. The reason I was getting the bgzip: not found error was because everytime I was going into R, I typed sudo R. Just typing R and hitting enter resolved the problem.
@DannyArends 2 роки тому ⁺¹
Yes, this is a major difference since using sudo you run R as the administrator account which is a different user account altogether.
@kwakuenoch823 2 роки тому ⁺¹
@@DannyArends now I know. Grateful sir!
@rahulgopalam6479 2 роки тому ⁺¹
Hello, I need your advice again.
EXITING because of INPUT ERROR: could not open genomeFastaFile: /home/rahul/genome/Saccharomyces_cerevisiae.R64-1-1.dna.primary_assembly.fa
I'm getting this message when I try to generate the index using STAR.
@DannyArends 2 роки тому
Hi, does the file exist in that folder? The file is generated using the R script, so make sure it's located there, either by setting your working directory there in R (using setwd) or by copying it into the folder afterwards.
@rahulgopalam6479 2 роки тому ⁺¹
@@DannyArends Thanks for the information. I downloaded it in a different file. It's working fine now.
@DannyArends 2 роки тому
Good to hear it's working 👍

Наступне

Автоматичне відтворення

RNA Sequencing - Read counts, RPKM, Volcano plot, and a Heatmap