UPDATE APRIL 2024 Thanks for the engagement, comments and feedback Due to updates to STAR and PICARD tools, two additional steps (git checkout) are required to get the versions used in the video. I have updated the "0_installSoftware" script to make sure the correct versions are used. Please let me know if you get stuck on any additional steps.
Dear Danny! I am inspired by your teaching style. I loved the series and found it very very useful and easy to follow. A lot of times coding abilities come with an attitude which gets in the way when people teach. You are the best! Thank you for this series and waiting to learn more!
Thank you very much for these lectures/tutorials. They are precious to me. I am biologist who struggle to comprehend a lot from was presented here, especially the R codes. I replicate your steps on WSL and I get stuck at IGV, an error came with "no X11 etc.". I spend ages and installed putty and tried to forward and at the end I gave up and will try dual booting or virtual Linux like you. (Also, another error came when using bgzip that it is not known although i installed it and tried t on another ubuntu and it works, very weird. (By the way, sratoolkit on ubuntu, like what you told us to download has different path: it does not have etc or usr, just directly to the folder.)) Oh, the people who are trying to imitate bioinformatician need to know enormous amount of information (plus and pardon me in this, the true bioinformatician themselves seem complicate the matter and making so different versions and steps). I will try to set up account on twitch to follow you also there. Looking forward for the next lecture, IN SHAA ALLAH. Thanks a lot and best wishes. Mohamed.
Hi Mohamed, Thanks for your comments, X11 is the window manager for Linux which is unfortunately not available for WSL2 (which is command line only), fortunately you can install it under windows, and look at the BAM file in windows. bgzip is part of htslib, make sure you make a symbolic link (using ln) to add it to the ~/bin/ folder, so that you can execute it without having to type the full path ~/Software/htslib/bgzip And yes, bioinformatics is difficult because you have to know both the biology, and the computer science. I am always amazed by how many things can go wrong when installing software, and how complex it has become in the last couple of years. Danny
I am reading a document called "RNA-Seq workflow: gene-level exploratory analysis and differential expression" and i am very confused about the workflow. They say that transcript abundance quantification methods like Kallisto, Salmon to estimate abundances without aligning reads, followed by tximport package are better because skip the generation of large files...in this video in what moment and with what tool do we do this? i guess i still do not understand all the workflow and all the tools that can be used. Thanks a lot
The workflow explained/showcased here is an alignment guided analysis and goes into detail which steps are taken in such a pipeline. Alignment guided helps to do other things besides RNA expression, like look for SNPs and look at alternative splicing. This cannot be done using alignment free methods. The pipeline is written from scratch, as such the goal of it is to NOT use an established pipeline. It's not a "build a pipeline in 15 minutes" by just using an established pipeline wrapped up in a single executable. It's a learning/educational introduction which spends 9 hours going into detail how RNA sequencing works to teach people how much steps are involved in the process.
I recently saw that you would be making an explanatory video to address some issues that have arisen with PICARD. Some time ago, I installed all the software and everything worked great with the analysis. Although I haven't used them again... I have a question: Will all the software I have installed still work, or might it encounter errors due to updates? I would like to know if you will be publishing the video from April 27 on any platform. Thank you.
I updated the code recently to check out specific versions of picard and STAR, the main issue was that updates to both tools changed their behavior. PICARD now requires BAM files to have ReadGroups. STAR changed their libC version making it not compatible with the linux used in the tutorial. A "git checkout xxxxx" command to revert back to the previous version was added to the script. It's always tricky to keep scripts short enough to fit in a
Good day, thank you for the video, very informative. I just want to know if i can use the vitual machine for human genome? I will be checking for gene expression levels between two groups and i figured that is a large data set. I will appreciate your guidance in this. thanks
For a human genome it would be advisable to run it in Linux, or even better on a university high performance compute cluster. The data size is going to be orders of magnitude more than the example S. Cerevisiae RNA sequencing dataset.
I want to do RNAseq data analysis using the human genome. When reading the following lines in, I get a failure error: # Generate genome/transcriptome index using STAR STAR --runThreadN 2 --runMode genomeGenerate \ --genomeDir ~/genome/STAR \ --genomeSAindexNbases 10 \ --sjdbGTFfile ~/genome/Homo_sapiens.GRCh38.111.gtf \ --genomeFastaFiles ~/genome/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa STAR version: 2.7.11b compiled: 2024-01-25T16:12:02-05:00 :/home/dobin/data/STAR/STARcode/STAR.master/source Apr 06 00:23:00 ..... started STAR run Apr 06 00:23:00 ... starting to generate Genome files terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc Afgebroken What is going wrong?
Bad_alloc means not enough memory, the human genome is much bigger (~150 times) compared to the yeast genome. You're going to have to allocate a lot more RAM (and HDD space) to the virtual machine to create the index, I think the recommendation is ~64 Gb for a human genome (128Gb for some larger genomes like grain or unions). Normally people use pre-built indexes for large genomes (e.g. the index is built on HPC infrastructure), and then use the pre-built index on their local machine. During alignment the memory depends not on the genome size but on the length of the reads, so you'll need less memory during the alignment phase.
@@DannyArends Do you know where to look for it because I have no idea and when looking I always end up with people that got the same problem as me but no solution. Is it maybe a good idea to use HISAT2 instead of STAR? I'm new to this and just thinking a loud
The bgzip version you're using is too old, 1.13 does not support -k (keep the original file) (1). Version 1.14 and above supports the option according to the documentation (2). Upgrading your htslib and bgzip should fix the error. (1) www.htslib.org/doc/1.13/bgzip.html (2) www.htslib.org/doc/1.14/bgzip.html
Hey @conlosjuguetesdemihijo6310 unfortunately that species has no genome or transcriptome information available on Ensembl. You probably look for a gtf file from where you got the reference genome. Google brings me to the following page: www.ncbi.nlm.nih.gov/genome?term=AYRZ00000000&cmd=DetailsSearch Here I see there is a reference genome available: ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/878/395/GCF_002878395.1_UCD10Xv1.1/GCF_002878395.1_UCD10Xv1.1_genomic.fna.gz as well as the transcriptome (in GFF format) at: ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/878/395/GCF_002878395.1_UCD10Xv1.1/GCF_002878395.1_UCD10Xv1.1_genomic.gff.gz You can convert the GFF to GTF using the AGAT tool: agat.readthedocs.io/en/latest/gff_to_gtf.html
I am using a UTM virtual Ubuntu on my M2 MacBook Air, can I use the softwares directly on MacBook terminal? Because the Ubuntu is taking a lot of space on my system 😢
In theory (and practice) yes, you should be able to get it working on a Mac, There might be some issues with java versions and/or installation of tools via homebrew instead of apt. Good luck !
Hi! Thanks for the video, very instructive!! I have a quick question: Does Picard automatically detect and delete unwanted duplicates from the bam file? or is there a risk to loose biological data information when performing this step?
Picard can do removal or flagging, removal produces a BAM without the duplicates while flagging keeps the reads in the file but puts a duplicate flag in the information field so downstream tools know yhis read is highly likely to come from a sequencing artifact. Any preprocessing step can theoretically remove real biology. However, due to how short read sequencing preps shear dna (usually through mechanical force), it's very unlikely to have many reads with exactly the same start and end position. When using e.g. enzymatic digestion the chances of this are higher and the possibility of removing real biology goes up.
@@DannyArends Thanks! this is helpful! what would be the interpretation if after deduplication with Pcard the fold coverage decreases significantly (say from 40 to 10), would this be a source of concern? I have seen in the video how this affect specific chromosomes but I am wondering about the file as a whole. Thanks again!
I've never seen it drop so dramatically, but I guess a 4 fold reduction in read coverage globally would indeed point to an issue during library prep, which could cause significant differences in the results/interpretation downstream. Depending on if you're doing A) genome seq for SNP hunting this might cause "fake" SNP to appear as a result of PCR introduced mutations, and B) in RNA seq might lead to drop out of low expressed genes, or significantly less power to detect differential expression.
Thanks! That sounds reasonable. I am in the second category so it is probably wise to skip the Picard step (or else do both and compare the outcomes)@@DannyArends
Do not skip the step, run it with and without and look (using the IGV) at a couple highly expressed genes to figure out if reads seem equally distributed across the whole length of the gene. If you see a tower of 100s of reads at a single position all with the same start and end position is a clear indicator of a PCR artifacts, not real biology.
Thank you so much for this informative series! I have a question: What if we cannot find the VCF file on ensemble? Do we skip those parts? Is there somewhere else to look? My organism is Papio anubis. Thank you!
I think you mean the gff/gtf file, since vcf is the output of SNP calling. In case your organism doesn't have an annotated genome RNA sequencing becomes a lot more challenging. You can still align reads, but you won't have the information on where a gene/it's exons start and end. You can use computational tools to generate an first annotation based on homology, or If you have RNA sequencing data on multiple tissues you can generate genome sequence annotation using the ensembl annotation pipeline or other tools. We're currently doing something similar for fruit bats, but this is not a trivial exercise.
Hi Danny, thank you again for the second series of the tutorial! I followed your tutorial step-by-step but I encountered an error in generating the index using STAR: terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc Aborted My input was similar to yours except that I changed the --genomeSAindexNbases to 14 as I'm working with the human sample: STAR --runThreadN 2 --runMode genomeGenerate --genomeDir ~/genome/STAR --genomeSAindexNbases 14 --genomeFastaFiles ~/genome/Homo_sapiens.GRCh38.dna.primary_assembly.fa --sjdbGTFfile ~/genome/Homo_sapiens.GRCh38.111.gtf Correct me if I'm wrong but based of my googling, it has to do with RAM? The setting of the VM was exactly the same with your 1st tutorial. Your reply is highly appreciated, Danny! Thanks
Yes, this is an out of RAM/HDD space error. The human genome is 162 times bigger than the yeast genome. You'll have to adjust the RAM to 64 Gb minimum and allocate 250 times more disk space since all files will be bigger (genome, vcf, index, etc)
Thanks for your video.. just quick question... among trimming methods, there is also called ngsshort package except for trimmomatic... do you think I can also use ngsshort rather than trimmomatics for rna-seq data?
Yep, any trimmer will do, this is the reason pipelines are more or less modular, you can easily swap out one tool for another one. Do make sure that besides read trimming based on quality it will also trim sequencing adapters (just in case they're still on the read data)
@@DannyArends based on my coverage data, meanmapq is almost 55~60 across samples, but meandepth is 27~39. Du you think that I need to picard to remove duplication?
Thanks for the great video. When I try to run the script as one it stops after Step 4.1 and does not start Step 5. Just stays with a plus sign like if something is missing from the code.
The + sign is usually a sign of forgetting to close a curly bracket { } or an unclosed string " ", try looking up at where the > turns into a plus, often the unclosed curly bracket/string it close by. To test whether it's a } or a ” type one of them in R and look which one generates an error, since an unclosed { cannot be closed by a ", and vise versa.
@@DannyArends i can simply copy the rest of the code and run it and it works. So no idea why there is a random break when I try to copy the entire script and run it in one go.
Interesting, perhaps you're exceeding your maximum clipboard size on virtual box or set by your operating system (this could be causing issues on large scripts). Does it work when you save it and run it from the command line using Rscript?
@@DannyArends I’m not using virtual box. I installed Ubuntu to boot alongside Windows. I haven’t tried that as yet. Still a bit new to R and all these codes and commands. Not sure how to set up the cmdlineargs like you have in your video.
That will come in the next video or later on. Easiest is to just copy paste it in 2 parts. I'll see if I can replicate the weird behavior you're experiencing Tuesday when I have access to my Ubuntu workstation at work.
Depending on your organism, there might already be an index available e.g. Human, Mouse, Rat, and other model organisms have an index available for download. However with only 8 Gb of RAM you'll run into very slow alignments during the next steps, since it'll use your HDD space to swap out RAM during alignment.
It almost never occurs that the researcher analyzing the data does not know the adapter sequences. However, In case you don't you can use fastQC www.bioinformatics.babraham.ac.uk/projects/fastqc/ to figure out which adapter sequences are present.
@@DannyArends Thanks for your quick response! Actually I am stuck with demulltiplexing, sorry for wrong question. Do you have any suggestion regarding matching the indexes if somehow it is not yielding data, particularly for 6bp indexes?
Oww, sorry for not seeing your follow up question, I would first look to see if those 6bp indices are in your fastq file. Just dump the first 1000 reads or so, and look at them in a text editor to figure out what is going on (e.g. is there an adapter before the 6bp, it might be that the adapter trimming step removes part of the index). You can also grep on the 6bp index that you expect to be in the data and then dump the reads containing these indices in separate files. Normally demultiplexing is done after the chastity filter and before doing any adapter trimming
Hey there, my advice would be to install IGV for windows directly (not under wsl2) and then load in the bam files, which you can access from the network path \\wsl$\
Hi Danny,I have no words to thank you for this incredible lessons.I have a question how the code will be if Iam using data from a company.the file is tar.gz.sorry for the silly question Iam a beginner.thank you again
Just extract the raw data from the compressed file, you can extract a tar.gz file by using: tar -xvzf filename.tar.gz Just change filename.tar.gz to how your file is called and this command will extract it for you.
Hello Danny,I have another issue ,I couldn't solve for awhile,when using the code for trimmomatic . Unable to access java file error came out.I am using wsl2.and my working directory is network drive.thank you for your help
If wsl2 is unable to find java, you're going to have to install it under wsl2, see for example this thread on stackoverflow: stackoverflow.com/questions/63866813/what-is-the-proper-way-of-using-jdk-on-wsl2-on-windows-10 Good luck
Thank you for this useful resource. I am stuck somewhere getting to the middle of the project, I hope you might be able to help me. Upon executing this command, I get bgzip not found, but I successfully installed all the software and linked them into a bin folder. # Compress the fasta file using bgzip (keep original) cmd
"Command not found" is a path problem, usually through forgetting a symbolic link. Update you $PATH variable to contain the bgzip install directory. Or add a symlink in /home//bin to the bgzip executable, and add that to your $PATH variable
@@DannyArends I have found the reason for the problem and I thought I should let you know about it. The reason I was getting the bgzip: not found error was because everytime I was going into R, I typed sudo R. Just typing R and hitting enter resolved the problem.
Hello, I need your advice again. EXITING because of INPUT ERROR: could not open genomeFastaFile: /home/rahul/genome/Saccharomyces_cerevisiae.R64-1-1.dna.primary_assembly.fa I'm getting this message when I try to generate the index using STAR.
Hi, does the file exist in that folder? The file is generated using the R script, so make sure it's located there, either by setting your working directory there in R (using setwd) or by copying it into the folder afterwards.
UPDATE APRIL 2024
Thanks for the engagement, comments and feedback Due to updates to STAR and PICARD tools, two additional steps (git checkout) are required to get the versions used in the video. I have updated the "0_installSoftware" script to make sure the correct versions are used. Please let me know if you get stuck on any additional steps.
Dear Danny! I am inspired by your teaching style. I loved the series and found it very very useful and easy to follow. A lot of times coding abilities come with an attitude which gets in the way when people teach. You are the best! Thank you for this series and waiting to learn more!
Thanks for the kind words, and leaving a message. It's good to hear you're enjoying the lectures.
Thank you so much for the batch 2 of the tutorial - it's super helpful
You're very welcome!
Thank you very much for these lectures/tutorials. They are precious to me. I am biologist who struggle to comprehend a lot from was presented here, especially the R codes. I replicate your steps on WSL and I get stuck at IGV, an error came with "no X11 etc.". I spend ages and installed putty and tried to forward and at the end I gave up and will try dual booting or virtual Linux like you. (Also, another error came when using bgzip that it is not known although i installed it and tried t on another ubuntu and it works, very weird. (By the way, sratoolkit on ubuntu, like what you told us to download has different path: it does not have etc or usr, just directly to the folder.)) Oh, the people who are trying to imitate bioinformatician need to know enormous amount of information (plus and pardon me in this, the true bioinformatician themselves seem complicate the matter and making so different versions and steps). I will try to set up account on twitch to follow you also there. Looking forward for the next lecture, IN SHAA ALLAH. Thanks a lot and best wishes. Mohamed.
Hi Mohamed,
Thanks for your comments,
X11 is the window manager for Linux which is unfortunately not available for WSL2 (which is command line only), fortunately you can install it under windows, and look at the BAM file in windows.
bgzip is part of htslib, make sure you make a symbolic link (using ln) to add it to the ~/bin/ folder, so that you can execute it without having to type the full path ~/Software/htslib/bgzip
And yes, bioinformatics is difficult because you have to know both the biology, and the computer science. I am always amazed by how many things can go wrong when installing software, and how complex it has become in the last couple of years.
Danny
@@DannyArends Thank you. Mohamed
I am reading a document called "RNA-Seq workflow: gene-level exploratory analysis and differential expression" and i am very confused about the workflow. They say that transcript abundance quantification methods like Kallisto, Salmon to estimate abundances without aligning reads, followed by tximport package are better because skip the generation of large files...in this video in what moment and with what tool do we do this? i guess i still do not understand all the workflow and all the tools that can be used. Thanks a lot
The workflow explained/showcased here is an alignment guided analysis and goes into detail which steps are taken in such a pipeline. Alignment guided helps to do other things besides RNA expression, like look for SNPs and look at alternative splicing. This cannot be done using alignment free methods.
The pipeline is written from scratch, as such the goal of it is to NOT use an established pipeline. It's not a "build a pipeline in 15 minutes" by just using an established pipeline wrapped up in a single executable. It's a learning/educational introduction which spends 9 hours going into detail how RNA sequencing works to teach people how much steps are involved in the process.
I recently saw that you would be making an explanatory video to address some issues that have arisen with PICARD. Some time ago, I installed all the software and everything worked great with the analysis. Although I haven't used them again... I have a question: Will all the software I have installed still work, or might it encounter errors due to updates? I would like to know if you will be publishing the video from April 27 on any platform. Thank you.
I updated the code recently to check out specific versions of picard and STAR, the main issue was that updates to both tools changed their behavior. PICARD now requires BAM files to have ReadGroups. STAR changed their libC version making it not compatible with the linux used in the tutorial.
A "git checkout xxxxx" command to revert back to the previous version was added to the script. It's always tricky to keep scripts short enough to fit in a
Good day, thank you for the video, very informative. I just want to know if i can use the vitual machine for human genome? I will be checking for gene expression levels between two groups and i figured that is a large data set. I will appreciate your guidance in this. thanks
For a human genome it would be advisable to run it in Linux, or even better on a university high performance compute cluster. The data size is going to be orders of magnitude more than the example S. Cerevisiae RNA sequencing dataset.
I want to do RNAseq data analysis using the human genome. When reading the following lines in, I get a failure error:
# Generate genome/transcriptome index using STAR
STAR --runThreadN 2 --runMode genomeGenerate \
--genomeDir ~/genome/STAR \
--genomeSAindexNbases 10 \
--sjdbGTFfile ~/genome/Homo_sapiens.GRCh38.111.gtf \
--genomeFastaFiles ~/genome/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa
STAR version: 2.7.11b compiled: 2024-01-25T16:12:02-05:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Apr 06 00:23:00 ..... started STAR run
Apr 06 00:23:00 ... starting to generate Genome files
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Afgebroken
What is going wrong?
Bad_alloc means not enough memory, the human genome is much bigger (~150 times) compared to the yeast genome. You're going to have to allocate a lot more RAM (and HDD space) to the virtual machine to create the index, I think the recommendation is ~64 Gb for a human genome (128Gb for some larger genomes like grain or unions). Normally people use pre-built indexes for large genomes (e.g. the index is built on HPC infrastructure), and then use the pre-built index on their local machine. During alignment the memory depends not on the genome size but on the length of the reads, so you'll need less memory during the alignment phase.
@@DannyArends Okay, thank you! Can I download these pre-built indices from the internet? Does this change any steps from the video?
The site from the person that wrote STAR is not accessible for me
You could in theory if you find one matching your genome version.
@@DannyArends Do you know where to look for it because I have no idea and when looking I always end up with people that got the same problem as me but no solution. Is it maybe a good idea to use HISAT2 instead of STAR? I'm new to this and just thinking a loud
Thank you very much for these lectures/tutorials. Upon executing bgzip command the following error msg was shown
> cmd
The bgzip version you're using is too old, 1.13 does not support -k (keep the original file) (1). Version 1.14 and above supports the option according to the documentation (2). Upgrading your htslib and bgzip should fix the error.
(1) www.htslib.org/doc/1.13/bgzip.html
(2) www.htslib.org/doc/1.14/bgzip.html
HI Danny. help me
"I can't find the index file gtf.gz in the Ensembl FTP files. Where can I download it from? My organism is Capsicum annuum."
Hey @conlosjuguetesdemihijo6310 unfortunately that species has no genome or transcriptome information available on Ensembl.
You probably look for a gtf file from where you got the reference genome.
Google brings me to the following page: www.ncbi.nlm.nih.gov/genome?term=AYRZ00000000&cmd=DetailsSearch
Here I see there is a reference genome available: ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/878/395/GCF_002878395.1_UCD10Xv1.1/GCF_002878395.1_UCD10Xv1.1_genomic.fna.gz
as well as the transcriptome (in GFF format) at: ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/878/395/GCF_002878395.1_UCD10Xv1.1/GCF_002878395.1_UCD10Xv1.1_genomic.gff.gz
You can convert the GFF to GTF using the AGAT tool: agat.readthedocs.io/en/latest/gff_to_gtf.html
@@DannyArends thank you so much
I am using a UTM virtual Ubuntu on my M2 MacBook Air, can I use the softwares directly on MacBook terminal? Because the Ubuntu is taking a lot of space on my system 😢
In theory (and practice) yes, you should be able to get it working on a Mac, There might be some issues with java versions and/or installation of tools via homebrew instead of apt. Good luck !
Hi! Thanks for the video, very instructive!! I have a quick question: Does Picard automatically detect and delete unwanted duplicates from the bam file? or is there a risk to loose biological data information when performing this step?
Picard can do removal or flagging, removal produces a BAM without the duplicates while flagging keeps the reads in the file but puts a duplicate flag in the information field so downstream tools know yhis read is highly likely to come from a sequencing artifact.
Any preprocessing step can theoretically remove real biology. However, due to how short read sequencing preps shear dna (usually through mechanical force), it's very unlikely to have many reads with exactly the same start and end position. When using e.g. enzymatic digestion the chances of this are higher and the possibility of removing real biology goes up.
@@DannyArends Thanks! this is helpful! what would be the interpretation if after deduplication with Pcard the fold coverage decreases significantly (say from 40 to 10), would this be a source of concern? I have seen in the video how this affect specific chromosomes but I am wondering about the file as a whole. Thanks again!
I've never seen it drop so dramatically, but I guess a 4 fold reduction in read coverage globally would indeed point to an issue during library prep, which could cause significant differences in the results/interpretation downstream. Depending on if you're doing A) genome seq for SNP hunting this might cause "fake" SNP to appear as a result of PCR introduced mutations, and B) in RNA seq might lead to drop out of low expressed genes, or significantly less power to detect differential expression.
Thanks! That sounds reasonable. I am in the second category so it is probably wise to skip the Picard step (or else do both and compare the outcomes)@@DannyArends
Do not skip the step, run it with and without and look (using the IGV) at a couple highly expressed genes to figure out if reads seem equally distributed across the whole length of the gene. If you see a tower of 100s of reads at a single position all with the same start and end position is a clear indicator of a PCR artifacts, not real biology.
Thank you very much .. These videos are very helpful.
btw which video games r u playing? ;)
I mostly play strategy/builder games, such as RimWorld, Dwarf Fortress, and I love doing multiplayer project zomboid
Thank you so much for this informative series! I have a question: What if we cannot find the VCF file on ensemble? Do we skip those parts? Is there somewhere else to look? My organism is Papio anubis. Thank you!
I think you mean the gff/gtf file, since vcf is the output of SNP calling. In case your organism doesn't have an annotated genome RNA sequencing becomes a lot more challenging. You can still align reads, but you won't have the information on where a gene/it's exons start and end.
You can use computational tools to generate an first annotation based on homology, or If you have RNA sequencing data on multiple tissues you can generate genome sequence annotation using the ensembl annotation pipeline or other tools.
We're currently doing something similar for fruit bats, but this is not a trivial exercise.
Yes, if you don't have a curated database of known SNPs it's not possible to do SNP/Indel realignment.
Minute 29:40, the file that tabix read is still compressed?
Yes, tabix is able to read uncompressed fasta, as well as bgzip compressed fasta files.
Hi Danny, thank you again for the second series of the tutorial!
I followed your tutorial step-by-step but I encountered an error in generating the index using STAR:
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Aborted
My input was similar to yours except that I changed the --genomeSAindexNbases to 14 as I'm working with the human sample:
STAR --runThreadN 2 --runMode genomeGenerate --genomeDir ~/genome/STAR --genomeSAindexNbases 14 --genomeFastaFiles ~/genome/Homo_sapiens.GRCh38.dna.primary_assembly.fa --sjdbGTFfile ~/genome/Homo_sapiens.GRCh38.111.gtf
Correct me if I'm wrong but based of my googling, it has to do with RAM? The setting of the VM was exactly the same with your 1st tutorial.
Your reply is highly appreciated, Danny! Thanks
Yes, this is an out of RAM/HDD space error. The human genome is 162 times bigger than the yeast genome. You'll have to adjust the RAM to 64 Gb minimum and allocate 250 times more disk space since all files will be bigger (genome, vcf, index, etc)
@@DannyArends Noted. Will change the RAM size accordingly. Thank you! Cheers
Thanks for your video.. just quick question... among trimming methods, there is also called ngsshort package except for trimmomatic... do you think I can also use ngsshort rather than trimmomatics for rna-seq data?
Yep, any trimmer will do, this is the reason pipelines are more or less modular, you can easily swap out one tool for another one. Do make sure that besides read trimming based on quality it will also trim sequencing adapters (just in case they're still on the read data)
@@DannyArends based on my coverage data, meanmapq is almost 55~60 across samples, but meandepth is 27~39. Du you think that I need to picard to remove duplication?
Thanks for the great video. When I try to run the script as one it stops after Step 4.1 and does not start Step 5. Just stays with a plus sign like if something is missing from the code.
The + sign is usually a sign of forgetting to close a curly bracket { } or an unclosed string " ", try looking up at where the > turns into a plus, often the unclosed curly bracket/string it close by.
To test whether it's a } or a ” type one of them in R and look which one generates an error, since an unclosed { cannot be closed by a ", and vise versa.
@@DannyArends i can simply copy the rest of the code and run it and it works. So no idea why there is a random break when I try to copy the entire script and run it in one go.
Interesting, perhaps you're exceeding your maximum clipboard size on virtual box or set by your operating system (this could be causing issues on large scripts). Does it work when you save it and run it from the command line using Rscript?
@@DannyArends I’m not using virtual box. I installed Ubuntu to boot alongside Windows.
I haven’t tried that as yet. Still a bit new to R and all these codes and commands. Not sure how to set up the cmdlineargs like you have in your video.
That will come in the next video or later on. Easiest is to just copy paste it in 2 parts. I'll see if I can replicate the weird behavior you're experiencing Tuesday when I have access to my Ubuntu workstation at work.
Thank you for your tutorial, There is any way to make the Index genome in star with less than 8Gb RAM (All i assigned to virtual box)?
Depending on your organism, there might already be an index available e.g. Human, Mouse, Rat, and other model organisms have an index available for download. However with only 8 Gb of RAM you'll run into very slow alignments during the next steps, since it'll use your HDD space to swap out RAM during alignment.
@@DannyArends thank you so much. Also, How can i get the vcf snp reference file for human genome?
You can get the known SNPs in humans (in vcf format) from the Ensembl ftp site: www.ensembl.org/info/data/ftp/index.html
@@DannyArends thank you again
How to handle if we do not know the exact adapter sequences?
It almost never occurs that the researcher analyzing the data does not know the adapter sequences. However, In case you don't you can use fastQC www.bioinformatics.babraham.ac.uk/projects/fastqc/ to figure out which adapter sequences are present.
@@DannyArends Thanks for your quick response! Actually I am stuck with demulltiplexing, sorry for wrong question. Do you have any suggestion regarding matching the indexes if somehow it is not yielding data, particularly for 6bp indexes?
Oww, sorry for not seeing your follow up question,
I would first look to see if those 6bp indices are in your fastq file. Just dump the first 1000 reads or so, and look at them in a text editor to figure out what is going on (e.g. is there an adapter before the 6bp, it might be that the adapter trimming step removes part of the index). You can also grep on the 6bp index that you expect to be in the data and then dump the reads containing these indices in separate files.
Normally demultiplexing is done after the chastity filter and before doing any adapter trimming
Hi I'm trying to install IGV on windows subsystem linux but its not working Can anyone help me out
Hey there, my advice would be to install IGV for windows directly (not under wsl2) and then load in the bam files, which you can access from the network path \\wsl$\
@@DannyArends ok sir I'll try that and get back to you
Hi Danny,I have no words to thank you for this incredible lessons.I
have a question how the code will be if Iam using data from a company.the file is tar.gz.sorry for the silly question Iam a beginner.thank you again
Just extract the raw data from the compressed file, you can extract a tar.gz file by using:
tar -xvzf filename.tar.gz
Just change filename.tar.gz to how your file is called and this command will extract it for you.
Thank you ❤❤.I'll try.
Hello Danny,I have another issue ,I couldn't solve for awhile,when using the code for trimmomatic . Unable to access java file error came out.I am using wsl2.and my working directory is network drive.thank you for your help
If wsl2 is unable to find java, you're going to have to install it under wsl2, see for example this thread on stackoverflow: stackoverflow.com/questions/63866813/what-is-the-proper-way-of-using-jdk-on-wsl2-on-windows-10
Good luck
+1
Thank you for this useful resource. I am stuck somewhere getting to the middle of the project, I hope you might be able to help me.
Upon executing this command, I get bgzip not found, but I successfully installed all the software and linked them into a bin folder.
# Compress the fasta file using bgzip (keep original)
cmd
"Command not found" is a path problem, usually through forgetting a symbolic link. Update you $PATH variable to contain the bgzip install directory. Or add a symlink in /home//bin to the bgzip executable, and add that to your $PATH variable
@@DannyArends Thank you for the quick response.
@@DannyArends I have found the reason for the problem and I thought I should let you know about it. The reason I was getting the bgzip: not found error was because everytime I was going into R, I typed sudo R. Just typing R and hitting enter resolved the problem.
Yes, this is a major difference since using sudo you run R as the administrator account which is a different user account altogether.
@@DannyArends now I know. Grateful sir!
Hello, I need your advice again.
EXITING because of INPUT ERROR: could not open genomeFastaFile: /home/rahul/genome/Saccharomyces_cerevisiae.R64-1-1.dna.primary_assembly.fa
I'm getting this message when I try to generate the index using STAR.
Hi, does the file exist in that folder? The file is generated using the R script, so make sure it's located there, either by setting your working directory there in R (using setwd) or by copying it into the folder afterwards.
@@DannyArends Thanks for the information. I downloaded it in a different file. It's working fine now.
Good to hear it's working 👍