RNA Sequencing - Setup and Prerequisites

Danny Arends

334

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 7 лют 2025

КОМЕНТАРІ • 114

@DannyArends 9 місяців тому ⁺⁴
UPDATE APRIL 2024
Thanks for the engagement, comments and feedback Due to updates to STAR and PICARD tools, two additional steps (git checkout) are required to get the versions used in the video. I have updated the "0_installSoftware" script to make sure the correct versions are used. Please let me know if you get stuck on any additional steps.
@BoominGame 9 місяців тому
From the bottom of my heart, after 30 years of informatics, sratoolkit cache setup was the most excruciating thing I have ever done, ik heb nu ne dringenden drank nodig.
@jaypatankar Рік тому ⁺³
Thanks Danny! Learning at 40 was never easier thanks to your videos!
@DannyArends Рік тому
Wow, thanks so much. Good to hear you found the series useful and informative.
@john2451 11 місяців тому ⁺¹
AWESOME video... Thank you so much for taking the time to demonstrate all of this from scratch -- that was **a lot** of work, and it was invaluable to see this process performed in real time. I will be watching Part 2 tomorrow.
@DannyArends 11 місяців тому ⁺¹
Thanks for leaving a comment. It is indeed a lot of effort, all in all it turned out to be ~8 hours of streaming across 3 sessions. However, if you're doing something you love it fortunately doesn't feel like work.
@augustinechukwunta680 Рік тому ⁺¹
Thank you Danny for going in-depth with the RNA-sequence tutorial - so detailed and easy to replicate for a beginner. This is super helpful.
@DannyArends Рік тому
Glad it was helpful! Thanks for leaving a comment.
@DannyArends 2 роки тому ⁺²
Updated the description with the link to the code on GitHub, and presentation in PDF on OneDrive:
Code on Github: gist.github.com/DannyArends/04d87f5590090dfe0dc6b42e5e1bbe15
Presentation on OneDrive: 1drv.ms/b/s!AtYWSYRMmSHZh4gFsR1904Y-Cce04Q?e=Q8dtRl
@BioinformaticsBase 2 роки тому ⁺¹
Thank you for this great video. Newcastle is my hometown, hope you are enjoying it there!
@DannyArends 2 роки тому ⁺¹
I'm still discovering new things every day, but from what I've seen I think I'm going to enjoy living here.
@tortora 4 місяці тому
Great Video. Thank you. I really appreciate this walk through.
@DannyArends 4 місяці тому
Glad it was helpful!
@histephenson007 Рік тому ⁺¹
Thank you from the bottom of my heart 🙂
@DannyArends Рік тому ⁺¹
You're welcome, glad that you're enjoying the content!
@dr.mvhieu Рік тому ⁺¹
Thank you Danny for your very excellent guide. Looking forward to your new lectures =))
@DannyArends Рік тому
Thanks for leaving a comment :)
@soheilbehravesh3114 5 місяців тому ⁺¹
Thank you very much Dr. Arends for the tutorials. Just wanted to add something for Ubuntu users, because the folders that created in Ubuntu are kind of different from centos linux.
In "Ubuntu", the PATH for "./vdb-config --interactive" or "./fasterq-dump" will be similar to this "/software/sratoolkit/sratoolkit.3.1.1-ubuntu64/bin".
@DannyArends 5 місяців тому
Thanks for the info, every Linux flavor is slightly different indeed.
@soheilbehravesh3114 5 місяців тому ⁺¹
@@DannyArends True, Dr. Arends. Thank you for providing the chance for learning and sharing our experience.
@sam3929 2 роки тому ⁺²
Thank you for uploading this! Your tutorials are really helpful :)
@DannyArends 2 роки тому ⁺¹
Thanks, happy you liked it. In the next one we'll start aligning some sequences, I thought it would be good to show the whole process.
@guihuajia7696 Рік тому ⁺¹
Great course sir, thank you.
@DannyArends Рік тому
Glad you like it, thanks for leaving a comment !
@TheMagodana 2 роки тому ⁺¹
I appreciate your work. I am new to RNA seq and I am finding it very interesting
thankyuo so much. . . +Sub
@DannyArends 2 роки тому
Awesome, thank you! Good to hear you're enjoying the lectures.
@testforall555 2 роки тому ⁺¹
Excellent, excellent, excellent. Thank you, a zillion time. As always, very instructive and educating style for beginners (I am biologist who loves programming and coding). Looking forward for tomorrow session, IN SHAA ALLAH. I have so many questions (naïve ones), the first one is this session (and future coming ones) for everyone. (Can I replicate it for my students. Also, I wish one day I will be able to publish a paper for RNA-seq). I ask this because I am under the impression that you direct these videos for your own students. Pardon me for my ignorance and good luck. Mohamed
@testforall555 2 роки тому ⁺¹
Edit:
…….. Can I replicate it for my students….. Of course, with all the credits to you and your channel and links.
@DannyArends 2 роки тому ⁺²
I put the lectures online so everyone can learn from them. I think education should be broadly available to everyone. For this lecture I just start from the very basics, setting up the tools needed for RNAseq. Tomorrow we'll have part 2 where we'll start building a pipeline for RNAseq read alignment.
@DannyArends 2 роки тому
Ofcourse feel free to use an resample, credits would be highly appreciated
@testforall555 2 роки тому ⁺¹
@@DannyArends Thank you very much for your kind response and reply. What makes you stand out from others is that you explain the command lines. I watched a lot (not saying every youtube video but many) and you are among the very very few who explain what is the meaning of the command. I do not ask for so many details because it will be impossible to do so for a public video but a balance between the two is favored. In addition, people from biology background are mostley lost in the linux environment with so many errors happen (apart from typo mistakes).
@DannyArends 2 роки тому
Thanks, I try to be as thorough and complete as possible. It's why I avoid blindly using packages like dplyr and such, and tend to focus on learning people to use for and while loops in R. When someone understands the basics on a fundamental level, more advanced manipulation statements come easier. The same holds for the command line.
@sami9138 Рік тому ⁺¹
Thank You So Much, Sir
@thisisanas5164 2 роки тому ⁺¹
I would like to appreciate this initiative of yours and obviously, it's great, btw can you please specify the configuration of the desktop or laptop in terms of RAM or processor as minimum requirement in order to perform the rna seq analysis all standalone. Again thnx in advance.
@DannyArends 2 роки тому
This depends on what you are sequencing (mostly the size of the genome). For bacteria an i5 with 4gb RAM would be enough. If you're doing humans, an i7 with 32gb RAM will be needed to do a handful of samples in a reasonable amount of time. For 100s of samples an HPC cluster is needed so you can distribute jobs to many machines.
@thisisanas5164 2 роки тому ⁺¹
@@DannyArends Heartiest thnx for your informative reply. I would like to work with the rna seq analysis of various plant species genome like soyabean, common bean, jute specially, so, Is the configuration of i5 with 16GB RAM considered good for performing rna seq analysis in these crops genome
standalone?
@DannyArends 2 роки тому
Should work, but it'll take some time to run the analysis since you'll probably only be able to do one sample at a time.
@thisisanas5164 2 роки тому ⁺¹
@@DannyArends Thnx from the core of my heart for your enligtening reply.
@aberakenea2528 4 місяці тому
Thanks! it is a very insightful video you did but how I can be able to follow your virtual online at the time you will do a video coz I am MSc in Bioinformatics and interested to follow your virtual online.
@DannyArends 4 місяці тому
If you are subscribed to the channel, you'll be informed about upcoming live streams. Generally I post the stream announcement ~ 1 week before the actual stream takes place, so people can plan to attend.
@testforall555 2 роки тому ⁺¹
Would you please, explain to me the following: when you start installing gatk (at 1:29 hr), you said you prefer compile it yourself but due to its size and time, then you will download and extdact. So, what is the difference between the two methods. Thanks in advance. Mohamed
@DannyArends 2 роки тому ⁺¹
When compiling it from source, you can more easily update it, just a simple git update followed by a gradle command. The added bonus is 1) you don't have to check the website to see if there is an update and 2) you have access to the source code when an error occurs which helps because the documentation online can lag behind.
@testforall555 2 роки тому ⁺¹
@@DannyArends Thank you very much. Mohamed
@vondhanaramesh4365 5 місяців тому ⁺¹
sorry for the disturbance, the link that you have provided for debian is 12.6.0, but what you have used in the video is 11.5.0, can you please provide the link for 11.5.0?
@DannyArends 5 місяців тому
No bother, yeah It seems a newer version was released, you can always get the older versions from the archives, a direct link to the 11.5.0 netinst image: cdimage.debian.org/mirror/cdimage/archive/11.5.0/amd64/iso-cd/debian-11.5.0-amd64-netinst.iso
@vondhanaramesh4365 5 місяців тому ⁺¹
@@DannyArends thanks a ton
@trickibaba8386 Рік тому ⁺¹
while testing the file I am having a trouble "Error: Invalid or corrupt jarfile gatk-4.4.0.0/gatk".
how to resolve this?
@DannyArends Рік тому
What is your full command? It seems you're calling java on the folder, not the .jar file. If you are, and the error persists, redownload the gatk and extract it, a corruption can occur during download sometimes.
@vondhanaramesh4365 5 місяців тому ⁺¹
Hi Danny, i have 16gb RAM memory in my laptop, will i be able to do RNA seq?
@DannyArends 5 місяців тому
For smaller data sets and genomes, 16 Gb will be enough (e.g. Yeast, Bacteria, Bees, some Plants). For Mouse or Human, 16 Gb is probably not going to be enough, and 32 / 64 Gb is going to be the minimum.
@dbgPwjd Рік тому ⁺¹
Thank you so much for the wonderful video! I am trying do this in WSL2, but as I am using a network drive, it is bit hard to follow the steps... I found out that it is not allowed to create soft link in SMB connected drive and WSL2 is very slow while writing on the mounted drives. Would this be critical in the further steps? Thank you in advance!
@DannyArends Рік тому ⁺¹
I haven't tried this in wsl2, mostly because I dual-boot to Linux to do bioinformatics related analysis. In theory you could run the whole analysis pipeline in windows itself since all tools are available for windows as well. You could go the wsl2 route *probably*, but it might needs some tweaks or workarounds. Even then, like virtual box the performance will not be anywhere near what's needed for real analysis. So, all in all, it's easiest to following along on linux/virtual box.
I chose a virtual box for this since my streaming setup is windows based and installing wsl2 needs a reboot which breaks the stream, so I decided a virtual box was the easiest to do a stream like this.
@dbgPwjd Рік тому ⁺¹
@@DannyArends I need to analyse actual dataset in the future, so I'll try again with the dual-boot & hard drive. I really appreciate your response!
@dariocosemans8326 10 місяців тому ⁺¹
Ik krijg een error na de "make" bij STAR: STAR.cpp:52:45: error 'parametersDefault' was not declared in this scope en ook STAR.cpp:53:20 error: 'parameterDefault_len' was not declared in this scope. Hoe kan ik hier omheen?
@DannyArends 10 місяців тому ⁺¹
You're going to have to use an older version of the STAR aligner. I've had several reports now that mentioned STAR not compiling, I think it's due to them changing their build based on a newer version of linux.
So two options:
1) try installing a newer linux version
2) grab an older binary version of star and use that. (Some other comment on here.mentions the version that still works)
I'll see if I can figure out what the issue is and make another video with the solution when I do.
@DannyArends 10 місяців тому ⁺¹
This is the comment I was referring to:
"Seems like the master branch is currently "broken", the quickest solution is to just download the binary distribution of the release page. The latest compiled version for linux is: github.com/alexdobin/STAR/releases/download/2.7.10a_alpha_220818/STAR_2.7.10a_alpha_220818_Linux_x86_64_static.zip
Just unzip it and put the STAR binary file in your ~/bin folder"
@JessilynGao Рік тому ⁺¹
Hello! This video is super helpful for a beginner, but I failed start the virual box. The computer has window 7 system btw and the debian is 32-bit instead of 64. Is there anyway that i can avoid this problem
@DannyArends Рік тому
Virtual box runs fine on windows 7, you do need to install Debian with a 64bit version otherwise you're not going to be able to run the tools. 32bit OS versions are not suitable for large files.
@JessilynGao Рік тому ⁺¹
@@DannyArends Thank you so much for the prompt reply! I will try on my mac to see how it goes then!
@DannyArends Рік тому
Good luck !
@DrSisuPark Рік тому ⁺¹
Hi Prof. Danny, Thank you for this excellent video. I have an issue regarding the update in bash file though I updated the code at the end of the bash file, I'm not able to execute the command, for example, when I execute "STAR" I'm getting the " Command 'STAR' not found, but can be installed with:
sudo apt install rna-star ".
I tried this after conda deactivate. where as I'm getting the command working in conda environment but not in other case.
@DannyArends Рік тому
Thanks for the compliment, thing with $path settings get quite complicated when conda is involved since it takes over the whole environment. Feel free to send me an email with a copy of your .bashrc file so I can take a look into it.
@thisisanas5164 2 роки тому ⁺¹
Hellow Professor, knocking your door for another curiosity. And that is, I have upgraded my i5 laptop's RAM from 8 gb to 16 gb and here is I am wondering what should I do, dual boot or Virtual box or wsl in windows or use linux standalone for performing rna seq analysis in some plant genome? I am using windows 10 now. So which option should be preferable to use? Thnx in advance.
@DannyArends Рік тому
I'd probably go for WSL on windows 10 for convenience and semi performant. Dualboot is nice when you have the HDD space for it (sequencing data is big), and virtualbox just has too little performance for real genome sizes.
@thisisanas5164 Рік тому ⁺¹
@@DannyArends Heartiest thnx for your prompt response, Professor. And sorry to bother you again. I would like to know if I go for WSL in Windows 10 then will I have my full 16 gb ram support for rna seq data analysis? My laptop has 1 TB HDD. so Is it enough for my laptop to efficiently handle the pressure of dual booting?
@DannyArends Рік тому
WSL allows for full memory usage
@thisisanas5164 Рік тому ⁺¹
@@DannyArends A tons of thnx to you Professor.
@vondhanaramesh4365 5 місяців тому
What to do if the compilation for trimmomatic has mot been done?
@DannyArends 5 місяців тому
In that case just download trimmomatic v0.39 from here: www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.39.zip and extract it. Make sure to update the script to reflect that you're using 0.39 not 0.40-rc1
@vondhanaramesh4365 5 місяців тому ⁺¹
@@DannyArends thank you so much and also the virtual box version what you have used in the tutorial and the one in the pdf is different, is it fine?
@DannyArends 5 місяців тому
The version of virtual box should not matter, the important part is to use the same Debian version
@LiDong-vz6gd 2 роки тому ⁺¹
Thank you for your video. When I was installing Trimmomatic or PICARD tools etc, and then tested if it was installed, it always showed like "Unable to access jarfile picard-2.27.5-SNAPSHOT-all.jar". I use M1 Mac and installed Debian bullseye. How to fix this problem? Thank you in advance!
@DannyArends 2 роки тому
Hi there, if you use the file browser can you see the picard jar file in the folder? The unable to access error generally means it cannot find the file that you're telling it to execute. So make sure you're in the right location and you can see the file using ls. Alternatively you can give the full path to the file: java -jar /home/username/Software/picard/picard.jar
@LiDong-vz6gd 2 роки тому ⁺¹
Thank you very much for your quick reply!
@testforall555 2 роки тому ⁺¹
Also, again sorry for such continuous bothering but if it is possible to make videos on issues like 23andme (or similar ones), exome analysis and microarray analysis (or if there will be a plan of any of these ones). If not, it is Ok, just making another “bothering naïve” suggestion.
@DannyArends 2 роки тому
Some of these topics are covered in the bioinformatics lecture series here on my channel. But I'm always open to suggestions.
@testforall555 2 роки тому ⁺¹
Hi Danny, sorry again. I just add this comment, may be it help someone or may be I have strange situation. In the step of making links, the ln step does not work for tabix and fasterq dump. (Again, one of the pain for biologists to learn linux.) Anyway, i googled and i guess i found the solution. Add f to s, so the command is "ln -sf path". Thanks. Mohamed (forget to mention that i am on ubuntu, dual boot. Also, i think the code for tabix3 is not on github).
@DannyArends 2 роки тому ⁺¹
The f (force) should only be needed when you're linking on top of an already existing file, link, or folder. It's not recommended to just overwrite what was already there, especially since it's relatively common to switch the from and to sides of the command. Perhaps you had tabix/fasterqdump already linked, and the f was needed to overwite the existing link?
@testforall555 2 роки тому ⁺¹
@@DannyArends Thanks a lot. I am not pro in linux but i do understand what you wrote. I tried the normal steps shown in your youtube videos (session #1 & #2), and when I do ls (from within bin folder), it gives me everything in green color except fasterq-dump and tabix, they appear in red. When I browse to the folder containing tabix and fasterq-dump, they only work when i type "./tabix". This seems weird. It is there but ln command within bin is not recognizing them (i am on ubunty 22.10). So, i searched for solution and that is what i found. I am very very sorry if my answer is irrelevant or has nothing to do with your kind answer. But my conclusion is to prefer to use debian and follow exactly your master and that ubuntu may be become not good for some bioinformatics tools. Thanks again. Mohamed
@DannyArends 2 роки тому ⁺¹
Generally them being shown in red means the target of the link doesn't exist. You can check this by doing an ls command with -lathr or something, it shows the target location for each link. Make sure the link points to the executable. Delete the red links when the link points to a non-existing path, then link again if the ln command gives an error or doesn't create a link 99% of the time it's a typo in the from path
@testforall555 2 роки тому ⁺¹
@@DannyArends thank you. Will test this and come back. Mohamed
@testforall555 2 роки тому
@@DannyArends Hi Danny, I followed your steps and it worked. I have no explanation. I first, removed the links that I made using -sf, then added them again like what you did in the video here, and it works (really, very strange. I repeated before this on two computers and both links to tabix and fasterq-dump, did not work before). Anyway, thank you very much. Mohamed
@RainBeats 2 роки тому
working for today
@DannyArends 2 роки тому
Enjoy work !
@liutrvcyrsui 7 місяців тому ⁺¹
+1
@hnisarbiotech Рік тому
Hi Professor
I account an ERROR entitled: BUILD FAILED, while running the ./gradlew shadowJar command for installing PICARD. Kindly help to solve this
@DannyArends Рік тому
The real error should be mentioned before, the "build failed" is not a real error it just lets you know it couldn't create the jar file.
I can help you with this, but I would need to see the full build command you used, as well as all output. Please drop it by email (my email is listed in the about section of my channel)
@hnisarbiotech Рік тому
@@DannyArends Thanks Professor for your answer. The problem gets solved due to Java 11 version. PICARD requires JAVA 17.
@harshadajadhav7198 Рік тому
@@hnisarbiotech How did you solve this problem? How do you get JAVA17
@kudakwashenyambo6023 10 місяців тому
@@hnisarbiotech how did you solve this issue
@rahulgopalam6479 2 роки тому ⁺¹
Hello, the video was extremely helpful and easy to follow. I installed everything and at the end, once I open a new terminal to check samtools or STAR, it tells bash:samtools:command not found. Whats the problem?
Also, I took the debian iso initially and not the dvd file that you used.
@rahulgopalam6479 2 роки тому ⁺¹
Hi, figured it out. /home/Rahul/software/ is the right one. I copy pasted directly which has danny in it. All are working now except STAR which has a red symbol. Any leads are helpful
@DannyArends 2 роки тому
Did you update the .bashrc file to add the ~/bin folder to your $PATH. see: gist.github.com/DannyArends/04d87f5590090dfe0dc6b42e5e1bbe15 (0_installSoftware.sh) line 83 to 97 where we make symbolic links in ~/bin and then use nano to update the bashrc file
@DannyArends 2 роки тому
A red symbol? That probably means the link isn't pointing to the correct location. Remove the link and add it again, using the tab key to auto complete paths will prevent some failures like typos and capitalization issues.
@rahulgopalam6479 2 роки тому ⁺¹
@@DannyArends Thank you so much for the fast response. I did update .bashrc file initially, but after I updated my name and added all 5 files again, I didn't do it
@rahulgopalam6479 2 роки тому ⁺¹
Looks like I have two STAR folders- One in software and one in home. Should I remove one?
@BoominGame 9 місяців тому ⁺¹
in ubuntu you need to run the vdb-config --interactive in the /bin that is at the root of your extracted file that should be in sratoolkit folder if you have mkdir one otherwise it's going to be in root of your /software folder. (Maybe because it'd my machine, but it is the most annoying program ever)
@DannyArends 9 місяців тому ⁺¹
Thanks for the info, I tend to run a debian based OS.
@BoominGame 9 місяців тому ⁺¹
@@DannyArends No worries it's very similar. Had to interrupt myself because it was a very long install and my day started, tomorrow I'll resume and try part 2. Thanks for the great work!
@farrkf 11 місяців тому ⁺¹
Hi Danny, thanks for sharing this video! I'm a beginner in this field and am following your tutorial step-by-step.
However, I'm stuck at the STAR software at the moment. I can't seem to compile the software. Error is as below:
'rm' -f STAR.o Parameters.o
g++ -c -O3 -std=c++11 -fopenmp -D'COMPILATION_TIME_PLACE="2024-03-14T10:26:24+08:00 :/home/farr/software/STAR/source"' -D'GIT_BRANCH_COMMIT_DIFF="On branch master ; commit b1edc1208d91a53bf40ebae8669f71d50b994851 ; diff files: "' -pipe -Wall -Wextra STAR.cpp
STAR.cpp: In function ‘void usage(int)’:
STAR.cpp:52:45: error: ‘parametersDefault’ was not declared in this scope
52 | cout.write(reinterpret_cast(parametersDefault),
| ^~~~~~~~~~~~~~~~~
STAR.cpp:53:20: error: ‘parametersDefault_len’ was not declared in this scope
53 | parametersDefault_len);
| ^~~~~~~~~~~~~~~~~~~~~
make: *** [Makefile:100: STAR.o] Error 1
How do I solve this error?
@DannyArends 11 місяців тому
Seems like the master branch is currently "broken", the quickest solution is to just download the binary distribution of the release page. The latest compiled version for linux is: github.com/alexdobin/STAR/releases/download/2.7.10a_alpha_220818/STAR_2.7.10a_alpha_220818_Linux_x86_64_static.zip
Just unzip it and put the STAR binary file in your ~/bin folder
@jinlingli9728 11 місяців тому ⁺¹
Hi@DannyArends, thanks, thanks for sharing the detailed video. I had set up my own Linux for RNA seq by following your instructions. However, I was wondering if there are any reasons why we create primary_assembly using R?
@DannyArends 11 місяців тому ⁺¹
The answer is that the Ensembl ftp server doesn't provide a primary assembly for saccharomyces cerevisiae to download, while it does for e.g. mouse/human and other commonly used model organisms.
For saccharomyces only the top-level genome build is provided, but top level builds include all chromsomes (aka the primary assembly), but also regions not assembled into chromosomes (contigs) and N padded haplotype/patch regions. According to Ensembl documentation when no primary assembly is provided it's because the toplevel one is complete, so in this case we could have used the toplevel one (since it'll be identical to the primary assembly) but for most genomes (e.g. mouse) there will be a difference and for alignment 99% of the cases you're going to use the primary assembly.
If you'd use the top level for alignment, then you're going to have to deal with these additional regions later on in the analysis which creates additional complexity in the pipeline and 99% of people ignore these regions anyway.
I just added the step of building it, since its not difficult and I think it shows how you can use any genome/reference in fasta to align against.
(More info see: ftp.ensembl.org/pub/release-108/fasta/saccharomyces_cerevisiae/dna/README)
@farrkf 10 місяців тому ⁺¹
@@DannyArends Tried the latest compiled version, but the same error appeared. 😞
@DannyArends 10 місяців тому ⁺¹
If you're using the binary, you can't have this compilation error, since you can skip the compilation (no need to build the binary, since you downloaded it).
Just download the binary, put it in ~/bin and then run STAR from the command line. You can skip the make commands to build STAR.