I believe this has become the standard video anyone recommends when you want to understand different types of count normalizations. I have been recommended this by both my supervisor and my professor on two separate occasions haha
thank you very much for explanation! But one thing I should notice: the ORDER of division (i.e. order of steps) doesn't matter. It matters, by WHAT are you dividing for - in TPM it's not just library size (i.e. raw amount of all reads), but all counts of reads, normalized by length (i.e. summary RPK across all genes). This is the root of differences between RPKM and TPM
Couldn't agree more. This video makes the impression that it is changing the order of division that makes the difference, while it's all about total reads vs total normalized counts.
Terimakasih pak saya sudah menonton UA-cam RPKM ini sangat memberkati saya 🙏🙏 Dan nama saya adalah Maria Samaloisa semester 4, terimakasih Tuhan Yesus memberkati kita semua 🙏🙏👍
Thank you very much for this video! It's really very helpful! For many biologists who have the thirst to understand the logic behind why certain metrics are the way they are with respect to statistics... this is certainly of immense help.
RPKM = Reads per kilobase million -> normalize for read depth (some replicates simply have more read depth, technical) - SE RNAseq - PE RNAseq = FPKM (rest same) 1. devide all reads per gene by the total amount of reads per replicate(or sample however you wanna call it) 2. devide by gene length TPM = different order 1. devide by read length 2. devide by gene length result of the difference in order is that the relative expression of reads is more easily comparable because in TPM the piecharts are all the same size and in RPKM the pies are different size
Most Most Useful. I was beating my head to understand these RPKM/FPKM since last 3 days by reading and reading and reading!!! But this 10 min video did it without any confusion. Thank you Very much.
Hi Josh, thanks very much for this again well and clear explained video. It seems that TPM would be the most approrpiate to mseure gene expression between sample. However, internet searches shows the contrary. Some are saying that TMM would be the best solution. What do think of this?
Are you talking about the length of the RNA fragments that are sequenced? I don't think it really matters much either way, however, maybe longer fragments are better for transcript-level counting, since you want the fragments to span exons.
Thanks for your quick reply :-), and yes for example if a gene has multiple isoforms I wonder which isoform length should be used for normalization step. I guess based on what you mentioned the longest isoform length should be use
If you are just counting reads per gene, I think most people use the longest isoform. However, if you are counting reads per transcript, then you just use that transcript’s length.
Thanks for the very explanatory video. It is really helpful for students like me. Could you please post a video on N50 values and these will be used to evaluate the assembly ??
First I want to thank you for this great explanation. There is a point I'm missing. All these normalization techniques assume that each type of cell analyzed is producing the same amount of RNA and all the difference we see are due to some variability in the depth of the sequencing. But is this true? Shouldn't be a better idea to normalize the count only on some housekeeping genes like we do with qPCR?
This is a great question. The reality is that when you do statistics on RNA-seq data, the normalization methods often use housekeeping genes. I explain how these normalization methods work in these videos: ua-cam.com/video/UFB993xufUU/v-deo.html and ua-cam.com/video/Wdt6jdi-NQo/v-deo.html
Hey Joshua, thank you for the great video. Could you please explain why normalized counts are not for statistical test? the absolute values are changed by normalization but the ranks or the relative expression has not been changed... Is it because of isoforms? Thank you!
Now you should make a video of why you can’t use these to compare genes between samples and only to compare genes to each other within a sample. Since TPM is a proportion, if one gene goes up in a sample, then the rest of the gene will seem like they are going down, when in reality they really might be at the same level!
Thanks for the video! Btw, a paper titled 'Single-cell RNA sequencing technologies and bioinformatics pipelines' published in 2018 seems to borrow your idea as their Fig.3c and without any citation. What do you think of that figure?
@@statquest Apologies, effective lengths could only be calculated when raw data is available (fastq files). Here you discuss processed data (counts data). Regardless, it would be pretty awesome though if you could discuss the data processesing pipeline.
Hi Josh, Just a quick question regarding the TPM. What am I supposed to use as TPM input? Is it for the given transcript total transcript length (so exons, introns and UTRs) or just length of exons? Many thanks for help!
It depends on how the sequencing is done. That said, most of the time, introns are spliced out of the transcript and are not sequenced, so you can exclude those from the length of the sequence. One sure way to know you're doing it right is to look at the alignments using a genome browser - then you'll see where the reads are mapping to - if it's just exons or exons + UTRs.
Josh, could you help answering a question from me? When normalizing to the total read count (the second step of TPM, after normalizing to gene length), is the total read count the sum of normalized read counts that are mapped to genes only? What about the reads that are not annotated? Thanks fro your help!
Thanks for the awesome video ! If I understood well, none of these metrics takes into account the amount of unmapped reads. So does comparing TPM across samples that aren't replicates (e.g. a few environmental metagenomes) make any sense ?
You make a very good point. To be honest, TPM, FPKM and RPKM etc are all just for connivence - they may the data easy to look at and get a general feel for. However, they are not used for any sort of "real" comparisons among samples. For example, DESeq2 and EdgeR2 (and pretty much any other software that looks for differences between sets of "seq" samples) use completely different normalization strategies. These methods take into account that different samples might express different sets of genes - and some samples might not have many reads over all etc. So, my advice, is to use edgeR or DESeq2 to normalize your data for you, rather than doing it by hand. I have videos that show how normalization works in EdgeR: ua-cam.com/video/Wdt6jdi-NQo/v-deo.html and DESeq2: ua-cam.com/video/UFB993xufUU/v-deo.html if you would like more information.
@@statquest Hi Josh, thanks a lot for your very helpful answer. I just watched your DeSeq2 video and it looks indeed a lot closer to what I'm looking for than the TPM/RPKM/FPKM metrics. I'll dive into the details and try it on my data.
Hey Josh, I used DEseq2 got the significant differential expression gene list. Then I checked the TPM of those genes. some genes' TPM are quite low ( < 1), some are quite high (hundreds or thousands ). should I use TPM cut-off value to filter the low-expression genes? If I have to do this, which cut-off value you prefer? Welcome to any suggestion. Thank you!
Hey Joshua, Thanks so much for this video. I've a follow-up question: suppose I want to compare relative expression levels of gene A between two samples, but the tissue samples vary in size ... do these normalization methods take into account the fact that some samples will have more genes present than others? As a hypothetical (but easy to visualize) example, suppose I cut off a hand, ground it up, and sequenced the RNA. This is sample 1. For sample 2, I cut off a different hand AND the attached arm, ground them all up, and sequenced the RNA. If I expected gene A expression only in the fingertips, would I be able to compare the two samples to uncover which sample had more expression of gene A, even though sample 2 had more (and more diverse) input tissue than sample 1? In short, is a there a normalization method that accounts for the fact that there may simply be a greater variety of genes being expressed in one sample relative to another? Thanks again for this video. You explained these concepts better than any other source I've found!
Hi, your video is very helpful! But if I only have the log2RPM data, how can I find the differentially expressed genes? Is there anyway to transfer the log2RPM data to count data?
why is it that longer genes will have more reads mapping to them? are longer genes more amplified or is it because the short fragment of reads can be mismapped?
Imagine I have mRNA transcripts for two different genes, Gene A and Gene B. The mRNA transcripts for Gene A are 300 bp long and the mRNA transcripts for Gene B are 900 bp long. Now, since the sequencer can only sequence 300 bp long fragments, I break all of the mRNA fragments in to pieces that are 300bp long. That means for each mRNA transcript for Gene A, we get one 300bp long fragment to sequence. For Gene B, we get 3 fragments to sequence. In other words, we will sequence 3 times as many fragments for every mRNA transcript from Gene B than from Gene A. Does that make sense?
@@statquest it absolutely does!!! thankyou so much for explaining, I completely missed that! I always assumed that you would correct for this. I was under the assumption that, not the fragment, but the entire 900bp would count as 1 count by default.
@@statquest I just watched one of your videos called "StatQuest: A gentle introduction to RNA-seq" so if I understand that correctly we have to divide by the gene length because we create fragments from the RNA to 200 - 300 bp to be able to even start sequencing. If so my question would be why don't we divide by the number of fragments instead?
@@ejvik3238 The number of reads per gene is a function of the gene's length (because a 1kb long gene will create 5 200bp fragments and a 2kb gene will create 10) and its expression level. By dividing by the length, we can then determine expression level, which is what we are interested in.
I don't think there is a typo. The title is: "StatQuest: RPKM, FPKM and TPM". RPKM, FPKM and TPM are three (3) different ways to normalize high-throughput sequencing data.
Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/
I believe this has become the standard video anyone recommends when you want to understand different types of count normalizations. I have been recommended this by both my supervisor and my professor on two separate occasions haha
That's awesome! DOUBLE BAM! :)
I love how this man knows his content is irreplaceable so he can mess around in the intro without being concerned about losing viewers
:)
thank you very much for explanation! But one thing I should notice: the ORDER of division (i.e. order of steps) doesn't matter. It matters, by WHAT are you dividing for - in TPM it's not just library size (i.e. raw amount of all reads), but all counts of reads, normalized by length (i.e. summary RPK across all genes). This is the root of differences between RPKM and TPM
This is correct!
Couldn't agree more. This video makes the impression that it is changing the order of division that makes the difference, while it's all about total reads vs total normalized counts.
all count of reads that map onto something i suppose
but why do we normalize by the length even for TPM?
this guy is awesome.
Thank you! :)
hands down the most perfect explanation on the internet
Thank you! :)
Most Clear explanation I ever heard of TPM/FPKM/RPKM. Dont know why So many PhD was so confusing in trying to explaning this to me before.
+Joshua Starm thanks you so much, I feel like cleaning up the dust piled up in my mind , this is perfect !
I've just started my PhD and I'm really glad I found this. Thanks!
Terimakasih pak saya sudah menonton UA-cam RPKM ini sangat memberkati saya 🙏🙏
Dan nama saya adalah Maria Samaloisa semester 4, terimakasih Tuhan Yesus memberkati kita semua 🙏🙏👍
bam! :)
My initial work for PhD just got soooooo much easy and fun. Thanks!
I watched this presentation and now I can explain this to even seniors in my place as well !! :)
Your channel is an absolute gem! Please do keep up the good work. We need you!!
Thank you very much for this video! It's really very helpful!
For many biologists who have the thirst to understand the logic behind why certain metrics are the way they are with respect to statistics... this is certainly of immense help.
The best and clearest explanation I've ever seen! Tks
Yet another great explanation StatQuest!!! You keep educating til today!!
Thanks!
RPKM = Reads per kilobase million -> normalize for read depth (some replicates simply have more read depth, technical)
- SE RNAseq
- PE RNAseq = FPKM (rest same)
1. devide all reads per gene by the total amount of reads per replicate(or sample however you wanna call it)
2. devide by gene length
TPM = different order
1. devide by read length
2. devide by gene length
result of the difference in order is that the relative expression of reads is more easily comparable because in TPM the piecharts are all the same size and in RPKM the pies are different size
bam!
Thank you so much for this explanation, it is very useful for us biologist that have no background in bioinformatics.
You're welcome! I'm glad to know that the video is helpful. :)
Hi Josh, thank you so much for your videos.
Glad you like them!
Thank you so much!! Could not be explained clearer. Keep up the good work!
Thank you! :)
Concrete explanation .
Concepts explained to the point.
Add more !
Yes, I will be watching all videos
This was great!!! You really explained it clearly! Thanks so much!
Glad it was helpful!
very clear explanation best ever
Thank you!
I find your videos extremely helpful! Thank you so much!!!! Can you share your presentations too?
I love your low-key intros
Thanks!
Thank you so much, Josh Starmer !! It was a very clear explanation. My doubts are totally cleared.
Awesome! Thank you. :)
Thx for a great and easy explanation!
Thank you!
Wonderful explanation.. So informative, yet explained so easily. Thank you very much. It was indeed a great help.
This is very very clear! I wish I've seen this video earlier! Keep it up!
Perfect explanation, however, one thing was left out - TMM. How does TMM fit into the mix?
TMM is similar to what they do in DESeq2. For more details, check out: ua-cam.com/video/UFB993xufUU/v-deo.html
Thank you for the clear explanation. You made it so straightforward and easy!
This video is a life saver! Thanks a Million!
bam! :)
this is fantastic. I'm all over this goddam
Bam! :)
Nice explanation
Thanks! :)
Most Most Useful. I was beating my head to understand these RPKM/FPKM since last 3 days by reading and reading and reading!!! But this 10 min video did it without any confusion. Thank you Very much.
Good Job!
I am going to use this video to explain these concepts in Genomics a Graduate/Undergraduate class I teach at Texas A&M University
Thanks!
So good! Thank you, this really helps with my thesis.
Thank you! Finally a nice and easy-to-understand explanation!
Hi Josh, thanks very much for this again well and clear explained video. It seems that TPM would be the most approrpiate to mseure gene expression between sample. However, internet searches shows the contrary. Some are saying that TMM would be the best solution. What do think of this?
Thank you!
Thanks Josh, so what do you think about TMM instead of TPM?@@statquest
@@SNAKE1375 Unfortunately I haven't been involved with high-throughput sequencing for a long time now, so I don't know the answer.
Awesome! Pretty well explained and coherent.
Thanks!!! :)
Excellent work!!
Thanks a lot!
Great explanation! Thank you!
Thanks! :)
This guy is amazing! So clear!
Thanks! :)
This is awesome!! Thank you!
Glad you like it!
Would you please elaborate on what length one should use if they have gene count instead of transcript count?
Are you talking about the length of the RNA fragments that are sequenced? I don't think it really matters much either way, however, maybe longer fragments are better for transcript-level counting, since you want the fragments to span exons.
Thanks for your quick reply :-), and yes for example if a gene has multiple isoforms I wonder which isoform length should be used for normalization step. I guess based on what you mentioned the longest isoform length should be use
If you are just counting reads per gene, I think most people use the longest isoform. However, if you are counting reads per transcript, then you just use that transcript’s length.
Thumb up every video
Thank you! :)
Thanks for the very explanatory video. It is really helpful for students like me. Could you please post a video on N50 values and these will be used to evaluate the assembly ??
very helpful, thanks very much
Thanks! :)
First I want to thank you for this great explanation.
There is a point I'm missing. All these normalization techniques assume that each type of cell analyzed is producing the same amount of RNA and all the difference we see are due to some variability in the depth of the sequencing. But is this true? Shouldn't be a better idea to normalize the count only on some housekeeping genes like we do with qPCR?
This is a great question. The reality is that when you do statistics on RNA-seq data, the normalization methods often use housekeeping genes. I explain how these normalization methods work in these videos: ua-cam.com/video/UFB993xufUU/v-deo.html and ua-cam.com/video/Wdt6jdi-NQo/v-deo.html
Oh thank for the answer(s)
What sequencing depth really signifies? Does having more sequencing depth mean high expression? Then why normalization for depth is required??
For details on what Sequencing Depth means and why we need to normalize, see: ua-cam.com/video/tlf6wYJrwKY/v-deo.html
Thanks for explaining this in a way I can understand. My only question, how do people manage with R/FPKM if it is so hard to compare between reps?
Hey Joshua, thank you for the great video. Could you please explain why normalized counts are not for statistical test? the absolute values are changed by normalization but the ranks or the relative expression has not been changed... Is it because of isoforms? Thank you!
This was great, thanks
Glad you enjoyed it!
Excelent video!
Thanks! :)
Now you should make a video of why you can’t use these to compare genes between samples and only to compare genes to each other within a sample. Since TPM is a proportion, if one gene goes up in a sample, then the rest of the gene will seem like they are going down, when in reality they really might be at the same level!
I'll keep that in mind.
Thanks
:)
Thanks for the video!
Btw, a paper titled 'Single-cell RNA sequencing technologies and bioinformatics pipelines' published in 2018 seems to borrow your idea as their Fig.3c and without any citation.
What do you think of that figure?
You're totally right. Thanks for pointing that out to me.
Very nice vid!! Thanks!
Thank you! :)
Excellent
Thanks!
I have a question, shouldn't we use the effective length rather than transcript length? could you maybe make a video about that?
I'll keep that in mind.
@@statquest Apologies, effective lengths could only be calculated when raw data is available (fastq files). Here you discuss processed data (counts data). Regardless, it would be pretty awesome though if you could discuss the data processesing pipeline.
Can you please make a detailed video on how limma+voom works
I'll keep it in mind.
great video ngl.
Thanks!
nicely explained!
Hi Josh,
Just a quick question regarding the TPM. What am I supposed to use as TPM input? Is it for the given transcript total transcript length (so exons, introns and UTRs) or just length of exons? Many thanks for help!
It depends on how the sequencing is done. That said, most of the time, introns are spliced out of the transcript and are not sequenced, so you can exclude those from the length of the sequence. One sure way to know you're doing it right is to look at the alignments using a genome browser - then you'll see where the reads are mapping to - if it's just exons or exons + UTRs.
Josh, could you help answering a question from me?
When normalizing to the total read count (the second step of TPM, after normalizing to gene length), is the total read count the sum of normalized read counts that are mapped to genes only? What about the reads that are not annotated? Thanks fro your help!
statquest is an encyclopedia of ...
bam! :)
Great lecture. Thanks StatQuest! I wonder if Deseq2 automatically normalizes counts based on FKPM or TPM?
For details on how DESeq2 normalizes reads, see: ua-cam.com/video/UFB993xufUU/v-deo.html
@@statquest thanks!
Thanks for the awesome video ! If I understood well, none of these metrics takes into account the amount of unmapped reads. So does comparing TPM across samples that aren't replicates (e.g. a few environmental metagenomes) make any sense ?
You make a very good point. To be honest, TPM, FPKM and RPKM etc are all just for connivence - they may the data easy to look at and get a general feel for. However, they are not used for any sort of "real" comparisons among samples. For example, DESeq2 and EdgeR2 (and pretty much any other software that looks for differences between sets of "seq" samples) use completely different normalization strategies. These methods take into account that different samples might express different sets of genes - and some samples might not have many reads over all etc. So, my advice, is to use edgeR or DESeq2 to normalize your data for you, rather than doing it by hand. I have videos that show how normalization works in EdgeR: ua-cam.com/video/Wdt6jdi-NQo/v-deo.html and DESeq2: ua-cam.com/video/UFB993xufUU/v-deo.html if you would like more information.
@@statquest Hi Josh, thanks a lot for your very helpful answer. I just watched your DeSeq2 video and it looks indeed a lot closer to what I'm looking for than the TPM/RPKM/FPKM metrics. I'll dive into the details and try it on my data.
@@TheBloodyBeat Hooray! :)
awesome explanation....thanks
What are some typical TPM values? I got a lot of zeros in my dataset. However there is a lot of values between 1 and 2, and some as high as 13. Thanks
Very good man!
So is TPM more like FPKM in the sense that it keeps track of paired end reads?
Great video! Can anyone recommend an R package for TPM normalisation? Thanks a lot in advance!
Joshua Starmer, thanks again!
Hello! I am supposed to do TPM normalisation of my counts Matrix , can l use steps explained here as it is? Or should l use any tool or package?
Usually a package will do this for you, but you can also follow these steps.
Hey Josh, I used DEseq2 got the significant differential expression gene list. Then I checked the TPM of those genes. some genes' TPM are quite low ( < 1), some are quite high (hundreds or thousands ). should I use TPM cut-off value to filter the low-expression genes? If I have to do this, which cut-off value you prefer? Welcome to any suggestion. Thank you!
DESeq2 should do this filtering for you. For more details, see: ua-cam.com/video/Gi0JdrxRq5s/v-deo.html
Awesome 😍 thanks
Thanks!
You are pretty awesome! Thanks.
Hey Joshua,
Thanks so much for this video. I've a follow-up question: suppose I want to compare relative expression levels of gene A between two samples, but the tissue samples vary in size ... do these normalization methods take into account the fact that some samples will have more genes present than others?
As a hypothetical (but easy to visualize) example, suppose I cut off a hand, ground it up, and sequenced the RNA. This is sample 1. For sample 2, I cut off a different hand AND the attached arm, ground them all up, and sequenced the RNA. If I expected gene A expression only in the fingertips, would I be able to compare the two samples to uncover which sample had more expression of gene A, even though sample 2 had more (and more diverse) input tissue than sample 1?
In short, is a there a normalization method that accounts for the fact that there may simply be a greater variety of genes being expressed in one sample relative to another?
Thanks again for this video. You explained these concepts better than any other source I've found!
Brilliant.I didn't realize those programs included that kind of normalization ... Thanks a lot, sir. I'm going to watch those videos pronto!
very good!
Thank you! :)
Hi, your video is very helpful! But if I only have the log2RPM data, how can I find the differentially expressed genes? Is there anyway to transfer the log2RPM data to count data?
Not that I know of.
so good! Thanks.
why is it that longer genes will have more reads mapping to them? are longer genes more amplified or is it because the short fragment of reads can be mismapped?
Imagine I have mRNA transcripts for two different genes, Gene A and Gene B. The mRNA transcripts for Gene A are 300 bp long and the mRNA transcripts for Gene B are 900 bp long. Now, since the sequencer can only sequence 300 bp long fragments, I break all of the mRNA fragments in to pieces that are 300bp long. That means for each mRNA transcript for Gene A, we get one 300bp long fragment to sequence. For Gene B, we get 3 fragments to sequence. In other words, we will sequence 3 times as many fragments for every mRNA transcript from Gene B than from Gene A. Does that make sense?
@@statquest it absolutely does!!! thankyou so much for explaining, I completely missed that! I always assumed that you would correct for this. I was under the assumption that, not the fragment, but the entire 900bp would count as 1 count by default.
love it...
even the corny songs.... because I know something good follows
Thank you very much! :)
the sarting felt like ppap song :P
but everything after that was awesome :D
Well explained, but I don't agree that TPM is better than FPKM
Noted!
thx
For the TPM, why do we normalize by the gene length?
Because the number of reads per gene scales by the length of the gene.
@@statquest Even if I do transcriptome from a sample and I'm interested in how much or how little (if at all) are genes expressed?
@@ejvik3238 yep
@@statquest I just watched one of your videos called "StatQuest: A gentle introduction to RNA-seq" so if I understand that correctly we have to divide by the gene length because we create fragments from the RNA to 200 - 300 bp to be able to even start sequencing. If so my question would be why don't we divide by the number of fragments instead?
@@ejvik3238 The number of reads per gene is a function of the gene's length (because a 1kb long gene will create 5 200bp fragments and a 2kb gene will create 10) and its expression level. By dividing by the length, we can then determine expression level, which is what we are interested in.
Nice!
Sir,
My data do not have any replicates and it is a paired end data.
bummer!
Does the title has a typo? TPM => FPM?
I don't think there is a typo. The title is: "StatQuest: RPKM, FPKM and TPM". RPKM, FPKM and TPM are three (3) different ways to normalize high-throughput sequencing data.
@@statquest that's right, the first slide in the video says FPM, I think the slide has a typo
Ah! You are correct! That's amazing. This video has been online for 4 years and you are the first person to spot that.
@@statquest no problem at all, the videos are very useful, thank you for all the hard work!
Why isn't it TPKM? :/
No idea!
Thank you so much for this, I now no longer look like this when trying to explain this: imgur.com/gallery/iWKad22