RPKM, FPKM and TPM, Clearly Explained!!!

Поділитися
Вставка
  • Опубліковано 17 лис 2024

КОМЕНТАРІ • 180

  • @statquest
    @statquest  2 роки тому +2

    Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/

  • @asmitaJ
    @asmitaJ Місяць тому +1

    I believe this has become the standard video anyone recommends when you want to understand different types of count normalizations. I have been recommended this by both my supervisor and my professor on two separate occasions haha

    • @statquest
      @statquest  Місяць тому +1

      That's awesome! DOUBLE BAM! :)

  • @Anonymous9683
    @Anonymous9683 3 роки тому +9

    I love how this man knows his content is irreplaceable so he can mess around in the intro without being concerned about losing viewers

  • @ivantsers9445
    @ivantsers9445 5 років тому +38

    thank you very much for explanation! But one thing I should notice: the ORDER of division (i.e. order of steps) doesn't matter. It matters, by WHAT are you dividing for - in TPM it's not just library size (i.e. raw amount of all reads), but all counts of reads, normalized by length (i.e. summary RPK across all genes). This is the root of differences between RPKM and TPM

    • @nicolaikarcher7186
      @nicolaikarcher7186 4 роки тому

      This is correct!

    • @LiptonTiptonTea
      @LiptonTiptonTea 4 роки тому +4

      Couldn't agree more. This video makes the impression that it is changing the order of division that makes the difference, while it's all about total reads vs total normalized counts.

    • @gabriele223
      @gabriele223 Рік тому

      all count of reads that map onto something i suppose

    • @ejvik3238
      @ejvik3238 3 місяці тому

      but why do we normalize by the length even for TPM?

  • @efthymiakokkinou1616
    @efthymiakokkinou1616 5 років тому +50

    this guy is awesome.

  • @solidsnake013579
    @solidsnake013579 6 років тому +7

    hands down the most perfect explanation on the internet

  • @Tiago211287
    @Tiago211287 9 років тому +8

    Most Clear explanation I ever heard of TPM/FPKM/RPKM. Dont know why So many PhD was so confusing in trying to explaning this to me before.

    • @maxfeng4532
      @maxfeng4532 8 років тому

      +Joshua Starm thanks you so much, I feel like cleaning up the dust piled up in my mind , this is perfect !

  • @marekglombik8887
    @marekglombik8887 7 років тому +1

    I've just started my PhD and I'm really glad I found this. Thanks!

  • @MariaSamaloisaMarsa-lw4fk
    @MariaSamaloisaMarsa-lw4fk 8 місяців тому +1

    Terimakasih pak saya sudah menonton UA-cam RPKM ini sangat memberkati saya 🙏🙏
    Dan nama saya adalah Maria Samaloisa semester 4, terimakasih Tuhan Yesus memberkati kita semua 🙏🙏👍

  • @TheBlackCarlo
    @TheBlackCarlo 7 років тому

    My initial work for PhD just got soooooo much easy and fun. Thanks!

  • @syednajeebashraf4101
    @syednajeebashraf4101 8 років тому +3

    I watched this presentation and now I can explain this to even seniors in my place as well !! :)

  • @tuskofgothos2637
    @tuskofgothos2637 6 років тому +1

    Your channel is an absolute gem! Please do keep up the good work. We need you!!

  • @de_aquila
    @de_aquila 5 років тому +2

    Thank you very much for this video! It's really very helpful!
    For many biologists who have the thirst to understand the logic behind why certain metrics are the way they are with respect to statistics... this is certainly of immense help.

  • @fabioPatroni
    @fabioPatroni 7 років тому

    The best and clearest explanation I've ever seen! Tks

  • @torlarsen2212
    @torlarsen2212 2 роки тому +1

    Yet another great explanation StatQuest!!! You keep educating til today!!

  • @louisebuijs3221
    @louisebuijs3221 4 роки тому

    RPKM = Reads per kilobase million -> normalize for read depth (some replicates simply have more read depth, technical)
    - SE RNAseq
    - PE RNAseq = FPKM (rest same)
    1. devide all reads per gene by the total amount of reads per replicate(or sample however you wanna call it)
    2. devide by gene length
    TPM = different order
    1. devide by read length
    2. devide by gene length
    result of the difference in order is that the relative expression of reads is more easily comparable because in TPM the piecharts are all the same size and in RPKM the pies are different size

  • @Qaxoontii
    @Qaxoontii 6 років тому +2

    Thank you so much for this explanation, it is very useful for us biologist that have no background in bioinformatics.

    • @statquest
      @statquest  6 років тому

      You're welcome! I'm glad to know that the video is helpful. :)

  • @dreamyagnes
    @dreamyagnes 2 роки тому +1

    Hi Josh, thank you so much for your videos.

  • @victorcampos9064
    @victorcampos9064 3 роки тому +1

    Thank you so much!! Could not be explained clearer. Keep up the good work!

  • @prachinagpal3112
    @prachinagpal3112 7 років тому

    Concrete explanation .
    Concepts explained to the point.
    Add more !

  • @KeziKing
    @KeziKing Рік тому +1

    This was great!!! You really explained it clearly! Thanks so much!

  • @asiyazhao3820
    @asiyazhao3820 3 роки тому +1

    very clear explanation best ever

  • @priyankamaripuri8249
    @priyankamaripuri8249 6 років тому +1

    I find your videos extremely helpful! Thank you so much!!!! Can you share your presentations too?

  • @Pongant
    @Pongant 4 роки тому +1

    I love your low-key intros

  • @sambhavmishra1873
    @sambhavmishra1873 5 років тому +1

    Thank you so much, Josh Starmer !! It was a very clear explanation. My doubts are totally cleared.

    • @statquest
      @statquest  5 років тому

      Awesome! Thank you. :)

  • @Jonix-redhat
    @Jonix-redhat 2 роки тому +1

    Thx for a great and easy explanation!

  • @bodhisattwabanerjee8936
    @bodhisattwabanerjee8936 8 років тому +1

    Wonderful explanation.. So informative, yet explained so easily. Thank you very much. It was indeed a great help.

  • @lucyyu2251
    @lucyyu2251 9 років тому +1

    This is very very clear! I wish I've seen this video earlier! Keep it up!

  • @TheLegendOfNiko
    @TheLegendOfNiko 4 роки тому +2

    Perfect explanation, however, one thing was left out - TMM. How does TMM fit into the mix?

    • @statquest
      @statquest  4 роки тому +2

      TMM is similar to what they do in DESeq2. For more details, check out: ua-cam.com/video/UFB993xufUU/v-deo.html

  • @george543
    @george543 8 років тому

    Thank you for the clear explanation. You made it so straightforward and easy!

  • @Rd-lx8tu
    @Rd-lx8tu 3 роки тому +1

    This video is a life saver! Thanks a Million!

  • @kanefoster8780
    @kanefoster8780 4 роки тому +2

    this is fantastic. I'm all over this goddam

  • @VenkatNagaraju
    @VenkatNagaraju 4 роки тому +1

    Nice explanation

  • @tejasgohil9387
    @tejasgohil9387 8 років тому

    Most Most Useful. I was beating my head to understand these RPKM/FPKM since last 3 days by reading and reading and reading!!! But this 10 min video did it without any confusion. Thank you Very much.

  • @rodolfoaramayo7392
    @rodolfoaramayo7392 8 років тому

    Good Job!
    I am going to use this video to explain these concepts in Genomics a Graduate/Undergraduate class I teach at Texas A&M University

  • @satu272
    @satu272 7 років тому +1

    So good! Thank you, this really helps with my thesis.

  • @mrnotsoevil
    @mrnotsoevil 8 років тому

    Thank you! Finally a nice and easy-to-understand explanation!

  • @SNAKE1375
    @SNAKE1375 7 місяців тому +1

    Hi Josh, thanks very much for this again well and clear explained video. It seems that TPM would be the most approrpiate to mseure gene expression between sample. However, internet searches shows the contrary. Some are saying that TMM would be the best solution. What do think of this?

    • @statquest
      @statquest  7 місяців тому +1

      Thank you!

    • @SNAKE1375
      @SNAKE1375 7 місяців тому

      Thanks Josh, so what do you think about TMM instead of TPM?@@statquest

    • @statquest
      @statquest  7 місяців тому +1

      @@SNAKE1375 Unfortunately I haven't been involved with high-throughput sequencing for a long time now, so I don't know the answer.

  • @glorybasumata7555
    @glorybasumata7555 6 років тому +1

    Awesome! Pretty well explained and coherent.

  • @mrcoolgs100
    @mrcoolgs100 Рік тому +1

    Excellent work!!

  • @taraeicher4241
    @taraeicher4241 5 років тому +2

    Great explanation! Thank you!

  • @williammo4450
    @williammo4450 4 роки тому +1

    This guy is amazing! So clear!

  • @rayz1408
    @rayz1408 3 роки тому +1

    This is awesome!! Thank you!

  • @rojinsafavi797
    @rojinsafavi797 6 років тому +2

    Would you please elaborate on what length one should use if they have gene count instead of transcript count?

    • @statquest
      @statquest  6 років тому

      Are you talking about the length of the RNA fragments that are sequenced? I don't think it really matters much either way, however, maybe longer fragments are better for transcript-level counting, since you want the fragments to span exons.

    • @rojinsafavi797
      @rojinsafavi797 6 років тому +1

      Thanks for your quick reply :-), and yes for example if a gene has multiple isoforms I wonder which isoform length should be used for normalization step. I guess based on what you mentioned the longest isoform length should be use

    • @statquest
      @statquest  6 років тому

      If you are just counting reads per gene, I think most people use the longest isoform. However, if you are counting reads per transcript, then you just use that transcript’s length.

  • @王吉-q4k
    @王吉-q4k 4 роки тому +1

    Thumb up every video

  • @steffimatchado8442
    @steffimatchado8442 4 роки тому

    Thanks for the very explanatory video. It is really helpful for students like me. Could you please post a video on N50 values and these will be used to evaluate the assembly ??

  • @yanggao8840
    @yanggao8840 5 років тому +1

    very helpful, thanks very much

  • @fmetaller
    @fmetaller 6 років тому +1

    First I want to thank you for this great explanation.
    There is a point I'm missing. All these normalization techniques assume that each type of cell analyzed is producing the same amount of RNA and all the difference we see are due to some variability in the depth of the sequencing. But is this true? Shouldn't be a better idea to normalize the count only on some housekeeping genes like we do with qPCR?

    • @statquest
      @statquest  6 років тому +1

      This is a great question. The reality is that when you do statistics on RNA-seq data, the normalization methods often use housekeeping genes. I explain how these normalization methods work in these videos: ua-cam.com/video/UFB993xufUU/v-deo.html and ua-cam.com/video/Wdt6jdi-NQo/v-deo.html

    • @fmetaller
      @fmetaller 6 років тому +1

      Oh thank for the answer(s)

  • @sumitkumar-el3kc
    @sumitkumar-el3kc 4 роки тому

    What sequencing depth really signifies? Does having more sequencing depth mean high expression? Then why normalization for depth is required??

    • @statquest
      @statquest  4 роки тому

      For details on what Sequencing Depth means and why we need to normalize, see: ua-cam.com/video/tlf6wYJrwKY/v-deo.html

  • @lloydy272
    @lloydy272 8 років тому

    Thanks for explaining this in a way I can understand. My only question, how do people manage with R/FPKM if it is so hard to compare between reps?

    • @maxfeng4532
      @maxfeng4532 7 років тому

      Hey Joshua, thank you for the great video. Could you please explain why normalized counts are not for statistical test? the absolute values are changed by normalization but the ranks or the relative expression has not been changed... Is it because of isoforms? Thank you!

  • @jamshidkhorashad1998
    @jamshidkhorashad1998 4 роки тому +1

    This was great, thanks

  • @Adelphos0101
    @Adelphos0101 4 роки тому +1

    Excelent video!

  • @RonaldCutler
    @RonaldCutler 6 місяців тому

    Now you should make a video of why you can’t use these to compare genes between samples and only to compare genes to each other within a sample. Since TPM is a proportion, if one gene goes up in a sample, then the rest of the gene will seem like they are going down, when in reality they really might be at the same level!

    • @statquest
      @statquest  6 місяців тому

      I'll keep that in mind.

  • @sanjaisrao484
    @sanjaisrao484 2 роки тому +1

    Thanks

  • @guigaolin6825
    @guigaolin6825 3 роки тому +1

    Thanks for the video!
    Btw, a paper titled 'Single-cell RNA sequencing technologies and bioinformatics pipelines' published in 2018 seems to borrow your idea as their Fig.3c and without any citation.
    What do you think of that figure?

    • @statquest
      @statquest  3 роки тому

      You're totally right. Thanks for pointing that out to me.

  • @Eduardrssl
    @Eduardrssl 4 роки тому +1

    Very nice vid!! Thanks!

  • @carlagibbs3223
    @carlagibbs3223 5 років тому +1

    Excellent

  • @blackV199
    @blackV199 2 роки тому

    I have a question, shouldn't we use the effective length rather than transcript length? could you maybe make a video about that?

    • @statquest
      @statquest  2 роки тому

      I'll keep that in mind.

    • @blackV199
      @blackV199 2 роки тому

      @@statquest Apologies, effective lengths could only be calculated when raw data is available (fastq files). Here you discuss processed data (counts data). Regardless, it would be pretty awesome though if you could discuss the data processesing pipeline.

  • @arpitachoudhury9788
    @arpitachoudhury9788 4 роки тому

    Can you please make a detailed video on how limma+voom works

    • @statquest
      @statquest  4 роки тому +1

      I'll keep it in mind.

  • @areeniiitd
    @areeniiitd 6 місяців тому +1

    great video ngl.

  • @rollieize
    @rollieize 8 років тому

    nicely explained!

  • @krzysztofkolmus6936
    @krzysztofkolmus6936 6 років тому +1

    Hi Josh,
    Just a quick question regarding the TPM. What am I supposed to use as TPM input? Is it for the given transcript total transcript length (so exons, introns and UTRs) or just length of exons? Many thanks for help!

    • @statquest
      @statquest  6 років тому

      It depends on how the sequencing is done. That said, most of the time, introns are spliced out of the transcript and are not sequenced, so you can exclude those from the length of the sequence. One sure way to know you're doing it right is to look at the alignments using a genome browser - then you'll see where the reads are mapping to - if it's just exons or exons + UTRs.

  • @george543
    @george543 7 років тому +1

    Josh, could you help answering a question from me?
    When normalizing to the total read count (the second step of TPM, after normalizing to gene length), is the total read count the sum of normalized read counts that are mapped to genes only? What about the reads that are not annotated? Thanks fro your help!

  • @elzedliew972
    @elzedliew972 3 роки тому +1

    statquest is an encyclopedia of ...

  • @leixiao169
    @leixiao169 3 роки тому

    Great lecture. Thanks StatQuest! I wonder if Deseq2 automatically normalizes counts based on FKPM or TPM?

    • @statquest
      @statquest  3 роки тому +1

      For details on how DESeq2 normalizes reads, see: ua-cam.com/video/UFB993xufUU/v-deo.html

    • @leixiao169
      @leixiao169 3 роки тому +1

      @@statquest thanks!

  • @TheBloodyBeat
    @TheBloodyBeat 6 років тому +1

    Thanks for the awesome video ! If I understood well, none of these metrics takes into account the amount of unmapped reads. So does comparing TPM across samples that aren't replicates (e.g. a few environmental metagenomes) make any sense ?

    • @statquest
      @statquest  6 років тому +1

      You make a very good point. To be honest, TPM, FPKM and RPKM etc are all just for connivence - they may the data easy to look at and get a general feel for. However, they are not used for any sort of "real" comparisons among samples. For example, DESeq2 and EdgeR2 (and pretty much any other software that looks for differences between sets of "seq" samples) use completely different normalization strategies. These methods take into account that different samples might express different sets of genes - and some samples might not have many reads over all etc. So, my advice, is to use edgeR or DESeq2 to normalize your data for you, rather than doing it by hand. I have videos that show how normalization works in EdgeR: ua-cam.com/video/Wdt6jdi-NQo/v-deo.html and DESeq2: ua-cam.com/video/UFB993xufUU/v-deo.html if you would like more information.

    • @TheBloodyBeat
      @TheBloodyBeat 6 років тому +1

      @@statquest Hi Josh, thanks a lot for your very helpful answer. I just watched your DeSeq2 video and it looks indeed a lot closer to what I'm looking for than the TPM/RPKM/FPKM metrics. I'll dive into the details and try it on my data.

    • @statquest
      @statquest  6 років тому

      @@TheBloodyBeat Hooray! :)

  • @biotechsampath
    @biotechsampath 7 років тому

    awesome explanation....thanks

  • @MrDeking10
    @MrDeking10 5 років тому

    What are some typical TPM values? I got a lot of zeros in my dataset. However there is a lot of values between 1 and 2, and some as high as 13. Thanks

  • @LGARCIA20504
    @LGARCIA20504 5 років тому

    Very good man!

  • @easyasperl
    @easyasperl 8 років тому

    So is TPM more like FPKM in the sense that it keeps track of paired end reads?

  • @krzysztofkolmus6936
    @krzysztofkolmus6936 6 років тому

    Great video! Can anyone recommend an R package for TPM normalisation? Thanks a lot in advance!

  • @stemcell1167
    @stemcell1167 7 місяців тому

    Hello! I am supposed to do TPM normalisation of my counts Matrix , can l use steps explained here as it is? Or should l use any tool or package?

    • @statquest
      @statquest  7 місяців тому

      Usually a package will do this for you, but you can also follow these steps.

  • @nnzhou9493
    @nnzhou9493 4 роки тому

    Hey Josh, I used DEseq2 got the significant differential expression gene list. Then I checked the TPM of those genes. some genes' TPM are quite low ( < 1), some are quite high (hundreds or thousands ). should I use TPM cut-off value to filter the low-expression genes? If I have to do this, which cut-off value you prefer? Welcome to any suggestion. Thank you!

    • @statquest
      @statquest  4 роки тому

      DESeq2 should do this filtering for you. For more details, see: ua-cam.com/video/Gi0JdrxRq5s/v-deo.html

  • @eldorado.t
    @eldorado.t 4 роки тому +1

    Awesome 😍 thanks

  • @reafdaw01
    @reafdaw01 7 років тому

    You are pretty awesome! Thanks.

  • @johnswenson6699
    @johnswenson6699 7 років тому

    Hey Joshua,
    Thanks so much for this video. I've a follow-up question: suppose I want to compare relative expression levels of gene A between two samples, but the tissue samples vary in size ... do these normalization methods take into account the fact that some samples will have more genes present than others?
    As a hypothetical (but easy to visualize) example, suppose I cut off a hand, ground it up, and sequenced the RNA. This is sample 1. For sample 2, I cut off a different hand AND the attached arm, ground them all up, and sequenced the RNA. If I expected gene A expression only in the fingertips, would I be able to compare the two samples to uncover which sample had more expression of gene A, even though sample 2 had more (and more diverse) input tissue than sample 1?
    In short, is a there a normalization method that accounts for the fact that there may simply be a greater variety of genes being expressed in one sample relative to another?
    Thanks again for this video. You explained these concepts better than any other source I've found!

    • @johnswenson6699
      @johnswenson6699 7 років тому

      Brilliant.I didn't realize those programs included that kind of normalization ... Thanks a lot, sir. I'm going to watch those videos pronto!

  • @pythonsun996
    @pythonsun996 6 років тому +2

    very good!

  • @明坤宋
    @明坤宋 3 роки тому

    Hi, your video is very helpful! But if I only have the log2RPM data, how can I find the differentially expressed genes? Is there anyway to transfer the log2RPM data to count data?

  • @zekihi6994
    @zekihi6994 7 років тому

    so good! Thanks.

  • @lilhedayat
    @lilhedayat 4 роки тому

    why is it that longer genes will have more reads mapping to them? are longer genes more amplified or is it because the short fragment of reads can be mismapped?

    • @statquest
      @statquest  4 роки тому +2

      Imagine I have mRNA transcripts for two different genes, Gene A and Gene B. The mRNA transcripts for Gene A are 300 bp long and the mRNA transcripts for Gene B are 900 bp long. Now, since the sequencer can only sequence 300 bp long fragments, I break all of the mRNA fragments in to pieces that are 300bp long. That means for each mRNA transcript for Gene A, we get one 300bp long fragment to sequence. For Gene B, we get 3 fragments to sequence. In other words, we will sequence 3 times as many fragments for every mRNA transcript from Gene B than from Gene A. Does that make sense?

    • @lilhedayat
      @lilhedayat 4 роки тому +1

      @@statquest it absolutely does!!! thankyou so much for explaining, I completely missed that! I always assumed that you would correct for this. I was under the assumption that, not the fragment, but the entire 900bp would count as 1 count by default.

  • @tinacole1450
    @tinacole1450 3 роки тому +1

    love it...

    • @tinacole1450
      @tinacole1450 3 роки тому +1

      even the corny songs.... because I know something good follows

    • @statquest
      @statquest  3 роки тому

      Thank you very much! :)

  • @omarmohammadibrahim2197
    @omarmohammadibrahim2197 6 років тому

    the sarting felt like ppap song :P
    but everything after that was awesome :D

  • @shichengguo8064
    @shichengguo8064 4 роки тому

    Well explained, but I don't agree that TPM is better than FPKM

  • @km2052
    @km2052 4 роки тому

    thx

  • @ejvik3238
    @ejvik3238 3 місяці тому

    For the TPM, why do we normalize by the gene length?

    • @statquest
      @statquest  3 місяці тому

      Because the number of reads per gene scales by the length of the gene.

    • @ejvik3238
      @ejvik3238 3 місяці тому

      @@statquest Even if I do transcriptome from a sample and I'm interested in how much or how little (if at all) are genes expressed?

    • @statquest
      @statquest  3 місяці тому

      @@ejvik3238 yep

    • @ejvik3238
      @ejvik3238 3 місяці тому

      @@statquest I just watched one of your videos called "StatQuest: A gentle introduction to RNA-seq" so if I understand that correctly we have to divide by the gene length because we create fragments from the RNA to 200 - 300 bp to be able to even start sequencing. If so my question would be why don't we divide by the number of fragments instead?

    • @statquest
      @statquest  3 місяці тому

      @@ejvik3238 The number of reads per gene is a function of the gene's length (because a 1kb long gene will create 5 200bp fragments and a 2kb gene will create 10) and its expression level. By dividing by the length, we can then determine expression level, which is what we are interested in.

  • @尼安德鲁-n6j
    @尼安德鲁-n6j 9 років тому

    Nice!

  • @anjalipatni2580
    @anjalipatni2580 3 роки тому

    Sir,
    My data do not have any replicates and it is a paired end data.

  • @IsaacXinPei
    @IsaacXinPei 5 років тому

    Does the title has a typo? TPM => FPM?

    • @statquest
      @statquest  5 років тому

      I don't think there is a typo. The title is: "StatQuest: RPKM, FPKM and TPM". RPKM, FPKM and TPM are three (3) different ways to normalize high-throughput sequencing data.

    • @IsaacXinPei
      @IsaacXinPei 5 років тому +1

      @@statquest that's right, the first slide in the video says FPM, I think the slide has a typo

    • @statquest
      @statquest  5 років тому

      Ah! You are correct! That's amazing. This video has been online for 4 years and you are the first person to spot that.

    • @IsaacXinPei
      @IsaacXinPei 5 років тому +1

      @@statquest no problem at all, the videos are very useful, thank you for all the hard work!

  • @joshua20199
    @joshua20199 Місяць тому

    Why isn't it TPKM? :/

  • @MBCOUGER
    @MBCOUGER 7 років тому +1

    Thank you so much for this, I now no longer look like this when trying to explain this: imgur.com/gallery/iWKad22

  • @fmetaller
    @fmetaller 6 років тому +1

  • @hypno666pl
    @hypno666pl 5 років тому +1