Thank you so much for this explanation. I was lost trying to understand where the data came from, now i got it. Thank youuuu ❤❤❤ you're a brilliant angel
Great video! Really helpful for getting an understanding of the analysis workflow! A small critique / suggestion for improvement that I think could be made is in terminology being used, specifically referring to genes in the ranked list as being overrepresented. As you said in the video, one is not filtering any genes, so when looking at your gene set in GSEA, you aren't looking at the proportion of the genes being part of your list, but rather where are the genes located in the unfiltered ranked list containing all the genes.
I love the way you explain the whole concept in simple terms. could you elaborate more on how to rank the gene list from the FC and Pvalues of the differential expression? I a trying to make the rnk file to be imported to.GSEA
Thanks for your comment! That is a great question, I think many people will have the same issue. I am working on a GSEA tutorial which will show you exactly how to do it but consider this an advancement on the full script!:P I work with the package fgsea bioconductor.org/packages/release/bioc/html/fgsea.html You can read the documentation for more detailed instructions and examples, but for example, if you want to use the sign of log2FC multiplied by the -log10(pval) as ranking to order your gene list, you can do something like: rankings
Hi! Great video. I have seen that it is very popular to use the foldChange to rank the genes... so here, when using FC*-log(p-value) , is it a convention? (Sorry if my question is very odd, I am new in this)
Hi David. Not at all, that is a great question! So it depends on what you have. If you rank all genes, you include also genes with a very high p-value (for example, gene X with p-val = 0.8). So yeah, perhaps your gene X has an amazing fold change meaning there is a big difference between the two groups you are comparing, but with a p-val of 0.8, that big change is just not significant. So using sign(FC) * -log(pval) is a way of taking this into account. -log(p-val) will transform those p-values (going from 0 to 1) into a more manageable scale (basically instead of pval 0.00000000000000001 you have a -logpval of 17). The sign(FC) just transforms that manageable number into positive (if upregulated, or FC > 0) or negative (if downregulated, so FC < 0). This way, you genes will be ranked from downregulated, SIGNIFICANT genes -> downregulated, less significant genes --> non-significant genes ----> upregulated, more significant genes ----> upregulated and significant genes. Of course, you can also pre-filter your genes to only include significant ones (e.g., using pval < 0.01 or 0.05), and then just sort them by FC without worrying about the significance. Does this make sense? Hopefully this helped. Thanks for the question!
@@biostatsquid Hi, I tried to work with the formula you present at 3:49 for the gene ALDOB from your table. From my calculation based on your formula, the rank for ALDOB comes to -27.1066. In your orange ranked table at 3:49, I see the ranking is done by just using -log10(pval) but in the next slide at 3:51 ALDOB has a positive ranked value of 11.3. Could you explain what I am doing wrong or missing here? Also, does it make any sense to use adjusted p-values (FDR) instead of regular p-values for such a ranking calculation? Why or why not? Thanks for your clarification in advance.
Hi Biostatsquid, thanks for the video! I had a question about the amount of genes that these analyses are performed on. In a workshop I did performing functional analysis, my input contained around 20,000 genes. Is this normal for GSEA? Or should the input size be around 20 or 100? Thanks again
Hi Arfaa! Great question. 20,000 genes sounds more than fine for GSEA. Actually GSEA makes more sense with many input genes, more than just 20 (in that case it wouldn't take that long to research what each gene does)
Thanks for the great explanation. What if the genes are enriched at both ends of the ranked list and are still significant without random distribution? That is something found in the STRING biological database. How would these terms be biologically interpreted? Some genes from our gene set contribute to the upregulation of the term, and some - to the downregulation?
Thanks for the info! Really helpful 🙌🏻 In my experiment multiple conditions were tested and I used multiple comparison tests. Thus, I have no the FC value. Can I simply use the results of F-statistics (or p val/p val adj) for my list of genes to perform GSEA)? Did you ever have this problem? Thanks in advance!
Hi Anmol, thank you so much for your comment. As for your question in gene permutation steps, I think the best explanation is the given by Anthony Castanza in this discussion: The gene_set permutation mode, which we acknowledge is inferior to the phenotype permutation mode, tests gene sets on the basis of how likely it is that a random gene set of a given size was to be enriched within the given dataset. The results from this distribution of random enrichment scores calculated as a result of sampling random gene sets that would be the same size as the set of interest, are then compared to the true enrichment score of the identically sized real set to determine if the observed enrichment is more extreme than would be expected if the true set, like the random sets, had no functional connection to a given process. In this permutation mode, GSEA constructs a "null" distribution of sets that are random and therefore are assumed to have no coordinated biological function, therefore the null hypothesis would be that the given real set has no coordinated biological function within the data, an enrichment more extreme than that observed in the null distribution (sets that we "know" are random and have no coordinated biological function) would allow us to reject the null hypothesis and say that the set does have a coordinated function at [pValue] level of probability. groups.google.com/g/gsea-help/c/dveYVGQGMS0/m/l5l2sli6CwAJ? Hope this helped!
Squidtastic!! How accurate it is to say that in the ranked list at the top we have the most upregulated and at the bottom the most downregulated (as you said in the video and image)? Because I would change into - at the top we have the most significant upregulated, and at the bottom the most significant and downregulated. Again maybe one the most significant (by pval/padj) is the most significant but it is not the most upregulated/downregulated?
Hi! Great point. If you rank them by sign(-log2FC)*p-val it's exactly what you said: you'd be ranking them from most significant & upregulated > less significant upregulated > less significant downregulated > most significant downregulated. Does this make sense? And yes, exactly, maybe the one with the highest sign(-log2FC)*p-val , is not the most upregulated, but rather the most significant:)
Dear BioStatquid, Thanks for the video, your explanation is really nice. I need to ask that few online platform for performing GSEA require organisms database e.g. Broad Institute GSEA. and it does not contain database for bacterial genome, I have RNASeq data that I need to perform GSEA but unable to perform it, because of unavailability of database in input format. Please suggest. Thanks in Advance
Hi! Thanks for your feedback. I have not really worked with prokaryotes, but FUNAGE-Pro could be a possible solution - 'comprehensive web server for gene set enrichment analysis of prokaryotes' pubmed.ncbi.nlm.nih.gov/35641095/ funagepro.molgenrug.nl/ Hope it works!
@@biostatsquid Thanks for your response, I hae performed analysis through FunagePRO, but its functional enrichment analysis in my case didn't work. Trying cluster profiler, and Goseq but all need an org database which I don't have.
How does KS test answer the question of whether the ranked list is random or not? Isn't that a test of normality of distribution? How can it inform us about randomness or non randomness of a ranked list?? Pls explain
Hi Jayashree, thank you for your question, I will elaborate a bit more than in the video. The KS test checks whether two samples follow the same distribution. It has many uses, for example, as you mention, to test for normality. In this case, however, we use it to check whether the distribution of genes from a certain pathway across the ranked list follows a random distribution or not. So for example, we check the distribution of genes related to 'ATP synthesis' in our ranked list (sorting genes by most to least upregulated). If most of the genes involved in ATP synthesis are upregulated in one condition, they will be located at the top of the list, so the distribution across our ranked list is clearly not random. Aka they don't follow a random distribution. Therefore, we conclude that ATP synthesis is a differential pathway between our two conditions. The KS test will sort out the statistics for us, giving us p-values to help us decide when a pathways is statistically significant for our comparison. Hope this was a bit clearer!
Hi Jenny, thanks so much for your question - I don't think I mentioned it in this video, so sorry for the confusion! In GSEA, we just need a list of all the genes we're interested in, and a list of gene sets. The background genes are used to filter out the genes that were not measured in our experiment from the gene sets, to avoid bias. E.g., if you download cancer hallmark gene sets, some pathways may contain genes that were not measured in your experiment for whatever reason (e.g., if you have liver samples, brain-related genes may be very downregulated or not expressed). So we must remove all those genes from the gene set list we use for our analysis. Hopefully this made sense! You can read more about it in my PEA blogpost/I think I also explain it in the PEA video:)
At last i fully understood the concept. Thank you
Amazing explanation, thank you! The graphics make it so much easier to understand and the video is also very entertaining to watch.
Hi Biostatsquid. This is the most straight forward explanation on GSEA i've heard. Thank you for your hard work.
So glad I discovered this channel! Looking forward to all these videos.
Thank you! Glad you enjoyed it:)
Very well explained and easy to follow! I really enjoyed the video
Thank you for your work! Your video helped me better understand a paper I am presenting to my lab. Clear and complete explanations 👏
Thank you so much for this explanation. I was lost trying to understand where the data came from, now i got it. Thank youuuu ❤❤❤ you're a brilliant angel
Thank so much for both videos, such clear and concise explanation, please continue making videos.🙃
Thank you so much! Perfect for beginners to quickly grasp it!
I thoroughly enjoy the illustrations. Thank you! :D
love this channel
wow this is such a clear description!
Simply genius :) ... Keep on making videos and entertain us
Tus videos me estan ayudando muchisimo!!! Sigue asi!!
Amazing, I will definitely recommend to my colleagues - thanks for such a nice work
Great video! Really helpful for getting an understanding of the analysis workflow!
A small critique / suggestion for improvement that I think could be made is in terminology being used, specifically referring to genes in the ranked list as being overrepresented. As you said in the video, one is not filtering any genes, so when looking at your gene set in GSEA, you aren't looking at the proportion of the genes being part of your list, but rather where are the genes located in the unfiltered ranked list containing all the genes.
Totally agree! Thanks for your comment:)
That was super helpful, thank you so much!
such an awesome video. informative and clear to follow. thank you so much
Truly amazing videos!
Great explanation! Thanks a lot
thank you very much for the explanation😃
AMAZING!!!
Eres la mejor!! Saludos desde Colombia :)
Hi mam, Could you make a video new generation tool "topology based method " for pathway enrichment analysis which you mentioned in this video @7:26
So helpful. Thanks a lot.
Thank you for explaining it well.. Can you pls provide information on the inputs needed to perform ssGSEA ...
YOU UNDERSTANDED ME THANK YOU
I love the way you explain the whole concept in simple terms. could you elaborate more on how to rank the gene list from the FC and Pvalues of the differential expression? I a trying to make the rnk file to be imported to.GSEA
Thanks for your comment! That is a great question, I think many people will have the same issue. I am working on a GSEA tutorial which will show you exactly how to do it but consider this an advancement on the full script!:P
I work with the package fgsea bioconductor.org/packages/release/bioc/html/fgsea.html
You can read the documentation for more detailed instructions and examples, but for example, if you want to use the sign of log2FC multiplied by the -log10(pval) as ranking to order your gene list, you can do something like:
rankings
Hi! Great video. I have seen that it is very popular to use the foldChange to rank the genes... so here, when using FC*-log(p-value) , is it a convention? (Sorry if my question is very odd, I am new in this)
Hi David. Not at all, that is a great question! So it depends on what you have. If you rank all genes, you include also genes with a very high p-value (for example, gene X with p-val = 0.8). So yeah, perhaps your gene X has an amazing fold change meaning there is a big difference between the two groups you are comparing, but with a p-val of 0.8, that big change is just not significant. So using sign(FC) * -log(pval) is a way of taking this into account. -log(p-val) will transform those p-values (going from 0 to 1) into a more manageable scale (basically instead of pval 0.00000000000000001 you have a -logpval of 17). The sign(FC) just transforms that manageable number into positive (if upregulated, or FC > 0) or negative (if downregulated, so FC < 0). This way, you genes will be ranked from downregulated, SIGNIFICANT genes -> downregulated, less significant genes --> non-significant genes ----> upregulated, more significant genes ----> upregulated and significant genes.
Of course, you can also pre-filter your genes to only include significant ones (e.g., using pval < 0.01 or 0.05), and then just sort them by FC without worrying about the significance.
Does this make sense?
Hopefully this helped. Thanks for the question!
@@biostatsquid Hi, I tried to work with the formula you present at 3:49 for the gene ALDOB from your table. From my calculation based on your formula, the rank for ALDOB comes to -27.1066. In your orange ranked table at 3:49, I see the ranking is done by just using -log10(pval) but in the next slide at 3:51 ALDOB has a positive ranked value of 11.3. Could you explain what I am doing wrong or missing here?
Also, does it make any sense to use adjusted p-values (FDR) instead of regular p-values for such a ranking calculation? Why or why not?
Thanks for your clarification in advance.
Hi Biostatsquid, thanks for the video! I had a question about the amount of genes that these analyses are performed on. In a workshop I did performing functional analysis, my input contained around 20,000 genes. Is this normal for GSEA? Or should the input size be around 20 or 100? Thanks again
Hi Arfaa! Great question. 20,000 genes sounds more than fine for GSEA. Actually GSEA makes more sense with many input genes, more than just 20 (in that case it wouldn't take that long to research what each gene does)
Thanks for the great explanation. What if the genes are enriched at both ends of the ranked list and are still significant without random distribution? That is something found in the STRING biological database. How would these terms be biologically interpreted? Some genes from our gene set contribute to the upregulation of the term, and some - to the downregulation?
Thanks for the info! Really helpful 🙌🏻
In my experiment multiple conditions were tested and I used multiple comparison tests. Thus, I have no the FC value. Can I simply use the results of F-statistics (or p val/p val adj) for my list of genes to perform GSEA)? Did you ever have this problem?
Thanks in advance!
up
Hi. thanks for the amazing depiction! I was wondering if you can clear out the "permutation" step used in GSEA or FCS analysis. Thanks.
Hi Anmol, thank you so much for your comment. As for your question in gene permutation steps, I think the best explanation is the given by Anthony Castanza in this discussion:
The gene_set permutation mode, which we acknowledge is inferior to the phenotype permutation mode, tests gene sets on the basis of how likely it is that a random gene set of a given size was to be enriched within the given dataset.
The results from this distribution of random enrichment scores calculated as a result of sampling random gene sets that would be the same size as the set of interest, are then compared to the true enrichment score of the identically sized real set to determine if the observed enrichment is more extreme than would be expected if the true set, like the random sets, had no functional connection to a given process.
In this permutation mode, GSEA constructs a "null" distribution of sets that are random and therefore are assumed to have no coordinated biological function, therefore the null hypothesis would be that the given real set has no coordinated biological function within the data, an enrichment more extreme than that observed in the null distribution (sets that we "know" are random and have no coordinated biological function) would allow us to reject the null hypothesis and say that the set does have a coordinated function at [pValue] level of probability.
groups.google.com/g/gsea-help/c/dveYVGQGMS0/m/l5l2sli6CwAJ?
Hope this helped!
Squidtastic!!
How accurate it is to say that in the ranked list at the top we have the most upregulated and at the bottom the most downregulated (as you said in the video and image)? Because I would change into - at the top we have the most significant upregulated, and at the bottom the most significant and downregulated. Again maybe one the most significant (by pval/padj) is the most significant but it is not the most upregulated/downregulated?
Hi! Great point. If you rank them by sign(-log2FC)*p-val it's exactly what you said: you'd be ranking them from most significant & upregulated > less significant upregulated > less significant downregulated > most significant downregulated. Does this make sense?
And yes, exactly, maybe the one with the highest sign(-log2FC)*p-val , is not the most upregulated, but rather the most significant:)
The best video
Hi Biostatsquid. What do you use to get the ranked list: p-value or adjusted p-value? If it is p-value, Why?
Hi! Thanks for your comment:) I normally use -log10(p-adj) * sign(log2FC), maybe this will help:
www.biostars.org/p/375584/
www.biostars.org/p/298312/
Dear BioStatquid, Thanks for the video, your explanation is really nice. I need to ask that few online platform for performing GSEA require organisms database e.g. Broad Institute GSEA. and it does not contain database for bacterial genome, I have RNASeq data that I need to perform GSEA but unable to perform it, because of unavailability of database in input format. Please suggest. Thanks in Advance
Hi! Thanks for your feedback.
I have not really worked with prokaryotes, but FUNAGE-Pro could be a possible solution - 'comprehensive web server for gene set enrichment analysis of prokaryotes'
pubmed.ncbi.nlm.nih.gov/35641095/
funagepro.molgenrug.nl/
Hope it works!
@@biostatsquid Thanks for your response, I hae performed analysis through FunagePRO, but its functional enrichment analysis in my case didn't work. Trying cluster profiler, and Goseq but all need an org database which I don't have.
Thanks mam, mam upload GSEA analysis in R, please
Great material! Do you know of any topology-based methods that works on single-cell datasets (or pseudo-bulk single cell data)?
thank u very much
How does KS test answer the question of whether the ranked list is random or not? Isn't that a test of normality of distribution? How can it inform us about randomness or non randomness of a ranked list??
Pls explain
Hi Jayashree, thank you for your question, I will elaborate a bit more than in the video. The KS test checks whether two samples follow the same distribution. It has many uses, for example, as you mention, to test for normality. In this case, however, we use it to check whether the distribution of genes from a certain pathway across the ranked list follows a random distribution or not. So for example, we check the distribution of genes related to 'ATP synthesis' in our ranked list (sorting genes by most to least upregulated). If most of the genes involved in ATP synthesis are upregulated in one condition, they will be located at the top of the list, so the distribution across our ranked list is clearly not random. Aka they don't follow a random distribution. Therefore, we conclude that ATP synthesis is a differential pathway between our two conditions. The KS test will sort out the statistics for us, giving us p-values to help us decide when a pathways is statistically significant for our comparison. Hope this was a bit clearer!
what does the list of background genes do?
Hi Jenny, thanks so much for your question - I don't think I mentioned it in this video, so sorry for the confusion! In GSEA, we just need a list of all the genes we're interested in, and a list of gene sets. The background genes are used to filter out the genes that were not measured in our experiment from the gene sets, to avoid bias. E.g., if you download cancer hallmark gene sets, some pathways may contain genes that were not measured in your experiment for whatever reason (e.g., if you have liver samples, brain-related genes may be very downregulated or not expressed). So we must remove all those genes from the gene set list we use for our analysis. Hopefully this made sense! You can read more about it in my PEA blogpost/I think I also explain it in the PEA video:)
The necklace makes the whole video
Squidtastic!
Leffler Shores
Emmitt Parkways
Jovanny Causeway
Bechtelar Lodge
Larissa Corners
Myrtis Courts
Darrel Wells
Halvorson Plains
Isabel Court
Marcel Parks
Koepp Gateway
Bahringer Rest
Smitham Ways
Troy Place
Schultz Skyway
Enrique Ferry
Johnson Pines
Jacobs Cape
Goldner Corners
Langosh Lodge
Anne Dam
Kuhn Crossing
Kassulke Rest
Zieme River
Bogisich Inlet
Adonis Island
Medhurst Causeway
Heathcote Plaza