Laura, thank you so much for all of your amazing content. It's helped me so much during my MSc course. Just wanted you to know that all of your hard work is much appreciated!
wow, so glad I found your channel, very high quality content. I would love to see more workflows using other clusterProfiler functions. Also, It would be cool to have workflow options for generating data visualizations that are good for comparing exposure groups and exposure windows using overlapping significant DEGs. Thank you! Have a squidtastic🦑day!
Thank you so much for your comment! Glad you like the videos. Great suggestions - will definitely add them to my list;) Quick question - what do you mean by 'exposure groups' and 'exposure windows'?
@@biostatsquid Hi, so I just mean for example when like there are lets say 3 exposure windows ie 24H, 3Days, 7Days and 3 exposure groups ie like different concentration of treatment or possibly different tissue/cell types, etc. Does that hopefully help what I mean lol. And its so nice to chat with you! Cheers!
@@joeyoviedo5202 Oh I see, so like ways to visualise comparisons of DEGs at different time points and possibly groups? That's a really good idea, will definitely add that to my todo list;) Thanks for the suggestion!
I need help, teacher. Since df include all the gene not just the differential genes, how to get the whole genes list since i saw that some p value is above 0.05, how do i get that list for my scRNA analysis. Appreciate it.
Has anyone tried changing all of the mouse .gmt files to .RDS? I can get all of them to do it except for the GO CC set. Anyone else run into this problem? It will read the .gmt file, but when i execute the saveRDS() function, it just doesn't appear in the folder like it did for all of the other .gmt files
your channel and videos are greatI liked your website as well, ! thanks so much for your help. I have a question, I have conducted differential expression analysis on TCGA-PRAD and a microarray dataset (GPL570) to get differential expressed genes between Normal and Cancer tissues. after that I drew a Venn diagram to get common DEGs between these two dataset, however my common DEGs ar just gene symbols, I don't have logFC or p.value for them(I have these for each of the datasets but I don't have them after drawing Venn diagram). how can I do PEA with cluster pofiler for my common DEGs obtained from Venn diagram? thanks in advance.
Hi! Thanks so much for your feedback, I'm glad your found them useful! I think the best option is to perform PEA independently for each of the two datasets (careful, remember to subset the background genes for the genes present in the datasets separately). Then maybe you can use a similar approach and see which pathways overlap. Otherwise, you might consider doing GSEA (video coming up soon!) on your selected gene list, ranking them by a consensus metric - e.g., some kind of average (but careful if you are considering log2FC as the sign is also important). This paper on concordant integrative gene set enrichment analysis might help: pubmed.ncbi.nlm.nih.gov/24564564/ Hope this helped!:)
Aaaaaaaaawesome!!!!! I've finished watching all your videos about pathway analysis and they really help a lot!! I'm really grateful for your excellent explaination!!!! But I wonder if I could apply GSEA into proteomic analysis? I've get the expression matrix of the proteins, but I don't know if I could match the protein ids with the gene set... could you please provide me some suggestions? I'd be approciate it a lot!!
Thanks for your comment! Glad you liked the videos:) Unfortunately,I have never applied GSEA to proteomics (which I believe is called PSEA;) so I cannot give you a sure answer, but I here are some suggestion to try out: - Following the same steps as for GSEA, but before running GSEA, convert gene symbols to protein IDs. There are a few tools to do this within R, or you could also use the UniProt Retrieve/ID Mapping tool. I think this should work if the IDs match, and you use gene sets based on protein-coding genes. - You might want to check out this publication, presenting PSEA-Quant: www.ncbi.nlm.nih.gov/pmc/articles/PMC5352860/ It allows you to perform PSEA (it's a web-based tool as far as I know) - but most importantly, if you check the methods you might figure out how to download protein sets from the tool itself. Hope you find a solution! Let me know! Good luck!
@@biostatsquid Thanks for your suggestions! I'm sorry to reply you so late because I'm not confident of my consequences. First I checked the PSEA-Quant article but I failed to visit the url they provided.🥲 Then I tried to find if there are protein datasets directly matching Uniprot ID so that I can lose as least information as I can. But when I tried to use uniprot id to analyse by clusterprofiler(), it showed error. I even tried to make my own gmt file(use uniprot id directly) to use in gsea, but it failed too. And I'm not that professional enough to build my own package...(keep learning💪) Finally, I chose to transfer uniprot id into entrezid, and got my results. But I doubt the reliability of this method because some proteins come from the same gene, and some of them up, some of them down, which may act as counteraction. Fortunately in my protein set there are only 2 proteins from the same gene and I eliminate them, to some degree the result still has some value as a reference.
Should codes in this chunk: # Subset to those pathways that have p adj < cutoff and gene count > cutoff (you can also do this in the enricher function) target_pws genecount_cutoff]) # select only target pathways have p adjusted < 0.05 and at least 6 genes res_df genecount_cutoff) as there are some cases when one of the two direction (up or down) of pathways with the same name does not pass the padj_cutoff, so directly filtering the values themselves would be more accurate?
Hello! thank you very much for the video, it has helped me a lot. However I had a query as I have played the whole script on my computer with my own SDR data. I have run the whole script and everything seems to be correct except when I run the last step "target_pws
I have another query, I have tried to use another data set and I get this result directly when running ClusterProfile: --> No gene can be mapped.... --> Expected input gene ID: HSD11B2,PTPN11,ABCG1,GALE,WASL,PLA2G12A --> return NULL... --> No gene can be mapped.... --> Expected input gene ID: APBB1,BID,GALT,NDUFA1,ABCB4,RUNX1 --> return NULL... It's like my genes don't match...how can that happen? Thanks in advance!
That's probably because your file is in a different folder, or not there at all. Make sure to download the file, put it in a folder and then set in_path to the full path of that folder. You can check if the file is there with list.files(in_path), for example. Hope this helps!
Very informative, I was wondering, If I want to GSEA for plant for eg soybean, how I do that, as ORG.db library is not available for that, can u plz help me with that
Hi Praveen, thank you for your comment! Actually, I have no experience working with non-model organisms, but I think perhaps another tool might be of more use? I saw a few people recommend agriGO enrichment tool for plant species - www.biostars.org/p/112022/ www.biostars.org/p/261449/ but if you want to stick with clusterProfiler, you can always create a custom gene set, as long as you keep the format clusterProfiler needs:) Good luck!
Lady you are good. But you only tell the facts half only. We get easily confused in certain points like gmt files in half way. If are decided to teach just do it correctly. For beginners in bioinformatics you are mocking is. Half truth is worse and worst than lying.
The differential data that you loaded in the r script initially, which has approx 30 thousand something genes and four variables, are they pre-processed data, like removing the duplicates and adjusting the p values and log FC?? Or are they raw data tT saved from r script?
Laura, you teach us like we are a bunch of kids. I find it awesome! You are so sweet! This helped me so much, Ma'am! Thank you.
Laura, thank you so much for doing these. Even hard heads like me can follow our tutorials, amazing stuff! The world bows in amazement.
Laura, thank you so much for all of your amazing content. It's helped me so much during my MSc course. Just wanted you to know that all of your hard work is much appreciated!
Thank you so much for your comment, this means a lot! Glad it helped:)
The tutorial is very helpful even i ran the enrichment pipeline lots of times before. Your code gave me useful tips!
wow, so glad I found your channel, very high quality content. I would love to see more workflows using other clusterProfiler functions. Also, It would be cool to have workflow options for generating data visualizations that are good for comparing exposure groups and exposure windows using overlapping significant DEGs. Thank you! Have a squidtastic🦑day!
Thank you so much for your comment! Glad you like the videos. Great suggestions - will definitely add them to my list;) Quick question - what do you mean by 'exposure groups' and 'exposure windows'?
@@biostatsquid Hi, so I just mean for example when like there are lets say 3 exposure windows ie 24H, 3Days, 7Days and 3 exposure groups ie like different concentration of treatment or possibly different tissue/cell types, etc. Does that hopefully help what I mean lol. And its so nice to chat with you! Cheers!
@@joeyoviedo5202 Oh I see, so like ways to visualise comparisons of DEGs at different time points and possibly groups? That's a really good idea, will definitely add that to my todo list;) Thanks for the suggestion!
This is a great tutorial. I have a question, how about if I want to analyze mouse data and GSEA didn't have a murine KEGG gene set?
I need help, teacher. Since df include all the gene not just the differential genes, how to get the whole genes list since i saw that some p value is above 0.05, how do i get that list for my scRNA analysis. Appreciate it.
Has anyone tried changing all of the mouse .gmt files to .RDS? I can get all of them to do it except for the GO CC set. Anyone else run into this problem?
It will read the .gmt file, but when i execute the saveRDS() function, it just doesn't appear in the folder like it did for all of the other .gmt files
your channel and videos are greatI liked your website as well, ! thanks so much for your help.
I have a question, I have conducted differential expression analysis on TCGA-PRAD and a microarray dataset (GPL570) to get differential expressed genes between Normal and Cancer tissues.
after that I drew a Venn diagram to get common DEGs between these two dataset, however my common DEGs ar just gene symbols, I don't have logFC or p.value for them(I have these for each of the datasets but I don't have them after drawing Venn diagram).
how can I do PEA with cluster pofiler for my common DEGs obtained from Venn diagram? thanks in advance.
Hi! Thanks so much for your feedback, I'm glad your found them useful!
I think the best option is to perform PEA independently for each of the two datasets (careful, remember to subset the background genes for the genes present in the datasets separately). Then maybe you can use a similar approach and see which pathways overlap.
Otherwise, you might consider doing GSEA (video coming up soon!) on your selected gene list, ranking them by a consensus metric - e.g., some kind of average (but careful if you are considering log2FC as the sign is also important). This paper on concordant integrative gene set enrichment analysis might help: pubmed.ncbi.nlm.nih.gov/24564564/
Hope this helped!:)
can i follow the same for proteomics data
Aaaaaaaaawesome!!!!! I've finished watching all your videos about pathway analysis and they really help a lot!! I'm really grateful for your excellent explaination!!!! But I wonder if I could apply GSEA into proteomic analysis? I've get the expression matrix of the proteins, but I don't know if I could match the protein ids with the gene set... could you please provide me some suggestions? I'd be approciate it a lot!!
Thanks for your comment! Glad you liked the videos:)
Unfortunately,I have never applied GSEA to proteomics (which I believe is called PSEA;) so I cannot give you a sure answer, but I here are some suggestion to try out:
- Following the same steps as for GSEA, but before running GSEA, convert gene symbols to protein IDs. There are a few tools to do this within R, or you could also use the UniProt Retrieve/ID Mapping tool. I think this should work if the IDs match, and you use gene sets based on protein-coding genes.
- You might want to check out this publication, presenting PSEA-Quant: www.ncbi.nlm.nih.gov/pmc/articles/PMC5352860/
It allows you to perform PSEA (it's a web-based tool as far as I know) - but most importantly, if you check the methods you might figure out how to download protein sets from the tool itself.
Hope you find a solution! Let me know! Good luck!
@@biostatsquid Thanks for your suggestions! I'm sorry to reply you so late because I'm not confident of my consequences.
First I checked the PSEA-Quant article but I failed to visit the url they provided.🥲
Then I tried to find if there are protein datasets directly matching Uniprot ID so that I can lose as least information as I can. But when I tried to use uniprot id to analyse by clusterprofiler(), it showed error. I even tried to make my own gmt file(use uniprot id directly) to use in gsea, but it failed too. And I'm not that professional enough to build my own package...(keep learning💪)
Finally, I chose to transfer uniprot id into entrezid, and got my results. But I doubt the reliability of this method because some proteins come from the same gene, and some of them up, some of them down, which may act as counteraction. Fortunately in my protein set there are only 2 proteins from the same gene and I eliminate them, to some degree the result still has some value as a reference.
Should codes in this chunk:
# Subset to those pathways that have p adj < cutoff and gene count > cutoff (you can also do this in the enricher function)
target_pws genecount_cutoff]) # select only target pathways have p adjusted < 0.05 and at least 6 genes
res_df genecount_cutoff)
as there are some cases when one of the two direction (up or down) of pathways with the same name does not pass the padj_cutoff, so directly filtering the values themselves would be more accurate?
Hello! thank you very much for the video, it has helped me a lot. However I had a query as I have played the whole script on my computer with my own SDR data. I have run the whole script and everything seems to be correct except when I run the last step "target_pws
I have another query, I have tried to use another data set and I get this result directly when running ClusterProfile: --> No gene can be mapped....
--> Expected input gene ID: HSD11B2,PTPN11,ABCG1,GALE,WASL,PLA2G12A
--> return NULL...
--> No gene can be mapped....
--> Expected input gene ID: APBB1,BID,GALT,NDUFA1,ABCB4,RUNX1
--> return NULL...
It's like my genes don't match...how can that happen?
Thanks in advance!
When i put in df
That's probably because your file is in a different folder, or not there at all. Make sure to download the file, put it in a folder and then set in_path to the full path of that folder. You can check if the file is there with list.files(in_path), for example. Hope this helps!
Very informative, I was wondering, If I want to GSEA for plant for eg soybean, how I do that, as ORG.db library is not available for that, can u plz help me with that
Hi Praveen, thank you for your comment! Actually, I have no experience working with non-model organisms, but I think perhaps another tool might be of more use?
I saw a few people recommend agriGO enrichment tool for plant species -
www.biostars.org/p/112022/
www.biostars.org/p/261449/
but if you want to stick with clusterProfiler, you can always create a custom gene set, as long as you keep the format clusterProfiler needs:)
Good luck!
SQUUUUUUUUUUUUIDTAAAAAAASTICCCCCC
Please ma'am don't use preinputed code it is not helpful. We need how to write R script
Lady you are good. But you only tell the facts half only.
We get easily confused in certain points like gmt files in half way.
If are decided to teach just do it correctly.
For beginners in bioinformatics you are mocking is.
Half truth is worse and worst than lying.
The differential data that you loaded in the r script initially, which has approx 30 thousand something genes and four variables, are they pre-processed data, like removing the duplicates and adjusting the p values and log FC?? Or are they raw data tT saved from r script?