Convert Ensembl ids to gene names/symbols in python by parsing the GTF file
Вставка
- Опубліковано 27 сер 2024
- This is a very simple way to convert Ensembl ids to gene names that doesn't require learning an R package like BioMart and works for any organism you can find a GTF file for. It is also often more sensitive and does a better job at mapping ids than something like BioMart. In total, it only requires 6 lines of python code.
I have uploaded the notebook: github.com/mou...
This man is an absolute God send, I can't even begin to count the amount of times he has came in clutch with a solution to issues I encounter in my personal projects and during my internship!
Glad I could help!
Hey Sam, somehow after I went through the steps, I got some gene IDs without gene names still. Does this mean my annotation file is rigged?
Epic splitting practice. Thank you so much!
This is very cool! Thank you very much! Your videos help me so much!
Thanks a lot for the video, I have read that there are sometimes not a correspondance 1:1 between gene name and Ensembl IDs but sometime one ID correspond to different gene-names as they don't really map the genes the same way. I feel like you're discarding the information here, doesn't it create any problem to do it this way ? Thanks
This is why you keep the Ensembl id's as long as possible and only convert them when you have to do something like making a figure, etc.
I have found that this way is equal or more effective than doing it with mapping databases like biomart or Org.Mm. Some gtf files vary in quality though so this may not always be the case. I had compared the two in my R conversion video, but then the video was too long so I cropped it. They both were almost identical. Interestingly this method identified some that were not in the other and the other identified some that were not found in the gtf. But in both cases it was
Thank you so much!
is there an easy way to map gene ids to gene symbols with GTF files in R? Thank you!
You’ve done this on CSV file, but what if we want to do the same on h5ad file (anndata where ‘adata.var stores ensemble IDs instead of gene name’)
adata.var is just a pandas dataframe, you can map a dictionary to any column like I did here
Thank you very much for this work, it helped me a lot, I loved it.
No problem! There is some python version of biomart but I find this much simpler to use if you have the GTF.
thanks a lot for your great tutorial
Sir My few dataset is hg 19 and few are hg38; so So I used hg19gtf file for hg19 data and hg38 gtf file for hg38. is it correct ? or is it like for both we have to use hg38 gtf file only? After converting, many ensemble ids have duplicate gene names and NAN(many ensemble ids not having corresponding gene name) too. So such duplicates and NANs can we remove as such? or is there any specific criteria which have to be followed for removal? how can i convert this hg19 gene names to hg38? or is it like without converting we can combine both data?
Hmm, good question. I would say you picked the right annotation version for each, but depending on the source it can be annotated differently. But it may have worked fine. It is hard for me to tell without looking. As for duplicate gene names, there should be a few naturally, but it is hard for me to say if what you are seeing is not due to some error or version mismatch. The best thing you could do in this scenario is rerun both datasets on the same genome build if you have the raw data. Would remove some potentially confounding variables in your analysis too, depending on what you are trying to accomplish. Good luck!
Hi there. i have a similar problem, i work with yeasts that are not well annotated and i will be using GCF_000146045._R64_genomic.gtf to convert. Any assistance with the script writing since i am not that experienced with coding? Looking forward to your response. Thanks
First look at the file and make sure the gene names you want are in the gtf. If yes, then you can follow what i do in the video, but you will likely have to split the strings on different characters because gtfs can vary.
@@sanbomics Thank you for the prompt response. I will check it and get back to you. Thanks.
@@sanbomics Dear Sir, so i have checked the GTF file and my two inputs files (CSV). Compared to the GTF file, my two input files have more genes than the GTF file, while the second input file has fewer genes. I have filtered out based on the geneid and geneID, and the figures are as follows; GTF file 6465, the CSV file contain 6575 and 6166 respectively. In this regard, how do i proceed?
Looking forward to your response.
Best regards
Evelyn
If you try labeling them with it, how many remain unlabeled?
Hi, where did you download you GFT file from?
ensemble, gencode, or ucsc
Thanks a lot for the informative video.
The whole script ran perfectly without any error for my case. But no gene name is retrieved. The whole Gene_name column shows "NaN" for each rows. What could be the reason behind this?
Thanks in advance.
Its possible your gtf is not exactly the same format. You will have to open it and take a look at how the lines are delimited. Alternatively, I have a package that converts the name for you: twitter.com/Sanbomics/status/1597246973903785988?s=20&t=vkEV6XWM8H68F0-fNNV5GQ
I had the same issue. I figured out that I provided the wrong column header. When you map the gene names provide the header of the column with the ENSEMBL IDs.
eg.: df["gene name"] = df["header with ENSEMBLE ID"].map(gtf)
Then it worked for me.
Kind regards,
Thanks, it's worked, but when I want to create the adata I get this error:
could not convert string to float: 'LINC00362'
What do you mean create the adata? What are you trying to do?
@@sanbomics I mean adata variable with the adata = sc.read_csv("csv_path")
After converting ensemble IDs into gene symbols, I got the error
Ok, I was just a little confused at how this relates to the ensembl ID video. Try loading it as a csv in pandas first and see how it looks and see if there is anything weird with the dataframe