Convert Ensembl ids to gene names/symbols in python by parsing the GTF file

Поділитися
Вставка
  • Опубліковано 27 сер 2024
  • This is a very simple way to convert Ensembl ids to gene names that doesn't require learning an R package like BioMart and works for any organism you can find a GTF file for. It is also often more sensitive and does a better job at mapping ids than something like BioMart. In total, it only requires 6 lines of python code.
    I have uploaded the notebook: github.com/mou...

КОМЕНТАРІ • 30

  • @danielpintard7382
    @danielpintard7382 2 місяці тому

    This man is an absolute God send, I can't even begin to count the amount of times he has came in clutch with a solution to issues I encounter in my personal projects and during my internship!

    • @sanbomics
      @sanbomics  2 місяці тому

      Glad I could help!

  • @user-gd9ul4wg1s
    @user-gd9ul4wg1s Годину тому

    Hey Sam, somehow after I went through the steps, I got some gene IDs without gene names still. Does this mean my annotation file is rigged?

  • @ieserbes
    @ieserbes 6 місяців тому

    Epic splitting practice. Thank you so much!

  • @Dr.UgurComlekcioglu
    @Dr.UgurComlekcioglu 9 місяців тому

    This is very cool! Thank you very much! Your videos help me so much!

  • @someone_there
    @someone_there 2 роки тому +1

    Thanks a lot for the video, I have read that there are sometimes not a correspondance 1:1 between gene name and Ensembl IDs but sometime one ID correspond to different gene-names as they don't really map the genes the same way. I feel like you're discarding the information here, doesn't it create any problem to do it this way ? Thanks

    • @sanbomics
      @sanbomics  2 роки тому +1

      This is why you keep the Ensembl id's as long as possible and only convert them when you have to do something like making a figure, etc.
      I have found that this way is equal or more effective than doing it with mapping databases like biomart or Org.Mm. Some gtf files vary in quality though so this may not always be the case. I had compared the two in my R conversion video, but then the video was too long so I cropped it. They both were almost identical. Interestingly this method identified some that were not in the other and the other identified some that were not found in the gtf. But in both cases it was

  • @Hiro_Kobayakawa
    @Hiro_Kobayakawa Місяць тому

    Thank you so much!

  • @chrislee8408
    @chrislee8408 4 місяці тому

    is there an easy way to map gene ids to gene symbols with GTF files in R? Thank you!

  • @mst63th
    @mst63th Рік тому +1

    You’ve done this on CSV file, but what if we want to do the same on h5ad file (anndata where ‘adata.var stores ensemble IDs instead of gene name’)

    • @sanbomics
      @sanbomics  Рік тому

      adata.var is just a pandas dataframe, you can map a dictionary to any column like I did here

  • @1234567899921344
    @1234567899921344 2 роки тому

    Thank you very much for this work, it helped me a lot, I loved it.

    • @sanbomics
      @sanbomics  2 роки тому

      No problem! There is some python version of biomart but I find this much simpler to use if you have the GTF.

  • @minjun9900
    @minjun9900 7 місяців тому

    thanks a lot for your great tutorial

  • @Maryashahere
    @Maryashahere 2 роки тому +1

    Sir My few dataset is hg 19 and few are hg38; so So I used hg19gtf file for hg19 data and hg38 gtf file for hg38. is it correct ? or is it like for both we have to use hg38 gtf file only? After converting, many ensemble ids have duplicate gene names and NAN(many ensemble ids not having corresponding gene name) too. So such duplicates and NANs can we remove as such? or is there any specific criteria which have to be followed for removal? how can i convert this hg19 gene names to hg38? or is it like without converting we can combine both data?

    • @sanbomics
      @sanbomics  2 роки тому

      Hmm, good question. I would say you picked the right annotation version for each, but depending on the source it can be annotated differently. But it may have worked fine. It is hard for me to tell without looking. As for duplicate gene names, there should be a few naturally, but it is hard for me to say if what you are seeing is not due to some error or version mismatch. The best thing you could do in this scenario is rerun both datasets on the same genome build if you have the raw data. Would remove some potentially confounding variables in your analysis too, depending on what you are trying to accomplish. Good luck!

  • @Ms_E_
    @Ms_E_ 2 роки тому

    Hi there. i have a similar problem, i work with yeasts that are not well annotated and i will be using GCF_000146045._R64_genomic.gtf to convert. Any assistance with the script writing since i am not that experienced with coding? Looking forward to your response. Thanks

    • @sanbomics
      @sanbomics  2 роки тому

      First look at the file and make sure the gene names you want are in the gtf. If yes, then you can follow what i do in the video, but you will likely have to split the strings on different characters because gtfs can vary.

    • @Ms_E_
      @Ms_E_ 2 роки тому +1

      @@sanbomics Thank you for the prompt response. I will check it and get back to you. Thanks.

    • @Ms_E_
      @Ms_E_ 2 роки тому

      @@sanbomics Dear Sir, so i have checked the GTF file and my two inputs files (CSV). Compared to the GTF file, my two input files have more genes than the GTF file, while the second input file has fewer genes. I have filtered out based on the geneid and geneID, and the figures are as follows; GTF file 6465, the CSV file contain 6575 and 6166 respectively. In this regard, how do i proceed?
      Looking forward to your response.
      Best regards
      Evelyn

    • @sanbomics
      @sanbomics  Рік тому

      If you try labeling them with it, how many remain unlabeled?

  • @claudiabaldacci5910
    @claudiabaldacci5910 5 місяців тому

    Hi, where did you download you GFT file from?

    • @sanbomics
      @sanbomics  4 місяці тому

      ensemble, gencode, or ucsc

  • @nilanjanamani5571
    @nilanjanamani5571 Рік тому

    Thanks a lot for the informative video.
    The whole script ran perfectly without any error for my case. But no gene name is retrieved. The whole Gene_name column shows "NaN" for each rows. What could be the reason behind this?
    Thanks in advance.

    • @sanbomics
      @sanbomics  Рік тому

      Its possible your gtf is not exactly the same format. You will have to open it and take a look at how the lines are delimited. Alternatively, I have a package that converts the name for you: twitter.com/Sanbomics/status/1597246973903785988?s=20&t=vkEV6XWM8H68F0-fNNV5GQ

    • @Saxi0
      @Saxi0 Рік тому

      I had the same issue. I figured out that I provided the wrong column header. When you map the gene names provide the header of the column with the ENSEMBL IDs.
      eg.: df["gene name"] = df["header with ENSEMBLE ID"].map(gtf)
      Then it worked for me.
      Kind regards,

  • @mst63th
    @mst63th Рік тому

    Thanks, it's worked, but when I want to create the adata I get this error:
    could not convert string to float: 'LINC00362'

    • @sanbomics
      @sanbomics  Рік тому

      What do you mean create the adata? What are you trying to do?

    • @mst63th
      @mst63th Рік тому

      @@sanbomics I mean adata variable with the adata = sc.read_csv("csv_path")
      After converting ensemble IDs into gene symbols, I got the error

    • @sanbomics
      @sanbomics  Рік тому

      Ok, I was just a little confused at how this relates to the ensembl ID video. Try loading it as a csv in pandas first and see how it looks and see if there is anything weird with the dataframe