BERTology Meets Biology: Interpreting Attention in Protein Language Models (Paper Explained)
Вставка
- Опубліковано 28 тра 2024
- Proteins are the workhorses of almost all cellular functions and a core component of life. But despite their versatility, all proteins are built as sequences of the same 20 amino acids. These sequences can be analyzed with tools from NLP. This paper investigates the attention mechanism of a BERT model that has been trained on protein sequence data and discovers that the language model has implicitly learned non-trivial higher-order biological properties of proteins.
OUTLINE:
0:00 - Intro & Overview
1:40 - From DNA to Proteins
5:20 - BERT for Amino Acid Sequences
8:50 - The Structure of Proteins
12:40 - Investigating Biological Properties by Inspecting BERT
17:45 - Amino Acid Substitution
24:55 - Contact Maps
30:15 - Binding Sites
33:45 - Linear Probes
35:25 - Conclusion & Comments
Paper: arxiv.org/abs/2006.15222
Code: github.com/salesforce/provis
My Video on BERT: • BERT: Pre-training of ...
My Video on Attention: • Attention Is All You Need
Abstract:
Transformer architectures have proven to learn useful representations for protein classification and generation tasks. However, these representations present challenges in interpretability. Through the lens of attention, we analyze the inner workings of the Transformer and explore how the model discerns structural and functional properties of proteins. We show that attention (1) captures the folding structure of proteins, connecting amino acids that are far apart in the underlying sequence, but spatially close in the three-dimensional structure, (2) targets binding sites, a key functional component of proteins, and (3) focuses on progressively more complex biophysical properties with increasing layer depth. We also present a three-dimensional visualization of the interaction between attention and protein structure. Our findings align with known biological processes and provide a tool to aid discovery in protein engineering and synthetic biology. The code for visualization and analysis is available at this https URL.
Authors: Jesse Vig, Ali Madani, Lav R. Varshney, Caiming Xiong, Richard Socher, Nazneen Fatema Rajani
Links:
UA-cam: / yannickilcher
Twitter: / ykilcher
Discord: / discord
BitChute: www.bitchute.com/channel/yann...
Minds: www.minds.com/ykilcher - Наука та технологія
Well done, Yannic!
Overall, the whole video is very descriptive; however, I want to mention that the 3D conformation of proteins is NOT determined by molecular simulations but with physical experimental methods (e.g., x-ray crystallography and cryo-EM). These physical methods are handicapped because either you cannot use x-ray crystallography at all for a specific protein or it just too expensive like cryo-EM. As of right now, the number of protein sequences versus physical structures has exploded thanks to sequence technology, therefore, there still remains a plethora of protein sequences without corresponding physical structures. A huge endeavor in the science community is to predict structure with only the protein sequence, considering huge datasets (e.g. Protein databank, Uniprot, etc.) and powerful models like BERT have emerged.
Great stuff as always! I'm curious how much time it takes you every day to produce this much high quality content
I think he should also put the "time to read" in the video description.
Good idea :) it really depends on the paper, sometimes 1-2 hours, sometimes 1-2 days
That is funny! I just started using transformers to address a "similar" problem using proteins. I think one of the reasons the model can't predict the bindings sites are the pos-translational modifications. This process happens when other proteins add modified amino acid versions or sugars at proteins structure. These modifications can change totally the protein folding and affect the binding site positions.
Are chaperone proteins included in that group of "other proteins"? Those have to throw a wrench in things too.
Mind-blowing concept !. Amazing video
thanks, this explanation saved my life understanding this paper!!! could it be an interesting thing to do another video with the latest 2021 version of the paper?
your channel is incredible. Thanks!
I guess there will be quite a lot similar papers coming soon (for example, chromosome close interactions, RNA-DNA interactions, ORF identifications, CRISPR gRNA design/evaluations...
How are the proteins encoded so that they can be consumed by the neural network? Is there a Word2Vec/Glove for proteins?
.
Are all proteins linear? How do we encode non-linear proteins?
.
Good paper. The next step would be to gradient descent backwards through the learnt model to generate proteins which meet some criteria.
Transformers have their own embedding table that is jointly learned with the model, so I guess it's just a "vocabulary" of 20 tokens. And yes, all proteins are linear, as far as I know. They are chains of amino acids, which can only make two connections each.
I think there are a lot of people working in that space, but I guess you'd want to build a model that also learns protein structure / drug interaction / etc. in a supervised manner, that's probably going to perform much better.
Nice information 👌
Keep it up bro 👊
How can I encode epitope sequences for a binary classification task. I tried prottrans embedding but the accuracy is quite low
Hey, I can't find the figure 2 you are showing in the paper itself. Do you have a different resource?
To bad that you are not doing more RL, your videos are so good
I much prefer the breadth of topics that Yannic covers. Being too myopic is almost certainly bad when trying to birth new memes :)
He's introduced me to a bunch of concepts I wouldn't have found otherwise, and everybody and their grandma has heard of RL.
Yo grandma! Check out my new A3C shit right here!
@@rbain16 well then well one of the grandmas she should make a video about" Accelerating Online Reinforcement Learning with Offline Datasets" and I did not say he should do 100% RL, just one or two a week would be nice Just to be clear I am very impressed by Yannics videos. Its really great work with or without RL topics!
Grandma is all about them PEARLs dog
Please upload for ViLBert, VisualBert and VisualBert COCO
but aren't they trying to predict the contact maps? the eq. with the attention they seem to be adding this f(i, j) as an input, is that part of the training data? 25:45 what is this alignment they speak of?
They are investigating the correlation between contact maps (the f part) and the attention patterns (the alpha part).
@@YannicKilcher thanks Yannic, spontaneous notice that the sum is ignoring all attention terms where f(i, j) is zero
You should be a professor in every university
You make such great content thank you! It’s very promising to see that the model can learn certain things from the raw amino acid sequence. I think for better performance they should investigate also how nucleic acids interact with proteins as those also mediate protein folding (source: pubmed.ncbi.nlm.nih.gov/19106620/)
Almost nailed the pronunciations. The Chinese name is pronounced tsai-ming shjong, pretty close.
Sometimes I get lucky :D
With same naivety language model can predict any written programm output - this can't be done
As written, this sentence doesn't parse. Care to rephrase?
@@jeremykothe2847 I think they tried to predict _final_ form of protein. But for me this looks like predicting final output of some turing-complete process. I think language models capable of predicting _next_ step by approximating underlying equation of physical process. But for predicting _all_ steps you need some other (maybe iterative) mechanism.
An unrelated question that keeps bugging me: why do you mention your "Attention is all you need" video in every video you produce? Are you trying to push the number of views there to a maximum, or am I just seeing patterns where there are none? In any case, your videos are as great as always, it just keeps tripping me up in every one of them :D
"Attention is all you need" is the name of the paper that described attention heads. Hence it's referred to when discussing attention heads, which have become popular.
Yes, it's getting repetitive :) but I usually want to give people who have no clue what I'm talking about a quick pointer to where they can learn
@@YannicKilcher I wouldn't worry - it's a really important paper, and I'm sure many new viewers would love to be directed to it rather than have you 'black box' it or try to fully explain attention in every video.
263rd
3rd
8th
4th
2nd
So,biology hahahahahhhaahahah
1st
sorry to interrupt
5th
First
This has no applications, the tridimensional and post-translational modification are the state-of-the-art of protein research. Sequence it's not insteresting anymore.
2nd