XCiT: Cross-Covariance Image Transformers (Facebook AI Machine Learning Research Paper Explained)
Вставка
- Опубліковано 11 чер 2024
- #xcit #transformer #attentionmechanism
After dominating Natural Language Processing, Transformers have taken over Computer Vision recently with the advent of Vision Transformers. However, the attention mechanism's quadratic complexity in the number of tokens means that Transformers do not scale well to high-resolution images. XCiT is a new Transformer architecture, containing XCA, a transposed version of attention, reducing the complexity from quadratic to linear, and at least on image data, it appears to perform on par with other models. What does this mean for the field? Is this even a transformer? What really matters in deep learning?
OUTLINE:
0:00 - Intro & Overview
3:45 - Self-Attention vs Cross-Covariance Attention (XCA)
19:55 - Cross-Covariance Image Transformer (XCiT) Architecture
26:00 - Theoretical & Engineering considerations
30:40 - Experimental Results
33:20 - Comments & Conclusion
Paper: arxiv.org/abs/2106.09681
Code: github.com/facebookresearch/xcit
Abstract:
Following their success in natural language processing, transformers have recently shown much promise for computer vision. The self-attention operation underlying transformers yields global interactions between all tokens ,i.e. words or image patches, and enables flexible modelling of image data beyond the local interactions of convolutions. This flexibility, however, comes with a quadratic complexity in time and memory, hindering application to long sequences and high-resolution images. We propose a "transposed" version of self-attention that operates across feature channels rather than tokens, where the interactions are based on the cross-covariance matrix between keys and queries. The resulting cross-covariance attention (XCA) has linear complexity in the number of tokens, and allows efficient processing of high-resolution images. Our cross-covariance image transformer (XCiT) is built upon XCA. It combines the accuracy of conventional transformers with the scalability of convolutional architectures. We validate the effectiveness and generality of XCiT by reporting excellent results on multiple vision benchmarks, including image classification and self-supervised feature learning on ImageNet-1k, object detection and instance segmentation on COCO, and semantic segmentation on ADE20k.
Authors: Alaaeldin El-Nouby, Hugo Touvron, Mathilde Caron, Piotr Bojanowski, Matthijs Douze, Armand Joulin, Ivan Laptev, Natalia Neverova, Gabriel Synnaeve, Jakob Verbeek, Hervé Jegou
Links:
TabNine Code Completion (Referral): bit.ly/tabnine-yannick
UA-cam: / yannickilcher
Twitter: / ykilcher
Discord: / discord
BitChute: www.bitchute.com/channel/yann...
Minds: www.minds.com/ykilcher
Parler: parler.com/profile/YannicKilcher
LinkedIn: / yannic-kilcher-488534136
BiliBili: space.bilibili.com/1824646584
If you want to support me, the best thing to do is to share out the content :)
If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: www.subscribestar.com/yannick...
Patreon: / yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n - Наука та технологія
OUTLINE:
0:00 - Intro & Overview
3:45 - Self-Attention vs Cross-Covariance Attention (XCA)
19:55 - Cross-Covariance Image Transformer (XCiT) Architecture
26:00 - Theoretical & Engineering considerations
30:40 - Experimental Results
33:20 - Comments & Conclusion
Paper: arxiv.org/abs/2106.09681
Code: github.com/facebookresearch/xcit
Hi Yannic, in 15:30, I think you say you explain cross-attention, but you're explaining the XCA.
I love your videos, and I learn a lot from them!
Excellen presentation Sir! Thank you
Pretty cool that the head feature triggered for the race cars cockpit
Hi Yannic, it would be great to hear from you on what you think makes 'best papers' at some large conferences (e.g. CVPR currently) special? What's the selection process for these awards and do you think it's important to aim for one? Thanks!
We now need more weekends to keep up with Yannic's speed of creating videos.
He's officially passed his speed of being "Yannic Lightspeed Kilcher"
So extracting features based on gram matrices . What they are doing is exploring equivariances, convs have translation, attn has permutation, this has scale and (and to certain degree) rotation.
I think classic attention is based on Gram matrices, whereas this one is based on Covariance matrices
@@YannicKilcher Covariance matrices are a case of Gram matricies with a linear kernel function.
Biologically speaking, Xcit makes more sense than the original transformer: every XCiTation (see I what I did there) to neurons produces some distributed representation and other neurons listen to these repsresentions in a specific cut (that changes over time if a better one is found). So in a way XCiT is a very very crude, small and linear approximation of how actual neurons listen to other neurons (But not an approximation on how they operate though).
5:35 You should remaster the "All you need is attention video".
.
32:45 What is being L2 normalized? All weights or just the weights of the transformer?
.
35:25 I dont understand the query and key visualizations. It is a norm across channels? What would be the interpretation in this case? If each channel corresponds to some feature, then a high norm means the neural network found multiple things in the same patch/pixel.
.
This is essentially learning a generator function for the kernels instead of the kernels themselves.
The queries and keys are L2-normalized. For the queries and keys, you simply look at each channel across the tokens as a vector and then proceed like usual. I think the visualizations are for the classification layer, where it's more "classic" attention, not this XCA. The visualizations are more to show that the network learns to focus on relevant things.
Have any papers played around with stacking transformer blocks width wise? I.e using self attention to determine the keys/values weights of an attention block etc.?
1. From your knowledge, the model may be in the group of state-of-art performance for image regression tasks (such as position regression of object)?
2. If so, what are pros and cons w.r.t. standard CNNs?
Just as a curiosity, what program do you use to open the pdfs?
One note
Can you review this 2021 paper ? The Affective Growth of Computer Vision, what do you think about it?
So if you apply the XCiT idea to NLP, would you attend to dimensions of the word embedding vectors instead of channels?
yes, exactly
Would be hard to apply to NLP because the QKV and FF matrices would require fixed length sequences.
@@snippletrap yup this is my interpretation too. This combines cross sequence information through 1x1 convolutions (as opposed to cross channel) and can only be used for fixed length sequences.
@@kazz811 You can do a cumulative sum of the covariance, similar to 'Transformers are RNNs'. Might require a different normalization scheme though.
i wonder if in fact "transformers" could be summarized as a form of metalearning or hypernetworks, where the weights are "learned" on the fly. The cross-covariance produces a fresh, single "learned" weight matrix at test time, while standard attention produces a weight matrix per data point, which is perhaps too complex. I am waiting for self-supervision to be applied explicitely on the fly inside the "inner loop" optimization ( "mesa" optimizer)
XCA as a 1x1 convolution: So might be interesting to replicate XCiT replacing the XCA by (PyTorch) `nn.Conv2D(d, d, 1, groups=h)` and comparing the outcome after training from scratch. I still suspect the "implicit" token mixing would provide some boost but I wonder how much.
Me reading XCiT paper this afternoon: if only he had done a video on this
Won't you lose positional information of the actual sequence features? I had the same idea that I applied in comp biology problem (DNA sequences) but couldn't recover the attention/interaction distances of sequence features/motifs in DNA.
yes, but you retain the position information in the final transformation because you pull through each patch independently.
I suspect they tried to use smaller blocks but the added performance either decreased or did not increase enough to outweigh the added flops. Smaller blocks equates to less features in the original layer? The entire image becomes the only feature. With 8 by 8 blocks, each entry of the block is a feature (64 features). You could create many features from one long feature or a small number of features, with something like a dense layer, but that is not going to give you good performance. That’s like making apple pie out of just apples, no flour, no sugar,…
5:30, I think, every different row represents a different channel, and every single element of the row should represent the probability of different objects, not an object, like an eye or mouth. Did I make any wrong understanding?
So basically this approach cannot be used for variable length sequences since it takes linear combinations along sequence dimension (instead of along the feature/channel dimension) before attention . Which means that whatever the image size, we would have to ensure that the number of patches are identical. Am I getting this right?
no, it works whatever the sequence length
@@etiennetiennetienne if it applies 1x1 convolutions along the sequence dimension for the query and Key vectors instead of along the channel dimension then I don't think it can. Otherwise how does this differ from standard attention? In standard attention all linear operations are done cross-channel with the sequence information coupled by softmax of the attention matrix.
i think the 1x1 convolution processes token by token, it is not mixing tokens together only the channels. it is the cross covariance computation that mixes the token together
I think, this is similarly idea, as was in "cSE" - "channel Squueze Excitation" except using more that 1 channel, and StyleGan like modulated convolution. Dynamic kernels for convolutions was in StyleGAN, I've seen about the same idea with little differencies in many papers, but with different names, like "Spade" blocks. So this is can be named,for example as cse modulated deep wise separable conv-net. Nothing new unfortunately.
You’re forgiven for drawing that picture exactly one more time, but no more.
Yadi yadi yada !
So basically it's a fancier Squeeze-Excite layer.
I found some of the stuff a bit confusing honestly. On one hand, I am seeing capturing channel-wise interactions across an entire sequence (which is probably a single image), on the other hand, the notation for the cross-covariance matrix tells it's only for the single data point.
You also kind of pointed out in the video that it does not even matter how we do it as long as things are contextualized. Works like Non-local Means, Global Context Blocks also provide a nice way to achieve that I would think.
1st comment. I know I am shameless :p
Convnets are transformers, but at pixel, small features level.
Please correct me if im wrong but to me it seems that all these attention\transposed attention\dynamic weights layers are doing is swap linear operation with quadratic or cubic operations, am I wrong?
That is to say, a normal FF layer is just a linear transformation (that we sometimes add a non linearity to after), and a dynamic weights\attention layer is a when the weights themselves are a linear transformation of the input x so the output is a quadratic transformation. if we use queries keys and values we get a cubic transformation (I notice that I ignored the softmax, but the general point holds).
If I am correct, why is this surprising that a higher degree polynomial will do a better fit than a linear function? Please help me make sense of this
Your statements are correct, but I think you're using a different notion of what is quadratic than what we usually talk about with these things. We refer to the operation as quadratic because it computes the interaction term between all pairs of tokens.
@@YannicKilcher I see what you mean, I meant a quadratic function of the input, as opposed to a linear function
If Facebook AI researchers team creating a deep-seated in extensive reinforcement learning with coherential resolutioning in 26k sampling - a little kind of cross-covariance transformer, I'll go pass out.
This's just plain old linear memory networks. You call the two parts participating in the memory construction `q` and `k`, but could just as well have called them `k` and `v` and nothing would've changed. Same exact formulas. And it makes more intuitive sense, in my honest opinion.
I think this sort of papers are kind of boring now, people just try a variation of the transformer by changing a couple of formulas minimally, throw *A LOT* of computing and engineering with little tricks to get the same results we are used to get.
It might just be, like for FFNet, that if you stir the pot of data you get as input and give it years of GPU processing good performance is bound to happen. Seems more of a side effect of "The Bitter Lesson" by Sutton than anything else.
I had the same overall reaction. But to reframe: it’s like these “hot” techniques, which win notoriety from performance that’s at least as much big compute/data as solid architecture and careful handling, become the excuse to give consideration to basic research. It seems like lazy/obvious permutations to test, but if same work was done without being on the category of a fad, you might call it useful basic work, if boring perhaps.
These papers are bricks in the pyramid of “what do we know about structuring bias into NN architectures”. Indeed seems like enough shaking with some sort of inner structure with a learning signal will perform some kind of useful search/sort. (Duh, maybe?) But what we want to know is what specific choices are good tradeoffs, and longer term, is there something fundamental to understand about it that can be distilled.
So, keep making bricks for now.
Or in other words, what a privilege that we now get to consider this kinda boring, haha. Must be progress of some kind?
@@oncedidactic It's totally fair to have papers that have incremental improvements and try different things in order to explore the space of possibilities or even increasing our certainty of known results, but hearing a paper like this explained isn't really adding that much to what was already presented a lot of times before.
Maybe there are some engineering tricks that will present themselves to be very resilient and overall beneficial in general (say batch norm), but it's only something we can see some years after a paper has been published and tried
Thanks for making this video but very bad explanations 😅