XCiT: Cross-Covariance Image Transformers (Facebook AI Machine Learning Research Paper Explained)

Поділитися
Вставка
  • Опубліковано 11 чер 2024
  • #xcit #transformer #attentionmechanism
    After dominating Natural Language Processing, Transformers have taken over Computer Vision recently with the advent of Vision Transformers. However, the attention mechanism's quadratic complexity in the number of tokens means that Transformers do not scale well to high-resolution images. XCiT is a new Transformer architecture, containing XCA, a transposed version of attention, reducing the complexity from quadratic to linear, and at least on image data, it appears to perform on par with other models. What does this mean for the field? Is this even a transformer? What really matters in deep learning?
    OUTLINE:
    0:00 - Intro & Overview
    3:45 - Self-Attention vs Cross-Covariance Attention (XCA)
    19:55 - Cross-Covariance Image Transformer (XCiT) Architecture
    26:00 - Theoretical & Engineering considerations
    30:40 - Experimental Results
    33:20 - Comments & Conclusion
    Paper: arxiv.org/abs/2106.09681
    Code: github.com/facebookresearch/xcit
    Abstract:
    Following their success in natural language processing, transformers have recently shown much promise for computer vision. The self-attention operation underlying transformers yields global interactions between all tokens ,i.e. words or image patches, and enables flexible modelling of image data beyond the local interactions of convolutions. This flexibility, however, comes with a quadratic complexity in time and memory, hindering application to long sequences and high-resolution images. We propose a "transposed" version of self-attention that operates across feature channels rather than tokens, where the interactions are based on the cross-covariance matrix between keys and queries. The resulting cross-covariance attention (XCA) has linear complexity in the number of tokens, and allows efficient processing of high-resolution images. Our cross-covariance image transformer (XCiT) is built upon XCA. It combines the accuracy of conventional transformers with the scalability of convolutional architectures. We validate the effectiveness and generality of XCiT by reporting excellent results on multiple vision benchmarks, including image classification and self-supervised feature learning on ImageNet-1k, object detection and instance segmentation on COCO, and semantic segmentation on ADE20k.
    Authors: Alaaeldin El-Nouby, Hugo Touvron, Mathilde Caron, Piotr Bojanowski, Matthijs Douze, Armand Joulin, Ivan Laptev, Natalia Neverova, Gabriel Synnaeve, Jakob Verbeek, Hervé Jegou
    Links:
    TabNine Code Completion (Referral): bit.ly/tabnine-yannick
    UA-cam: / yannickilcher
    Twitter: / ykilcher
    Discord: / discord
    BitChute: www.bitchute.com/channel/yann...
    Minds: www.minds.com/ykilcher
    Parler: parler.com/profile/YannicKilcher
    LinkedIn: / yannic-kilcher-488534136
    BiliBili: space.bilibili.com/1824646584
    If you want to support me, the best thing to do is to share out the content :)
    If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
    SubscribeStar: www.subscribestar.com/yannick...
    Patreon: / yannickilcher
    Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
    Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
    Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
    Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n
  • Наука та технологія

КОМЕНТАРІ • 52

  • @YannicKilcher
    @YannicKilcher  3 роки тому +6

    OUTLINE:
    0:00 - Intro & Overview
    3:45 - Self-Attention vs Cross-Covariance Attention (XCA)
    19:55 - Cross-Covariance Image Transformer (XCiT) Architecture
    26:00 - Theoretical & Engineering considerations
    30:40 - Experimental Results
    33:20 - Comments & Conclusion
    Paper: arxiv.org/abs/2106.09681
    Code: github.com/facebookresearch/xcit

  • @adizhol
    @adizhol 3 роки тому +2

    Hi Yannic, in 15:30, I think you say you explain cross-attention, but you're explaining the XCA.
    I love your videos, and I learn a lot from them!

  • @jeyakumarjohnnyjonathan461
    @jeyakumarjohnnyjonathan461 2 роки тому

    Excellen presentation Sir! Thank you

  • @Gazzar19
    @Gazzar19 3 роки тому +1

    Pretty cool that the head feature triggered for the race cars cockpit

  • @expirinot8724
    @expirinot8724 3 роки тому +5

    Hi Yannic, it would be great to hear from you on what you think makes 'best papers' at some large conferences (e.g. CVPR currently) special? What's the selection process for these awards and do you think it's important to aim for one? Thanks!

  • @ChaiTimeDataScience
    @ChaiTimeDataScience 3 роки тому +2

    We now need more weekends to keep up with Yannic's speed of creating videos.
    He's officially passed his speed of being "Yannic Lightspeed Kilcher"

  • @machinelearningone2635
    @machinelearningone2635 3 роки тому +5

    So extracting features based on gram matrices . What they are doing is exploring equivariances, convs have translation, attn has permutation, this has scale and (and to certain degree) rotation.

    • @YannicKilcher
      @YannicKilcher  3 роки тому +2

      I think classic attention is based on Gram matrices, whereas this one is based on Covariance matrices

    • @magi-1
      @magi-1 3 роки тому +2

      @@YannicKilcher Covariance matrices are a case of Gram matricies with a linear kernel function.

  • @444haluk
    @444haluk 2 роки тому +1

    Biologically speaking, Xcit makes more sense than the original transformer: every XCiTation (see I what I did there) to neurons produces some distributed representation and other neurons listen to these repsresentions in a specific cut (that changes over time if a better one is found). So in a way XCiT is a very very crude, small and linear approximation of how actual neurons listen to other neurons (But not an approximation on how they operate though).

  • @herp_derpingson
    @herp_derpingson 3 роки тому +7

    5:35 You should remaster the "All you need is attention video".
    .
    32:45 What is being L2 normalized? All weights or just the weights of the transformer?
    .
    35:25 I dont understand the query and key visualizations. It is a norm across channels? What would be the interpretation in this case? If each channel corresponds to some feature, then a high norm means the neural network found multiple things in the same patch/pixel.
    .
    This is essentially learning a generator function for the kernels instead of the kernels themselves.

    • @YannicKilcher
      @YannicKilcher  3 роки тому +1

      The queries and keys are L2-normalized. For the queries and keys, you simply look at each channel across the tokens as a vector and then proceed like usual. I think the visualizations are for the classification layer, where it's more "classic" attention, not this XCA. The visualizations are more to show that the network learns to focus on relevant things.

  • @paulcurry8383
    @paulcurry8383 2 роки тому

    Have any papers played around with stacking transformer blocks width wise? I.e using self attention to determine the keys/values weights of an attention block etc.?

  • @matteoguida9971
    @matteoguida9971 3 роки тому

    1. From your knowledge, the model may be in the group of state-of-art performance for image regression tasks (such as position regression of object)?
    2. If so, what are pros and cons w.r.t. standard CNNs?

  • @CristianCYAC
    @CristianCYAC 3 роки тому +5

    Just as a curiosity, what program do you use to open the pdfs?

  • @ukhu_pacha
    @ukhu_pacha 3 роки тому

    Can you review this 2021 paper ? The Affective Growth of Computer Vision, what do you think about it?

  • @st33lbird
    @st33lbird 3 роки тому +8

    So if you apply the XCiT idea to NLP, would you attend to dimensions of the word embedding vectors instead of channels?

    • @YannicKilcher
      @YannicKilcher  3 роки тому +2

      yes, exactly

    • @snippletrap
      @snippletrap 3 роки тому

      Would be hard to apply to NLP because the QKV and FF matrices would require fixed length sequences.

    • @kazz811
      @kazz811 3 роки тому

      @@snippletrap yup this is my interpretation too. This combines cross sequence information through 1x1 convolutions (as opposed to cross channel) and can only be used for fixed length sequences.

    • @seanburton6007
      @seanburton6007 2 роки тому

      @@kazz811 You can do a cumulative sum of the covariance, similar to 'Transformers are RNNs'. Might require a different normalization scheme though.

  • @etiennetiennetienne
    @etiennetiennetienne 3 роки тому

    i wonder if in fact "transformers" could be summarized as a form of metalearning or hypernetworks, where the weights are "learned" on the fly. The cross-covariance produces a fresh, single "learned" weight matrix at test time, while standard attention produces a weight matrix per data point, which is perhaps too complex. I am waiting for self-supervision to be applied explicitely on the fly inside the "inner loop" optimization ( "mesa" optimizer)

  • @swazza9999
    @swazza9999 2 роки тому

    XCA as a 1x1 convolution: So might be interesting to replicate XCiT replacing the XCA by (PyTorch) `nn.Conv2D(d, d, 1, groups=h)` and comparing the outcome after training from scratch. I still suspect the "implicit" token mixing would provide some boost but I wonder how much.

  • @yimingqu2403
    @yimingqu2403 3 роки тому

    Me reading XCiT paper this afternoon: if only he had done a video on this

  • @fahad3802
    @fahad3802 3 роки тому +1

    Won't you lose positional information of the actual sequence features? I had the same idea that I applied in comp biology problem (DNA sequences) but couldn't recover the attention/interaction distances of sequence features/motifs in DNA.

    • @YannicKilcher
      @YannicKilcher  3 роки тому +1

      yes, but you retain the position information in the final transformation because you pull through each patch independently.

  • @victorrielly4588
    @victorrielly4588 3 роки тому

    I suspect they tried to use smaller blocks but the added performance either decreased or did not increase enough to outweigh the added flops. Smaller blocks equates to less features in the original layer? The entire image becomes the only feature. With 8 by 8 blocks, each entry of the block is a feature (64 features). You could create many features from one long feature or a small number of features, with something like a dense layer, but that is not going to give you good performance. That’s like making apple pie out of just apples, no flour, no sugar,…

  • @user-dz4qx8kc9j
    @user-dz4qx8kc9j 2 роки тому

    5:30, I think, every different row represents a different channel, and every single element of the row should represent the probability of different objects, not an object, like an eye or mouth. Did I make any wrong understanding?

  • @kazz811
    @kazz811 3 роки тому

    So basically this approach cannot be used for variable length sequences since it takes linear combinations along sequence dimension (instead of along the feature/channel dimension) before attention . Which means that whatever the image size, we would have to ensure that the number of patches are identical. Am I getting this right?

    • @etiennetiennetienne
      @etiennetiennetienne 3 роки тому

      no, it works whatever the sequence length

    • @kazz811
      @kazz811 3 роки тому

      @@etiennetiennetienne if it applies 1x1 convolutions along the sequence dimension for the query and Key vectors instead of along the channel dimension then I don't think it can. Otherwise how does this differ from standard attention? In standard attention all linear operations are done cross-channel with the sequence information coupled by softmax of the attention matrix.

    • @etiennetiennetienne
      @etiennetiennetienne 3 роки тому +1

      i think the 1x1 convolution processes token by token, it is not mixing tokens together only the channels. it is the cross covariance computation that mixes the token together

  • @aidegrod
    @aidegrod 2 роки тому

    I think, this is similarly idea, as was in "cSE" - "channel Squueze Excitation" except using more that 1 channel, and StyleGan like modulated convolution. Dynamic kernels for convolutions was in StyleGAN, I've seen about the same idea with little differencies in many papers, but with different names, like "Spade" blocks. So this is can be named,for example as cse modulated deep wise separable conv-net. Nothing new unfortunately.

  • @victorrielly4588
    @victorrielly4588 3 роки тому

    You’re forgiven for drawing that picture exactly one more time, but no more.

  • @JTMoustache
    @JTMoustache 3 роки тому +1

    Yadi yadi yada !

  • @kanfoosj
    @kanfoosj 3 роки тому +1

    So basically it's a fancier Squeeze-Excite layer.

  • @sayakpaul3152
    @sayakpaul3152 3 роки тому

    I found some of the stuff a bit confusing honestly. On one hand, I am seeing capturing channel-wise interactions across an entire sequence (which is probably a single image), on the other hand, the notation for the cross-covariance matrix tells it's only for the single data point.
    You also kind of pointed out in the video that it does not even matter how we do it as long as things are contextualized. Works like Non-local Means, Global Context Blocks also provide a nice way to achieve that I would think.

  • @seetj12
    @seetj12 3 роки тому +20

    1st comment. I know I am shameless :p

  • @pensiveintrovert4318
    @pensiveintrovert4318 3 роки тому

    Convnets are transformers, but at pixel, small features level.

  • @edeneden97
    @edeneden97 3 роки тому +1

    Please correct me if im wrong but to me it seems that all these attention\transposed attention\dynamic weights layers are doing is swap linear operation with quadratic or cubic operations, am I wrong?
    That is to say, a normal FF layer is just a linear transformation (that we sometimes add a non linearity to after), and a dynamic weights\attention layer is a when the weights themselves are a linear transformation of the input x so the output is a quadratic transformation. if we use queries keys and values we get a cubic transformation (I notice that I ignored the softmax, but the general point holds).
    If I am correct, why is this surprising that a higher degree polynomial will do a better fit than a linear function? Please help me make sense of this

    • @YannicKilcher
      @YannicKilcher  3 роки тому +1

      Your statements are correct, but I think you're using a different notion of what is quadratic than what we usually talk about with these things. We refer to the operation as quadratic because it computes the interaction term between all pairs of tokens.

    • @edeneden97
      @edeneden97 3 роки тому

      @@YannicKilcher I see what you mean, I meant a quadratic function of the input, as opposed to a linear function

  • @G12GilbertProduction
    @G12GilbertProduction 3 роки тому

    If Facebook AI researchers team creating a deep-seated in extensive reinforcement learning with coherential resolutioning in 26k sampling - a little kind of cross-covariance transformer, I'll go pass out.

  • @aspergale9836
    @aspergale9836 2 роки тому

    This's just plain old linear memory networks. You call the two parts participating in the memory construction `q` and `k`, but could just as well have called them `k` and `v` and nothing would've changed. Same exact formulas. And it makes more intuitive sense, in my honest opinion.

  • @mgostIH
    @mgostIH 3 роки тому +7

    I think this sort of papers are kind of boring now, people just try a variation of the transformer by changing a couple of formulas minimally, throw *A LOT* of computing and engineering with little tricks to get the same results we are used to get.
    It might just be, like for FFNet, that if you stir the pot of data you get as input and give it years of GPU processing good performance is bound to happen. Seems more of a side effect of "The Bitter Lesson" by Sutton than anything else.

    • @oncedidactic
      @oncedidactic 3 роки тому +1

      I had the same overall reaction. But to reframe: it’s like these “hot” techniques, which win notoriety from performance that’s at least as much big compute/data as solid architecture and careful handling, become the excuse to give consideration to basic research. It seems like lazy/obvious permutations to test, but if same work was done without being on the category of a fad, you might call it useful basic work, if boring perhaps.
      These papers are bricks in the pyramid of “what do we know about structuring bias into NN architectures”. Indeed seems like enough shaking with some sort of inner structure with a learning signal will perform some kind of useful search/sort. (Duh, maybe?) But what we want to know is what specific choices are good tradeoffs, and longer term, is there something fundamental to understand about it that can be distilled.
      So, keep making bricks for now.

    • @oncedidactic
      @oncedidactic 3 роки тому

      Or in other words, what a privilege that we now get to consider this kinda boring, haha. Must be progress of some kind?

    • @mgostIH
      @mgostIH 3 роки тому +1

      @@oncedidactic It's totally fair to have papers that have incremental improvements and try different things in order to explore the space of possibilities or even increasing our certainty of known results, but hearing a paper like this explained isn't really adding that much to what was already presented a lot of times before.
      Maybe there are some engineering tricks that will present themselves to be very resilient and overall beneficial in general (say batch norm), but it's only something we can see some years after a paper has been published and tried

  • @saeed577
    @saeed577 11 місяців тому

    Thanks for making this video but very bad explanations 😅