MLP-Mixer: An all-MLP Architecture for Vision (Machine Learning Research Paper Explained)

Поділитися
Вставка
  • Опубліковано 15 тра 2024
  • #mixer #google #imagenet
    Convolutional Neural Networks have dominated computer vision for nearly 10 years, and that might finally come to an end. First, Vision Transformers (ViT) have shown remarkable performance, and now even simple MLP-based models reach competitive accuracy, as long as sufficient data is used for pre-training. This paper presents MLP-Mixer, using MLPs in a particular weight-sharing arrangement to achieve a competitive, high-throughput model and it raises some interesting questions about the nature of learning and inductive biases and their interaction with scale for future research.
    OUTLINE:
    0:00 - Intro & Overview
    2:20 - MLP-Mixer Architecture
    13:20 - Experimental Results
    17:30 - Effects of Scale
    24:30 - Learned Weights Visualization
    27:25 - Comments & Conclusion
    Paper: arxiv.org/abs/2105.01601
    Abstract:
    Convolutional Neural Networks (CNNs) are the go-to model for computer vision. Recently, attention-based networks, such as the Vision Transformer, have also become popular. In this paper we show that while convolutions and attention are both sufficient for good performance, neither of them are necessary. We present MLP-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs). MLP-Mixer contains two types of layers: one with MLPs applied independently to image patches (i.e. "mixing" the per-location features), and one with MLPs applied across patches (i.e. "mixing" spatial information). When trained on large datasets, or with modern regularization schemes, MLP-Mixer attains competitive scores on image classification benchmarks, with pre-training and inference cost comparable to state-of-the-art models. We hope that these results spark further research beyond the realms of well established CNNs and Transformers.
    Authors: Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, Alexey Dosovitskiy
    ERRATA: Here is their definition of what the 5-shot classifier is: "we report the few-shot accuracies obtained by solving the L2-regularized linear regression problem between the frozen learned representations of images and the labels"
    Links:
    TabNine Code Completion (Referral): bit.ly/tabnine-yannick
    UA-cam: / yannickilcher
    Twitter: / ykilcher
    Discord: / discord
    BitChute: www.bitchute.com/channel/yann...
    Minds: www.minds.com/ykilcher
    Parler: parler.com/profile/YannicKilcher
    LinkedIn: / yannic-kilcher-488534136
    BiliBili: space.bilibili.com/1824646584
    If you want to support me, the best thing to do is to share out the content :)
    If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
    SubscribeStar: www.subscribestar.com/yannick...
    Patreon: / yannickilcher
    Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
    Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
    Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
    Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n
  • Наука та технологія

КОМЕНТАРІ • 105

  • @YannicKilcher
    @YannicKilcher  3 роки тому +10

    OUTLINE:
    0:00 - Intro & Overview
    2:20 - MLP-Mixer Architecture
    13:20 - Experimental Results
    17:30 - Effects of Scale
    24:30 - Learned Weights Visualization
    27:25 - Comments & Conclusion

  • @TheOneSevenNine
    @TheOneSevenNine 3 роки тому +145

    next up: SOTA on imagenet using [throws dart at wall] polynomial curve fitting

  • @CristianGarcia
    @CristianGarcia 3 роки тому +95

    Coming soon to an arxiv near you: Random Forrests are all you need.

  • @patrickjdarrow
    @patrickjdarrow 3 роки тому +50

    Log scale graphs are all you need (to make your results look competitive)

  • @YannicKilcher
    @YannicKilcher  3 роки тому +22

    ERRATA: Here is their definition of what the 5-shot classifier is: "we report the few-shot accuracies obtained by solving the L2-regularized linear regression problem between the frozen learned representations of images and the labels"

  • @adamrak7560
    @adamrak7560 3 роки тому +15

    It effectively learns to implement a CNN, but it is more general. So it could implement attention like mechanisms too, on global and local scales.
    This proves that the space of good enough architectures is very large, basically if the architecture is general enough, it can do the job.

  • @pensiveintrovert4318
    @pensiveintrovert4318 3 роки тому +34

    I have come to the conclusion that all large networks are expensive hash tables.

  • @mayankmishra3875
    @mayankmishra3875 3 роки тому +10

    Your explanation of such papers is very easy to understand and highly engaging. A big thank you for your videos.

  • @mrdbourke
    @mrdbourke 3 роки тому +14

    MLPs are all you need!

  • @linminhtoo
    @linminhtoo 3 роки тому +1

    Thanks for the explanation! Just a little funny how the abstract claims "We present MLP-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs)." but the pseudo-code has "nn.Conv" (although yes it's not really a convolution and just to get the "linear embeddings" and probably more efficient in implementation). To me it more of shows how specialized layers still have their place, due to reasons of practical efficiency or otherwise

  • @rock_sheep4241
    @rock_sheep4241 3 роки тому +11

    Congratulations for the PHD and for yourwork :D

  • @amrmartini3935
    @amrmartini3935 3 роки тому +5

    10:21 are they really patches after MLP1? I feel like we lose sense of patches as MLP1 will mix together all patches of a channel. So, "red" patch does not necessarily map to "red" patch as shown in the diagram. Also this mixing via transpose reminds me of normalizing flows that "mix" the limited coordinate updates of previous layers by permutation matrices. Here, the transpose is doing the same thing on flattened feature maps. Anyways, congrats on PhD!

  • @TeoZarkopafilis
    @TeoZarkopafilis 3 роки тому +1

    Since different topologies can be equivalent (conv, attention, mlp), both in terms of learning capacity/ability and pure math/computationally-wise, I really like the 'effects of scale' part.

  • @luke2642
    @luke2642 3 роки тому +2

    A few-shot classifier is a better measure of generalisation, even if it's more relvant for theoretical academic progress than actual product implementations.

  • @Rhannmah
    @Rhannmah 3 роки тому +14

    Great video as usual. Although,
    15:49 I can't disagree more with this statement. Real scientific research isn't about a quest for results, it's a quest for answers. What I mean by that is that any answer is beneficial for the common knowledge, whether it be a positive or negative result.
    In the context of machine learning, if your results show that a specific type of architecture or algorithm produces worse results than already established methods, great! By publishing, you can now prevent others from pursuing that specific path of research, so research time gets put elsewhere in more productive areas. The modern requirement for positive results to get published is incredibly stupid and stifles progress in a huge manner.

  • @peterfeatherstone9768
    @peterfeatherstone9768 3 роки тому +1

    Do you think this architecture could be used for object detection? So have the final fully connected layer predict Nx(4+1+C) features where N is some upper bound on number of possible objects, lets say 100 (like detr), 4 corresponds to xywh, 1 for "is object" or "is empty", and C possible classes. Then use hungarian algorithm for matching targets with predictions (bipartite matching), CIOU loss for xywh, and binary cross entropy loss for the rest?

  • @sau002
    @sau002 3 роки тому

    Very nicely presented. Thank you

  • @andres_pq
    @andres_pq 3 роки тому +1

    Can't wait for Mixer BERT

  • @scottmiller2591
    @scottmiller2591 3 роки тому +2

    So apparently they're calling linear (technically affine) transformations "fully connected," instead of "fully connected" meaning a perceptron layer.
    I did like that they put a "Things that did not work" section in the paper.

  • @yangyue5823
    @yangyue5823 3 роки тому +2

    one thing for sure, residual connection is a must no matter what architecture we are playing with.

    • @quAdxify
      @quAdxify 3 роки тому +2

      But that's just a technicality due to the nature of back propagation. If some other kind of optimization was used (and there are alternatives, just not as fast), residual connections would most likely be useless.

    • @slobodanblazeski0
      @slobodanblazeski0 3 роки тому

      @@quAdxify I don't think I so. I believe that skip connections are something like going from coarser to finer resolution

    • @quAdxify
      @quAdxify 3 роки тому

      ​@@slobodanblazeski0 huh, that does not really make sense, you need to explain. Residual connection are all about avoiding vanishing gradients. It's the same concept LSTMs use to avoid vanishing gradients. In classic CNNs the pooling/ striding/ dilating is what allows for multiple resolutions/ scales.

    • @slobodanblazeski0
      @slobodanblazeski0 3 роки тому

      @@quAdxify OK in GANs I feel it's something like previous result is more or less correct. Network should do some work in higher dimensional space but when you project in lower dimensional space and add it with intermediate result the next solution shouldn't change that much. For example you first create 8x8, then 16x16 to which you add previously generated and upsampled 8x8. etc If that makes any sense

  • @Cardicardi
    @Cardicardi 3 роки тому +4

    Am I getting sth wrong or this works only for fixed size patches and number of patches (i.e. fixed size images)? Mostly due to the first MLP layer in the mixer layer (MLP 1) that operates on vectors of size "sequence length" (number of patches). It would be interesting to see if sth like this can be achieved by replacing the transpose operation with a covariance operation (X.transpose dot X), this would eliminate the dependency on the sequence length and ultimately apply to any number of patches in the images (still fixed patch size I guess)

    • @sheggle
      @sheggle 3 роки тому +4

      Or just use any type of pyramid pooling

  • @reuvper
    @reuvper 2 роки тому

    What a great channel! Thanks!

  • @youngseokjeon3376
    @youngseokjeon3376 14 днів тому

    this seems to be a good technique for attaining large receptive field size with cheap operation like MLP, rather than the expensive self-attention technique.

  • @peterfeatherstone9768
    @peterfeatherstone9768 3 роки тому

    So it looks like this doesn't work with dynamic size, correct? Or can you have as many patches as you want since the MLP layers are 1x1 convolutions? In which case, the image needs to have dimensions divisible by the patch sizes? I assume layer norm isn't sensitive to image size...

  • @furkatsultonov9976
    @furkatsultonov9976 3 роки тому +13

    " It is gonna be not long video..." - 28:11 mins

    • @emuccino
      @emuccino 3 роки тому +1

      His videos are often closer to an hour.

  • @konghong3885
    @konghong3885 3 роки тому

    Hype for it being the next big thing in NLP

  • @jeffr_ac
    @jeffr_ac Рік тому

    Great video!

  • @kaixiao2931
    @kaixiao2931 3 роки тому

    I think there is a mistake in 9:34. The first channel corresponds to the upper left corner of each patch?

  • @sahityayadav9606
    @sahityayadav9606 2 роки тому

    Always amazing video

  • @shengyaozhuang3748
    @shengyaozhuang3748 3 роки тому +2

    Since it can scale up to encode a very long sequence of input, I'm very curious about how this model could be applied to NLP tasks, as transformers, such as BERT and GPTs, have a strong limitation length of inputs. And also I noticed that Mixer does not use position embeddings because "the token-mixing MLPs are sensitive to the order of the input tokens, and therefore may learn to represent location"? I don't get why this is the case.

    • @dariodemattiesreyes3788
      @dariodemattiesreyes3788 3 роки тому

      I don't get it either, can someone clarify this? Thanks!

    • @my_master55
      @my_master55 2 роки тому

      I guess it doesn't need positional encoding because we assume that patches are stacked at the beginning in a certain order and then "unstacked" at the end with the same order, preserving their positions by this.

  • @rishikaushik8307
    @rishikaushik8307 3 роки тому

    in the mixer, MLP-1 just sees the distribution of a channel or feature without knowing what that feature is, appending the one hot index of the channel to the input might help the model learn better
    also can we really say that each channel in the output of MLP-1 after transpose corresponds to a patch?

    • @arunavaghatak6281
      @arunavaghatak6281 3 роки тому +1

      In the output of MLP-1, before transpose, each row alone contains information about the global distribution of some feature but no information regarding what that feature is. But notice that in the resultant matrix, information regarding the identity of the feature is still there. The first row corresponds to first feature, the second row corresponds to the second feature and so on. The feature identity is determined from the row number. So, after the transpose, when MLP-2 is applied, it knows that the first element of the input vector contains information about the global distribution of the first feature, second element contains that of the second feature and so on. So, information regarding the feature's identity is not lost. We don't need one hot index of the channel for that.
      And we can definitely say that each channel in the output of MLP-1 after transpose corresponds to a patch. That's because MLP-1 doesn't output a single scalar value. It outputs a vector in which each element corresponds to some patch. So, MLP-1 outputs information (about global distribution of the feature) needed by the first patch at first element of the output vector, that needed by the second patch at the second element of the output vector and so on. That's why different patches have different feature vectors after the transpose.
      Hope it makes sense. I have not studied deep learning very deeply.

  • @sau002
    @sau002 3 роки тому

    How does the feature detection work, if there are no CNN kernels ? E.g at some point you mention about corners getting detected in a patch.

  • @belab127
    @belab127 3 роки тому +5

    One question:
    You say in the first "per patch FC" layers the weights are shared between each patch.
    In my opinion this would be the same as using a convolution with size=patch size and stride=patch size
    Short: if I do a lot of weight sharing I end up building convolutions from scratch, saying it's not a convolution is kind of cheating, as convolutions can be viewed as multiple mini networks with shared weights.

  • @zhiyuanchen4829
    @zhiyuanchen4829 3 роки тому +2

    Let MLP Great Again!

  • @first-thoughtgiver-of-will2456
    @first-thoughtgiver-of-will2456 3 роки тому

    I always thought that if transformers can approximate unrolled LSTMs than a normal fully connected NN with dropout should be able to learn a similar branching logic. Skip connections are awesome but recurrent connections seem to be more difficult to implement efficiently than they are expressive. DAG > DCG for all approximators (in my opinion).

  • @sau002
    @sau002 3 роки тому

    Thank you.

  • @andres_pq
    @andres_pq 3 роки тому +4

    It's an axial convolution

  • @mlengineering9541
    @mlengineering9541 3 роки тому

    It is a pointnet for image patches ...

  • @yimingqu2403
    @yimingqu2403 3 роки тому

    "How about 'an RNN with T=1'". I LIKE THIS.

  • @welcomeaioverlords
    @welcomeaioverlords 3 роки тому +1

    In what ways is this *not* a CNN? It seems the core operation is to apply the same parameters to image patches in a translation-equivariant way. What am I missing?

  • @IoannisNousias
    @IoannisNousias 3 роки тому

    Is the “mixer” kinda like a separable filter?

  • @hannesstark5024
    @hannesstark5024 3 роки тому +8

    How is the per patch Fully-connected layer in the beginning different from a convolution with the patchsize as the stride?

    • @JamesAwokeKnowing
      @JamesAwokeKnowing 3 роки тому

      I think the point is the opposite. That it can be implemented using only mlp (very old, pre-cnn) concepts

    • @sheggle
      @sheggle 3 роки тому

      @@JamesAwokeKnowing the point hannes is trying to make, is that by dividing it into patches, they didn't actually remove all convs

    • @hannesstark5024
      @hannesstark5024 3 роки тому

      Well, seems that Yann LeCun is asking the same thing twitter.com/ylecun/status/1390419124266938371
      So I'll just guess that there is no difference.

    • @saimitheranj8741
      @saimitheranj8741 3 роки тому +2

      @@hannesstark5024 seems like it, even their pseudocode from the paper has a "nn.Conv", not sure what they mean by it being "convolution-free"

    • @hannesstark5024
      @hannesstark5024 3 роки тому

      @@saimitheranj8741 Oh nice spot :D
      Do you have a link to the line in the code?

  • @talha_anwar
    @talha_anwar 2 роки тому

    in case of RGB, there will be 3 channels?

  • @ecitslos
    @ecitslos 3 роки тому

    You can implement MLP with CNN layers, and due to hardware and CUDA magic it can be faster. They are essentially the same operations.

    • @ghostriley22
      @ghostriley22 2 роки тому

      Can you explain more or give any links?

    • @my_master55
      @my_master55 2 роки тому

      @@ghostriley22 1x1 convolutions can be applied instead of MLP.

  • @billykotsos4642
    @billykotsos4642 3 роки тому +18

    WHAT IS THIS? ITS 1987 ALL OVER AGAIN!!!!!

  • @user-mb3mf2og9k
    @user-mb3mf2og9k 3 роки тому +3

    How about replace the MLPs with CNNs or transformers in this architecture?

  • @jonatan01i
    @jonatan01i 3 роки тому +5

    So
    1 only cares about WHERE, and
    2 only cares about WHAT.

  • @slackstation
    @slackstation 3 роки тому +2

    Congratulations on the PhD!

  • @NeoShameMan
    @NeoShameMan 3 роки тому +1

    I'm confused what does mlp mean here? i thought mlp was binary neuron, ie either output 1 or 0, no fancy curve like sigmoid and relu...

    • @YannicKilcher
      @YannicKilcher  3 роки тому +3

      Multilayer Perceptron. A neural net consisting mainly of fully connected layers and nonlinearities

    • @NeoShameMan
      @NeoShameMan 3 роки тому +1

      @@YannicKilcher thanks!

  • @qidongyang7817
    @qidongyang7817 3 роки тому

    Pre-training again ?!

  • @JamesAwokeKnowing
    @JamesAwokeKnowing 3 роки тому

    I think you lost an opportunity here. Well maybe a follow-up "for beginners" video because the code/process is so small you couls get new-to-deep-learning folks to see the full process of sota vision. Like you can illustrate the matrices and vectors, channels gelu and norm (skip back prop) to show what code/math is necessary to take a 2d image and output class, with sota accuracy.

  • @bluel1ng
    @bluel1ng 3 роки тому

    Mix it baby! Gimme dadda, more more dadda ..

  • @XOPOIIIO
    @XOPOIIIO 3 роки тому +4

    These patches can't extract data efficiently, they should overlap at least. To do better you don't need larger dataset, just augment existing with translation.

    • @sheggle
      @sheggle 3 роки тому

      They do augment, what makes you think that you don't need a larger dataset?

    • @XOPOIIIO
      @XOPOIIIO 3 роки тому +1

      @@sheggleBecause how these patches are stacked together, they're not strided as in CNN.

  • @NextFuckingLevel
    @NextFuckingLevel 3 роки тому +3

    Reject Attention, embrace Perceptron

  • @Rizhiy13
    @Rizhiy13 3 роки тому

    Now someone needs to try it on NLP)

  • @herp_derpingson
    @herp_derpingson 3 роки тому

    Friendship ended with CNNs. DNN is my new best friend.

  • @droidcrackye5238
    @droidcrackye5238 3 роки тому

    conv share parameters over patches. this paper does not.

  • @timhutcheson8714
    @timhutcheson8714 3 роки тому

    What I was doing in 1990...

  • @domenickmifsud
    @domenickmifsud 3 роки тому +1

    Lightspeed

  • @jamgplus334
    @jamgplus334 3 роки тому

    next up: empty is all you need

  • @Mikey-lj2kq
    @Mikey-lj2kq 3 роки тому

    per-patch looks like cnn though.

  • @Adhil_parammel
    @Adhil_parammel 3 роки тому

    Can alpha zero teach go or chess lessons with help of gpt3!?

  • @swordwaker7749
    @swordwaker7749 3 роки тому

    I feel that the mlp could be replaced with an lstm or something.

  • @valeria6813
    @valeria6813 3 роки тому

    my little pony my little pony AaAaAAaAa

  • @jameszhang126
    @jameszhang126 3 роки тому +1

    these mlps are just another way of saying 1d cnn

  • @sangeetamenon8958
    @sangeetamenon8958 3 роки тому

    I want to mail you please give me id

  • @444haluk
    @444haluk 2 роки тому

    They literally proved nothing has changed after 2012.

  • @WhatsAI
    @WhatsAI 3 роки тому

    Hey Yannic! Awesome video as usual, and it's so great that you cover all these new papers so quickly. I love it!
    Also, we share your content on my discord server with over 10'000 members in the field of AI now. If you like chatting with people, I would love it if you joined us! It was built to allow people to share their projects, help each other etc., in the field of AI, and I would be extremely happy to see your name in there sometimes! You can also share your new videos and just talk with others on there. I am not sure if I can paste links in a youtube chat, but the server is Learn AI Together on discord. I could message you the link on Twitter if you'd like!

  • @peterfeatherstone9768
    @peterfeatherstone9768 3 роки тому +1

    It looks like what makes this good is the transposing of the patches which enhances the receptive field of the network. Maybe something similarly good could be achieved using pixel shuffle (pytorch.org/docs/stable/generated/torch.nn.PixelShuffle.html) and regular convs. Maybe pixel shuffling and other tricks like that could do the job just as fine.

    • @da_lime
      @da_lime 2 роки тому

      Sounds interesting