Involution: Inverting the Inherence of Convolution for Visual Recognition (Research Paper Explained)

Поділитися
Вставка
  • Опубліковано 15 тра 2024
  • #involution #computervision #attention
    Convolutional Neural Networks (CNNs) have dominated computer vision for almost a decade by applying two fundamental principles: Spatial agnosticism and channel-specific computations. Involution aims to invert these principles and presents a spatial-specific computation, which is also channel-agnostic. The resulting Involution Operator and RedNet architecture are a compromise between classic Convolutions and the newer Local Self-Attention architectures and perform favorably in terms of computation accuracy tradeoff when compared to either.
    OUTLINE:
    0:00 - Intro & Overview
    3:00 - Principles of Convolution
    10:50 - Towards spatial-specific computations
    17:00 - The Involution Operator
    20:00 - Comparison to Self-Attention
    25:15 - Experimental Results
    30:30 - Comments & Conclusion
    Paper: arxiv.org/abs/2103.06255
    Code: github.com/d-li14/involution
    Abstract:
    Convolution has been the core ingredient of modern neural networks, triggering the surge of deep learning in vision. In this work, we rethink the inherent principles of standard convolution for vision tasks, specifically spatial-agnostic and channel-specific. Instead, we present a novel atomic operation for deep neural networks by inverting the aforementioned design principles of convolution, coined as involution. We additionally demystify the recent popular self-attention operator and subsume it into our involution family as an over-complicated instantiation. The proposed involution operator could be leveraged as fundamental bricks to build the new generation of neural networks for visual recognition, powering different deep learning models on several prevalent benchmarks, including ImageNet classification, COCO detection and segmentation, together with Cityscapes segmentation. Our involution-based models improve the performance of convolutional baselines using ResNet-50 by up to 1.6% top-1 accuracy, 2.5% and 2.4% bounding box AP, and 4.7% mean IoU absolutely while compressing the computational cost to 66%, 65%, 72%, and 57% on the above benchmarks, respectively. Code and pre-trained models for all the tasks are available at this https URL.
    Authors: Duo Li, Jie Hu, Changhu Wang, Xiangtai Li, Qi She, Lei Zhu, Tong Zhang, Qifeng Chen
    Links:
    TabNine Code Completion (Referral): bit.ly/tabnine-yannick
    UA-cam: / yannickilcher
    Twitter: / ykilcher
    Discord: / discord
    BitChute: www.bitchute.com/channel/yann...
    Minds: www.minds.com/ykilcher
    Parler: parler.com/profile/YannicKilcher
    LinkedIn: / yannic-kilcher-488534136
    BiliBili: space.bilibili.com/1824646584
    If you want to support me, the best thing to do is to share out the content :)
    If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
    SubscribeStar: www.subscribestar.com/yannick...
    Patreon: / yannickilcher
    Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
    Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
    Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
    Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n
  • Наука та технологія

КОМЕНТАРІ • 41

  • @YannicKilcher
    @YannicKilcher  3 роки тому +6

    OUTLINE:
    0:00 - Intro & Overview
    3:00 - Principles of Convolution
    10:50 - Towards spatial-specific computations
    17:00 - The Involution Operator
    20:00 - Comparison to Self-Attention
    25:15 - Experimental Results
    30:30 - Comments & Conclusion

  • @yoperator8712
    @yoperator8712 3 роки тому +18

    i love your channel yannic! keep up your good work!

  • @EinsteinNewtonify
    @EinsteinNewtonify 3 роки тому +13

    Hello Yannic! First of all, thank you for the work you do to present the papers to us. I think it's obvious that you're getting faster and faster at producing the stuff.
    But now it's like this: you might start working soon. And maybe these videos are also a great advertisement for you to get exciting positions. Nevertheless, I hope that you will continue to find the time to fill your channel with life in the future.
    Take care Jürgen

  • @bluel1ng
    @bluel1ng 3 роки тому +4

    Regarding position specific computations: We always could concat position-encoding feature maps, e.g. computed kernels would not only be content but also real position dependent.
    The video contains a great 10min CNN intro/recap! Some notes for completeness:
    - There are {1,2,3..N}-D convolutions, 4D weights are the 2D conv-case (most popular, due to image use-case).
    - The center-output mappings only stay at the same position with proper padding, otherwise there the output-feature maps will be (w-kw+1)x(h-kh+1) (e.g. 4x4 input with 3x3 kernel -> 2x2 output without padding).

  • @Neptutron
    @Neptutron 2 роки тому +1

    These guys are presenting at CVPR tomorrow at 6am!

  • @nocomments_s
    @nocomments_s 3 роки тому

    Amazing channel, man! Thank you very much!

  • @sayakpaul3152
    @sayakpaul3152 3 роки тому

    I got my convolution revision pretty well. Such a lovely one

  • @NilabhraRoyChowdhury
    @NilabhraRoyChowdhury 3 роки тому +11

    14:35 - "Whatever, we don't do slidey slidey anymore"

  • @spiritcrusha
    @spiritcrusha 3 роки тому +2

    The idea here seems a straightforward combination of fast weight memory networks and locally connected layers

  • @nitikanigam287
    @nitikanigam287 3 роки тому

    Love your channel and way you explain
    Please do tell something related to research. How to do and what should need to do regarding deep learning in computer vision.

  • @CristianGarcia
    @CristianGarcia 3 роки тому +4

    I don't know if standard convolutional operations on tensor frameworks support per location kernels which might be a barrier for practitioners in the short term. That said, I really like the idea.

  • @NelsLindahl
    @NelsLindahl 3 роки тому

    Keep rocking the papers! You will be to 100k subscribers before you know it.

  • @rubenpartono
    @rubenpartono 3 роки тому +1

    About your comment at 24:00, if the pixel also contained spatial information (e.g. RGBXY), wouldn't this then be spatial-specific?

  • @herp_derpingson
    @herp_derpingson 3 роки тому

    15:10 The spacial-agnosticity of CNNs are counteracted by the fact that the features that these kernels extract are propagated further down the layers. Eventually a DNN at the end or a global max pooling does some non-spacial-agnostic stuff.
    .
    Nice idea, it would be interesting to see more of this meta-generated neural networks.

  • @hiransarkar1236
    @hiransarkar1236 3 роки тому

    Amazing

  • @jfno67
    @jfno67 3 роки тому +3

    At 24:00 you mention that this "involution kernel" is also spatial agnostic, since it will generate the same one for two different pixel if they have the same channel components. Do you think it would be worthwhile to add a positional encoding to the channels to make each "involution kernel" truly position specific?

    • @linminhtoo
      @linminhtoo 3 роки тому +3

      That's interesting, but is there really a need to/do we want that? It will enforce the idea that pixels in the top left corner of an image are semantically different from pixels somewhere else in the image, when in reality that is not true since the location of a pixel is more an artifact of whoever took the photograph than actual semantic meaning. Won't this make it lose translation invariance?

  • @JoshBrownKramer
    @JoshBrownKramer 3 роки тому +2

    Where do new channels come from and how does information from different channels get fused together?

  • @RonaldoMessina
    @RonaldoMessina 3 роки тому +1

    that is some convoluted writing!

  • @nabi7600
    @nabi7600 2 роки тому

    what do you use to highlight on the pdf?

  • @ACArchangels
    @ACArchangels 3 роки тому

    What do you think about Kervolutions?

  • @usamanavid2044
    @usamanavid2044 3 роки тому

    Where to learn about transformers & Self Attention?

  • @albertwang5974
    @albertwang5974 3 роки тому

    Maybe, we just create a bunch of kernels manually, then apply every kernel to every channel, after several layers, we get a channel tree network, a network without training needed, we marked every activate cell as a connection to the target, more connections more credits to the target.

  • @anishbhanushali
    @anishbhanushali 3 роки тому

    thanks to their pseudo-code, I got that kernels are not direct learnable weights (in contrast to normal convolution convention )

  • @priyamdey3298
    @priyamdey3298 3 роки тому +4

    2:32 - 2:59 This reminded me of Schmidhuber 😀

  • @kimchi_taco
    @kimchi_taco 3 роки тому

    I heard there are multiple instances in AWS, whose name is yannic_xxxx.

  • @edeneden97
    @edeneden97 3 роки тому

    isn't the first part just depthwise convolution?

  • @nauman.mustafa
    @nauman.mustafa 3 роки тому +8

    yet another good paper with pytorch pseudo code instead of math heavy latex

    • @Hugo-ms4mx
      @Hugo-ms4mx 3 роки тому +2

      @@frazuppi4897 what makes the code bad ? I’ve quickly scheme through it and it doesn’t look that bad to me. Also I guess the original author was ironic ? He would have preferred more math in the paper ? Trying to learn here :) thanks !

  • @billykotsos4642
    @billykotsos4642 3 роки тому +2

    Doesn't channel = feature... roughly ?

    • @pensiveintrovert4318
      @pensiveintrovert4318 3 роки тому +1

      Only if it turns out to be useful, but these nets are not intended to prune useless, random "features." Ever wondered why BERT transformers have 8 heads? They throw enough compute/storage at any problem and hope some useful feature would float to the top.

    • @YannicKilcher
      @YannicKilcher  3 роки тому +1

      on a per-layer basis, yes more or less. channel is the technical name for the dimension, while feature is a more conceptual thing

  • @yoheikao490
    @yoheikao490 3 роки тому +4

    Too bad that the emphasis is on the number of parameters or flops, these are known useless proxy measures of the things that really matter: generalization and computation time. The later point is a huge disappointment (at least for those still believing in the relevance of flops), as RedNet are actually *slower* (Table 2 of the paper) than ResNet. Ooops, the x-axis of those comparison graphs are now irrelevant: how to RedNet then *really* compares to old ResNet, not to mention newer variants?

    • @datrumart
      @datrumart 3 роки тому +2

      this could be explained because pytorch call a very well optimized nvidia CuDNN implementation for the convolution operation used in Resnet whereas their new operation is written in pure pytorch. Using computation time is a bad idea for theorical papers as results would be even more sensitive to hardware lottery

  • @yoperator8712
    @yoperator8712 3 роки тому +1

    first comment!

  • @gamerx1133
    @gamerx1133 3 роки тому

    second

  • @MstProper
    @MstProper 3 роки тому +1

    I like turtles

  • @freemind.d2714
    @freemind.d2714 3 роки тому +2

    Free Hong Kong!!!

  • @nigelwan2841
    @nigelwan2841 3 роки тому

    內卷

  • @01FNG
    @01FNG 3 роки тому

    These researchers are going too far!!