Involution: Inverting the Inherence of Convolution for Visual Recognition (Research Paper Explained)
Вставка
- Опубліковано 15 тра 2024
- #involution #computervision #attention
Convolutional Neural Networks (CNNs) have dominated computer vision for almost a decade by applying two fundamental principles: Spatial agnosticism and channel-specific computations. Involution aims to invert these principles and presents a spatial-specific computation, which is also channel-agnostic. The resulting Involution Operator and RedNet architecture are a compromise between classic Convolutions and the newer Local Self-Attention architectures and perform favorably in terms of computation accuracy tradeoff when compared to either.
OUTLINE:
0:00 - Intro & Overview
3:00 - Principles of Convolution
10:50 - Towards spatial-specific computations
17:00 - The Involution Operator
20:00 - Comparison to Self-Attention
25:15 - Experimental Results
30:30 - Comments & Conclusion
Paper: arxiv.org/abs/2103.06255
Code: github.com/d-li14/involution
Abstract:
Convolution has been the core ingredient of modern neural networks, triggering the surge of deep learning in vision. In this work, we rethink the inherent principles of standard convolution for vision tasks, specifically spatial-agnostic and channel-specific. Instead, we present a novel atomic operation for deep neural networks by inverting the aforementioned design principles of convolution, coined as involution. We additionally demystify the recent popular self-attention operator and subsume it into our involution family as an over-complicated instantiation. The proposed involution operator could be leveraged as fundamental bricks to build the new generation of neural networks for visual recognition, powering different deep learning models on several prevalent benchmarks, including ImageNet classification, COCO detection and segmentation, together with Cityscapes segmentation. Our involution-based models improve the performance of convolutional baselines using ResNet-50 by up to 1.6% top-1 accuracy, 2.5% and 2.4% bounding box AP, and 4.7% mean IoU absolutely while compressing the computational cost to 66%, 65%, 72%, and 57% on the above benchmarks, respectively. Code and pre-trained models for all the tasks are available at this https URL.
Authors: Duo Li, Jie Hu, Changhu Wang, Xiangtai Li, Qi She, Lei Zhu, Tong Zhang, Qifeng Chen
Links:
TabNine Code Completion (Referral): bit.ly/tabnine-yannick
UA-cam: / yannickilcher
Twitter: / ykilcher
Discord: / discord
BitChute: www.bitchute.com/channel/yann...
Minds: www.minds.com/ykilcher
Parler: parler.com/profile/YannicKilcher
LinkedIn: / yannic-kilcher-488534136
BiliBili: space.bilibili.com/1824646584
If you want to support me, the best thing to do is to share out the content :)
If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: www.subscribestar.com/yannick...
Patreon: / yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n - Наука та технологія
OUTLINE:
0:00 - Intro & Overview
3:00 - Principles of Convolution
10:50 - Towards spatial-specific computations
17:00 - The Involution Operator
20:00 - Comparison to Self-Attention
25:15 - Experimental Results
30:30 - Comments & Conclusion
i love your channel yannic! keep up your good work!
Hello Yannic! First of all, thank you for the work you do to present the papers to us. I think it's obvious that you're getting faster and faster at producing the stuff.
But now it's like this: you might start working soon. And maybe these videos are also a great advertisement for you to get exciting positions. Nevertheless, I hope that you will continue to find the time to fill your channel with life in the future.
Take care Jürgen
Regarding position specific computations: We always could concat position-encoding feature maps, e.g. computed kernels would not only be content but also real position dependent.
The video contains a great 10min CNN intro/recap! Some notes for completeness:
- There are {1,2,3..N}-D convolutions, 4D weights are the 2D conv-case (most popular, due to image use-case).
- The center-output mappings only stay at the same position with proper padding, otherwise there the output-feature maps will be (w-kw+1)x(h-kh+1) (e.g. 4x4 input with 3x3 kernel -> 2x2 output without padding).
These guys are presenting at CVPR tomorrow at 6am!
Amazing channel, man! Thank you very much!
I got my convolution revision pretty well. Such a lovely one
14:35 - "Whatever, we don't do slidey slidey anymore"
The idea here seems a straightforward combination of fast weight memory networks and locally connected layers
Love your channel and way you explain
Please do tell something related to research. How to do and what should need to do regarding deep learning in computer vision.
I don't know if standard convolutional operations on tensor frameworks support per location kernels which might be a barrier for practitioners in the short term. That said, I really like the idea.
Keep rocking the papers! You will be to 100k subscribers before you know it.
About your comment at 24:00, if the pixel also contained spatial information (e.g. RGBXY), wouldn't this then be spatial-specific?
15:10 The spacial-agnosticity of CNNs are counteracted by the fact that the features that these kernels extract are propagated further down the layers. Eventually a DNN at the end or a global max pooling does some non-spacial-agnostic stuff.
.
Nice idea, it would be interesting to see more of this meta-generated neural networks.
Amazing
At 24:00 you mention that this "involution kernel" is also spatial agnostic, since it will generate the same one for two different pixel if they have the same channel components. Do you think it would be worthwhile to add a positional encoding to the channels to make each "involution kernel" truly position specific?
That's interesting, but is there really a need to/do we want that? It will enforce the idea that pixels in the top left corner of an image are semantically different from pixels somewhere else in the image, when in reality that is not true since the location of a pixel is more an artifact of whoever took the photograph than actual semantic meaning. Won't this make it lose translation invariance?
Where do new channels come from and how does information from different channels get fused together?
that is some convoluted writing!
what do you use to highlight on the pdf?
What do you think about Kervolutions?
Where to learn about transformers & Self Attention?
Maybe, we just create a bunch of kernels manually, then apply every kernel to every channel, after several layers, we get a channel tree network, a network without training needed, we marked every activate cell as a connection to the target, more connections more credits to the target.
thanks to their pseudo-code, I got that kernels are not direct learnable weights (in contrast to normal convolution convention )
2:32 - 2:59 This reminded me of Schmidhuber 😀
I heard there are multiple instances in AWS, whose name is yannic_xxxx.
isn't the first part just depthwise convolution?
yet another good paper with pytorch pseudo code instead of math heavy latex
@@frazuppi4897 what makes the code bad ? I’ve quickly scheme through it and it doesn’t look that bad to me. Also I guess the original author was ironic ? He would have preferred more math in the paper ? Trying to learn here :) thanks !
Doesn't channel = feature... roughly ?
Only if it turns out to be useful, but these nets are not intended to prune useless, random "features." Ever wondered why BERT transformers have 8 heads? They throw enough compute/storage at any problem and hope some useful feature would float to the top.
on a per-layer basis, yes more or less. channel is the technical name for the dimension, while feature is a more conceptual thing
Too bad that the emphasis is on the number of parameters or flops, these are known useless proxy measures of the things that really matter: generalization and computation time. The later point is a huge disappointment (at least for those still believing in the relevance of flops), as RedNet are actually *slower* (Table 2 of the paper) than ResNet. Ooops, the x-axis of those comparison graphs are now irrelevant: how to RedNet then *really* compares to old ResNet, not to mention newer variants?
this could be explained because pytorch call a very well optimized nvidia CuDNN implementation for the convolution operation used in Resnet whereas their new operation is written in pure pytorch. Using computation time is a bad idea for theorical papers as results would be even more sensitive to hardware lottery
first comment!
second
I like turtles
Free Hong Kong!!!
內卷
These researchers are going too far!!