Perceiver: General Perception with Iterative Attention (Google DeepMind Research Paper Explained)

Поділитися
Вставка
  • Опубліковано 28 кві 2024
  • #perceiver #deepmind #transformer
    Inspired by the fact that biological creatures attend to multiple modalities at the same time, DeepMind releases its new Perceiver model. Based on the Transformer architecture, the Perceiver makes no assumptions on the modality of the input data and also solves the long-standing quadratic bottleneck problem. This is achieved by having a latent low-dimensional Transformer, where the input data is fed multiple times via cross-attention. The Perceiver's weights can also be shared across layers, making it very similar to an RNN. Perceivers achieve competitive performance on ImageNet and state-of-the-art on other modalities, all while making no architectural adjustments to input data.
    OUTLINE:
    0:00 - Intro & Overview
    2:20 - Built-In assumptions of Computer Vision Models
    5:10 - The Quadratic Bottleneck of Transformers
    8:00 - Cross-Attention in Transformers
    10:45 - The Perceiver Model Architecture & Learned Queries
    20:05 - Positional Encodings via Fourier Features
    23:25 - Experimental Results & Attention Maps
    29:05 - Comments & Conclusion
    Paper: arxiv.org/abs/2103.03206
    My Video on Transformers (Attention is All You Need): • Attention Is All You Need
    Abstract:
    Biological systems understand the world by simultaneously processing high-dimensional inputs from modalities as diverse as vision, audition, touch, proprioception, etc. The perception models used in deep learning on the other hand are designed for individual modalities, often relying on domain-specific assumptions such as the local grid structures exploited by virtually all existing vision models. These priors introduce helpful inductive biases, but also lock models to individual modalities. In this paper we introduce the Perceiver - a model that builds upon Transformers and hence makes few architectural assumptions about the relationship between its inputs, but that also scales to hundreds of thousands of inputs, like ConvNets. The model leverages an asymmetric attention mechanism to iteratively distill inputs into a tight latent bottleneck, allowing it to scale to handle very large inputs. We show that this architecture performs competitively or beyond strong, specialized models on classification tasks across various modalities: images, point clouds, audio, video and video+audio. The Perceiver obtains performance comparable to ResNet-50 on ImageNet without convolutions and by directly attending to 50,000 pixels. It also surpasses state-of-the-art results for all modalities in AudioSet.
    Authors: Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, Joao Carreira
    Links:
    TabNine Code Completion (Referral): bit.ly/tabnine-yannick
    UA-cam: / yannickilcher
    Twitter: / ykilcher
    Discord: / discord
    BitChute: www.bitchute.com/channel/yann...
    Minds: www.minds.com/ykilcher
    Parler: parler.com/profile/YannicKilcher
    LinkedIn: / yannic-kilcher-488534136
    BiliBili: space.bilibili.com/1824646584
    If you want to support me, the best thing to do is to share out the content :)
    If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
    SubscribeStar: www.subscribestar.com/yannick...
    Patreon: / yannickilcher
    Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
    Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
    Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
    Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n
  • Наука та технологія

КОМЕНТАРІ • 127

  • @YannicKilcher
    @YannicKilcher  3 роки тому +13

    OUTLINE:
    0:00 - Intro & Overview
    2:20 - Built-In assumptions of Computer Vision Models
    5:10 - The Quadratic Bottleneck of Transformers
    8:00 - Cross-Attention in Transformers
    10:45 - The Perceiver Model Architecture & Learned Queries
    20:05 - Positional Encodings via Fourier Features
    23:25 - Experimental Results & Attention Maps
    29:05 - Comments & Conclusion

  • @mgostIH
    @mgostIH 3 роки тому +71

    This approach is so elegant! Unironically Schmidhuber was right that the more something looks like an LSTM the better 😆

    • @reesejammie8821
      @reesejammie8821 3 роки тому +7

      I always thought the human brain is a recurrent neural network with a big hidden state and being constantly fed data from the environment.

    • @6lack5ushi
      @6lack5ushi 3 роки тому

      Powerful!!!

  • @srikanthpolisetty7476
    @srikanthpolisetty7476 3 роки тому +3

    Congratulations. I'm so glad this channel is growing so well, great to see a channel get the recognition they deserve. Can't wait to see where this channel goes from here.

  • @RS-cz8kt
    @RS-cz8kt 3 роки тому +1

    Stumbled upon your channel a couple of days ago, watched a dozen videos since then, amazing work, thanks!

  • @Gorulabro
    @Gorulabro 3 роки тому +9

    Your videos are a joy to watch.
    Nothing I do in my spare time is so usefull!

  • @jamiekawabata7101
    @jamiekawabata7101 3 роки тому +14

    The scissors scene is wonderful!

  • @sanzharbakhtiyarov4044
    @sanzharbakhtiyarov4044 3 роки тому +2

    Thanks a lot for the review Yannic! Great work

  • @bardfamebuy
    @bardfamebuy 3 роки тому +4

    I love how you did the cutting in front of a green screen and not even bother editing it out.

  • @CristianGarcia
    @CristianGarcia 3 роки тому +3

    This is VERY nice! I'd love to give it a spin on a toy dataset. 😍
    BTW: Many transformer patterns can be found in the Set Transformers paper, the learned query reduction strategy is termed Pooling by Attention.

  • @maxdoner4528
    @maxdoner4528 2 роки тому

    Good Job, It's pretty great to have These topics explained by someone other than the aufhorchen, Keep it up!

  • @timdernedde993
    @timdernedde993 3 роки тому +3

    Hey Yannic, great Video as usual :)
    If you want some feedback I feel like you could have covered the results a bit more. I do think the methodology of course is much more important but it helps to have a bit of an overview of how good it performs at what tasks. Maybe give it a few minutes more in the results section next time. But anyways still enjoyed the video greatly. Keep up the great work!

  • @simonstrandgaard5503
    @simonstrandgaard5503 3 роки тому

    Excellent walkthrough

  • @emilianpostolache545
    @emilianpostolache545 3 роки тому +9

    27:30 - Kant is all you need

  • @JTedam
    @JTedam 2 роки тому

    this helps a lot to make research accessible

  • @Coolguydudeness1234
    @Coolguydudeness1234 3 роки тому +7

    I lost it when you cut the piece of paper 😂

  • @HuyNguyen-rb4py
    @HuyNguyen-rb4py 2 роки тому

    so touching for an excellent video

  • @robboswell3943
    @robboswell3943 Рік тому +4

    Excellent video! A critical question: How exactly are the learned latent arrays being learned? Is there some kind of algorithm used to create the learned latent array by reducing the dimensions of the input "byte array"? They never really go into detail about the exact process they used to do this in the paper. Surprisingly, no online sources on this paper that I have found speak about the exact process either. On pg. 3, it does state, "The model can also be seen as performing a fully end-to-end clustering of the inputs with latent positions as cluster centres..." But this is a pretty generic explanation. Could you please provide a short explanation of the process they used?

  • @AbgezocktXD
    @AbgezocktXD 3 роки тому +119

    One day you will stop explaining how transformers work and I will be completely lost

  • @MsFearco
    @MsFearco 3 роки тому

    I just finished this, its an extremely interesting paper.
    Please review the SWIN transformer next. Its even more interesting :)

  • @hugovaillaud5102
    @hugovaillaud5102 3 роки тому +1

    Is this architecture slower than a resnet with a comparable amount of parameters due to the fact that it is somehow recurrent?
    Great video, you explain things so clearly!

  • @Ronschk
    @Ronschk 3 роки тому +1

    Really nice idea. I wonder how much improvement it would bring if the incoming data would converted through a "sense". Our brain also doesn't receive images directly, but instead receives signals from our eyes which transform the input image (and use something akin to convolutions?). So you would have this as a generic compute structure, but depending on the modality you would have a converter. I think they had something like this in the "one model to rule them all" paper or so...

  • @bender2752
    @bender2752 3 роки тому

    Great video! Consider making a video about DCTransformer maybe? 😊

  • @amirfru
    @amirfru 3 роки тому +3

    This is incredibly similar to Tabnet ! but with the attentive blocks changed to attention layers

  • @neworldemancer
    @neworldemancer 3 роки тому

    Thanks for video, Yannic! i would imagine that the attention "lines" @27:00 could indeed be static, but the alternative - they are input dependent, yet too overfitted to FF, as this lines are clear artefact.

  • @justindaniels863
    @justindaniels863 Рік тому

    unexpected combination of humour and intelligence!

  • @TheCreativeautomaton
    @TheCreativeautomaton 3 роки тому

    ey Thanks for doing this, very much like the direction of transformers in ML, im newer to NLP and looking at where the direction of ML might go next. once again thanks.

  • @Daniel-ih4zh
    @Daniel-ih4zh 3 роки тому +18

    Things are going so fast in the last year or two.

    • @ssssssstssssssss
      @ssssssstssssssss 3 роки тому +2

      I disagree... There haven't really been many major innovations in machine learning in the past two years.

  • @patf9770
    @patf9770 2 роки тому

    Something I just noticed about the attention maps: they seem to reflect something about the positional encodings? It looks like the model processes images hierarchically, globally at first and with a progressively finer tooth comb. My understanding is that CNNs tend to have a bias towards local textural information so it'd be really cool if an attention model learned to process images more intuitively

  • @maks029
    @maks029 3 роки тому

    Thanks for for an amazing video, I didn't really catch what the "Latent array" represents? It's array of zeros at first?

  • @henridehaybe525
    @henridehaybe525 3 роки тому

    It would be nice to see how the Perceiver would perform when the KV of the cross-attentions are not the raw image at each "attend" but the feature maps of a pretrained ResNet. E.g. the first "attend" KV are the raw image, the second KV is the feature maps of the second ResNet output, and so on. A pretrained ResNet would do the trick but it could technically be feasible to train it concurrently. It would be a Parallel-Piped Convolutionnal-Perceiver model.

  • @48956l
    @48956l 2 роки тому

    thank you for that wonderful demonstration with the piece of paper lol

  • @Shan224
    @Shan224 3 роки тому

    Thank you yannic

  • @ruroruro
    @ruroruro 3 роки тому +7

    Yeah, the attention maps look really really suspicious. Almost like the network only attends to the fourier features after the first layer.
    Also, the whole idea, that they are feeding the same unprocessed image into the network multiple times seems really weird. The keys should basically be a linear combination of r,g,b and the same fourier features each time. How much information can you realistically extract from an image just by attending to the low level color and positional information.
    I would have expected them to at least use a simple resnet or FPN alongside the "thin" attention branch thingy.

    • @reesejammie8821
      @reesejammie8821 3 роки тому +1

      Couldn't agree more. It's like the attention maps are far from being content-based. Also agree on the features being too low level, what does it even mean to attend to raw pixels?

  • @jonathandoucette3158
    @jonathandoucette3158 3 роки тому +2

    Fantastic video, as always! Around 20:05 you describe transformers as invariant to permutations, but I believe they're more accurately equivariant, no? I.e. permuting the input permutes the output in exactly the same way, as opposed to permuting the input leading to the exact same output. Similar to convolutions being equivariant w.r.t. position

    • @mgostIH
      @mgostIH 3 роки тому +1

      You could say those terms are just equivariant to mistakes!

    • @ruroruro
      @ruroruro 3 роки тому +2

      Transformers are invariant to key+value permutations and equivariant to query permutations. The reason, why they are invariant to k+v permutations is that for each query all the values get summed together and the weights depend only on the keys. So if you permute the keys and the values in the same way, you still get the same weights and the sum is still the same.

    • @jonathandoucette3158
      @jonathandoucette3158 3 роки тому

      @@ruroruro Ahh, thanks for the clarification! In my head I was thinking only of self attention layers, which based on your explanation would indeed be permutation equivariant. But cross-attention layers are more subtle; queries equivariant, keys/values invariant (if they are permuted in the same way).

    • @anonymouse2884
      @anonymouse2884 2 роки тому

      I belive that it is permuation invariant, since you are doing a weighted sum of the inputs/ context, you should "roughly" (the positional encoder might encoder different time indices slightly differently, but this should not matter a lot) get the same results even if you permute the inputs.

  • @emmanuellagarde2212
    @emmanuellagarde2212 3 роки тому

    If the attention maps for layers >2 are not image specific, then this echoes the results of the paper "Pretrained Transformers as Universal Computation Engines" which suggests that there is a universal mode of operation for processing "natural" data

  • @notsure7132
    @notsure7132 3 роки тому

    Thank you.

  • @herp_derpingson
    @herp_derpingson 3 роки тому +8

    17:30 Since you already bought a green screen, maybe next time put Mars or the Apollo landing in the background. Or a large cheese cake. Thats good too.
    .
    All in all. Once architecture to rule them all.

  • @TheGreatBlackBird
    @TheGreatBlackBird 3 роки тому

    I was very confused until the visual demonstration.

  • @petrroll
    @petrroll 3 роки тому

    There's one thing I don't quite understand. How does this model do low features capture / how does it retain the information? I.e. how does it do the processing that happens in the first few layers of CNN. I can clearly see how this mechanism works well for higher-level processing but how does it capture (and keep) low-level features?
    The reason why I don't quite understand it that the amount of information that flows between the first and second layer of this and e.g. first and second module of ResNet is quite drastically different. In this case it's essentially N*D which I suppose is way smaller than M* (not M because there's some pooling even in the first section of Resnet, but still close) in case of ResNet, simply on the account of N

  • @hanstaeubler
    @hanstaeubler 2 роки тому

    It would also be interesting to 'interpret' this model or algorithm on the music level as well (I compose music myself for my pleasure)?
    Thanks in any case for the good interpretation of this AI work!

  • @L9X
    @L9X 2 роки тому +1

    Could this perhaps be used to model incredibly long distance relationships, i.e. incredibly long term memory? As in, the latent query vector (i'll just call it Q from here) becomes the memory. Perhaps we start of with a randomly initialised latent Q_0 and input KV_0 - let's say the first message sent by a user - to the perceiver which produces latent output Q_1, and we then feed Q_1 back into the perceiver with the next message sent by the user KV_1 as an input and get output Q_2 from the perceiver and so on. Then at every step we take Q_n and feed that to some small typical generative transformer decoder to produce a response to the user's message. This differs from typical conversational models, such as those using GPT-whatever, because they feed the entire conversation back into the model as input, and since the model has a constant size input, the older messages get truncated as enough new messages are given, which means the older memories get totally lost. Could this be a viable idea? We could have M >> N which means we have more memory than input length, but if we keep M on the order of a thousand that gives us 1000 'units' of memory that retain only the most important information.

  • @pvlr1788
    @pvlr1788 2 роки тому

    Thanks for the video!
    But I can't understand where from the first latent array comes..

  • @Deez-Master
    @Deez-Master 3 роки тому

    Nice video

  • @dr.mikeybee
    @dr.mikeybee 3 роки тому

    Even with my limited understanding, this looks like a big game change.

  • @peterszilvasi752
    @peterszilvasi752 2 роки тому

    17:07 - The visual demonstration of how the quadratic bottleneck is solved was a true "Explain Like I'm Five" moment. 😀

  • @jonatan01i
    @jonatan01i 3 роки тому +11

    2:44
    "And the image is of not a cat!, a house! What did you think??!.."
    I thought nothing; my mind was empty :(

  • @cptechno
    @cptechno 3 роки тому +5

    Yes, I like this type of content. Keep up the good work. Bringing this material to our attention is a prime service. You might consider creating an AI.tv commercial channel. I'll join.

  • @piratepartyftw
    @piratepartyftw 3 роки тому +4

    Very cool. I wonder if it works when you feed in multimodal data (e.g. both image and text in the same byte array).

    • @galchinsky
      @galchinsky 3 роки тому +1

      Proper positional encodings should somehow work

  • @gz6963
    @gz6963 Рік тому

    4:10 Is this related to the puzzles we have to solve with Google Captcha? Things like "select all the squares containing a boat"

  • @aday7475
    @aday7475 Рік тому +1

    Any chance we can get a compare and contrast between perciever, percieverIO, and percieverAR?

  • @GuillermoValleCosmos
    @GuillermoValleCosmos 3 роки тому

    this is clever and cool

  • @NilabhraRoyChowdhury
    @NilabhraRoyChowdhury 3 роки тому

    What's interesting is that the model performs better with weight sharing.

  • @axeldroid2453
    @axeldroid2453 3 роки тому

    Does it have something todo with sparse sensing ? It basically attentds to the most relevant data points.

  • @ibrahimaba8966
    @ibrahimaba8966 Рік тому

    17:28 best way to solve the quadratic bottleneck 😄!

  • @xealen2166
    @xealen2166 2 роки тому

    i'm curious, how are the queries generated from the latent matrix, how is the latent matrix initially generated?

  • @synthetiksoftware5631
    @synthetiksoftware5631 3 роки тому

    Isn't the 'fourier' style positional encoding just a different way to build a scale space representation of the input data? So you are still 'baking' that kind of scale space prior into the system.

  • @Anujkumar-my1wi
    @Anujkumar-my1wi 3 роки тому

    can you tell me why neural nets with many hidden layer requires less number of neurons than a neural net with a single hidden layer to approximate a function?

  • @marat61
    @marat61 3 роки тому

    Also you did not say about dimension size in ablation part

  • @thegistofcalculus
    @thegistofcalculus 3 роки тому

    Just a silly question, instead of big data input vector and small latent vector could they have a big latent vector that they use as a summary vector and spoon feed slices of data in order to achieve some downstream task such as maybe predicting the next data slice? Would this allow for even bigger input which is summarized (like HD video)?

    • @thegistofcalculus
      @thegistofcalculus Рік тому

      Looking back it seems that my comment was unclear. It would involve a second cross attention module to determine what gets written into the big vector.

  • @yassineabbahaddou4369
    @yassineabbahaddou4369 2 роки тому

    why they have used a GPT-2 architecture in the latent transformer instead of BERT architecture?

  • @evilby
    @evilby 9 місяців тому

    WAHHH... Problem Solved!😆

  • @LaNeona
    @LaNeona 3 роки тому

    If I have a gamification model is there anyone you know that does meta analysis on system mechanisms?

  • @cocoarecords
    @cocoarecords 3 роки тому +2

    Yannic can you tell us your approach to understand papers quickly?

    • @YannicKilcher
      @YannicKilcher  3 роки тому +26

      Look at the pictures

    • @TheZork1995
      @TheZork1995 3 роки тому +2

      @@YannicKilcher xD so easy yet so far. Thank you for the good work though. Literally the best youtube channel I ever found!

  • @marat61
    @marat61 3 роки тому +3

    I belive there are error in the paper 23:07 Q must be MxC not MxD otherwise QK.transpose() will be imposible

  • @azimgivron1823
    @azimgivron1823 3 роки тому +1

    Are the query dimension and the latent array in figure 1 of the same dimensions ? It is written that Q belongs to the space of matrices of real numbers of dimensions MxD which does not make sens to me. I believe they meant NxD where D=C since you need to do a dot product to compute the cross-attention between the query Q and the keys K ==> Q.Kt with Kt being the transpose of K so it implies that the dimensions D and C are equal, isn't right ?
    I am kinda disappointed by the paper because this the core of what they want to show and they do not make the effort to dive in the math and explain this clearly.

  • @swoletech5958
    @swoletech5958 2 роки тому +1

    PointNet++ from 2017 outperformed the perceiver in image point clouds. 91.9 accuracy versus 85.7 See @ 27:19

  • @Kram1032
    @Kram1032 3 роки тому +1

    Did the house sit on the mat though

  • @happycookiecamper8101
    @happycookiecamper8101 3 роки тому

    nice

  • @TheJohnestOfJohns
    @TheJohnestOfJohns 3 роки тому +1

    Isn't this really similar to facebook's DETR with their object queries, but with shared weights?

    • @antoninhejny8156
      @antoninhejny8156 3 роки тому

      No, since DETR is just for localising objects from extracted features via some backbone like resnet, while this is the feature extractor. Furthemore, DETR just puts the features into a transformer, whereas this is like making an idea about what is in the image while consulting with the raw information in the form of RGB. This is however very suspitious, because linear combination of RGB is just three numbers.

  • @conduit242
    @conduit242 3 роки тому +3

    Embeddings are still all you need 🤷

  • @vadimschashecnikovs3082
    @vadimschashecnikovs3082 3 роки тому

    Hmm, I think it is possible to add some GLOM-like hierarchy of "words". This could improve the model...

  • @kirtipandya4618
    @kirtipandya4618 3 роки тому

    Where can we find source code?

  • @moctardiallo2608
    @moctardiallo2608 3 роки тому +1

    Yeah 30min is very better!

  • @brll5733
    @brll5733 3 роки тому

    Performers already grow entirely linearly, right?

  • @hiramcoriarodriguez1252
    @hiramcoriarodriguez1252 3 роки тому +7

    This is huge, i'm not going to surprise if "perceiver" becomes the gold standard for CV tasks.

    • @galchinsky
      @galchinsky 3 роки тому +1

      The way it is it seems to be classification only

    • @nathanpestes9497
      @nathanpestes9497 3 роки тому

      @@galchinsky You should be able to run it backwards for generation. Just say my output (image/point-cloud/text I want to generate) is my latent(as labeled in the diagram), and my input (byte array in the diagram) is some latent representation that feeds into my outputs over several steps. I think this could be super cool for 3D GANs since you don't wind up having to fill 3d grids with a bunch of empty space.

    • @galchinsky
      @galchinsky 3 роки тому

      @@nathanpestes9497 @Nathan Pestes won't you get o(huge^2) this way?

    • @nathanpestes9497
      @nathanpestes9497 3 роки тому

      ​@@galchinsky I think it would be cross attention o(user defined * huge) same as the paper (different order). Generally we have o(M*N),
      M - the size of input/byte-array,
      N - the size of the latent.
      The paper goes after performance by forcing the latent to be non-huge so M=huge, N=small O(huge * small). Running it backwards you would have small input (which is now actually our latent so a low dimensional random sample if we want to do a gan, perhaps the (actual) latent from another perceiver in a VAE or similar). So backwards you have M=small N=huge so O(small*huge).

    • @galchinsky
      @galchinsky 3 роки тому

      ​@@nathanpestes9497 Thanks for pointing this. I thought we would get Huge x Huge attention matrix, while you are right, if we set Q length to be Huge and K/V to be Small, the resulting complexity will be O(Huge*Small).
      So we want to get new K/V pair each time and this approach seems quite natural: (here was an imgur link but youtube seems to hide it).
      So there 2 parallel stacks of layers. The first set is like in the article: latent weights, then cross attention, then stack of transformers and so on.
      The second stack consists of your cross-attention layers, so operates in byte-array dimension.
      The first Q is the byte array input and K,V is taken from the stack of the "latent transformers". Then its output is fed as K,V back to the "latent" cross attention, making new K,V. So there is an informational ping-pong between "huge" and "latent" cross-attention layers.

  • @AvastarBin
    @AvastarBin 3 роки тому

    +1 For the visual representation of M*N hahah

  • @teatea5528
    @teatea5528 Рік тому

    It is stupid, but I want to ask how the author claims their method is better than VIT in ImageNet in the appendix A, Table 7 while their accuracy is not higher?

  • @bensums
    @bensums 3 роки тому

    So the main point is you can have less queries than values? This is obvious even just by looking at the definition of scaled dot-product attention in Attention Is All You Need (Equation 1). From the definition there, the number of outputs equals the number of queries and is independent of the number of keys or values. The only constraints are: 1. the number of keys must match the number of values, 2. the dimension of each query must equal the dimension of the corresponding key.

    • @bensums
      @bensums 3 роки тому

      (in the paper all queries and keys are the same dimension (d_k), but that's not necessary)

  • @enriquesolarte1164
    @enriquesolarte1164 3 роки тому +1

    haha, I love the scissors...!!!

  • @kenyang687
    @kenyang687 Рік тому

    The "hmm by hmm" is just too confusing lol

  • @martinschulze5399
    @martinschulze5399 3 роки тому

    habt ihr phd stellen offen? ^^

  • @errrust
    @errrust 2 роки тому

    Clearly you are more of a fan of row vectors rather than column vectors Yannic (refererring to your visual demo :))

  • @timstevens3361
    @timstevens3361 3 роки тому +1

    attention looped is consciousness

  • @DistortedV12
    @DistortedV12 2 роки тому

    “General architecture”, but can it understand tabular inputs??

  • @TechyBen
    @TechyBen 3 роки тому +3

    Oh no, they are making it try to be alive. XD

  • @freemind.d2714
    @freemind.d2714 3 роки тому

    Good job Yannic, But I start to feel like lot of paper you talk in video those days are all about transformer, and frankly they kind similar and most are about engineering research not scientific research, hope you don't mind to talk more about interesting paper on different subject

  • @seraphim9723
    @seraphim9723 3 роки тому

    The ablation study consists of three points without any error bars and could just be coincidence? One cannot call that "science".

  • @Stefan-bs3gm
    @Stefan-bs3gm 3 роки тому +3

    with O(M*M) attention you quickly get to OOM :-P

  • @oreganorx7
    @oreganorx7 Рік тому

    Very similar to MemFormer

  • @NeoShameMan
    @NeoShameMan 3 роки тому +3

    So basically it's conceptually close to rapide eye movement, where we refine over time data we need to resolve recognition...

  • @omegapointil5741
    @omegapointil5741 3 роки тому

    I guess curing Cancer is even more complicated than this.

  • @ivangruber7895
    @ivangruber7895 3 роки тому

    CAT > HOUSE

  • @insighttoinciteworksllc1005
    @insighttoinciteworksllc1005 2 роки тому

    Humans can do the iterative process too. The Inquiry Method is the only thing that requires it. If you add the trial and error element with self-correction, young minds can develop a learning process. Learn How to learn? Once they get in touch with their inner teacher, they connect to the Information Dimension (theory). Humans can go to where the Perceiver can't go. The Inner teacher uses intuition to bring forth unknown knowledge to mankind's consciousness. The system Mr. Tesla used to create original thought. Unless you think he had a computer? The Perceiver will be able to replace all the scientists that helped develop it and the masses hooked on the internet. It will never replace the humans that develop the highest level of consciousness. Thank you, Yeshua for this revelation.

  • @meselfobviouslyme6292
    @meselfobviouslyme6292 3 роки тому +1

    second

  • @Vikram-wx4hg
    @Vikram-wx4hg Рік тому

    17:15

  • @jianjianh_
    @jianjianh_ 3 роки тому

    Problem solved! Lmao

  • @pratik245
    @pratik245 2 роки тому

    😂😂

  • @guidoansem
    @guidoansem 2 роки тому

    algo

  • @allurbase
    @allurbase 2 роки тому

    It's kind of dumb to input the same video frame over and over, just go frame by frame, it's will take a bit for it to catch up but so would you.

  • @mikesl6895
    @mikesl6895 3 роки тому +2

    Third