That's not the main reason, RNN keep adding the embeddings and hence override information that came before where as in case of transformer embeddings are there all the time and attention can pick the ones that are important.
that was a great video! i find learning about such things generally easier and more interesting, if they are compared to other models/ideas that are similar but not equal
I think lstms are more tuned towards keeping the order, because although transformers can assemble embeddings from various tokens, they don't know what follows what in a sentence. But, perhaps with relative positional encoding they might be equipped just about enough to understand the order of sequential input
I would like to have a more skeleton-up or foundation-up understanding (to better understand the top down representation of the transformer). Where should I start, linear algebra?
This was cool but not sure it was explained correctly or I didn’t understand fully. I study transformers and the global attention mechanism is word prediction comparing it to every other past word and input. How does that predict future words?
Thanks so much for the kind words. There is a full video on self attention on the channel. Check out the first video below in the playlist “Transformers from scratch “
Transformers aren't only used for text generation. But in the case of text generation, the model internally predicts the next token for every token on the sentence. E.g the model is trained to do this: This is an example phrase is an example phrase So the training requires a single step. Text generation models also have a causal mask, tokens can only attend to the tokens that come before it. So the network doesn't cheat during training. During inference, only one token is generated at a time, indeed. If I'm not mistaken, there's an optimization to avoid recalculating the previously calculated tokens.
What if you wanted to train a network to take a sequence of images (like in a video) and generate what comes next? Wouldn't that be a case where RNNs and its variations like LSTM and GRUs are better since each image is most closely related to the images coming directly before and after it?
This is done by “GAN” networks. Or generative adversarial networks. This would have two CNNs one is a “discriminator ” network and the other a “generator” network.
@@-p2349 I thought that GANs could only generate an image that was similar to those in the dataset (such as a dataset containing faces). Also, how would a GAN deal with the sequential nature of videos?
There is ViT (Vision Transformer), although that predicts parts of an image, and I've seen at least one example of ViT feeding into a Longformer network for video input. But I have no experience using it. GAN are not the answer to what I read in your question.
What I'm wondering is. Why do all APIs charge you credits for input tokens for transformers? For me, it shouldn't make a difference for a transformer to take 20 tokens as input or 1000 (as long as it's within its maximum context lengths). Isn't that the case that transformer always pads the input to its maximum context length anyway?
No, the attention layers usually take a padding mask into account and can use smaller matrices. It just makes the implementation a bit more involved. The actual cost should be roughly quadratic in your input size, but that's probably not something the marketing department would accept.
how we can relate this to masked multi head attention concept of transformers, this video is kind of conflicting with that, any expert ideas here please ..
These RNNs are still worse than Transformers. However, there have been Transformers + LSTM combinations. Such neural networks have theoretical potential to create extremely long term chatbots, far higher than 4000 tokens, due to their recurrent nature.
They work better with less text data, they also work better as decoders. While LSTMs don't have many advantages, future iterations of RNNs could lead to learning far longer term dependencies than Transformers. I think that LSTMs are more biologically accurate than Transformers since they incorporate time and are not layered like conventional networks but instead are theoretically capable of simple topological structures. However, their have been "recurrent Transformers" which is basically Long Short Term Memory + Transformers. The architecture is literally a transformer layer turned into a recurrent cell along with gates inspired by LSTM.
Great video! I’m addition to this, RNNs due to their sequential nature are unable to take advantage of transfer learning. Transformers do not have this limitation
That's not the main reason, RNN keep adding the embeddings and hence override information that came before where as in case of transformer embeddings are there all the time and attention can pick the ones that are important.
Exactly!
Note that the decoder in Transformer outputs one vector at a time as well
that was a great video!
i find learning about such things generally easier and more interesting, if they are compared to other models/ideas that are similar but not equal
Thank you for the kind words. And yep, agreed 👍🏽
@@CodeEmporium i guess just like CLIP our brains perform contrastive learning as well xd
This answered a question I didn't have. Thanks!
Always glad to help when not needed!
UA-cam recommend me more videos like this plz
I think lstms are more tuned towards keeping the order, because although transformers can assemble embeddings from various tokens, they don't know what follows what in a sentence.
But, perhaps with relative positional encoding they might be equipped just about enough to understand the order of sequential input
Your comment came right before gpt blew up so maybe you wouldn’t say this anymore?
This is a great video!!
Thanks a lot Greg. I try :)
This video is 90% wrong…
But presented confidently and getting praise. Reminds me of ChatGPT. 😂
An important caveat is that transformers like the decoder and GPT models are trained autoregresively with no context of the words coming after.
ya its masked multi head attention only focuses on left-to-right right ?
@@sreedharsn-xw9yiyes that's decoders only transformers such as gpt 3.5 for example and any text generation model
I would like to have a more skeleton-up or foundation-up understanding (to better understand the top down representation of the transformer). Where should I start, linear algebra?
This was cool but not sure it was explained correctly or I didn’t understand fully. I study transformers and the global attention mechanism is word prediction comparing it to every other past word and input. How does that predict future words?
You should have put LSTMs as a middle step
Good call. I just bundled them with Recurrent Neural Networks here
This is the best explanation I’ve ever seen RNN vs Transformer. Is there similar video like this for self attention by any chance? Thank you
Thanks so much for the kind words. There is a full video on self attention on the channel. Check out the first video below in the playlist “Transformers from scratch “
The main reason is that rnn has what we call the exploding and vanishing gradient descent..
Don’t transformer models generate one token at a time? It’s just they’re faster as calculations can be done in parallel
Transformers aren't only used for text generation.
But in the case of text generation, the model internally predicts the next token for every token on the sentence.
E.g the model is trained to do this:
This is an example phrase
is an example phrase
So the training requires a single step.
Text generation models also have a causal mask, tokens can only attend to the tokens that come before it. So the network doesn't cheat during training.
During inference, only one token is generated at a time, indeed.
If I'm not mistaken, there's an optimization to avoid recalculating the previously calculated tokens.
Not all transformers use a causal mask. Encoder models like BERT usually don't - it would break the usefulness of the [CLS] token, for starters.
Aren’t most of the transformers used, based on causal self-attention? That doesn’t seem to have the bidirectional thing to it?
Does a decoder model share these same advantages? Without the attention mapping wouldn’t it would be operating with the same context as an RNN?
Can you do Fourier Transform replacing the attention head
Fourier Transform?
What if you wanted to train a network to take a sequence of images (like in a video) and generate what comes next? Wouldn't that be a case where RNNs and its variations like LSTM and GRUs are better since each image is most closely related to the images coming directly before and after it?
This is done by “GAN” networks. Or generative adversarial networks. This would have two CNNs one is a “discriminator ” network and the other a “generator” network.
@@-p2349 I thought that GANs could only generate an image that was similar to those in the dataset (such as a dataset containing faces). Also, how would a GAN deal with the sequential nature of videos?
There is ViT (Vision Transformer), although that predicts parts of an image, and I've seen at least one example of ViT feeding into a Longformer network for video input. But I have no experience using it.
GAN are not the answer to what I read in your question.
What I'm wondering is. Why do all APIs charge you credits for input tokens for transformers? For me, it shouldn't make a difference for a transformer to take 20 tokens as input or 1000 (as long as it's within its maximum context lengths). Isn't that the case that transformer always pads the input to its maximum context length anyway?
No, the attention layers usually take a padding mask into account and can use smaller matrices. It just makes the implementation a bit more involved.
The actual cost should be roughly quadratic in your input size, but that's probably not something the marketing department would accept.
is this before or after manba?
how we can relate this to masked multi head attention concept of transformers, this video is kind of conflicting with that, any expert ideas here please ..
Ty
But there is also a version of RNN with attention.
These RNNs are still worse than Transformers. However, there have been Transformers + LSTM combinations. Such neural networks have theoretical potential to create extremely long term chatbots, far higher than 4000 tokens, due to their recurrent nature.
Fantastic!
Thanks so much again :)
I respect the craft! Also, pick up a pop filter
I have p-p-p-predilection for p-p-plosives
I need your help about my narx neural network please
Do LSTMs have any advantage over transformers ?
They work better with less text data, they also work better as decoders. While LSTMs don't have many advantages, future iterations of RNNs could lead to learning far longer term dependencies than Transformers. I think that LSTMs are more biologically accurate than Transformers since they incorporate time and are not layered like conventional networks but instead are theoretically capable of simple topological structures.
However, their have been "recurrent Transformers" which is basically Long Short Term Memory + Transformers. The architecture is literally a transformer layer turned into a recurrent cell along with gates inspired by LSTM.
how the new one SSM in MAMBA? the Mamba said to better than transformer
For the algo
cool
Many thanks :)
No model understands
Great video! I’m addition to this, RNNs due to their sequential nature are unable to take advantage of transfer learning. Transformers do not have this limitation