Yannic is awesome at “bottom-lining” things. Cuts through the abstruse mathematical fog and says “this is all it’s REALLY doing”. This channel is HUGELY valuable to me. There are too many papers that veer off into the implementation maths, IMHO. Yannic helps you filter out all the irrelevancies.
It's Great to learn why transformer works so well (Theorem 4) and how the three vectors (K, Q and V) can be translated in Hopfiled networks. The analysis of layers for patterns reminds me many studies of BERTology in NLP. In one of the papers, I remember it reported that the most syntactic processing seems to occur by the middle of the 12 layers. It's interesting and seems there are still many things to be known in the future. Thanks!
I'm still confused by this paper: the original Krotov energy for binary pattern retrieval keeps weights as *sums* over all stored patterns, which means constant storage... this lse and update rule seem to be keeping the entire list of stored patterns around.. that looks like cheating to me.. I am probably missing something
@@samanthaqiu3416 I kept asking myself this throughout the entire video. Surely I must be missing something? The version of Hopfield "network" described by Yannic in this video just seems like regular CPU storage with a slightly more intelligent retrieval system.
Regarding theorem #3: c has a lower bound that is exponential on d^-1, hence the guarantee that N will grow exponential seems optimistic. If you include the lower bound on c, seems that the lower bound on N has no exponential dependence on d at all
If I understand the proof in the appendix correctly, this is just phrased weirdly. By looking at the actual formula for c, its actual asymptotic behaviour as d \to \infty is just a constant.
Wow I found about your channel a few days ago, today I saw this paper and got interested in it and now I see you just uploaded! Your channel has been very informative and detailed, quite rare compared to many others which just gloss over details
If there can be so much exponential information embedding within these hopfield networks, does that mean that this is a good architecture type to use in a reinforcement learning task?
@@sthk1998 You can use hopfield networks (and transformers) for the episodic memory of the agent. DeepMind has used similar transformer like attention mechanisms in their latest RL methods, e.g., Agent57.
So, the Albert architecture, with the sharing parameters can be described as a hoper network with 12 iterations? Albert is an unique transformer encoder iterated 12 times.
Love your work! I'm interested in the research paper magazines that your regularly scan into. Can you give a list of these research magazines? Maybe you can classify them has 1) very often quoted magazine 2) less often quoted ....
Lol I was looking for a youtube video about this paper just 30 min ago and was sad to see that you have not had uploaded a video about it yet...I was 15 min to early I guess :D
At this point in the video ua-cam.com/video/nv6oFDp6rNQ/v-deo.html you state that if you increase the dimension by 1, the storage capacity increases by 3. However it increases by c^{1/4} so by about 1.316 and not 3, correct?
@@YannicKilcher It seems that c is not a constant and depends on d. Given their examples with d=20 and d=75, we get respectively N>7 and N>10 which looks like a quite slow capacity increase, or did I miss something?
Hey man. I literally sit and argue AI for a job and I often find myself relying on info or ideas either fully explained or at least lightly touched by you very often. This is a great example. It’d be a sin to ever stop. It’s obvious to me that training was in no way done and the constant activity in the middle does not indicate the same items are going back and forth about the same things
It looks great, but equivalently expressive networks aren't always equally trainable? Can anyone recommend a paper that tackles measuring learnability of data, trainability of networks, maybe linking p=np and computational complexity? I understand ill posed problems, but for example, cracking encryption, no size of network or quantity of training data will help... because the patterns are too recursive, too deeply burried, and so unlearnable? How is this measured?
Concerning these spheres : do they span all the parameter space? Or there are some regions, not belonging itself to a particular pattern? There were theorems, claiming that the algorithm has to converge, in that case, does the getting caught by a particular cluster depend on the initialisation of weights?
Great overview! Definition 1 for stored and retrieved patterns was a little confusing to me. I'm not sure if they meant that the patterns are "on" the surface of the sphere or if they were "inside" the actual sphere. Usually in mathematics, when we say "sphere" we mean just the surface of the sphere and when we say "ball" we mean all points inside the volume that the sphere surrounds. Since they said "sphere" and they used the "element of" symbol, I assume they meant that the patterns should exist on the surface of the sphere itself and not in the volume inside the sphere. They also use the wording "on the sphere" in the text following the definition and in Theorem 3. Assuming that's the intended interpretation, I think the pictures drawn at 33:42 are a bit misleading.
Hang on a sec, n nodes, therefor n^2 weights. (ish) The weights contain the information for stored patterns, thats not exponential on size n, more like n pattern storage of n bits at best. Continuous is different.. each real number can contain infinite information, depends on the accuracy of output required.
Isn't it fair to say that if we have one sentence in the attention mechanism meaning that each word in the sentence is attending to the words from the same sentence, the strongest signal will always be from any word attending to itself bcs in this case the query is identical to the key? Am I missing something here?
Not necessarily, in the case of the Transformer: for example, if the K matrix is -Q matrix, then the attention will be lowest for a position onto itself.
@@charlesfoster6326 True, although based on what I read on transformers in cases of a single sentence, K==Q. If so we are multiplying a vector by itself. This is not the case when there are 2 sentences (translation task is a good example of that). I haven't seen the case when K == -Q
@@pastrop2003 I don't know why that would be. To clarify, what I'm calling Q and K are the linear transforms you multiply the token embeddings with prior to performing attention. So q_i = tok_i * Q and k_i = tok_i * K. Then q_i and k_i will only be equal if Q and K are equal. But these are two different matrices, which will get different gradient updates during training.
Hey! Amazing video, love your work. I'm a beginner in all of this but I have this question : can bringing up the number of dimensions of the problem lower the "perplexity" of the problem? Higher dimensions meaning more information meaning tighter or more specific "spheres" around a pattern. My guess is "yes" but that sometimes the dimensions are fixed in a problem so this solution to lower perplexity is impossible. Does the paper say anything about that, or do you have an educated guess on what could be an answer? :) If my question is stupid just say so I really don't mind! Thanks for any answer and thank you for your videos. I'm hoping on making this an activity for high school students to promote science, so thanks a lot!
Hey, Can you explain this Research Paper - CANet: Class-Agnostic Segmentation Networks with Iterative Refinement and Attentive Few-Shot Learning (arxiv.org/abs/1903.02351) It is related to Image Segmentation. I am having a problem understanding this paper.
What really bugs me about all "modern AI" "explanations" is, that they do not enable you to actually code it. If you refer to one source, e.g. this paper, you are none the wiser. If you refer to multiple sources, you end up confused because they do not appear to describe the same thing. So, it is not rocket science but people seem to be fond of making it sound like rocket science, maybe to stop people from just implementing it? Here a few points, not clear (at least to me) at all: 1. Can a modern hopfield network (the one with the exp) be trained step by step, without (externally) retaining the original patterns it learned? 2. Some sources say, there are 2 (or more) layers (feature layer and memory layer). This paper says nothing about that. 3. What are the methods to artificially "enlarge" a network if a problem has more states to store, than the natural encoding of a pattern requires (2 ^ number of nodes < (number of features to store))? 4. What is the actual algorithm to compute the weights if you want to teach a network a new feature vector? Both the paper and the video seem to fall short in all those points.
This is dumb. Floating point numbers are already represented with 32 bits. THEY ARE BITS! The beauty in Hopfield Networks is that I can change every bit independent of other bits to store a novel representation. If you multiply a floating number with 2, they will all shift to left, you just killed many type of operation/degree of freedom due to linearity. With 10K bits I can represent many patterns, FAR more than number of the atoms in the universe. I can represent far more with 96 bits instead of 3 floats. This paper's network is a very narrow minded update to the original network.
Yannic Kilcher is All You Need
Thats a good one :)
haha, yeah man,
Such information transformer
Yannic is awesome at “bottom-lining” things. Cuts through the abstruse mathematical fog and says “this is all it’s REALLY doing”. This channel is HUGELY valuable to me. There are too many papers that veer off into the implementation maths, IMHO. Yannic helps you filter out all the irrelevancies.
THANK YOU! I actually asked specifically for this one - and man that was fast :)
It's Great to learn why transformer works so well (Theorem 4) and how the three vectors (K, Q and V) can be translated in Hopfiled networks. The analysis of layers for patterns reminds me many studies of BERTology in NLP. In one of the papers, I remember it reported that the most syntactic processing seems to occur by the middle of the 12 layers. It's interesting and seems there are still many things to be known in the future. Thanks!
I'm still confused by this paper: the original Krotov energy for binary pattern retrieval keeps weights as *sums* over all stored patterns, which means constant storage... this lse and update rule seem to be keeping the entire list of stored patterns around.. that looks like cheating to me.. I am probably missing something
@@samanthaqiu3416 I kept asking myself this throughout the entire video. Surely I must be missing something? The version of Hopfield "network" described by Yannic in this video just seems like regular CPU storage with a slightly more intelligent retrieval system.
Regarding theorem #3:
c has a lower bound that is exponential on d^-1, hence the guarantee that N will grow exponential seems optimistic. If you include the lower bound on c, seems that the lower bound on N has no exponential dependence on d at all
If I understand the proof in the appendix correctly, this is just phrased weirdly. By looking at the actual formula for c, its actual asymptotic behaviour as d \to \infty is just a constant.
Damn man, the amount of great papers you review is amazing. Great work.
Wow I found about your channel a few days ago, today I saw this paper and got interested in it and now I see you just uploaded! Your channel has been very informative and detailed, quite rare compared to many others which just gloss over details
You are indeed the most amazing neural network ever :))
A quick Sunday night film :))
Personhood goes well beyond stimulated predictions with evaluatory mechanics.
Great video Yannic!
I loved how he drew a pentagram
Cool work, great to get more insights about Transformer attention!
If there can be so much exponential information embedding within these hopfield networks, does that mean that this is a good architecture type to use in a reinforcement learning task?
possibly yes
@@YannicKilcher how would one transfer the model representation of eg Bert or some other transformer model to a RL framework
@@sthk1998 You can use hopfield networks (and transformers) for the episodic memory of the agent. DeepMind has used similar transformer like attention mechanisms in their latest RL methods, e.g., Agent57.
Also how resistant would it be against catastrophic forgetting?
@@jaakjpn I wonder if the ontogenic equivalent of Baldwin effect played part
Linear algebra is all you need
Can you train Hopfield networks via gradient descent? Can you integrate a Hopfield module inside a typical backprop-trained network?
I guess fast Weights can do those
Very clear! Great job!
Not gonna lie, I have been waiting for this video so I don't have to read it myself :D
These spheres (32:00) are just as in coding theory. Very cool
So, the Albert architecture, with the sharing parameters can be described as a hoper network with 12 iterations? Albert is an unique transformer encoder iterated 12 times.
It's probably more complicated, because transformer layers contain more than just the attention mechanism
"its not also higher, it is also wider" LMAO
It is time to stop giving academia the 'all you need' ultimate.
Modesty is all you need!
Fascinating paper, fantastic video.
You are a genius man!
Love your work! I'm interested in the research paper magazines that your regularly scan into. Can you give a list of these research magazines? Maybe you can classify them has 1) very often quoted magazine 2) less often quoted ....
There's not really a system to this
@@YannicKilcher so just the PhD "one paper per day" stuff?
I read about hopfield nets, thought "why can't they be continuous?", and bang straight into the cutting edge.
Lol I was looking for a youtube video about this paper just 30 min ago and was sad to see that you have not had uploaded a video about it yet...I was 15 min to early I guess :D
At this point in the video ua-cam.com/video/nv6oFDp6rNQ/v-deo.html you state that if you increase the dimension by 1, the storage capacity increases by 3. However it increases by c^{1/4} so by about 1.316 and not 3, correct?
True.
@@YannicKilcher It seems that c is not a constant and depends on d. Given their examples with d=20 and d=75, we get respectively N>7 and N>10 which looks like a quite slow capacity increase, or did I miss something?
Hopfield Networks is All You Need To Get A Nobel Prize in Physics.
If queries, keys and values are of the same embedding size, how do you retrieve a pattern of a bigger size in your introduction?
good point. you'd need to change the architecture in that case.
Hey man. I literally sit and argue AI for a job and I often find myself relying on info or ideas either fully explained or at least lightly touched by you very often. This is a great example. It’d be a sin to ever stop. It’s obvious to me that training was in no way done and the constant activity in the middle does not indicate the same items are going back and forth about the same things
It looks great, but equivalently expressive networks aren't always equally trainable? Can anyone recommend a paper that tackles measuring learnability of data, trainability of networks, maybe linking p=np and computational complexity? I understand ill posed problems, but for example, cracking encryption, no size of network or quantity of training data will help... because the patterns are too recursive, too deeply burried, and so unlearnable? How is this measured?
Would love for some pseudo code!
Both for training, and for retrieval
This video is soon gonna boom lol
Concerning these spheres : do they span all the parameter space? Or there are some regions, not belonging itself to a particular pattern? There were theorems, claiming that the algorithm has to converge, in that case, does the getting caught by a particular cluster depend on the initialisation of weights?
Yes, they are only around the patterns. Each pattern has a sphere.
It'd be cool to see the code running on some data set.
Great overview! Definition 1 for stored and retrieved patterns was a little confusing to me. I'm not sure if they meant that the patterns are "on" the surface of the sphere or if they were "inside" the actual sphere. Usually in mathematics, when we say "sphere" we mean just the surface of the sphere and when we say "ball" we mean all points inside the volume that the sphere surrounds. Since they said "sphere" and they used the "element of" symbol, I assume they meant that the patterns should exist on the surface of the sphere itself and not in the volume inside the sphere. They also use the wording "on the sphere" in the text following the definition and in Theorem 3. Assuming that's the intended interpretation, I think the pictures drawn at 33:42 are a bit misleading.
I think I even mention that my pictures are not exactly correct when I draw them :)
I think this paper is pretty solid, just wondering why it was not accepted in any of the major conferences.
Hang on a sec, n nodes, therefor n^2 weights. (ish) The weights contain the information for stored patterns, thats not exponential on size n, more like n pattern storage of n bits at best. Continuous is different.. each real number can contain infinite information, depends on the accuracy of output required.
Isn't it fair to say that if we have one sentence in the attention mechanism meaning that each word in the sentence is attending to the words from the same sentence, the strongest signal will always be from any word attending to itself bcs in this case the query is identical to the key? Am I missing something here?
Not necessarily, in the case of the Transformer: for example, if the K matrix is -Q matrix, then the attention will be lowest for a position onto itself.
@@charlesfoster6326 True, although based on what I read on transformers in cases of a single sentence, K==Q. If so we are multiplying a vector by itself. This is not the case when there are 2 sentences (translation task is a good example of that). I haven't seen the case when K == -Q
@@pastrop2003 I don't know why that would be. To clarify, what I'm calling Q and K are the linear transforms you multiply the token embeddings with prior to performing attention. So q_i = tok_i * Q and k_i = tok_i * K. Then q_i and k_i will only be equal if Q and K are equal. But these are two different matrices, which will get different gradient updates during training.
It is breakthrough in understanding immunity and cancer
Doesn't this network take the form of a hub?
Want to see attentionGAN (or op-GAN). Does attention work the same way in GANs.
Thanks for sharing. This is very interesting
Hey! Amazing video, love your work.
I'm a beginner in all of this but I have this question : can bringing up the number of dimensions of the problem lower the "perplexity" of the problem?
Higher dimensions meaning more information meaning tighter or more specific "spheres" around a pattern.
My guess is "yes" but that sometimes the dimensions are fixed in a problem so this solution to lower perplexity is impossible.
Does the paper say anything about that, or do you have an educated guess on what could be an answer? :)
If my question is stupid just say so I really don't mind!
Thanks for any answer and thank you for your videos. I'm hoping on making this an activity for high school students to promote science, so thanks a lot!
I am wondering, so how many patterns does each transformer head actually store?
good point, it seems that depends on what exactly you mean by pattern and store
Maximum width achieved
Why dont they say attractor? Much more easier than 'circles'.
Lot's of math in the paper. Got lost in the mathematics portion. Got a gist of it, however
Hey,
Can you explain this Research Paper - CANet: Class-Agnostic Segmentation Networks with Iterative Refinement and Attentive Few-Shot Learning
(arxiv.org/abs/1903.02351)
It is related to Image Segmentation. I am having a problem understanding this paper.
as an ex-phd student on neural science, I am quite interested in such research.
thx!!
😀👍👊🏻🎉
So kNN is all you need
What really bugs me about all "modern AI" "explanations" is, that they do not enable you to actually code it. If you refer to one source, e.g. this paper, you are none the wiser. If you refer to multiple sources, you end up confused because they do not appear to describe the same thing. So, it is not rocket science but people seem to be fond of making it sound like rocket science, maybe to stop people from just implementing it?
Here a few points, not clear (at least to me) at all:
1. Can a modern hopfield network (the one with the exp) be trained step by step, without (externally) retaining the original patterns it learned?
2. Some sources say, there are 2 (or more) layers (feature layer and memory layer). This paper says nothing about that.
3. What are the methods to artificially "enlarge" a network if a problem has more states to store, than the natural encoding of a pattern requires (2 ^ number of nodes < (number of features to store))?
4. What is the actual algorithm to compute the weights if you want to teach a network a new feature vector?
Both the paper and the video seem to fall short in all those points.
Subscibers to the moon!
This is dumb. Floating point numbers are already represented with 32 bits. THEY ARE BITS! The beauty in Hopfield Networks is that I can change every bit independent of other bits to store a novel representation. If you multiply a floating number with 2, they will all shift to left, you just killed many type of operation/degree of freedom due to linearity. With 10K bits I can represent many patterns, FAR more than number of the atoms in the universe. I can represent far more with 96 bits instead of 3 floats. This paper's network is a very narrow minded update to the original network.
Too fast, haha.