Hopfield Networks is All You Need (Paper Explained)

Yannic Kilcher

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 16 гру 2024

КОМЕНТАРІ • 86

@thomasmuller7001 4 роки тому ⁺⁵⁰⁶
Yannic Kilcher is All You Need
@sonOfLiberty100 4 роки тому ⁺⁹
Thats a good one :)
@tarmiziizzuddin337 4 роки тому ⁺²
haha, yeah man,
@kimchi_taco 4 роки тому ⁺⁸
Such information transformer
@rockapedra1130 3 роки тому ⁺⁴
Yannic is awesome at “bottom-lining” things. Cuts through the abstruse mathematical fog and says “this is all it’s REALLY doing”. This channel is HUGELY valuable to me. There are too many papers that veer off into the implementation maths, IMHO. Yannic helps you filter out all the irrelevancies.
@centar1595 4 роки тому ⁺¹⁸
THANK YOU! I actually asked specifically for this one - and man that was fast :)
@good_user35 4 роки тому ⁺¹⁸
It's Great to learn why transformer works so well (Theorem 4) and how the three vectors (K, Q and V) can be translated in Hopfiled networks. The analysis of layers for patterns reminds me many studies of BERTology in NLP. In one of the papers, I remember it reported that the most syntactic processing seems to occur by the middle of the 12 layers. It's interesting and seems there are still many things to be known in the future. Thanks!
@samanthaqiu3416 3 роки тому ⁺³
I'm still confused by this paper: the original Krotov energy for binary pattern retrieval keeps weights as *sums* over all stored patterns, which means constant storage... this lse and update rule seem to be keeping the entire list of stored patterns around.. that looks like cheating to me.. I am probably missing something
@maxkho00 Рік тому ⁺³
@@samanthaqiu3416 I kept asking myself this throughout the entire video. Surely I must be missing something? The version of Hopfield "network" described by Yannic in this video just seems like regular CPU storage with a slightly more intelligent retrieval system.
@samanthaqiu3416 4 роки тому ⁺⁹
Regarding theorem #3:
c has a lower bound that is exponential on d^-1, hence the guarantee that N will grow exponential seems optimistic. If you include the lower bound on c, seems that the lower bound on N has no exponential dependence on d at all
@Deathflyer 2 роки тому
If I understand the proof in the appendix correctly, this is just phrased weirdly. By looking at the actual formula for c, its actual asymptotic behaviour as d \to \infty is just a constant.
@chochona019 4 роки тому ⁺⁷
Damn man, the amount of great papers you review is amazing. Great work.
@mgostIH 4 роки тому ⁺³
Wow I found about your channel a few days ago, today I saw this paper and got interested in it and now I see you just uploaded! Your channel has been very informative and detailed, quite rare compared to many others which just gloss over details
@rock_sheep4241 4 роки тому ⁺⁷
You are indeed the most amazing neural network ever :))
@rock_sheep4241 4 роки тому ⁺²
A quick Sunday night film :))
@0MVR_0 4 роки тому ⁺²
Personhood goes well beyond stimulated predictions with evaluatory mechanics.
@Irbdmakrtb 4 роки тому ⁺²
Great video Yannic!
@jesseshakarji9241 4 роки тому ⁺⁵
I loved how he drew a pentagram
@woooka 3 роки тому
Cool work, great to get more insights about Transformer attention!
@sthk1998 4 роки тому ⁺¹⁵
If there can be so much exponential information embedding within these hopfield networks, does that mean that this is a good architecture type to use in a reinforcement learning task?
@YannicKilcher 4 роки тому ⁺¹
possibly yes
@sthk1998 4 роки тому
@@YannicKilcher how would one transfer the model representation of eg Bert or some other transformer model to a RL framework
@jaakjpn 4 роки тому ⁺¹
@@sthk1998 You can use hopfield networks (and transformers) for the episodic memory of the agent. DeepMind has used similar transformer like attention mechanisms in their latest RL methods, e.g., Agent57.
@revimfadli4666 Рік тому
Also how resistant would it be against catastrophic forgetting?
@revimfadli4666 Рік тому
@@jaakjpn I wonder if the ontogenic equivalent of Baldwin effect played part
@emuccino 4 роки тому ⁺¹¹
Linear algebra is all you need
@alvarofrancescbudriafernan2005 2 роки тому ⁺⁴
Can you train Hopfield networks via gradient descent? Can you integrate a Hopfield module inside a typical backprop-trained network?
@revimfadli4666 Рік тому
I guess fast Weights can do those
@rockapedra1130 4 роки тому ⁺¹
Very clear! Great job!
@konghong3885 4 роки тому ⁺⁷
Not gonna lie, I have been waiting for this video so I don't have to read it myself :D
@AbgezocktXD 4 роки тому ⁺³
These spheres (32:00) are just as in coding theory. Very cool
@davidhsv2 4 роки тому ⁺¹
So, the Albert architecture, with the sharing parameters can be described as a hoper network with 12 iterations? Albert is an unique transformer encoder iterated 12 times.
@YannicKilcher 4 роки тому ⁺¹
It's probably more complicated, because transformer layers contain more than just the attention mechanism
@Xiineet 4 роки тому ⁺¹²
"its not also higher, it is also wider" LMAO
@0MVR_0 4 роки тому ⁺²³
It is time to stop giving academia the 'all you need' ultimate.
@seraphim9723 4 роки тому ⁺¹²
Modesty is all you need!
@valthorhalldorsson9300 4 роки тому
Fascinating paper, fantastic video.
@sergiomanuel2206 4 роки тому ⁺²
You are a genius man!
@cptechno 4 роки тому ⁺³
Love your work! I'm interested in the research paper magazines that your regularly scan into. Can you give a list of these research magazines? Maybe you can classify them has 1) very often quoted magazine 2) less often quoted ....
@YannicKilcher 4 роки тому ⁺³
There's not really a system to this
@revimfadli4666 Рік тому
@@YannicKilcher so just the PhD "one paper per day" stuff?
@dylanmenzies3973 2 місяці тому
I read about hopfield nets, thought "why can't they be continuous?", and bang straight into the cutting edge.
@umutcoskun4247 4 роки тому ⁺⁵
Lol I was looking for a youtube video about this paper just 30 min ago and was sad to see that you have not had uploaded a video about it yet...I was 15 min to early I guess :D
@mathmagic9333 4 роки тому ⁺⁵
At this point in the video ua-cam.com/video/nv6oFDp6rNQ/v-deo.html you state that if you increase the dimension by 1, the storage capacity increases by 3. However it increases by c^{1/4} so by about 1.316 and not 3, correct?
@YannicKilcher 4 роки тому ⁺¹
True.
@nicolasPi_ 3 роки тому
@@YannicKilcher It seems that c is not a constant and depends on d. Given their examples with d=20 and d=75, we get respectively N>7 and N>10 which looks like a quite slow capacity increase, or did I miss something?
@bzqp2 2 місяці тому
Hopfield Networks is All You Need To Get A Nobel Prize in Physics.
@martinrenaudin7415 4 роки тому ⁺¹
If queries, keys and values are of the same embedding size, how do you retrieve a pattern of a bigger size in your introduction?
@YannicKilcher 4 роки тому
good point. you'd need to change the architecture in that case.
@TheGroundskeeper 4 роки тому ⁺²
Hey man. I literally sit and argue AI for a job and I often find myself relying on info or ideas either fully explained or at least lightly touched by you very often. This is a great example. It’d be a sin to ever stop. It’s obvious to me that training was in no way done and the constant activity in the middle does not indicate the same items are going back and forth about the same things
@luke.perkin.inventor 4 роки тому ⁺¹
It looks great, but equivalently expressive networks aren't always equally trainable? Can anyone recommend a paper that tackles measuring learnability of data, trainability of networks, maybe linking p=np and computational complexity? I understand ill posed problems, but for example, cracking encryption, no size of network or quantity of training data will help... because the patterns are too recursive, too deeply burried, and so unlearnable? How is this measured?
@nitsanbh Рік тому
Would love for some pseudo code!
Both for training, and for retrieval
@lucasl1047 2 місяці тому ⁺⁴
This video is soon gonna boom lol
@sacramentofwilderness6656 4 роки тому
Concerning these spheres : do they span all the parameter space? Or there are some regions, not belonging itself to a particular pattern? There were theorems, claiming that the algorithm has to converge, in that case, does the getting caught by a particular cluster depend on the initialisation of weights?
@YannicKilcher 4 роки тому ⁺¹
Yes, they are only around the patterns. Each pattern has a sphere.
@DamianReloaded 4 роки тому ⁺¹
It'd be cool to see the code running on some data set.
@gamefaq 4 роки тому
Great overview! Definition 1 for stored and retrieved patterns was a little confusing to me. I'm not sure if they meant that the patterns are "on" the surface of the sphere or if they were "inside" the actual sphere. Usually in mathematics, when we say "sphere" we mean just the surface of the sphere and when we say "ball" we mean all points inside the volume that the sphere surrounds. Since they said "sphere" and they used the "element of" symbol, I assume they meant that the patterns should exist on the surface of the sphere itself and not in the volume inside the sphere. They also use the wording "on the sphere" in the text following the definition and in Theorem 3. Assuming that's the intended interpretation, I think the pictures drawn at 33:42 are a bit misleading.
@YannicKilcher 4 роки тому ⁺¹
I think I even mention that my pictures are not exactly correct when I draw them :)
@Imboredas 2 роки тому ⁺⁴
I think this paper is pretty solid, just wondering why it was not accepted in any of the major conferences.
@dylanmenzies3973 2 місяці тому
Hang on a sec, n nodes, therefor n^2 weights. (ish) The weights contain the information for stored patterns, thats not exponential on size n, more like n pattern storage of n bits at best. Continuous is different.. each real number can contain infinite information, depends on the accuracy of output required.
@pastrop2003 4 роки тому ⁺¹
Isn't it fair to say that if we have one sentence in the attention mechanism meaning that each word in the sentence is attending to the words from the same sentence, the strongest signal will always be from any word attending to itself bcs in this case the query is identical to the key? Am I missing something here?
@charlesfoster6326 4 роки тому ⁺²
Not necessarily, in the case of the Transformer: for example, if the K matrix is -Q matrix, then the attention will be lowest for a position onto itself.
@pastrop2003 4 роки тому ⁺¹
@@charlesfoster6326 True, although based on what I read on transformers in cases of a single sentence, K==Q. If so we are multiplying a vector by itself. This is not the case when there are 2 sentences (translation task is a good example of that). I haven't seen the case when K == -Q
@charlesfoster6326 4 роки тому ⁺²
@@pastrop2003 I don't know why that would be. To clarify, what I'm calling Q and K are the linear transforms you multiply the token embeddings with prior to performing attention. So q_i = tok_i * Q and k_i = tok_i * K. Then q_i and k_i will only be equal if Q and K are equal. But these are two different matrices, which will get different gradient updates during training.
@burntroses1 2 роки тому
It is breakthrough in understanding immunity and cancer
@siuuuuuuuuuuuu12 2 місяці тому ⁺¹
Doesn't this network take the form of a hub?
@tripzero0 4 роки тому
Want to see attentionGAN (or op-GAN). Does attention work the same way in GANs.
@ChocolateMilkCultLeader 4 роки тому
Thanks for sharing. This is very interesting
@jeanphilippe9141 2 роки тому
Hey! Amazing video, love your work.
I'm a beginner in all of this but I have this question : can bringing up the number of dimensions of the problem lower the "perplexity" of the problem?
Higher dimensions meaning more information meaning tighter or more specific "spheres" around a pattern.
My guess is "yes" but that sometimes the dimensions are fixed in a problem so this solution to lower perplexity is impossible.
Does the paper say anything about that, or do you have an educated guess on what could be an answer? :)
If my question is stupid just say so I really don't mind!
Thanks for any answer and thank you for your videos. I'm hoping on making this an activity for high school students to promote science, so thanks a lot!
@ArrakisMusicOfficial 4 роки тому
I am wondering, so how many patterns does each transformer head actually store?
@YannicKilcher 4 роки тому ⁺¹
good point, it seems that depends on what exactly you mean by pattern and store
@shabamee9809 4 роки тому ⁺¹
Maximum width achieved
@zerotwo7319 Рік тому
Why dont they say attractor? Much more easier than 'circles'.
@rameshravula8340 4 роки тому ⁺¹
Lot's of math in the paper. Got lost in the mathematics portion. Got a gist of it, however
@tanaykale1571 4 роки тому
Hey,
Can you explain this Research Paper - CANet: Class-Agnostic Segmentation Networks with Iterative Refinement and Attentive Few-Shot Learning
(arxiv.org/abs/1903.02351)
It is related to Image Segmentation. I am having a problem understanding this paper.
@004307ec 4 роки тому ⁺¹
as an ex-phd student on neural science, I am quite interested in such research.
@josephchan9414 4 роки тому
thx!!
@FOLLOWJESUSJohn316-n8x 4 роки тому ⁺¹
😀👍👊🏻🎉
@conduit242 3 роки тому
So kNN is all you need
@ruffianeo3418 Рік тому
What really bugs me about all "modern AI" "explanations" is, that they do not enable you to actually code it. If you refer to one source, e.g. this paper, you are none the wiser. If you refer to multiple sources, you end up confused because they do not appear to describe the same thing. So, it is not rocket science but people seem to be fond of making it sound like rocket science, maybe to stop people from just implementing it?
Here a few points, not clear (at least to me) at all:
1. Can a modern hopfield network (the one with the exp) be trained step by step, without (externally) retaining the original patterns it learned?
2. Some sources say, there are 2 (or more) layers (feature layer and memory layer). This paper says nothing about that.
3. What are the methods to artificially "enlarge" a network if a problem has more states to store, than the natural encoding of a pattern requires (2 ^ number of nodes < (number of features to store))?
4. What is the actual algorithm to compute the weights if you want to teach a network a new feature vector?
Both the paper and the video seem to fall short in all those points.
@sonOfLiberty100 4 роки тому
Subscibers to the moon!
@444haluk 3 роки тому ⁺⁴
This is dumb. Floating point numbers are already represented with 32 bits. THEY ARE BITS! The beauty in Hopfield Networks is that I can change every bit independent of other bits to store a novel representation. If you multiply a floating number with 2, they will all shift to left, you just killed many type of operation/degree of freedom due to linearity. With 10K bits I can represent many patterns, FAR more than number of the atoms in the universe. I can represent far more with 96 bits instead of 3 floats. This paper's network is a very narrow minded update to the original network.
@HunteronX 4 роки тому
Too fast, haha.

Наступне

Автоматичне відтворення

Radioactive data: tracing through training (Paper Explained)