Concept Learning with Energy-Based Models (Paper Explained)
Вставка
- Опубліковано 10 тра 2020
- This is a hard paper! Energy-functions are typically a mere afterthought in current machine learning. A core function of the Energy - its smoothness - is usually not exploited at inference time. This paper takes a stab at it. Inferring concepts, world states, and attention masks via gradient descent on a learned energy function leads to an interesting framework with many possibilities.
Paper: arxiv.org/abs/1811.02486
Blog: openai.com/blog/learning-conc...
Videos: sites.google.com/site/energyc...
Abstract:
Many hallmarks of human intelligence, such as generalizing from limited experience, abstract reasoning and planning, analogical reasoning, creative problem solving, and capacity for language require the ability to consolidate experience into concepts, which act as basic building blocks of understanding and reasoning. We present a framework that defines a concept by an energy function over events in the environment, as well as an attention mask over entities participating in the event. Given few demonstration events, our method uses inference-time optimization procedure to generate events involving similar concepts or identify entities involved in the concept. We evaluate our framework on learning visual, quantitative, relational, temporal concepts from demonstration events in an unsupervised manner. Our approach is able to successfully generate and identify concepts in a few-shot setting and resulting learned concepts can be reused across environments. Example videos of our results are available at this http URL
Authors: Igor Mordatch
Links:
UA-cam: / yannickilcher
Twitter: / ykilcher
BitChute: www.bitchute.com/channel/yann...
Minds: www.minds.com/ykilcher - Наука та технологія
1:16 Overview of energy based models
15:20 Start of the paper
30:05 Experiments
Love your style of presenting papers, its clear and well-structured. Keep up the good work!
I’m 10 minutes in and so far this is a great summary of what energy based learning is, I’ve heard the name but had no idea what it was before now!
Best video yet, the longer intro was totally worth it. Timestamps would be great though.
Great explanation. It's both a new and challenging concept mathematically. Thank you for the clear explanation.
Once we have GPUs large enough, this would be a game changer in solving abstract reasoning problems and procedurally generated matrices.
Great way of presenting link between current knowledge with problem and solution addressed by paper in a simple way 😃
Nice explanation. I was going to say reminds me a lot of neurosymbolic concept learning, however, just found out this work was published before NS-CL.
Super interesting, thanks for breaking it down!
Great video! This reminds me of the differentiable ODEs paper.
Great presentation. I wonder how much time did it take you to understand such a paper (not taking into account planning out this presentation)
Thank you.
A good example for something mind bending: Imagine a differentiable cat.
This is a great paper and great job explaining it. I kept on wondering while watching this if this is the concept behind the attention mechanism?
28:00
Could you explain what backpropagation through an optimization procedure means?
So if I understand correctly, w is rather arbitrary. It depends specifically on the energy function and how it's trained. I guess if they have n concepts to learn, they make w an n dim vector, and encode each concept with 1-hot. This paper does not explore out of distribution concepts, but I suppose theoretically, you could interpolate them.
In all these problems the elements of x are positionally independent. If you swap the first and last element of x, and swap the first and last of the attention vector, you ought to get the same result. Do they test that this is true in practice? Does this technique require positional independence? Could enforcing positional independence more strictly give performance benefits?
If you make a neural net piecewise linear the entire way through, you can calculate the function of the loss (or energy) with respect to a single parameter completely, and find the minima of that function in a computationally efficient manner. This is the key component of my current research. I wonder if this concept learning would benefit from attempting that instead of gradient descent.
Wouldn't this connet somehow with the recent iMAML paper that you reviewed?
Backpropagating through SGD seemed worth trying
Somehow, I start to think that if this model is further developed and then married up with a lot of compute we may get something looking like AGI?????
Yannic, from your point of view, as a highly experienced researcher and a person who dissects papers like this 'for a living', how hard would it be to write the code for this one? I haven't found anything online and I wonder the reason it wasn't shared is that it might be a bit..difficult or hard to organize?
No I think as long as you have a clear picture in your mind of what is the "dataset", what counts as X and what is Y in each sample, you should be fine. The only engineering difficulty here is backpropagating through the inner optimization procedure.
I'm not sure if I understand. So a deep neural network is an energy based model because you want to minimize the loss? Then deep learning is models are just energy based model and there's no difference?
Wouldn't a structured SVM framework provide a backprop-able loss that avoids having to backprop through SGD? You just need a solid (or principled) learning framework where a max/min/argmax/argmin is part of the definition of the loss function.
Genius design, beautiful presentation! But there is one thing I don't understand: "why only 11.5K subscribers"?
It's a very exclusive club ;)
Almost quadrupled by now
This paper wasn't published anywhere was it? I see an ICLR workshop version, but the full version doesn't seem to have been accepted at any conference.
Welcome to arvix
So if it can be considered that inferring an x or a or w from the others, using an existing energy function, is “learning”, then maybe learning the energy function parameters is “meta learning” in a way? But maybe not, and I guess it’s just a less important matter of definition.
That's a good observation! It's maybe a bit tricky to call that learning, as we usually understand learning as something that has a lasting impact over time. Here, once you have inferred an x, you go to the next one and start over.
Interesting concept. Do you know why it has the name "Energy" function? Is it like, the more energy the more unstable it is?
I think it comes from physics. Think of the potential energy of a pendulum, for example. It will converge to the place where this energy is the lowest. I might be very wrong, though.
@@YannicKilcher Oh yeah, of course. Like how Snell's law can be thought about as minimizing the energy during travel of the light ray.
I agree with Yannic. The energy function is positive for all values of X, and close to zero for an equilibrium. The name also implies that the unknown energy function E(x) is differentable in contrast to any generic objective functions in AI. Generally, in physics, they aim to minimize the potential energy function to find the solution to complex nonlinear problems. Also through gradient decent methods. The advantage is that the Hessian (i.e., twice differentiation of E(x) w.r.t. X) is always positive definite since the energy function is always positively increasing for every X, similar to an elastic spring storing more energy the further you stretch it. An energy function, which is just a definition of a thing with similar characteristics like potential energy, offfers therefore good numerical stability and convergence!
@@YannicKilcher It does come from physics, but the lineage is through Hopfield nets and the Ising models that inspired them.
But you do perform gradient descent when training the generator in the GANs framework, don't you?
Yes, but the gradient descent isn't part of the model itself.
This is not a new idea to use gradient descent at inference. Ive definitely seen classic computer vision algorithms that have done this. Is deep learning now considered classic machine learning lol? I think the main contribution of this paper is the formulation of these concepts. That seems promising.
Yeah structured models and EBMs are full of this. The looped inference is a major bottleneck for any computational learning research in this area. It's why the computational community has moved away from PGMs in the first place.
"you can gradient descent on colors" = 🤯
You can even do that on cats!
Nice video, p/s: "nice demonstration" of Discriminator of GAN 05:44 =)))
447 like and 0 dislikes - truly incredible!
Found better justifications in a slide here. When Y is high dimensional (or simply conbinatorial), normalizing becomes intractable...See: cs.nyu.edu/~yann/talks/lecun-20050719-ipam-2-ebm.pdf
Bro I'm laughing so hard at 5:23 rn I'm so sorry for being so immature
5:25 slow down Yannic
EBM is coming to fruition considering the recent leak on Q*
That guy is stupid on Twitter who leaked that
this aged well