Concept Learning with Energy-Based Models (Paper Explained)

Yannic Kilcher

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 10 тра 2020
This is a hard paper! Energy-functions are typically a mere afterthought in current machine learning. A core function of the Energy - its smoothness - is usually not exploited at inference time. This paper takes a stab at it. Inferring concepts, world states, and attention masks via gradient descent on a learned energy function leads to an interesting framework with many possibilities.
Paper: arxiv.org/abs/1811.02486
Blog: openai.com/blog/learning-conc...
Videos: sites.google.com/site/energyc...
Abstract:
Many hallmarks of human intelligence, such as generalizing from limited experience, abstract reasoning and planning, analogical reasoning, creative problem solving, and capacity for language require the ability to consolidate experience into concepts, which act as basic building blocks of understanding and reasoning. We present a framework that defines a concept by an energy function over events in the environment, as well as an attention mask over entities participating in the event. Given few demonstration events, our method uses inference-time optimization procedure to generate events involving similar concepts or identify entities involved in the concept. We evaluate our framework on learning visual, quantitative, relational, temporal concepts from demonstration events in an unsupervised manner. Our approach is able to successfully generate and identify concepts in a few-shot setting and resulting learned concepts can be reused across environments. Example videos of our results are available at this http URL
Authors: Igor Mordatch
Links:
UA-cam: / yannickilcher
Twitter: / ykilcher
BitChute: www.bitchute.com/channel/yann...
Minds: www.minds.com/ykilcher
Наука та технологія

КОМЕНТАРІ • 53

@Chhillee 4 роки тому ⁺¹³
1:16 Overview of energy based models
15:20 Start of the paper
30:05 Experiments
@BJCaasenbrood 4 роки тому ⁺¹⁵
Love your style of presenting papers, its clear and well-structured. Keep up the good work!
@PeterOtt 4 роки тому ⁺³
I’m 10 minutes in and so far this is a great summary of what energy based learning is, I’ve heard the name but had no idea what it was before now!
@matterhart 4 роки тому ⁺³
Best video yet, the longer intro was totally worth it. Timestamps would be great though.
@slackstation 4 роки тому ⁺³
Great explanation. It's both a new and challenging concept mathematically. Thank you for the clear explanation.
@herp_derpingson 4 роки тому ⁺⁸
Once we have GPUs large enough, this would be a game changer in solving abstract reasoning problems and procedurally generated matrices.
@jugrajsingh3299 4 роки тому ⁺¹
Great way of presenting link between current knowledge with problem and solution addressed by paper in a simple way 😃
@TijsMaas 4 роки тому ⁺⁵
Nice explanation. I was going to say reminds me a lot of neurosymbolic concept learning, however, just found out this work was published before NS-CL.
@welcomeaioverlords 4 роки тому ⁺¹
Super interesting, thanks for breaking it down!
@CristianGarcia 4 роки тому ⁺²
Great video! This reminds me of the differentiable ODEs paper.
@Laszer271 4 роки тому ⁺¹
Great presentation. I wonder how much time did it take you to understand such a paper (not taking into account planning out this presentation)
@sau002 Рік тому
Thank you.
@vsiegel 3 роки тому ⁺¹
A good example for something mind bending: Imagine a differentiable cat.
@sarvagyagupta1744 Рік тому
This is a great paper and great job explaining it. I kept on wondering while watching this if this is the concept behind the attention mechanism?
@atursams6471 3 роки тому ⁺¹
28:00
Could you explain what backpropagation through an optimization procedure means?
@jrkirby93 4 роки тому ⁺¹
So if I understand correctly, w is rather arbitrary. It depends specifically on the energy function and how it's trained. I guess if they have n concepts to learn, they make w an n dim vector, and encode each concept with 1-hot. This paper does not explore out of distribution concepts, but I suppose theoretically, you could interpolate them.
In all these problems the elements of x are positionally independent. If you swap the first and last element of x, and swap the first and last of the attention vector, you ought to get the same result. Do they test that this is true in practice? Does this technique require positional independence? Could enforcing positional independence more strictly give performance benefits?
If you make a neural net piecewise linear the entire way through, you can calculate the function of the loss (or energy) with respect to a single parameter completely, and find the minima of that function in a computationally efficient manner. This is the key component of my current research. I wonder if this concept learning would benefit from attempting that instead of gradient descent.
@blizzard072 3 роки тому
Wouldn't this connet somehow with the recent iMAML paper that you reviewed?
Backpropagating through SGD seemed worth trying
@pastrop2003 3 роки тому ⁺²
Somehow, I start to think that if this model is further developed and then married up with a lot of compute we may get something looking like AGI?????
@theodorosgalanos9663 4 роки тому ⁺¹
Yannic, from your point of view, as a highly experienced researcher and a person who dissects papers like this 'for a living', how hard would it be to write the code for this one? I haven't found anything online and I wonder the reason it wasn't shared is that it might be a bit..difficult or hard to organize?
@YannicKilcher 4 роки тому
No I think as long as you have a clear picture in your mind of what is the "dataset", what counts as X and what is Y in each sample, you should be fine. The only engineering difficulty here is backpropagating through the inner optimization procedure.
@lislouise2305 3 роки тому ⁺¹
I'm not sure if I understand. So a deep neural network is an energy based model because you want to minimize the loss? Then deep learning is models are just energy based model and there's no difference?
@amrmartini3935 3 роки тому
Wouldn't a structured SVM framework provide a backprop-able loss that avoids having to backprop through SGD? You just need a solid (or principled) learning framework where a max/min/argmax/argmin is part of the definition of the loss function.
@vojtechkubin1590 4 роки тому ⁺²
Genius design, beautiful presentation! But there is one thing I don't understand: "why only 11.5K subscribers"?
@YannicKilcher 4 роки тому ⁺⁴
It's a very exclusive club ;)
@snippletrap 3 роки тому
Almost quadrupled by now
@Chhillee 4 роки тому
This paper wasn't published anywhere was it? I see an ICLR workshop version, but the full version doesn't seem to have been accepted at any conference.
@CristianGarcia 4 роки тому
Welcome to arvix
@nbrpwng 4 роки тому ⁺²
So if it can be considered that inferring an x or a or w from the others, using an existing energy function, is “learning”, then maybe learning the energy function parameters is “meta learning” in a way? But maybe not, and I guess it’s just a less important matter of definition.
@YannicKilcher 4 роки тому ⁺²
That's a good observation! It's maybe a bit tricky to call that learning, as we usually understand learning as something that has a lasting impact over time. Here, once you have inferred an x, you go to the next one and start over.
@joirnpettersen 4 роки тому ⁺²
Interesting concept. Do you know why it has the name "Energy" function? Is it like, the more energy the more unstable it is?
@YannicKilcher 4 роки тому ⁺⁵
I think it comes from physics. Think of the potential energy of a pendulum, for example. It will converge to the place where this energy is the lowest. I might be very wrong, though.
@joirnpettersen 4 роки тому
@@YannicKilcher Oh yeah, of course. Like how Snell's law can be thought about as minimizing the energy during travel of the light ray.
@BJCaasenbrood 4 роки тому ⁺²
I agree with Yannic. The energy function is positive for all values of X, and close to zero for an equilibrium. The name also implies that the unknown energy function E(x) is differentable in contrast to any generic objective functions in AI. Generally, in physics, they aim to minimize the potential energy function to find the solution to complex nonlinear problems. Also through gradient decent methods. The advantage is that the Hessian (i.e., twice differentiation of E(x) w.r.t. X) is always positive definite since the energy function is always positively increasing for every X, similar to an elastic spring storing more energy the further you stretch it. An energy function, which is just a definition of a thing with similar characteristics like potential energy, offfers therefore good numerical stability and convergence!
@snippletrap 3 роки тому
@@YannicKilcher It does come from physics, but the lineage is through Hopfield nets and the Ising models that inspired them.
@xgplayer 4 роки тому ⁺¹
But you do perform gradient descent when training the generator in the GANs framework, don't you?
@YannicKilcher 4 роки тому
Yes, but the gradient descent isn't part of the model itself.
@SergeyVBD 4 роки тому ⁺²
This is not a new idea to use gradient descent at inference. Ive definitely seen classic computer vision algorithms that have done this. Is deep learning now considered classic machine learning lol? I think the main contribution of this paper is the formulation of these concepts. That seems promising.
@amrmartini3935 3 роки тому
Yeah structured models and EBMs are full of this. The looped inference is a major bottleneck for any computational learning research in this area. It's why the computational community has moved away from PGMs in the first place.
@patrickjdarrow 4 роки тому ⁺¹
"you can gradient descent on colors" = 🤯
@vsiegel 3 роки тому
You can even do that on cats!
@datgatto3911 3 роки тому
Nice video, p/s: "nice demonstration" of Discriminator of GAN 05:44 =)))
@AndreiMargeloiu 3 роки тому
447 like and 0 dislikes - truly incredible!
@MrAlextorex 4 роки тому ⁺¹
Found better justifications in a slide here. When Y is high dimensional (or simply conbinatorial), normalizing becomes intractable...See: cs.nyu.edu/~yann/talks/lecun-20050719-ipam-2-ebm.pdf
@Elstuhn 5 місяців тому
Bro I'm laughing so hard at 5:23 rn I'm so sorry for being so immature
@snippletrap 3 роки тому ⁺²
5:25 slow down Yannic
@AI_Financier 2 місяці тому
EBM is coming to fruition considering the recent leak on Q*
@HB-kl5ik 2 місяці тому
That guy is stupid on Twitter who leaked that
@kormannn1 2 місяці тому
this aged well

Наступне

Автоматичне відтворення