A lot of these works around equivariance and symmetry preservation have equations very similar to slow feature analysis. If we treat x_i+1 - x_i as a discrete gradient, minimising it's square is similar to minimising the dirchelet energy. And minimising the dirichelet energy is a variational solution to minimising some laplacian. SFA also has connections to solutions to the generalised eigenvector problem.
The paper explained format together with the authors is simply excellent. Hope you'll keep 'em coming! I do have a minor comment though, for your consideration: The small red line at the bottom of the thumbnail (glad you aren't doing dumbnails for these) makes it seem like I had already watched it (or 75% of it).
So interesting to see such a research: prediction from the data of the object of a study (one sole sequence), using its own data from this sole sequence to make a prediction. I’m sure this type of algorithm will be very useful in the future to filter out (remove) useless sequences in a large data of sequences used to make predictions in real life. I wouldn’t be surprised to see such algorithm used in robots someday. Thank you
It is not clear what is the use of this... Does it do real predictions? Or it is like training and testing with the same data? Also may be the only thing that is predicted as conserved is the background (r2 mode, sorry)
Great breakdown of the paper! It's interesting to find some similarity between this quantity conservation approach and "no negative sampling" self-supervised learning like BYOL or SWaV. The staggered updates of inner and outer loop is like the two steps in expectation maximization algorithms. I guess this kind of restriction on what one update step can do based on some "prior" of the parameters, according to the previous training step, is crucial for preventing collapse. I'm still wondering how this algorithm actually learns informative conserved quantities, though. It seems you could end up with some less useful conserved quantities like frame colour spectrum, yet still plausibly get the same attention map. How do you constrain what kind of symmetries corresponding to the conserved value to learn, when you do not specify which symmetries you want to enforce?
On the face of it, seems like data augmentation would help chip away at the “adversarial* symmetries” you mention, to coin a term. *er I mean “non robust” or whatever Not a good full solution but a stepping stone to study the issue.
Great and very interesting presentation. I have 3 questions ? 1) I'm wandering how to make sure that the g network doesn't learn a trivial conserved quantity ? 2) Also, if more that one quantities are conserved (like dynamical system that conserve mass and energy), is there a way handle it ? 3) How this method would compared to PINN (physics inspired neural net) ? Injecting the residual of dynamical equations in the loss function is a way to guide the network to learn conserved quanties. Any advantage presented by one method or the other ?
I just fought for this energy pendulum theorem, it may gains for a unentropic arguments of algorithm, or even this functions of theorem got only differences in the iteration level?
For the pendulum experiments with results shown at 1:06:00, isn't it weird to compare noether+symbolic-regression to no-noether+MLP, wouldn't you want to compare to no-noether+symbolic-regression or even all four combinations? I don't see the intrinsic reason to have the noether procedure and symbolic regression coupled.
Might be interesting to throw this method at random contexts and see what happens. Perhaps it could find useful conserved quantities in unexpected places!
Scene changes would actually be trivial. Instead of inter-frame L2, do a triple frame median of the square error (or something similar). That would allow for occasional sudden changes while still enforcing sameness most of the time.
Nitpicking on how a word was used : I think the way you used the word "behold" in "behold actually writing this down and implementing the [...]" was a nonstandard use of the term? In my experience, "behold" is generally used when the thing can be observed/seen , not just imagined-what-it-would be. Like, I think it generally implies that the thing is there, not just a hypothetical.
A class is a conserved quantity - dogs are symmetric - DOG IS ENERGY ! 🔥🐶💣 Also... they dont do one gradient step.. there's a loop in the train function. What's going on here? Why the big lie ? Conspiracy.. ALSO how do you train G(x) !? Man... Is it only contrastive ?? I will need to read the paper, I have not done that since this channel was created. Cool stuff tho
I think even though you understand the Noether theory correctly, you didn't understand the implications for the intelligence. Humans are map creators. There is rarely a graph of attributes when it comes to representing the problem. We are almost always doing mapping on manifolds. And it is not some aesthetic sentence. We actively act on a map and perceive the world via prediction. The symmetry encoding in the brain is via actions. You can take a route (action) with your eye (both extraocular and ciliary muscle), came back to the same point (perception) and that's a symmetry.
Oh cmon I have never heard "approximate conservations". Friction turns into heat. Energy is conserved (unless the gravitational field has been effected). I don't think this works for slow dissipations.
Because infinite summation order has examples of interchange of order of summation affecting the sum therefore it can't be assumed there are only 3 ways of behaviour toward a singularity?
OUTLINE:
0:00 - Intro & Overview
18:10 - Interview Start
21:20 - Symmetry priors vs conserved quantities
23:25 - Example: Pendulum
27:45 - Noether Network Model Overview
35:35 - Optimizing the Noether Loss
41:00 - Is the computation graph stable?
46:30 - Increasing the inference time computation
48:45 - Why dynamically modify the model?
55:30 - Experimental Results & Discussion
Paper: arxiv.org/abs/2112.03321
Website: dylandoblar.github.io/noether-networks/
Code: github.com/dylandoblar/noether-networks
Thanks for the great questions Yannic! A pleasure to be on the show :)
A lot of these works around equivariance and symmetry preservation have equations very similar to slow feature analysis. If we treat x_i+1 - x_i as a discrete gradient, minimising it's square is similar to minimising the dirchelet energy. And minimising the dirichelet energy is a variational solution to minimising some laplacian. SFA also has connections to solutions to the generalised eigenvector problem.
Super excited to watch this later, awesome paper selection Yannic!
I can only agree with that!
The paper explained format together with the authors is simply excellent. Hope you'll keep 'em coming! I do have a minor comment though, for your consideration: The small red line at the bottom of the thumbnail (glad you aren't doing dumbnails for these) makes it seem like I had already watched it (or 75% of it).
Nicely done. I really like this format where you have the author(s) explain the information that examples and parts of the paper are trying to convey.
Thumbnail suggestion:
I thought that i already watched the video since it has a red bar at the bottom
This format of questioning a paper creator about his paper is the best quality I can imagine.
love the interview style, its very helpful in clearing up any misconceptions. Great stuff!
So interesting to see such a research: prediction from the data of the object of a study (one sole sequence), using its own data from this sole sequence to make a prediction. I’m sure this type of algorithm will be very useful in the future to filter out (remove) useless sequences in a large data of sequences used to make predictions in real life. I wouldn’t be surprised to see such algorithm used in robots someday. Thank you
Thank you a lot for the clear explanations!
Feature request: How about making the background static to make watching a bit easier?
Molt bé Ferran, molt interessant la teva xerrada al DL BCN!
Great job Ferran, and your talk at DL BCN was super interesting!
Gracies! :)
It is not clear what is the use of this... Does it do real predictions? Or it is like training and testing with the same data? Also may be the only thing that is predicted as conserved is the background (r2 mode, sorry)
Great breakdown of the paper! It's interesting to find some similarity between this quantity conservation approach and "no negative sampling" self-supervised learning like BYOL or SWaV. The staggered updates of inner and outer loop is like the two steps in expectation maximization algorithms. I guess this kind of restriction on what one update step can do based on some "prior" of the parameters, according to the previous training step, is crucial for preventing collapse.
I'm still wondering how this algorithm actually learns informative conserved quantities, though. It seems you could end up with some less useful conserved quantities like frame colour spectrum, yet still plausibly get the same attention map. How do you constrain what kind of symmetries corresponding to the conserved value to learn, when you do not specify which symmetries you want to enforce?
On the face of it, seems like data augmentation would help chip away at the “adversarial* symmetries” you mention, to coin a term.
*er I mean “non robust” or whatever
Not a good full solution but a stepping stone to study the issue.
Great and very interesting presentation.
I have 3 questions ?
1) I'm wandering how to make sure that the g network doesn't learn a trivial conserved quantity ?
2) Also, if more that one quantities are conserved (like dynamical system that conserve mass and energy), is there a way handle it ?
3) How this method would compared to PINN (physics inspired neural net) ?
Injecting the residual of dynamical equations in the loss function is a way to guide the network to learn conserved quanties.
Any advantage presented by one method or the other ?
I just fought for this energy pendulum theorem, it may gains for a unentropic arguments of algorithm, or even this functions of theorem got only differences in the iteration level?
For the pendulum experiments with results shown at 1:06:00, isn't it weird to compare noether+symbolic-regression to no-noether+MLP, wouldn't you want to compare to no-noether+symbolic-regression or even all four combinations? I don't see the intrinsic reason to have the noether procedure and symbolic regression coupled.
Might be interesting to throw this method at random contexts and see what happens. Perhaps it could find useful conserved quantities in unexpected places!
Scene changes would actually be trivial. Instead of inter-frame L2, do a triple frame median of the square error (or something similar). That would allow for occasional sudden changes while still enforcing sameness most of the time.
Nitpicking on how a word was used : I think the way you used the word "behold" in "behold actually writing this down and implementing the [...]" was a nonstandard use of the term? In my experience, "behold" is generally used when the thing can be observed/seen , not just imagined-what-it-would be. Like, I think it generally implies that the thing is there, not just a hypothetical.
A class is a conserved quantity - dogs are symmetric - DOG IS ENERGY ! 🔥🐶💣
Also... they dont do one gradient step.. there's a loop in the train function. What's going on here? Why the big lie ? Conspiracy..
ALSO how do you train G(x) !? Man... Is it only contrastive ?? I will need to read the paper, I have not done that since this channel was created.
Cool stuff tho
I think even though you understand the Noether theory correctly, you didn't understand the implications for the intelligence. Humans are map creators. There is rarely a graph of attributes when it comes to representing the problem. We are almost always doing mapping on manifolds. And it is not some aesthetic sentence. We actively act on a map and perceive the world via prediction. The symmetry encoding in the brain is via actions. You can take a route (action) with your eye (both extraocular and ciliary muscle), came back to the same point (perception) and that's a symmetry.
Oh cmon I have never heard "approximate conservations". Friction turns into heat. Energy is conserved (unless the gravitational field has been effected). I don't think this works for slow dissipations.
Conservation of acceleration? Seems a lot of broken glass agrees.
Strange how the nature of a singularity on f(x)x/x can give 3 or 4 given interchange of integral and sum order, different approaches to a singularity.
Because infinite summation order has examples of interchange of order of summation affecting the sum therefore it can't be assumed there are only 3 ways of behaviour toward a singularity?
Not that much is approaching a singularity. But that's where completeness lies.