I'd been hoping for this sort of approach since 2017! Wonderful to see that you all have fit the pieces together well, to make Mixture of Experts with Attention in a composable fashion! All I did was write a vague essay - "Neural Networks: a Mixture of Experts with Attention" and then I wandered off to something else. Math-life! Thank you for putting the thought and rigor into making this real!
Sandbox stability violation error on programblame example url. Stabalize via min span all essentials plus minimal impact cover plus benefit bound bias :D
Nice video. One point to note is that Waleed tried to add points to the conversation few times but did not get a chance eg: 1:18:47. It could have been better if every person got equal attention to talk when they wanted to talk.
Pretty cool. I get the sense that if they were to scale this up and genuinely capture some kind of causality property of reality within most of the functions then a more sophisticated routing scheme may be required to direct the flow of information, since the functions would only do something useful within a narrow context. So awesome to see causality getting chipped away at just like unsupervised learning became demystified lately.
I’m surprised at the \otimes being element-wise multiplication? I would have thought to use \odot for that? Like, when I see \otimes , I’m thinking tensor product (which could also be meaningful in that location)
I have an idea idk if it makes sense. Can we train a model that some part of it is forced to accept and produce binary vectors. Then convert them to native code with bitwise operations, then fine-tune the rest. Like a learned logic circuit, which can also be implemented later on ASIC. The model can be decomposed to 3 parts, encoder, logic unit, decoder. Discretized logic layers lose differentiability therefore you cannot backpropagate through it. So you can only fine-tune decoder part. Encoder can be designed sparse, because converting floating-point vectors to bitsets loses information. The goal is to produce a faster and more compact model. Can this be possible? Was it done already?
Are they running a second training operation on sets of outputs of early layers? or are they running an internal typeinference(x) model underneath using attention on the results? ... or did I completely misunderstand this one lol?
> "Are they running a second training operation on sets of outputs of early layers?" We're not, though this should also work. We messed around with two ways of fine-tuning this: * Funetuning only the function signatures and codes -- think of these as learnable vectors that "instruct" the model what to do with its inputs. They usually won't amount to more than a few thousand parameters, and if there's not a lot of data, this is the way to go. We tested it with as few as 128 samples. * Finetuning everything, like you would any other model. If you have a good amount of data, this is a good place to start.
Ah estimated future code line ... maybe useful to feed OoO stats on machine code optimizers. Common factors pulled earlier out of a loop eg. ... what's the outputs? How many errors can accumulate and be reduced to none? The effective S space for a lingo might be interesting.
Godelian sandbox creation exception within experimental context. Outer kernal solidity execution precontext add swing. Back inference type stability markations on type for safe extraction of axiomatization of base code.
Great stuff Yannic I really enjoy this series w/ author. Did you see Andrej's and Justin's paper review with first author of DALL-E... you might find it intriguing. ua-cam.com/video/PtdpWC7Sr98/v-deo.html Blessings
OUTLINE:
0:00 - Intro & Overview
3:00 - Model Overview
7:00 - Interpreter weights and function code
9:40 - Routing data to functions via neural type inference
14:55 - ModLin layers
18:25 - Experiments
21:35 - Interview Start
24:50 - General Model Structure
30:10 - Function code and signature
40:30 - Explaining Modulated Layers
49:50 - A closer look at weight sharing
58:30 - Experimental Results
Paper: arxiv.org/abs/2110.06399
Guests:
Nasim Rahaman: twitter.com/nasim_rahaman
Francesco Locatello: twitter.com/FrancescoLocat8
Waleed Gondal: twitter.com/Wallii_gondal
Yanic you sound more excited than usual about this concept , than other concepts. something tells me this has some magic.
I'd been hoping for this sort of approach since 2017! Wonderful to see that you all have fit the pieces together well, to make Mixture of Experts with Attention in a composable fashion! All I did was write a vague essay - "Neural Networks: a Mixture of Experts with Attention" and then I wandered off to something else. Math-life! Thank you for putting the thought and rigor into making this real!
Great minds and all that 🤩
Another great video. I really like you having the authors on so you can have them answer the questions others might have.
Sandbox stability violation error on programblame example url. Stabalize via min span all essentials plus minimal impact cover plus benefit bound bias :D
Nice video.
One point to note is that Waleed tried to add points to the conversation few times but did not get a chance eg: 1:18:47. It could have been better if every person got equal attention to talk when they wanted to talk.
5 seconds in - oh man - this is great. Having the authors that wrote the paper explain the damn thing. Awesome 🔥🔥🔥🔥🔥🔥
Pretty cool. I get the sense that if they were to scale this up and genuinely capture some kind of causality property of reality within most of the functions then a more sophisticated routing scheme may be required to direct the flow of information, since the functions would only do something useful within a narrow context. So awesome to see causality getting chipped away at just like unsupervised learning became demystified lately.
man! these interactive discussion are freakin' HOT! thanks :)
First! Thanks Yannic for the great videos
I’m surprised at the \otimes being element-wise multiplication? I would have thought to use \odot for that?
Like, when I see \otimes , I’m thinking tensor product (which could also be meaningful in that location)
Good pointer (thx!), \odot would have made more sense.
I have an idea idk if it makes sense. Can we train a model that some part of it is forced to accept and produce binary vectors. Then convert them to native code with bitwise operations, then fine-tune the rest. Like a learned logic circuit, which can also be implemented later on ASIC.
The model can be decomposed to 3 parts, encoder, logic unit, decoder. Discretized logic layers lose differentiability therefore you cannot backpropagate through it. So you can only fine-tune decoder part. Encoder can be designed sparse, because converting floating-point vectors to bitsets loses information.
The goal is to produce a faster and more compact model. Can this be possible? Was it done already?
Are they running a second training operation on sets of outputs of early layers? or are they running an internal typeinference(x) model underneath using attention on the results?
... or did I completely misunderstand this one lol?
> "Are they running a second training operation on sets of outputs of early layers?"
We're not, though this should also work.
We messed around with two ways of fine-tuning this:
* Funetuning only the function signatures and codes -- think of these as learnable vectors that "instruct" the model what to do with its inputs. They usually won't amount to more than a few thousand parameters, and if there's not a lot of data, this is the way to go. We tested it with as few as 128 samples.
* Finetuning everything, like you would any other model. If you have a good amount of data, this is a good place to start.
@@nasimrahaman7886 thanks for clarifying for me :)
I'm really impressed by the communcation, you guys rock.
Hi Yannic. Can I ask what software do you use for writing notes on these papers?
What if the script is generizable to graph neural networks with a function in every node?
Will the code be released?
Ah estimated future code line ... maybe useful to feed OoO stats on machine code optimizers. Common factors pulled earlier out of a loop eg. ... what's the outputs? How many errors can accumulate and be reduced to none? The effective S space for a lingo might be interesting.
LOCs? AST statements? Closest valid AST?
Adversarial spare dispercity? Adversarial solute S gravity inversion? Does it lock on a never list deterministic pattern match?
Godelian sandbox creation exception within experimental context. Outer kernal solidity execution precontext add swing. Back inference type stability markations on type for safe extraction of axiomatization of base code.
yo I kind of like where you're going with this but I think you might need to turn your temperature down bro
It sounds like what you're saying is that you could really beef up compilers with this. that does seem plausible to me.
We might be watching the start of a new paradigm here 😀, anyone seen the code?
Yannic is missing some of his hairs.
Great stuff Yannic I really enjoy this series w/ author. Did you see Andrej's and Justin's paper review with first author of DALL-E... you might find it intriguing. ua-cam.com/video/PtdpWC7Sr98/v-deo.html
Blessings
Cheap Automated replication , differentiation and integration of neural network is all you need.
Imagine throwing a problem to an AI that decides the scripts to use
2