Dynamic Inference with Neural Interpreters (w/ author interview)

Поділитися
Вставка
  • Опубліковано 19 січ 2025

КОМЕНТАРІ • 35

  • @YannicKilcher
    @YannicKilcher  3 роки тому +2

    OUTLINE:
    0:00 - Intro & Overview
    3:00 - Model Overview
    7:00 - Interpreter weights and function code
    9:40 - Routing data to functions via neural type inference
    14:55 - ModLin layers
    18:25 - Experiments
    21:35 - Interview Start
    24:50 - General Model Structure
    30:10 - Function code and signature
    40:30 - Explaining Modulated Layers
    49:50 - A closer look at weight sharing
    58:30 - Experimental Results
    Paper: arxiv.org/abs/2110.06399
    Guests:
    Nasim Rahaman: twitter.com/nasim_rahaman
    Francesco Locatello: twitter.com/FrancescoLocat8
    Waleed Gondal: twitter.com/Wallii_gondal

  • @zeev
    @zeev 3 роки тому +8

    Yanic you sound more excited than usual about this concept , than other concepts. something tells me this has some magic.

  • @anthonyrepetto3474
    @anthonyrepetto3474 3 роки тому +7

    I'd been hoping for this sort of approach since 2017! Wonderful to see that you all have fit the pieces together well, to make Mixture of Experts with Attention in a composable fashion! All I did was write a vague essay - "Neural Networks: a Mixture of Experts with Attention" and then I wandered off to something else. Math-life! Thank you for putting the thought and rigor into making this real!

  • @JBoy340a
    @JBoy340a 3 роки тому +4

    Another great video. I really like you having the authors on so you can have them answer the questions others might have.

    • @SimonJackson13
      @SimonJackson13 3 роки тому

      Sandbox stability violation error on programblame example url. Stabalize via min span all essentials plus minimal impact cover plus benefit bound bias :D

  • @mikejason3822
    @mikejason3822 3 роки тому +3

    Nice video.
    One point to note is that Waleed tried to add points to the conversation few times but did not get a chance eg: 1:18:47. It could have been better if every person got equal attention to talk when they wanted to talk.

  • @johnpope1473
    @johnpope1473 3 роки тому +3

    5 seconds in - oh man - this is great. Having the authors that wrote the paper explain the damn thing. Awesome 🔥🔥🔥🔥🔥🔥

  • @thegistofcalculus
    @thegistofcalculus 3 роки тому +3

    Pretty cool. I get the sense that if they were to scale this up and genuinely capture some kind of causality property of reality within most of the functions then a more sophisticated routing scheme may be required to direct the flow of information, since the functions would only do something useful within a narrow context. So awesome to see causality getting chipped away at just like unsupervised learning became demystified lately.

  • @Guytron95
    @Guytron95 3 роки тому

    man! these interactive discussion are freakin' HOT! thanks :)

  • @ChristosKyrkou
    @ChristosKyrkou 3 роки тому +5

    First! Thanks Yannic for the great videos

  • @drdca8263
    @drdca8263 3 роки тому +4

    I’m surprised at the \otimes being element-wise multiplication? I would have thought to use \odot for that?
    Like, when I see \otimes , I’m thinking tensor product (which could also be meaningful in that location)

    • @nasimrahaman7886
      @nasimrahaman7886 3 роки тому +1

      Good pointer (thx!), \odot would have made more sense.

  • @alpers.2123
    @alpers.2123 3 роки тому +1

    I have an idea idk if it makes sense. Can we train a model that some part of it is forced to accept and produce binary vectors. Then convert them to native code with bitwise operations, then fine-tune the rest. Like a learned logic circuit, which can also be implemented later on ASIC.
    The model can be decomposed to 3 parts, encoder, logic unit, decoder. Discretized logic layers lose differentiability therefore you cannot backpropagate through it. So you can only fine-tune decoder part. Encoder can be designed sparse, because converting floating-point vectors to bitsets loses information.
    The goal is to produce a faster and more compact model. Can this be possible? Was it done already?

  • @paxdriver
    @paxdriver 3 роки тому +1

    Are they running a second training operation on sets of outputs of early layers? or are they running an internal typeinference(x) model underneath using attention on the results?
    ... or did I completely misunderstand this one lol?

    • @nasimrahaman7886
      @nasimrahaman7886 3 роки тому +1

      > "Are they running a second training operation on sets of outputs of early layers?"
      We're not, though this should also work.
      We messed around with two ways of fine-tuning this:
      * Funetuning only the function signatures and codes -- think of these as learnable vectors that "instruct" the model what to do with its inputs. They usually won't amount to more than a few thousand parameters, and if there's not a lot of data, this is the way to go. We tested it with as few as 128 samples.
      * Finetuning everything, like you would any other model. If you have a good amount of data, this is a good place to start.

    • @paxdriver
      @paxdriver 3 роки тому +1

      @@nasimrahaman7886 thanks for clarifying for me :)
      I'm really impressed by the communcation, you guys rock.

  • @arahir1129
    @arahir1129 3 роки тому

    Hi Yannic. Can I ask what software do you use for writing notes on these papers?

  • @erickmarin6147
    @erickmarin6147 3 роки тому

    What if the script is generizable to graph neural networks with a function in every node?

  • @ScottzPlaylists
    @ScottzPlaylists 8 місяців тому

    Will the code be released?

  • @SimonJackson13
    @SimonJackson13 3 роки тому +2

    Ah estimated future code line ... maybe useful to feed OoO stats on machine code optimizers. Common factors pulled earlier out of a loop eg. ... what's the outputs? How many errors can accumulate and be reduced to none? The effective S space for a lingo might be interesting.

    • @SimonJackson13
      @SimonJackson13 3 роки тому +1

      LOCs? AST statements? Closest valid AST?

    • @SimonJackson13
      @SimonJackson13 3 роки тому

      Adversarial spare dispercity? Adversarial solute S gravity inversion? Does it lock on a never list deterministic pattern match?

    • @SimonJackson13
      @SimonJackson13 3 роки тому

      Godelian sandbox creation exception within experimental context. Outer kernal solidity execution precontext add swing. Back inference type stability markations on type for safe extraction of axiomatization of base code.

    • @laurenpinschannels
      @laurenpinschannels 3 роки тому +3

      yo I kind of like where you're going with this but I think you might need to turn your temperature down bro

    • @laurenpinschannels
      @laurenpinschannels 3 роки тому

      It sounds like what you're saying is that you could really beef up compilers with this. that does seem plausible to me.

  • @JanBlok
    @JanBlok 3 роки тому

    We might be watching the start of a new paradigm here 😀, anyone seen the code?

  • @444haluk
    @444haluk 3 роки тому +1

    Yannic is missing some of his hairs.

  • @amaniarman460
    @amaniarman460 3 роки тому +2

    Great stuff Yannic I really enjoy this series w/ author. Did you see Andrej's and Justin's paper review with first author of DALL-E... you might find it intriguing. ua-cam.com/video/PtdpWC7Sr98/v-deo.html
    Blessings

  • @Adhil_parammel
    @Adhil_parammel 3 роки тому +3

    Cheap Automated replication , differentiation and integration of neural network is all you need.

  • @erickmarin6147
    @erickmarin6147 3 роки тому

    Imagine throwing a problem to an AI that decides the scripts to use

  • @vincent-uh5uo
    @vincent-uh5uo 3 роки тому +2

    2