Mixture of Agents (MoA) BEATS GPT4o With Open-Source (Fully Tested)

Поділитися
Вставка
  • Опубліковано 22 чер 2024
  • Full test of Mixture of Experts implementation.
    Subscribe to my newsletter for a chance to win a Dell Monitor: gleam.io/otvyy/dell-nvidia-mo... (Only available in North America this time)
    Be sure to check out Pinecone for all your Vector DB needs: www.pinecone.io/
    Join My Newsletter for Regular AI Updates 👇🏼
    www.matthewberman.com
    Need AI Consulting? 📈
    forwardfuture.ai/
    My Links 🔗
    👉🏻 Subscribe: / @matthew_berman
    👉🏻 Twitter: / matthewberman
    👉🏻 Discord: / discord
    👉🏻 Patreon: / matthewberman
    👉🏻 Instagram: / matthewberman_ai
    👉🏻 Threads: www.threads.net/@matthewberma...
    👉🏻 LinkedIn: / forward-future-ai
    Media/Sponsorship Inquiries ✅
    bit.ly/44TC45V
    Links:
    github.com/togethercomputer/MoA
    Leaderboard - bit.ly/3qHV0X7
  • Наука та технологія

КОМЕНТАРІ • 280

  • @matthew_berman
    @matthew_berman  6 днів тому +19

    Should MoA be the default for Open Source now?
    Subscribe to my newsletter for a chance to win a Dell Monitor: gleam.io/otvyy/dell-nvidia-monitor-1 (Only available in North America this time)

    • @d.d.z.
      @d.d.z. 6 днів тому +2

      If I'm outside US I have no chance?

    • @user-ru1qz1bo2q
      @user-ru1qz1bo2q 6 днів тому +1

      Generally speaking, the improvements seen here can be achieved with standard open source models by using more effective prompting. The prompts you use for these tests seem specifically designed to make the models work as hard as possible. Better prompting doesn't carry the significant speed or memory costs of the MoA paradigm.

    • @jimmassey140
      @jimmassey140 6 днів тому

      I've gotten some models to perform better on the "apple" challenge by increasingly the "cost" of getting one wrong. Maybe worth a shot more broadly? E.g. Please generate 10 sentences that end in the word "apple". If any one of the sentences does NOT end in the word "apple", then you have FAILED the entire task. There is NO credit for partial success. (Llama3 8b and 70b seem to be impacted by this a lot).

  • @joe_limon
    @joe_limon 6 днів тому +53

    I can't wait for MOA to be smart enough to pull specific models based on what they are good at rather then prompting every single model. This would bring wayy more value toward training narrower specialized models that outperform at specific tasks.

    • @matthew_berman
      @matthew_berman  6 днів тому +12

      Agreed. This is what HuggingGPT paper from last year was all about! Finally coming to fruition.

    • @Yipper64
      @Yipper64 6 днів тому +4

      So one thing we know is that if you train a small model on data from a bigger model literally just to prompt it, it can work much more like the better model.
      Well MOA allows smaller models to work together to behave like a bigger model.
      Idk if you get diminishing returns, but I feel like you could literally loop this and get something that trains itself.

    • @rayr268
      @rayr268 6 днів тому

      Also hood for running on smaller devices imo

    • @joe_limon
      @joe_limon 6 днів тому

      @@rayr268 and running much faster

    • @14supersonic
      @14supersonic 6 днів тому +1

      Most likely, what we would also need is a model that's specifically trained to understand agentic workflows and identify what types of models are typically good at what types of tasks. Then I think we'll be cooking.

  • @klaushermann6760
    @klaushermann6760 6 днів тому +85

    Every enterprise now knows anyone is going to ask for the snake game. That is already something so slick that it's not even worth asking anymore.

    • @vio_tio12
      @vio_tio12 6 днів тому +16

      fr he should update his benchmarks

    • @netherportals
      @netherportals 6 днів тому +1

      Water cooler magic at it's best

    • @jichaelmorgan3796
      @jichaelmorgan3796 6 днів тому +1

      That's what you call ai general mastery of a task. We have to keep on coming up with more general tasks or "skills" for them to master on the march to agi.

    • @Joe333Smith
      @Joe333Smith 6 днів тому

      Exactly, totally 100% useless

    • @matthew_berman
      @matthew_berman  6 днів тому +23

      Yet models still can't pass it consistently!

  • @njorgard
    @njorgard 6 днів тому +44

    When are you testing Claude Sonnet 3.5?

  • @tvwithtiffani
    @tvwithtiffani 6 днів тому +2

    The Killers and Marble answers seem so good that it seem the models might be training on you test questions now.

  • @seanmcgu
    @seanmcgu 6 днів тому +4

    Yes, would love to see MoA working together for coding! Thanks for your consideration.

  •  6 днів тому +14

    With crewaI you can build similar setup and also give it instructions to test code of each iteration.

    • @MrMoonsilver
      @MrMoonsilver 6 днів тому

      Do you have a link to that?

    •  6 днів тому

      @@MrMoonsilver YT does not like whne I post links directly, but when you google "deeplearning crewai" you will find whole course completely for free.
      Also there are many tutorials here on YT. You can search how to connect different models to multiple agents into single workflow for crewai. You can connect local models, or run them on cloud, or even use API by 3rd parties like openAI or Groq.

  • @dbishnoi
    @dbishnoi 6 днів тому +4

    You delivered Matt. And quickly too. Thank you. This is amazing.

  • @shubharthaksangharsha6248
    @shubharthaksangharsha6248 6 днів тому +24

    why are you not doing video of Sonnet 3.5 bro?

  • @KC_79
    @KC_79 6 днів тому +28

    The combination of 4 models is 452B, and I can't even run a 72B model. I guess I will continue to use large models from big companies.

    • @CaponeBlackBusiness
      @CaponeBlackBusiness 6 днів тому

      Sell your organs 😏

    • @oratilemoagi9764
      @oratilemoagi9764 6 днів тому +8

      72B dawg I barely run a 8b model

    • @handsanitizer2457
      @handsanitizer2457 6 днів тому +1

      Hmm maybe il try it with phi qwen 0.5 to 7b etc

    • @Derick99
      @Derick99 6 днів тому +2

      Yea 4 x phi3

    • @blisphul8084
      @blisphul8084 6 днів тому +1

      ​​​@@Derick99 What about 2Qwen 1 Phi?
      As in 2 Qwen2 1.5b agents and 1 Phi 3 mini model.
      Two 1.5b models and one 4b model could be an interesting MoA option that fits on one GPU.

  • @bosthebozo5273
    @bosthebozo5273 6 днів тому +3

    Can't wait for the Sonnet video Matt! So far, I've created about 6 basic games like a simple RTS, strategy card game, jpg puzzle generator, asteroids, endless racer and of course snake... often in one shot. This model is insane in terms of progress.

  • @Quinceybibbs
    @Quinceybibbs 6 днів тому +16

    Thank you for this😊 can you please create a follow-up video using code models

    • @wurstelei1356
      @wurstelei1356 6 днів тому +1

      Yes, I be waiting for a MoA coder for a while now.

  • @asastudios7782
    @asastudios7782 6 днів тому +4

    GPT-4o nails the apple test
    1. Dude the metaphysical implications of quantum entanglement challenge our understanding of causality dude.
    2. Dude the dialectical synthesis of thesis and antithesis reveals the evolution of human thought dude.
    3. Dude the existential dilemma of free will versus determinism remains an enigma dude.
    4. Dude the ontological argument for the existence of a higher power transcends empirical evidence dude.
    5. Dude the phenomenology of consciousness illuminates the subjective nature of reality dude.
    6. Dude the epistemological pursuit of knowledge questions the limits of human understanding dude.
    7. Dude the ethical relativism in cultural contexts underscores the complexity of moral philosophy dude.
    8. Dude the teleological perspective on the universe suggests an inherent purpose to existence dude.
    9. Dude the interplay between chaos and order is fundamental to the fabric of the cosmos dude.
    10. Dude the hermeneutics of interpreting ancient texts unveils the timelessness of human wisdom dude.

    • @wurstelei1356
      @wurstelei1356 6 днів тому

      Dude the balls grow exponentially with each sentence dude.

    • @dulinak6251
      @dulinak6251 3 дні тому

      Dude this is art dude

  • @MonkeyBars1
    @MonkeyBars1 6 днів тому +5

    Finally the ball didn't end up in the microwave!! 🎉

    • @netherportals
      @netherportals 6 днів тому

      "End a sentence with the word apple" "No" "Okay, end a sentence with the word apple" "Apple".

  • @Timotheeee1
    @Timotheeee1 6 днів тому +8

    11:40 it just wrote random sentences and added ", apple" at the end of them

    • @marc_frank
      @marc_frank 6 днів тому +1

      yeah it's not very smart in that regard

    • @MonkeyBars1
      @MonkeyBars1 6 днів тому +1

      fail not pass

    • @matthew_berman
      @matthew_berman  6 днів тому +1

      I'll still count it :)

    • @Cine95
      @Cine95 6 днів тому +1

      but is correct

    • @MonkeyBars1
      @MonkeyBars1 6 днів тому +1

      @@matthew_berman a sentence is determined by syntax not just punctuation, so your prompt was not fulfilled.

  • @TheAlastairBrown
    @TheAlastairBrown 6 днів тому +2

    I'd love to see a collab between Claud 3.5 and GTP4o, especially with multiple agents that are set to different temperatures, with the final agent being set to low creativity making the final decision. The mixing of temperatures is extremely important, you want the models to be as creative as possible so they come up with amazing solutions, but you also need strict rational enforcers to keep the crazy in check.

  • @BarryMcBangerz
    @BarryMcBangerz 6 днів тому +1

    Great vid, would definitely love to see more MoA videos trying out different models and tasks

  •  6 днів тому +1

    Very impressive Matt, thank you!

  • @user-qb2jn9zh9i
    @user-qb2jn9zh9i 6 днів тому +1

    In a standard situation, where the temperature is set from 0 to 1, setting the temperature to 0.7 means getting a fierce delirium generator out of anything. If the temperature in this “mixture” is determined differently, it is worth talking about this in detail.

  • @mediacenter3174
    @mediacenter3174 6 днів тому +2

    Claude 3,5
    Let's think through this step-by-step:
    The person takes a marble.
    They put the marble inside a cup.
    They put the cup upside down on the table.
    They take the cup and put it in the microwave.
    The key point here is step 3: when the cup was turned upside down on the table, the marble would have fallen out onto the table.
    Therefore, the marble is still on the table where the cup was initially placed upside down.
    The cup is now in the microwave, but it's empty - the marble is not in the cup anymore.

  • @spdnova9012
    @spdnova9012 6 днів тому +1

    matt posting faster than light speed 😭💀 every time i open youtube there are like 1/2 new videos

  • @kostaspramatias320
    @kostaspramatias320 6 днів тому

    Good testing, thanks Matthew

  • @fabiankliebhan
    @fabiankliebhan 6 днів тому +14

    Great stuff. I found a great prompt on X that breaks almost every LLM at the moment. Maybe you could consider adding this?
    "A farmer and a sheep are standing on one side of a river. There is a boat with enough room for one human and one animal. How can the farmer get across the river with the sheep in the fewest number of trips?"

    • @TheRysiu120
      @TheRysiu120 6 днів тому +2

      I just tested it and suprisingly it really do destroy their logic

    • @jje984
      @jje984 6 днів тому +1

      That's so odd, on a single shot attempt both GPT4o and Sonnet 3.5 get it wrong. With a prompt like "why does the boat have to go back" they get it right. But their first answer is broken.

    • @donaldedward4329
      @donaldedward4329 6 днів тому +3

      Perhaps this has to do with the fact that sheep is an irregular noun, ie, both singular and plural are spelled the same.
      I just tried with a dog ith Qwen 5Gb, broken.
      But Qwen 15Gb gets it right.
      Just tried GPT-4, took 3 trips.

    • @djfremen
      @djfremen 6 днів тому

      Write it like this “A farmer and a koala bear are on one side of a river. There is a boat that can carry the farmer and the koala bear at the same time. How many trips are needed for the farmer to get across the river with the koala bear?”

    • @moozooh
      @moozooh 6 днів тому

      @@donaldedward4329 Nothing to do with this; almost every model breaks with a wide variety of different entities. I've tried this in the past with Elon Musk and Cybertruck, John Wayne and horse, but the most devious is an Olympic swimmer and a ferryman. Dozens of attempts across dozens of models with hilarious(ly bad) results in the vast majority of cases, with the GPT family being by far the most consistent. The reason why it happens, as far as I understand, is that the biggest models overfit to the _structure_ of the puzzle which is present a LOT of times in their training data, and in the vast majority of cases it has more than two entities as well as some limitation on why they cannot all cross together, and the learned assumption that it _should_ be solved this way overpowers the easy, straightforward answer presented right in the prompt. Some models like Yi will go so far as to invent the third object and insert it in the puzzle just so it could fit its training better. Notably, Codestral is very resilient to this "attack", presumably because of code being its main training corpus (so basic logic learned from the code overpowers structural overfit), although Deepseek-coder fails just as well.

  • @Kram1032
    @Kram1032 6 днів тому +1

    executing code at each step sounds like a security nightmare
    very impressive performance tho

  • @nzahmd4117
    @nzahmd4117 6 днів тому +1

    Could you provide the links to the paper you give the diagrams from in the description or along with the video. Thanks

  • @pedrorafaelnunes
    @pedrorafaelnunes 6 днів тому +1

    I have done something close to a mixture of agents i think.
    I got a bunch of local, openai, groq llms to respond to the same input.
    Then a voting system to chose the best and most correct output of all.
    Was capable of giving the correct output for almost every question !

  • @JakobN-zg1st
    @JakobN-zg1st 6 днів тому

    Thanks for all the work you put in. And I always appreciate the open source love

  • @aSFADVSrbWETRWEYHTET
    @aSFADVSrbWETRWEYHTET 6 днів тому +2

    Hey, could you potentially share the notion page, where you have your benchmarks?

    • @matthew_berman
      @matthew_berman  6 днів тому +1

      bit.ly/3qHV0X7 sorry I usually share it! i'll put it in the desc as well

  • @noeservellon
    @noeservellon 6 днів тому +3

    can you make an episode on how to run this locally? It would be interesting to see this run with SMLs instead of LLMs

    • @brulsmurf
      @brulsmurf 6 днів тому

      locally on your 30000€ GPU?

    • @wurstelei1356
      @wurstelei1356 6 днів тому

      I think this is running locally. Still a tutorial on how to run the MoA code from the github repo would be great.

  • @realKytra
    @realKytra 6 днів тому

    thanks, your channel is fantastic 👌
    Keep up the good work, very interesting and inspiring 💪

  • @maj373
    @maj373 6 днів тому

    Thank you Mathew!

  • @jozitrucker7123
    @jozitrucker7123 6 днів тому +2

    We waiting for Claude 3.5 test…

  • @dee132456
    @dee132456 6 днів тому +2

    Is it really a fair test? Since they are 4 llms through 3 layers. It would be like asking chatgpt 4o 12 questions. To test if multiple different llms are better youd have to run MoA using just chatpgt 4o as 4 agents

  • @drlordbasil
    @drlordbasil 6 днів тому

    I did ML lobes and different models in my project instead of just different models. Love the progress in everyones work lately!

  • @ingenierofelipeurreg
    @ingenierofelipeurreg 6 днів тому +9

    Pls share cheatsheet for try locally

    • @bodhi.advayam
      @bodhi.advayam 6 днів тому

      2x a 70b model...locally.. I need to upgrade my computer!

  • @bennyboiii1196
    @bennyboiii1196 6 днів тому +1

    I don't really see a super big advantage with MOA in this way. I do like the aggregator model, but I feel like there are better (and faster) ways of doing this kind of thing with a router agent and a verification agent. Basically instead of pooling a bunch of answers, you would route the model to a specific agent to a specific agent, then duplicate said agent to verify the answer, basically creating an adversarial network that wouldn't spit out an answer until it can verify that it is correct. It would be slow, just like this, but LLM's are quite good at comparison, so to boil down a question of any type of logic to mainly comparison logic would allow the LLM to play to its advantages.
    In crewAI, I did a similar experiment and found that it basically got all questions right, even if the initial answer given on the first round was wrong. This included planning questions. To me this is kind of what MCTSr does but at a higher level. The difference was, i did it with only llama70b, and didn't bother doing the routing thing. It would probably be more accurate if i did the routing.
    Instead of the snake game i asked if it could code a draggable element in a window, as well as other UI elements (i.e a slider, an internal pane, a context menu, etc...) to give it some curveballs in case it was trained on snake.

  • @novantha1
    @novantha1 6 днів тому +1

    One thing I noticed about the performance scaling of the scores is that MoA seems to "crush" the performance of models towards the ceiling of all possible scores; GPT 4 involvement wasn't a strong improvement in capability, compared to just the open source models.
    The implication of this to me is that a person could probably actually pull back on model size quite a bit and still get fairly competitive performance. With something like S-Lora (I think this was it, I'm referring to the implementation of LoRA that allows hot-swapping of LoRAs at inference), I think you could possibly hit very strong performance with domain specific tuning in a lot of areas and a single, strong, fairly small model. Imagine something to the effect of...
    Stage 1:
    Llama 3 8B
    L3 8B networking LoRA
    L3 8B database LoRA
    L3 8B frontend LoRA
    Stage 2:
    Llama 3 8B
    L3 8B x86 intrinsics C LoRA
    L3 8B pen tester LoRA
    And so on, so forth.
    I'm pretty sure a smart implementation could have very little memory overhead in the sense that you could possibly keep the base model loaded and "hot swap" the LoRAs in by calculating the impact of the LoRA at every layer, or you could just save the inverse of the XOR of the LoRA and use it to swap back to the base model before applying the next LoRA in the sequence.
    With a setup like this I'm pretty sure you could lose not that much performance but be able to run this on a 4090, for instance, or frankly, even on a CPU.
    Bonus points would be having some form of semantic assessment that let the system pick from hundreds of LoRAs based on the problem at hand, for each stage of the pipeline, so you didn't have to manually set up the pipeline for each individual task.

  • @eucharisticadoration
    @eucharisticadoration 5 днів тому +1

    Yes, please try a local version of local LLMs doing a MoA for Source-Code!

  • @MagnesRUS
    @MagnesRUS 6 днів тому

    Thanks! I wonder how they would work in conjunction with proprietary models, as a combination of proprietary models, as a combination of the best models from the rating in different size parameters 8, 72, etc. Coding would also be interesting to see. An interesting option is to combine small models so that they fit into 16-24-48 GB.

  • @fahadxxdbl
    @fahadxxdbl 6 днів тому

    I love these evaluations

  • @Bacca839
    @Bacca839 6 днів тому

    I found it incredibly interesting to see that it queried gravity for the marble problem considering that you removed that portion of the prompt a while back.

  • @dudedkdk
    @dudedkdk 4 дні тому

    I think it would be beneficial to explore more advanced tasks for agentic models to truly demonstrate whether they outperform those that respond to single, one-shot prompts. Tasks could include writing documentation for a large codebase, undertaking more complex, prolonged machine learning training, or other activities that exceed what a single prompt could encompass. It would be very interesting to have different evaluations for the base model and agentic workflow models, highlighting their respective capabilities.
    As always thanks for the vid!

  • @brianWreaves
    @brianWreaves 6 днів тому

    Instead of parallel running in all 3 steps, which is similar to CoT, is there a method for the 2nd step's format to be each model evaluating the other 2 models' response to improve the output for their 2nd response. Then the 3rd step they merge all 3 responses to create a single 3rd response, which is the given answer from the 4th step. That would be the true value, to collaborate on the result just as if you are collaborating with 2 colleagues at work.

  • @emnovoa
    @emnovoa 6 днів тому

    Could you give details of the hardware you use to run this example

  • @marcfruchtman9473
    @marcfruchtman9473 6 днів тому

    Thanks for the review. I do think the Mixture of Agents method might be a little difficult for code, how do they come together to decide on the right code without adversely affecting each other?

  • @geonovelty
    @geonovelty 6 днів тому

    Can we choose local fine tuned models or other models from hugging face? or multiple loras instead having a selected base model?

  • @jonmichaelgalindo
    @jonmichaelgalindo 6 днів тому +1

    It just randomly added the word "apple" to the end of the sentences. :-P Well-played, AI.

    • @wurstelei1356
      @wurstelei1356 6 днів тому

      Yes, Mat should extend the question like ...10 sentences with the word apple at the end that make sense.

  • @UnchartedDiscoveries
    @UnchartedDiscoveries 4 дні тому +1

    interested to see MoA using LLAMA 3, GPT-4o and Sonnet 3.5

  • @ronbridegroom8428
    @ronbridegroom8428 6 днів тому

    Yes, I would like to see this with coding related models. Thanks for all the work involved in your videos.

  • @dudufusco
    @dudufusco 5 днів тому

    Did you run it all locally? Which hardware is needed to have enough performance for real life applications?

  • @nathanbanks2354
    @nathanbanks2354 6 днів тому

    It'll be fun to watch Anthropic and OpenAI et al apply all of these research papers. Plus it will be great to see Meta & various open-source models jump ahead of them again. This also gives me hope for high quality artificial training data.

  • @KodandocomFaria
    @KodandocomFaria 6 днів тому

    Have you tried the Microsoft samba hybrid model ?

  • @KurtWoloch
    @KurtWoloch 6 днів тому

    So what happens if you compare MoA with the newly released Claude 3.5 Sonnet?

  • @mikezooper
    @mikezooper 6 днів тому +1

    Matthew’s millionth video: his AI clone while he’s on the beach sipping cocktails 😀

    • @wurstelei1356
      @wurstelei1356 6 днів тому

      Sometime I think his AI clone is already in the current video...

  • @user-tz7jq9sw4d
    @user-tz7jq9sw4d 4 дні тому

    Is your benchmarking focused on single shot accuracy? Between Claude, Gemini and GPT4o, if you pass a script from one LLM to the next asking each to make corrections they get it right by about the 3rd hop

  • @glitch_city_gamer2846
    @glitch_city_gamer2846 6 днів тому

    I think the most interesting out come of this test run was the explanation of the flaws in the more difficult logic reasoning questions and where the LLMs get confused. Giving us a better insight of how they're thinking about problems. Would be interest to ask how write a prompt with the specific information it would need to understand that the marble size and cup size, open ended etc. The concept itself is amazing of course, it would be interesting to create mixture of experts of code models, and then create a MoE arachietech on top of that. Using the top 5 open source coding experts to be the coding expert in the MoE. And then the best closed source LLM to be the coordinator. Vs a open source. Bit of a "How deep does the rabbit hole go*.

  • @damienboykin7772
    @damienboykin7772 5 днів тому

    Would it be possible to combine this and Nvidias Scuda to accelerate the processing speed from querying all the models?

  • @masonweimer5337
    @masonweimer5337 5 днів тому

    I would definitely love to see this tested but with models more focused on coding! Keep up the good work!

  • @darwinboor1300
    @darwinboor1300 6 днів тому

    Thanks Mathew. Now we need a task parsing AI to break prompts into tasks and a supervisor AI to itterate and optimize the MoA build for each task. Next put the crew to work building a factual real world knowledge base, identifying holes in that knowledge base, and building better versions of the crew and the hardware they run on.
    PS Love your new hardware. Thanks to Dell and Nvidia

  • @romgenie
    @romgenie 6 днів тому

    Absolutely would love to see a setup with coding agents (or uniquely as you suggested with testing the code execution).

  • @24-7gpts
    @24-7gpts 6 днів тому

    Nice concept it;s just like a diverse group of researchers not just one

  • @chetanreddy6128
    @chetanreddy6128 6 днів тому

    yes we need code specific opensource models agent's benchmark video

  • @MeinDeutschkurs
    @MeinDeutschkurs 6 днів тому

    What exactly is a sentence? Does a sentence end with a period, question mark, or exclamation mark? Can it end with a comma? Hmmm.

  • @isaach.1135
    @isaach.1135 6 днів тому

    So is there a self hosted option? Could see about using lighter weight models to make it more practical, but checking out the linked github page, it just says to grab an API key...

  • @isg9106
    @isg9106 6 днів тому

    I really like the rubric you use to test the models, but I I’ve always felt like the could benefit greatly from just the slightest adjustment in the values you use when presenting the questions. Some models a really good at repeating things verbatim and get tripped up when the numbers are even slightly modified from the original, and I think you’ve even mentioned the idea of adding this to your rubric in the past. I’m REALLY interested to seeing which models completely fail when given minor changes in the parameters to the problem they were trained on.

  • @snts_andres
    @snts_andres 6 днів тому

    What would be the difference of creating the same architecture with multiple layers of the same model? Or creating several responses on the same layer and then a second verification layer? Isn't this basically selection-inference prompting? I know that each model is better at certain tasks but in my opinion this adds a lot of complexity

  • @ahrmiller2003
    @ahrmiller2003 5 днів тому

    Great review. Yes, please do one for coding via multi AI. Thank you.

  • @thetrueanimefreak6679
    @thetrueanimefreak6679 6 днів тому

    Matt, thanks for the hard work, I think try incorporating agent AI in the mix on your next video with these LLM .much love

  • @pigeon_official
    @pigeon_official 2 дні тому

    what happens if you use MoA with all of the agents being the same model? like could you just take the same model say for example llama3 70 and have all 4 models be llama3 70b?

  • @REDULE26
    @REDULE26 3 дні тому

    On github they’re talking about MoA lite, is this an implementation with only small models like llama3 8b, phi3 small,… ? I’m kinda curious about how good it could be

  • @hinro
    @hinro 6 днів тому

    Have you tried using it with open-interpreter? might be able to have it test it self with code

  • @rahulnundlall2617
    @rahulnundlall2617 6 днів тому

    Very keen to see you test MoA with coding models

  • @danberm1755
    @danberm1755 6 днів тому +1

    From my experience it makes 100% sense that agents are MUCH stronger than a single pass for each word through the neutral network.
    You have to envision the training data of the Internet.
    We already have AGI, we just need to expand agents. Agents provide critical thinking about random thoughts that pass through an LLMs brain. Just like humans do.

    • @carlosamado7606
      @carlosamado7606 6 днів тому +1

      True, imagine giving the first answer that comes up to your mind. No source checking, no editing, no deep thought about the subject ,etc...

  • @aleksandreliott5440
    @aleksandreliott5440 6 днів тому

    I would love to see a "mixture of agents" video for code stuff.

  • @VishnuSashi-yq3tt
    @VishnuSashi-yq3tt 6 днів тому

    Been working on this for 3 months and i see this ughh

  • @jkcrews09
    @jkcrews09 5 днів тому

    Could you run all individually and combined (MoA) at the same time…?

  • @yrudrc
    @yrudrc 6 днів тому

    Amazing 🤩

  • @itamarperez-ryan3654
    @itamarperez-ryan3654 6 днів тому

    How can I learn to create agents?

  • @DanielKnoodle
    @DanielKnoodle 5 днів тому

    @matthew_berman I would love to see the code version of MoA. What are your current favorite top models for code generation?

  • @miket64
    @miket64 6 днів тому

    It would be great to see the result by using more accessible models like llama 3_8b

  • @MrMiniPilote
    @MrMiniPilote 4 дні тому

    New Test: "Given these letters; R, W, I, E, S, Z, please provide all the English 4 letter words that are possible. Each letter can only be used once per word." I haven't found a model yet that answers correctly.

  • @talonfirst
    @talonfirst 4 дні тому

    This seems like a nitpick, but wouldn't the answer to the Killers question be FOUR? Just because one of the original three becomes a corpse, he's still a killer. Or is it one of those existential metrics like "A person should not be defined by their profession" or "How did he lose his job? He died"?

  • @positivevibe142
    @positivevibe142 6 днів тому

    Guys, any good recommendation for a good an inexpensive laptop to run / play around with Large Language Models (LLM)? Around $1000 maybe!
    Currently, I have MSI G65, Thin with 40G RAM yet 6G VRAM and can hardly run the 72B models! So slow and overheats! 🤨

  • @最新AI应用
    @最新AI应用 5 днів тому

    Impressive! But can it beat GPT-4 in a karaoke contest? I'd pay to see that showdown!

  • @christopherroge5621
    @christopherroge5621 6 днів тому

    Basically you're running the same prompt through 4 models? Expensive.

  • @merelogics
    @merelogics 6 днів тому

    Probably increasing the token limit when executing the coding prompt might output better results.🤔

  • @robboerman9378
    @robboerman9378 6 днів тому

    If you take away the numbers from the “word count”, is it still incorrect? Just wondering if wordcount counted the numbers as words where the MoA did not 🤷‍♂️

  • @arnaudjean1159
    @arnaudjean1159 6 днів тому

    How much time till they fix the code 😂?? And after ?? I bet it will boost again the improvement process

  • @MrMoonsilver
    @MrMoonsilver 6 днів тому

    I want to see the code models at work! =)

  • @zippytechnologies
    @zippytechnologies 6 днів тому

    Yep and yep

  • @gustavstressemann7817
    @gustavstressemann7817 6 днів тому

    You really have to try out different coding models with this approach. I'm sure it's really cool

  • @Sparky_Chipmunk
    @Sparky_Chipmunk 6 днів тому

    What I like to see is all of AI being on device instead of datacenters.

  • @chipcode5538
    @chipcode5538 6 днів тому +1

    You’re so friendly, yesterday it gave me the correct answer but on the exam it did not. Let’s call this a pass. As for the programming, it can make some programs that were it the training set. I use copilot everyday, it works in just a minority of the cases. Sometimes it produces an excellent output. At other times it is completely garbage. At this point AI is not capable of doing real world programming tasks without human assistance. I think with the examples I have seen for AI programming, a student is able to get a working program with one internet search. AI is still impressive but don’t get overexcited.

  • @paul1979uk2000
    @paul1979uk2000 5 днів тому

    I think this would be a lot more interesting with much smaller models, especially if you can run 2 or even 3 of them on your gpu or they run fast enough through the cpu.
    This bigger models and having a few working together are not practical in most cases, especially if you want to run them locally, they will be too big and slow, so I really wonder how well small models do, anywhere from 2B to 13B, which you might be able to have 2 or 3 running at the same time, and performance shouldn't be too bad, and if the results are much better than any of the individual models, it would be worth looking into it.

  • @ScottWinterringer
    @ScottWinterringer 6 днів тому +3

    post the model.

    • @wurstelei1356
      @wurstelei1356 6 днів тому

      Link to the MoA github is in the video description.

  • @NoHandleToSpeakOf
    @NoHandleToSpeakOf 6 днів тому

    Isn't 0.7 temp too high for consistency?

  • @paelnever
    @paelnever 6 днів тому +1

    Many open source coding tools like opendevin already execute the code and review it to fix issues.

  • @WiseWeeabo
    @WiseWeeabo 6 днів тому

    Personally I'm really impressed at the INSIGHTS of Claude 3 sonnet.
    It's not as polished as gpt4 so it's not as good at writing code, but when I use both models gpt-4o and claude 3 in combination it produces some truly insightful results.

  • @johnbollenbacher6715
    @johnbollenbacher6715 6 днів тому

    Here is a simple question that ChatGPT always gets wrong. “How many p’s are there in the word pepper”.

  • @DiegoSilva-dv9uf
    @DiegoSilva-dv9uf 6 днів тому +1

    Valeu!