Best LLM for Parallel Function Calling: 14 LLM, 420 Prompt, 1 Winner Benchmark

Поділитися
Вставка
  • Опубліковано 28 лис 2024

КОМЕНТАРІ • 91

  • @sd5853
    @sd5853 10 днів тому +12

    i dont usually comment but i want the youtube algorithm to know i want more stuff like that

  • @techfren
    @techfren 11 днів тому +44

    would love to see qwen 2.5 coder on these videos

    • @kora5
      @kora5 11 днів тому +5

      Agree. Qwen2.5-Coder 32B should work well too.

    • @Techonsapevole
      @Techonsapevole 11 днів тому +5

      me too. I'm more interested in local LLMs

    • @andyjm2k
      @andyjm2k 10 днів тому

      Have been using qwen2.5 coder 7B for tool calls with my assistant and it works great

    • @lancerben4551
      @lancerben4551 10 днів тому +3

      Agreed looks promising for a local llm.

    • @lokeshart3340
      @lokeshart3340 10 днів тому +1

      Suiii u here also

  • @MuhammadFaisal_Iqbal
    @MuhammadFaisal_Iqbal 11 днів тому +12

    I am SOOO excited for the AI coding course!!! 🎉🤖

  • @callumarul6322
    @callumarul6322 8 днів тому +2

    You are a legend, can't wait for the course

  • @TheAIBlueprint
    @TheAIBlueprint 4 дні тому

    whoa.... this is so insanely tense... really good research, and testing... would love to watch a tutorial on how you built this.
    Also, for the video, those hands moving in the background and no background music has me glued and psychologically stressing out over who is going to win... awesome work on the video.

  • @solyarisoftware
    @solyarisoftware 11 днів тому +11

    Hi Dan,
    Thanks for the interesting benchmark.
    As you mentioned during the video, it could be interesting to see the same benchmarks comparing small-size LLMs on Ollama!
    Giorgio

    • @loudsquad2324
      @loudsquad2324 11 днів тому

      yes please, would be excited to see open source and SLM's !

    • @senecalouck2335
      @senecalouck2335 11 днів тому +1

      Came here to say the same. LM Studio now supports function calling in its latest beta as well.

    • @indydevdan
      @indydevdan  7 днів тому +1

      YW! You got it. I'll cover locals in future videos.

  • @HassanAllaham
    @HassanAllaham 10 днів тому +2

    Since one of the main targets is PERSONAL assistant, and since we are talking about agents and function calling, then the main targeted LLMs should be those which can be used on edge not on APIs, no one will like the idea of giving and sharing personal data ... Anyway, thanks for the good content 🌹

  • @vincentjean6756
    @vincentjean6756 11 днів тому +5

    The failure of the new Sonnet is very surprising. I always use flash now, it's super fast, super cheap with an epic context size. Good job Dan!🎉

    • @indydevdan
      @indydevdan  7 днів тому

      ikr I was shocked. Flash is so underrated.

  • @BillBaran
    @BillBaran 8 днів тому

    Thank you! THIS is the benchmark that really matters!

  • @MacS7n
    @MacS7n 11 днів тому

    I feel like I subscribed to the wrong channel. You really know what you are talking about with great understanding. I wish I was smart like you. I love your channel but I'm not a software engineer so some time I don't really understand but at the end of the video I'm definetly much more smarter. I'm part of the new wave of prompt engineer coding with prompts.

  • @brianmorin5547
    @brianmorin5547 11 днів тому

    You had me in the first 60-sec. Precisely. At the moment all my long chains take a structured output from a model then to a function acting as traffic controller to determine what agent to call next

  • @IslandDave007
    @IslandDave007 10 днів тому +1

    Love to see the new Mistral model included too - big updates today

  • @puneet1977
    @puneet1977 10 днів тому

    What interesting way to benchmark. Thank you for doing this. It matches my personal experience with all of these LLM’s when it comes to tool calling. Although the failure rate usually goes up when you have a complicated /longer list of parameters.

  • @audioreworkvisions
    @audioreworkvisions 11 днів тому +1

    Thank You IndyDevDan! You build so great Things n Stuff with almost every Time a great value to me. Thanxxx and we gonna Rock...

  • @ScottLahteine
    @ScottLahteine День тому

    Very useful information, thanks! As we’re just beginning to get into agents and tool-calling it will be very important to know which models are trustworthy so we can tell when to blame our code or the model. It would definitely be helpful to see a followup that tests all the most highly rated models we can run in Ollama. Two or three new models dropped just this week, including a new Qwen model called QwQ.

  • @DemetriusZhomir
    @DemetriusZhomir 10 днів тому +1

    This benchmark doesn't feel perfect, but gives some surprising results. Didn't expect Gemini to perform this good!
    And oh boy, I'm so curious about results that open-sourced models would bring!

  • @k22marie
    @k22marie 10 днів тому +1

    content so good I watch it at 1x speed

  • @captaincode6241
    @captaincode6241 10 днів тому

    Thanks for doing this benchmarking. You saved me tens of $ that I was spending on Sonnet. NOW, what I would love to see is different combinations of that 15 step process. ;)

  • @dustineagar1999
    @dustineagar1999 10 днів тому

    Awesome video, I'm diving into building around tool calls and loving your channel and the resources you've been putting out there.
    I noticed in benchy/server/modules/tools.py that the tool descriptions for the gemeni_tools_list are somewhat more detailed and expressive than those for openai_tools_list and anthropic_tools_list.
    New to the tools concept and might be way off here, but my understanding is that those descriptions act as kind of a semantic surface for queries to "connect" with and trigger the tool call, and I wonder if the differences in descriptions might have had a bearing on your test results.

  • @perschistence2651
    @perschistence2651 10 днів тому +2

    Flash 1.5 in my experience is great for small contexts, but as soon as you get into the 10k+ tokens context lengths, its performance plummets.

    • @lancerben4551
      @lancerben4551 10 днів тому +3

      My experience as well. It gets confused very easily. Even the big model is like that. Once google figures out how to keep it from hallucinating and improves accuracy and coherence it will be much better. For now I'm stuck paying the big price for Claude and o1 models for any complex task.

  • @vladrm1
    @vladrm1 11 днів тому +1

    Great benchmark and great video, thank you for sharing!
    What temperature setting did the llms run in this benchmark? Are there any (other) paramenters you found relevant for function calling? In my experience, i found that temperature 0 makes a big difference.
    Also, do you have a way to benchmark the quality of the tool input paramenters? This is where I found that smaller modules struggle and become impractical for function calling in some cases - in scenarios where tool input params require some reasoning.

    • @indydevdan
      @indydevdan  7 днів тому

      YW! Checking the quality of the 'prompt' input param is a great future direction. Intentionally left out for simplicity.

  • @techfren
    @techfren 11 днів тому +7

    I never used function calling, always json format then my own 'function calling' or whatever with that json

  • @jameshizon4861
    @jameshizon4861 4 дні тому

    Great LLM analysis. I am looking forward to applying gemini soon for building AI applications.

  • @stephenterry6372
    @stephenterry6372 10 днів тому

    Great stuff. Using claude-sonnet-3-5 with Cline has been problematic recently, but I'm wondering if Anthropic vary the model when it's busy. It would be good to see results at different times of day.

  • @extremelylucky999
    @extremelylucky999 10 днів тому +1

    The ironic thing is, the thumbnail has a typo. Talk about accuracy! 😂

  • @TheAIBlueprint
    @TheAIBlueprint 4 дні тому

    I also wonder if you would consider using DSPY for building the system of prompts, as each system reacts differently to the way the prompts are written, so it's hard to get accurate benchmarks with the same prompt that isn't optimized for each LLM... would that be a possibility for future videos?

  • @WenRolland
    @WenRolland 10 днів тому

    This is a great tool. Would love to see how Llama models do and also how smaller models like 1b, 3b and 8b would do on local systems which is a likely scenario for privacy purposes.

  • @johannes-johannsen
    @johannes-johannsen 10 днів тому

    great stuff, would be interesting to see what sorts of things break the models, and how much more complex tool calls impact results

  • @cashvo
    @cashvo 8 днів тому

    Cool benchmarks but I was hoping to learn how to do tool calling in my own agnetic code. Do you have a video on how to do that?

  • @saabirmohamed636
    @saabirmohamed636 7 днів тому

    this is soo good.
    can check 1st ... check performance price and commit. I am struggling with tool calls i feel like im really having to talk my cheap models into it.

  • @vermitsu
    @vermitsu 11 днів тому

    Perfect Accurac-t-y. Nice play on words

  • @bukitsorrento
    @bukitsorrento 10 днів тому

    Oh and for the AI course, would love to suggest purchasing power parity pricing, gumroad has a feature for this.

  • @joshuafadiji8253
    @joshuafadiji8253 10 днів тому

    Great work 👍🏽, love it.
    Could you also add xAI's grok? It is OpenAI API compatible, and currently in free beta testing

  • @caseystar_
    @caseystar_ 10 днів тому

    What’s a couple use cases for this? Would this be used in a multi-agent system? Or how do you use this most effectively?

  • @MuhammadFaisal_Iqbal
    @MuhammadFaisal_Iqbal 10 днів тому

    Sir, please make a video on the Software Engineer Roadmap, providing guidance on how to become proficient engineers in the age of AI.

  • @phanquochung3924
    @phanquochung3924 8 днів тому

    awesome work

  • @mrpocock
    @mrpocock 11 днів тому

    Really nice benchmark. Can you do the same with things that can be run on a single 12gb card locally through ollama?

  • @NLPprompter
    @NLPprompter 8 днів тому

    dan can you share your audio recording setup? your audio quality soo cool i like it so much so i want to create all my presentation audio as yours alike,please?

  • @toxy805
    @toxy805 11 днів тому

    excited for the AI coding course

  • @toxy805
    @toxy805 11 днів тому

    hey, can you make an introductory video, from the basics for the North Star goal of this channel? for benchmarking and everything you're doing

  • @jaysonp9426
    @jaysonp9426 11 днів тому +1

    Gpt4o mini is my favorite model right now. It's mostly perfect for this kind of stuff and basically free. I'll never trust a Gemini model lol

    • @orthodox_gentleman
      @orthodox_gentleman 10 днів тому

      Why wouldn’t you trust a Gemini model?

    • @indydevdan
      @indydevdan  7 днів тому +1

      4o-mini is insane - solid choice

    • @jaysonp9426
      @jaysonp9426 7 днів тому

      @@orthodox_gentleman more of a joke. His tests showed flash is really good and Gemini 1.5 is on the top of lmm sys right now. Until this though, their models have been trash and they keep acting like they're amazing (looking at you Gemma)

  • @arekkusub6877
    @arekkusub6877 11 днів тому

    How those tools/functions were defined? Modell agnostic in some sort of ad-hoc JSON format?

    • @indydevdan
      @indydevdan  7 днів тому

      One set of functions and roughly one json schema per model provider. See server/modules/tools.py. LID

  • @MagagnaJayzxui
    @MagagnaJayzxui 10 днів тому +1

    Qwen models and Mistral please

  • @bukitsorrento
    @bukitsorrento 10 днів тому

    Multi-modal benchmark, audio, image, video(images) as input.

  • @MekMoney79
    @MekMoney79 10 днів тому

    Outstading!

  • @toxy805
    @toxy805 11 днів тому

    Like a also a course on agentic engineering with tests, evals so we can learn how to run on our own.

  • @andyb4828
    @andyb4828 9 днів тому

    Great! Thanks.

  • @tomaszzielinski4521
    @tomaszzielinski4521 11 днів тому

    The order is not enough. From my experience only GPT-4o was able to call tools with multiple arguments (such as filter options for search query).
    At the same time, order doesn't always matter. For instance, if I want my Notion agent to fetch a number of pages or blocks, it doesn't matter in which order it gets them as long as all of them end up in context window.
    Also, I see no need of calling that may tools in a row, you'd probably be better to run one or two, validate the output, then proceed with prompt chain. Calling 15 tools in a row is no good if you get some hallucinations or icorrect call half way.

  • @samizdat_eth
    @samizdat_eth 10 днів тому

    Perfect accuracty?

  • @davidcampos9768
    @davidcampos9768 10 днів тому

    Please test local LLMs. Obviously these could be cheaper (free) and faster hosted on a local beefy machine.

  • @lokeshart3340
    @lokeshart3340 10 днів тому

    Where is the src code of this pls?

  • @brennan123
    @brennan123 10 днів тому

    @IndyDevDan, please add Groq and some of the Llama models to your tests.

    • @indydevdan
      @indydevdan  7 днів тому

      almost added groq in this video, will add into next benchy vid

  • @mikew2883
    @mikew2883 11 днів тому

    Very very cool! 👏

  • @drowningpenguin1588
    @drowningpenguin1588 10 днів тому

    Given how poorly haiku 3 and 4o were both doing, it seems valuable to include haiku-3-json for comparison. It’s still not as inexpensive as flash or 4o mini so maybe not cost effective still but haiku 3 was used for a lot of aider tools so I’m surprised it performs so poorly

  • @TryingThink
    @TryingThink 10 днів тому

    I wonder about Gemini Flash 1.5-8B

  • @MaJetiGizzle
    @MaJetiGizzle 11 днів тому

    Exporting the results would be useful for reporting purposes as well.

  • @AnansiTrading
    @AnansiTrading 10 днів тому

    Bravo🎉

  • @NooSpheere
    @NooSpheere 11 днів тому

    Thx à lot.

  • @stevensexton5801
    @stevensexton5801 10 днів тому

    Hmmm, looks like you need a benchmarking agent.

  • @lancerben4551
    @lancerben4551 11 днів тому

    My experience with the google models is that they hallucinate a lot and very mistake prone in their answers, very forgetful and not very good with instructions. My go to model is Claude 3-5 and o1-mini for planning or more complex coding. But it's nice to see the flash model being good at running the tools I will integrate it in my app for certain simpler repetitive tasks. But the rate of hallucination for me is a big problem as with more complex tool calls good reasoning will be essential. Another worry of mine is their service agreement. It is extremely strict.

    • @indydevdan
      @indydevdan  7 днів тому +1

      "...I will integrate it in my app for certain simpler repetitive tasks." - I think you're spot on here with how to best use flash / 4o-mini like models. Simple repetitive tasks where you can create simple prompts.

  • @Aristocle
    @Aristocle 9 днів тому

    Maybe sonnet use XML.

  • @SC-ck8pb
    @SC-ck8pb 10 днів тому

    Hey dude don't disable transcripts on your video. I really can't spend 23 minutes right to figure out which model performs better, so I'm trying to summarize the transcript of your video but you have transcripts disabled.

  • @asi_karel
    @asi_karel 11 днів тому

    best

  • @bolte5987
    @bolte5987 4 години тому

    Hate to be that guy, but "perfect accuracty"?
    Hint: Look at your thumbnail

  • @techfren
    @techfren 11 днів тому +2

    Firstt lesgoo new idd vid

    • @adriangpuiu
      @adriangpuiu 11 днів тому

      that model is expensive bro :D

  • @tom-et-jerry
    @tom-et-jerry 11 днів тому

    blablabla and we never see any result of your prompts... weird