MASSIVE Leap for LLama3! OpenChat's 3.6 8B Model Obliterates LLama3 8B!

Поділитися
Вставка
  • Опубліковано 28 жов 2024

КОМЕНТАРІ • 52

  • @elyakimlev
    @elyakimlev 5 місяців тому +13

    It's not a 3.6B model. It's just version 3.6. It's still 8B parameters.

    • @GerryPrompt
      @GerryPrompt 5 місяців тому

      I think he updated the title, no longer says 3.6B

    • @aifluxchannel
      @aifluxchannel  5 місяців тому +1

      Corrected :)

  • @p1nkfreud
    @p1nkfreud 5 місяців тому +3

    I’m preferring Hermes Theta for practically everything

    • @aifluxchannel
      @aifluxchannel  5 місяців тому

      Hermes is an incredible model, cannot deny that!

  • @SiCSpiT1
    @SiCSpiT1 5 місяців тому +1

    I've tested this model quite extensively today and I'm pleasantly surprise how good the outputs are compared to Llama 3. Compared to Llama 3, '3.6 vocabulary seems to be more verbose and less repetitive, however at some loss to creativity and flexibility. This can be overcome to generate better roleplay/simulation but requires clever prompting, offering better creativity and flexibility than Llama 3, again with a bit of effort.
    It seems smarter with less bias than Llama 3. One of the ways I like to test this is by first asking models to write the 2nd amendment of the US constitution, and if correct including punctuation, I ask it to analyze the sentence structure, ignoring political opinions. Through that analysis of the sentence structure, again ignoring political opinions, what does the 2nd amendment mean? From this I'm looking for two things, first is if it recognizes the "militia" and "the right of the people" to be independent of each other, seriously people with masters degrees can get this wrong. The second I want it to acknowledge that either the right to bear arms is a collective right of people or an individual right, since both points have historical merit. And I judge it based on the over all quality of its response. Both Llama 3 and OpenChat 3.6 passed with college level answers, OpenChat 3.6 being slightly better.
    OpenChat 3.6 might become my new go to model.

  • @Happ1ness
    @Happ1ness 5 місяців тому +7

    Why censoring "perf" (performance, I assume) in the thumbnail?

  • @ChristophBackhaus
    @ChristophBackhaus 5 місяців тому +1

    Are there any newsletters specific to different parts of AI developments?
    I would love a newsletter that only deals with TTS, Extraction, Text Generation, and Image Generation.

    • @aifluxchannel
      @aifluxchannel  5 місяців тому

      I've been considering adding a weekly newsletter to the AI Flux channel! Potentially even a paywalled version with promotions and tutorials.

  • @mshonle
    @mshonle 5 місяців тому +1

    You were so close to clicking the like button on that tweet! Share the love!

    • @aifluxchannel
      @aifluxchannel  5 місяців тому +1

      I made sure to click it this morning!

  • @BHBalast
    @BHBalast 5 місяців тому +3

    I'm testing 8 bit GGUF with my usual questions and it indeed seems a bit better than llama-3 but it may be just an illusion due to wishful thinking or something, for now I can say that it feels different.

    • @mlsterlous
      @mlsterlous 5 місяців тому +1

      i agree. I'm always skeptical about all those finetunes of the original model. From my tests they are usually worse, especially in reasoning. The only time it did better was with this question: "I own 50 books. I read five of my books, how many books do I own then?". You can try it yourself :)

    • @msclrhd
      @msclrhd 5 місяців тому

      You can use tools like promptfoo to automate testing and comparison of the models against these and other questions/conversations.

    • @aifluxchannel
      @aifluxchannel  5 місяців тому

      It feels a bit more witty, although the fact it's slightly more deterministic isn't surprising given the methods that OpenChat uses.

  • @_SimpleSam
    @_SimpleSam 5 місяців тому +2

    'L3-8B-Stheno-v3.1' has been knocking it out of the park on some of things I've been doing.
    Seems to follow the system prompt religiously.

    • @MisterB123
      @MisterB123 5 місяців тому

      Can I ask what kinds of dialogue it performs really well with?

    • @_SimpleSam
      @_SimpleSam 5 місяців тому

      @@MisterB123 Chat/assistant, roleplay, but I also had good luck with openai style markdown responses. I've had better luck with very simple system prompts that are to the point, as opposed to verbose. I've also seen some chain of thought without any chain of thought in the prompt, in addition to lists and markdown/*actions* without prompting for it.

    • @GerryPrompt
      @GerryPrompt 5 місяців тому

      Someone should tell him to review!

    • @aifluxchannel
      @aifluxchannel  5 місяців тому

      Thanks for the ping! I will test this model soon. What did you most like about this specific model in comparison to vanilla llama3?

    • @_SimpleSam
      @_SimpleSam 5 місяців тому

      @@aifluxchannel Honestly, after using so many it's hard to compare in any kind of objective way. Maybe it feels wordier and less terse? Feels like I do less cycles where I get a bad response and adjust the system prompt, like it extrapolates a bit more from a shorter system prompt? Very interested to see a more objective review 👍could be all in my head. 🤣

  • @marcfruchtman9473
    @marcfruchtman9473 5 місяців тому +1

    Thanks for this video review.

  • @axelesch9271
    @axelesch9271 5 місяців тому +2

    After testing it its 2-3% better than llama 8b which is reallly low and not great at all

    • @aifluxchannel
      @aifluxchannel  5 місяців тому +1

      It doesn't sound like a lot, but previous finetunes only managed to eek out 1.5% performance improvements at best.

  • @southcoastinventors6583
    @southcoastinventors6583 5 місяців тому +2

    I tested it it is good with basic chemical synthesis and weight loss/nutrition questions. Still fails the famous marble question. A marble is placed inside a cup. The cup is then turned upside down and put on a table then the cup is picked up and put in a microwave. Where is the marble ?
    Answer: The marble is inside the cup, which is now in the microwave.
    I tried several variations but the output was similar but if you tell it marble is not attached to the cup it does answer the question correctly and for contrast GPT-4o.
    Answer: The marble would be on the table. When the cup is turned upside down and placed on the table, the marble would have fallen out and remained on the table. Then, when the cup is picked up and put in the microwave, the marble stays on the table.
    From testing I do like this better than stock so it is on the level of GPT3.5 thanks for the link for testing. Will try some other things later.

    • @aifluxchannel
      @aifluxchannel  5 місяців тому

      Interesting! I definitely want to add some basic chemistry questions and this marble test! I'd actually never heard of this multi-step test, but I like it because it's similar to some programming questions I use that reference a geometric plane / point clouds.

  • @supercurioTube
    @supercurioTube 5 місяців тому +4

    As a sanity check I got used to ask a variant of: "I'm in front of a door where I can read "PUSH" but it appears mirrored, what should I do to exit?", and this model replies are nonsensical every time 🙁 wile Llama 8B tells you to pull most of the time, regardless of how the question is formulated.
    Therefore, not so impressed right away.

    • @WatchNoah
      @WatchNoah 5 місяців тому +1

      mistral instruct 7b v3 with q8_0 gets this right: "To exit through the mirrored door that says "PUSH," you should pull the door instead. Since the word is mirrored,
      its opposite action is required."
      EDIT: it does seem that the reasoning doesnt make sense, because if I ask it why, then it says, because it is mirrored, the opposite action must be taken. But when asking about a mirrored STOP sign, it says i can go forward since its mirrored so its definitely not perfect either.

    • @supercurioTube
      @supercurioTube 5 місяців тому +2

      @@WatchNoah oh yeah it's getting the concept of reversing the instruction.
      I tried with Llama3 70B and sometimes it says that the meaning of STOP doesn't change, sometimes it says to go.
      GPT4o, PUSH then STOP:
      "If the message "STOP" appears mirrored, it likely means you are seeing the message intended for people on the other side of the door, indicating that the door opens towards you. To exit, you would need to push the door."
      These models are still so dumb it's shocking 😅

    • @WatchNoah
      @WatchNoah 5 місяців тому +2

      @@supercurioTube I just tried gpt4o and it gave me nonesense too xD "Remember, the mirrored sign is likely designed to instruct people from the other side. So, pushing the door should generally work."

    • @aifluxchannel
      @aifluxchannel  5 місяців тому +1

      I like this test! Will be adding it to my official list.

    • @aifluxchannel
      @aifluxchannel  5 місяців тому +2

      I've found GPT4o to have wildly varying performance. Sometimes it's great, other times it's impossible to keep it focused.

  • @8eck
    @8eck 5 місяців тому +3

    Those benchmarks are a total bull-sh1t.

    • @aifluxchannel
      @aifluxchannel  5 місяців тому

      I do wonder given how MMLU was actually lower. Which benchmarks do you trust the most?

    • @8eck
      @8eck 5 місяців тому +1

      @@aifluxchannel the problem is that I'm pretty sure that the LLMs are already aligned to those benchmarks. Basically, they are training those models to have a higher score, almost putting them into the training cycle, if not directly. 😅
      It's like training model both on training and evaluation datasets all together...

  • @GerryPrompt
    @GerryPrompt 5 місяців тому +1

    Any idea why this seems so quick? I wonder what GPUs they're running the free inference endpoint with?

    • @BHBalast
      @BHBalast 5 місяців тому +2

      It's just a small model.

    • @WatchNoah
      @WatchNoah 5 місяців тому +1

      llama.cpp magic

    • @msclrhd
      @msclrhd 5 місяців тому

      If using a frontend like text-generation-webui to run this (or any other GGUF models) locally:
      1. Choose a GGUF model size that will fit in VRAM (this will be faster as the GPU will not be reading from RAM to load the model weights) -- For Llama 3 and derivatives, the F16 model will need a 24GB card, the Q8_0 a 10B card, Q5/Q6 an 8GB card, and the Q4 a 6GB card. NOTE: You can still use the larger models with a 50/50 split or more between VRAM and RAM for a lower performace.
      2. Set n-gpu-layers to 128 to load the entire model in VRAM. Use 64 for 50/50, etc. Ideally, this should be as high as possible for better performance.
      3. Set n_ctx to a reasonable value. NOTE: You may need to reduce this if your card doesn't have enough VRAM. Alternatively, you can reduce the n-gpu-layers a but, sacrificing a bit of performance for an increased context window size.
      4. Select the tensorcores option if you have an RTX card as this will improve the model performance/speed.
      5. Experiment with the tensor_split option if you have multiple GPUs.
      A 7B/8B model on a 4090 24GB is very fast with this configuration with a 4096 context window. For a 13B model, you can run it on a 24GB card at Q6 with a 2048 context window size. It's possible to run the larger models on these cards but with a lower performance by reducing the n-gpu-layers value.
      The other thing to look for when choosing a card is the number of CUDA cores (which do the work of running the models) and tensor cores (which are used when the tensorcores option is selected). The larger these are, the more of the neural network calculations the models can do per second. As such, the RTX 4090 is currently the best consumer-grade NVIDIA card; I'm not sure about AMD or Intel cards, nor any of the TPUs/NPUs (Tensor/Neural Processing Units) such as the AI accelerator cards.

    • @aifluxchannel
      @aifluxchannel  5 місяців тому +1

      Big expensive GPUs haha

  • @henkhbit5748
    @henkhbit5748 5 місяців тому +4

    Context length still 8k?

    • @linusbrendel
      @linusbrendel 5 місяців тому +1

      Fine tuning can't increase context length. Context length is determined by the model.

    • @Rolandfart
      @Rolandfart 5 місяців тому +2

      @@linusbrendeldolphin llama3 is 256k context. There are llama3 versions with 1m context length.

    • @aifluxchannel
      @aifluxchannel  5 місяців тому

      Yep, simple finetuning doesn't change that.

  • @Ginto_O
    @Ginto_O 4 місяці тому

    Just use groq 70b llama