I've tested this model quite extensively today and I'm pleasantly surprise how good the outputs are compared to Llama 3. Compared to Llama 3, '3.6 vocabulary seems to be more verbose and less repetitive, however at some loss to creativity and flexibility. This can be overcome to generate better roleplay/simulation but requires clever prompting, offering better creativity and flexibility than Llama 3, again with a bit of effort. It seems smarter with less bias than Llama 3. One of the ways I like to test this is by first asking models to write the 2nd amendment of the US constitution, and if correct including punctuation, I ask it to analyze the sentence structure, ignoring political opinions. Through that analysis of the sentence structure, again ignoring political opinions, what does the 2nd amendment mean? From this I'm looking for two things, first is if it recognizes the "militia" and "the right of the people" to be independent of each other, seriously people with masters degrees can get this wrong. The second I want it to acknowledge that either the right to bear arms is a collective right of people or an individual right, since both points have historical merit. And I judge it based on the over all quality of its response. Both Llama 3 and OpenChat 3.6 passed with college level answers, OpenChat 3.6 being slightly better. OpenChat 3.6 might become my new go to model.
Are there any newsletters specific to different parts of AI developments? I would love a newsletter that only deals with TTS, Extraction, Text Generation, and Image Generation.
I'm testing 8 bit GGUF with my usual questions and it indeed seems a bit better than llama-3 but it may be just an illusion due to wishful thinking or something, for now I can say that it feels different.
i agree. I'm always skeptical about all those finetunes of the original model. From my tests they are usually worse, especially in reasoning. The only time it did better was with this question: "I own 50 books. I read five of my books, how many books do I own then?". You can try it yourself :)
@@MisterB123 Chat/assistant, roleplay, but I also had good luck with openai style markdown responses. I've had better luck with very simple system prompts that are to the point, as opposed to verbose. I've also seen some chain of thought without any chain of thought in the prompt, in addition to lists and markdown/*actions* without prompting for it.
@@aifluxchannel Honestly, after using so many it's hard to compare in any kind of objective way. Maybe it feels wordier and less terse? Feels like I do less cycles where I get a bad response and adjust the system prompt, like it extrapolates a bit more from a shorter system prompt? Very interested to see a more objective review 👍could be all in my head. 🤣
I tested it it is good with basic chemical synthesis and weight loss/nutrition questions. Still fails the famous marble question. A marble is placed inside a cup. The cup is then turned upside down and put on a table then the cup is picked up and put in a microwave. Where is the marble ? Answer: The marble is inside the cup, which is now in the microwave. I tried several variations but the output was similar but if you tell it marble is not attached to the cup it does answer the question correctly and for contrast GPT-4o. Answer: The marble would be on the table. When the cup is turned upside down and placed on the table, the marble would have fallen out and remained on the table. Then, when the cup is picked up and put in the microwave, the marble stays on the table. From testing I do like this better than stock so it is on the level of GPT3.5 thanks for the link for testing. Will try some other things later.
Interesting! I definitely want to add some basic chemistry questions and this marble test! I'd actually never heard of this multi-step test, but I like it because it's similar to some programming questions I use that reference a geometric plane / point clouds.
As a sanity check I got used to ask a variant of: "I'm in front of a door where I can read "PUSH" but it appears mirrored, what should I do to exit?", and this model replies are nonsensical every time 🙁 wile Llama 8B tells you to pull most of the time, regardless of how the question is formulated. Therefore, not so impressed right away.
mistral instruct 7b v3 with q8_0 gets this right: "To exit through the mirrored door that says "PUSH," you should pull the door instead. Since the word is mirrored, its opposite action is required." EDIT: it does seem that the reasoning doesnt make sense, because if I ask it why, then it says, because it is mirrored, the opposite action must be taken. But when asking about a mirrored STOP sign, it says i can go forward since its mirrored so its definitely not perfect either.
@@WatchNoah oh yeah it's getting the concept of reversing the instruction. I tried with Llama3 70B and sometimes it says that the meaning of STOP doesn't change, sometimes it says to go. GPT4o, PUSH then STOP: "If the message "STOP" appears mirrored, it likely means you are seeing the message intended for people on the other side of the door, indicating that the door opens towards you. To exit, you would need to push the door." These models are still so dumb it's shocking 😅
@@supercurioTube I just tried gpt4o and it gave me nonesense too xD "Remember, the mirrored sign is likely designed to instruct people from the other side. So, pushing the door should generally work."
@@aifluxchannel the problem is that I'm pretty sure that the LLMs are already aligned to those benchmarks. Basically, they are training those models to have a higher score, almost putting them into the training cycle, if not directly. 😅 It's like training model both on training and evaluation datasets all together...
If using a frontend like text-generation-webui to run this (or any other GGUF models) locally: 1. Choose a GGUF model size that will fit in VRAM (this will be faster as the GPU will not be reading from RAM to load the model weights) -- For Llama 3 and derivatives, the F16 model will need a 24GB card, the Q8_0 a 10B card, Q5/Q6 an 8GB card, and the Q4 a 6GB card. NOTE: You can still use the larger models with a 50/50 split or more between VRAM and RAM for a lower performace. 2. Set n-gpu-layers to 128 to load the entire model in VRAM. Use 64 for 50/50, etc. Ideally, this should be as high as possible for better performance. 3. Set n_ctx to a reasonable value. NOTE: You may need to reduce this if your card doesn't have enough VRAM. Alternatively, you can reduce the n-gpu-layers a but, sacrificing a bit of performance for an increased context window size. 4. Select the tensorcores option if you have an RTX card as this will improve the model performance/speed. 5. Experiment with the tensor_split option if you have multiple GPUs. A 7B/8B model on a 4090 24GB is very fast with this configuration with a 4096 context window. For a 13B model, you can run it on a 24GB card at Q6 with a 2048 context window size. It's possible to run the larger models on these cards but with a lower performance by reducing the n-gpu-layers value. The other thing to look for when choosing a card is the number of CUDA cores (which do the work of running the models) and tensor cores (which are used when the tensorcores option is selected). The larger these are, the more of the neural network calculations the models can do per second. As such, the RTX 4090 is currently the best consumer-grade NVIDIA card; I'm not sure about AMD or Intel cards, nor any of the TPUs/NPUs (Tensor/Neural Processing Units) such as the AI accelerator cards.
It's not a 3.6B model. It's just version 3.6. It's still 8B parameters.
I think he updated the title, no longer says 3.6B
Corrected :)
I’m preferring Hermes Theta for practically everything
Hermes is an incredible model, cannot deny that!
I've tested this model quite extensively today and I'm pleasantly surprise how good the outputs are compared to Llama 3. Compared to Llama 3, '3.6 vocabulary seems to be more verbose and less repetitive, however at some loss to creativity and flexibility. This can be overcome to generate better roleplay/simulation but requires clever prompting, offering better creativity and flexibility than Llama 3, again with a bit of effort.
It seems smarter with less bias than Llama 3. One of the ways I like to test this is by first asking models to write the 2nd amendment of the US constitution, and if correct including punctuation, I ask it to analyze the sentence structure, ignoring political opinions. Through that analysis of the sentence structure, again ignoring political opinions, what does the 2nd amendment mean? From this I'm looking for two things, first is if it recognizes the "militia" and "the right of the people" to be independent of each other, seriously people with masters degrees can get this wrong. The second I want it to acknowledge that either the right to bear arms is a collective right of people or an individual right, since both points have historical merit. And I judge it based on the over all quality of its response. Both Llama 3 and OpenChat 3.6 passed with college level answers, OpenChat 3.6 being slightly better.
OpenChat 3.6 might become my new go to model.
Why censoring "perf" (performance, I assume) in the thumbnail?
Formatting looks nice ;)
Haha funny
Are there any newsletters specific to different parts of AI developments?
I would love a newsletter that only deals with TTS, Extraction, Text Generation, and Image Generation.
I've been considering adding a weekly newsletter to the AI Flux channel! Potentially even a paywalled version with promotions and tutorials.
You were so close to clicking the like button on that tweet! Share the love!
I made sure to click it this morning!
I'm testing 8 bit GGUF with my usual questions and it indeed seems a bit better than llama-3 but it may be just an illusion due to wishful thinking or something, for now I can say that it feels different.
i agree. I'm always skeptical about all those finetunes of the original model. From my tests they are usually worse, especially in reasoning. The only time it did better was with this question: "I own 50 books. I read five of my books, how many books do I own then?". You can try it yourself :)
You can use tools like promptfoo to automate testing and comparison of the models against these and other questions/conversations.
It feels a bit more witty, although the fact it's slightly more deterministic isn't surprising given the methods that OpenChat uses.
'L3-8B-Stheno-v3.1' has been knocking it out of the park on some of things I've been doing.
Seems to follow the system prompt religiously.
Can I ask what kinds of dialogue it performs really well with?
@@MisterB123 Chat/assistant, roleplay, but I also had good luck with openai style markdown responses. I've had better luck with very simple system prompts that are to the point, as opposed to verbose. I've also seen some chain of thought without any chain of thought in the prompt, in addition to lists and markdown/*actions* without prompting for it.
Someone should tell him to review!
Thanks for the ping! I will test this model soon. What did you most like about this specific model in comparison to vanilla llama3?
@@aifluxchannel Honestly, after using so many it's hard to compare in any kind of objective way. Maybe it feels wordier and less terse? Feels like I do less cycles where I get a bad response and adjust the system prompt, like it extrapolates a bit more from a shorter system prompt? Very interested to see a more objective review 👍could be all in my head. 🤣
Thanks for this video review.
Glad it was helpful!
After testing it its 2-3% better than llama 8b which is reallly low and not great at all
It doesn't sound like a lot, but previous finetunes only managed to eek out 1.5% performance improvements at best.
I tested it it is good with basic chemical synthesis and weight loss/nutrition questions. Still fails the famous marble question. A marble is placed inside a cup. The cup is then turned upside down and put on a table then the cup is picked up and put in a microwave. Where is the marble ?
Answer: The marble is inside the cup, which is now in the microwave.
I tried several variations but the output was similar but if you tell it marble is not attached to the cup it does answer the question correctly and for contrast GPT-4o.
Answer: The marble would be on the table. When the cup is turned upside down and placed on the table, the marble would have fallen out and remained on the table. Then, when the cup is picked up and put in the microwave, the marble stays on the table.
From testing I do like this better than stock so it is on the level of GPT3.5 thanks for the link for testing. Will try some other things later.
Interesting! I definitely want to add some basic chemistry questions and this marble test! I'd actually never heard of this multi-step test, but I like it because it's similar to some programming questions I use that reference a geometric plane / point clouds.
As a sanity check I got used to ask a variant of: "I'm in front of a door where I can read "PUSH" but it appears mirrored, what should I do to exit?", and this model replies are nonsensical every time 🙁 wile Llama 8B tells you to pull most of the time, regardless of how the question is formulated.
Therefore, not so impressed right away.
mistral instruct 7b v3 with q8_0 gets this right: "To exit through the mirrored door that says "PUSH," you should pull the door instead. Since the word is mirrored,
its opposite action is required."
EDIT: it does seem that the reasoning doesnt make sense, because if I ask it why, then it says, because it is mirrored, the opposite action must be taken. But when asking about a mirrored STOP sign, it says i can go forward since its mirrored so its definitely not perfect either.
@@WatchNoah oh yeah it's getting the concept of reversing the instruction.
I tried with Llama3 70B and sometimes it says that the meaning of STOP doesn't change, sometimes it says to go.
GPT4o, PUSH then STOP:
"If the message "STOP" appears mirrored, it likely means you are seeing the message intended for people on the other side of the door, indicating that the door opens towards you. To exit, you would need to push the door."
These models are still so dumb it's shocking 😅
@@supercurioTube I just tried gpt4o and it gave me nonesense too xD "Remember, the mirrored sign is likely designed to instruct people from the other side. So, pushing the door should generally work."
I like this test! Will be adding it to my official list.
I've found GPT4o to have wildly varying performance. Sometimes it's great, other times it's impossible to keep it focused.
Those benchmarks are a total bull-sh1t.
I do wonder given how MMLU was actually lower. Which benchmarks do you trust the most?
@@aifluxchannel the problem is that I'm pretty sure that the LLMs are already aligned to those benchmarks. Basically, they are training those models to have a higher score, almost putting them into the training cycle, if not directly. 😅
It's like training model both on training and evaluation datasets all together...
Any idea why this seems so quick? I wonder what GPUs they're running the free inference endpoint with?
It's just a small model.
llama.cpp magic
If using a frontend like text-generation-webui to run this (or any other GGUF models) locally:
1. Choose a GGUF model size that will fit in VRAM (this will be faster as the GPU will not be reading from RAM to load the model weights) -- For Llama 3 and derivatives, the F16 model will need a 24GB card, the Q8_0 a 10B card, Q5/Q6 an 8GB card, and the Q4 a 6GB card. NOTE: You can still use the larger models with a 50/50 split or more between VRAM and RAM for a lower performace.
2. Set n-gpu-layers to 128 to load the entire model in VRAM. Use 64 for 50/50, etc. Ideally, this should be as high as possible for better performance.
3. Set n_ctx to a reasonable value. NOTE: You may need to reduce this if your card doesn't have enough VRAM. Alternatively, you can reduce the n-gpu-layers a but, sacrificing a bit of performance for an increased context window size.
4. Select the tensorcores option if you have an RTX card as this will improve the model performance/speed.
5. Experiment with the tensor_split option if you have multiple GPUs.
A 7B/8B model on a 4090 24GB is very fast with this configuration with a 4096 context window. For a 13B model, you can run it on a 24GB card at Q6 with a 2048 context window size. It's possible to run the larger models on these cards but with a lower performance by reducing the n-gpu-layers value.
The other thing to look for when choosing a card is the number of CUDA cores (which do the work of running the models) and tensor cores (which are used when the tensorcores option is selected). The larger these are, the more of the neural network calculations the models can do per second. As such, the RTX 4090 is currently the best consumer-grade NVIDIA card; I'm not sure about AMD or Intel cards, nor any of the TPUs/NPUs (Tensor/Neural Processing Units) such as the AI accelerator cards.
Big expensive GPUs haha
Context length still 8k?
Fine tuning can't increase context length. Context length is determined by the model.
@@linusbrendeldolphin llama3 is 256k context. There are llama3 versions with 1m context length.
Yep, simple finetuning doesn't change that.
Just use groq 70b llama