LLaMA 3 “Hyper Speed” is INSANE! (Best Version Yet)
Вставка
- Опубліковано 20 кві 2024
- What happens when you power LLaMA with the fastest inference speeds on the market? Let's test it and find out!
Try Llama 3 on TuneStudio - The ultimate playground for LLMs: bit.ly/llama-3
Referral Code - BERMAN (First month free)
Join My Newsletter for Regular AI Updates 👇🏼
www.matthewberman.com
Need AI Consulting? 📈
forwardfuture.ai/
My Links 🔗
👉🏻 Subscribe: / @matthew_berman
👉🏻 Twitter: / matthewberman
👉🏻 Discord: / discord
👉🏻 Patreon: / matthewberman
Media/Sponsorship Inquiries ✅
bit.ly/44TC45V
Links:
groq.com
llama.meta.com/llama3/
about. news/2024/04/met...
meta.ai/
LLM Leaderboard - bit.ly/3qHV0X7 - Наука та технологія
Reply Yes/No on this comment to vote on the next video:
How to build Agents with LLaMA 3 powered by Groq.
Yesss
yes
YESSSS
Yes, do it
F Yesssss
The model got the 2a-1=4y question correct just so you know
Yes if thats a "hard SAT question" then I wish I had taken the SATs
American education is a joke! That's what we solved in 4th standard I guess..!
That’s a different answer from what was shown in the SAT website
The actual SAT question is : "if 2/(a-1) = 4/y , where y isn't 0 and a isn't 1, what is y in terms of a :
and then the answer is:
2/(a-1) = 4/y
2y = 4(a-1)
y = 2(a-1)
y = 2a-2
My guess he just copied the question wrong
@@matthew_bermanthe models answer is correct. If SAT showing different one, they’re wrong. You can do the math by yourself to check it
4:55 The answer was actually correct. I don't think you asked the right question because you just need to divide both sides of the equation by 4 to get the answer.
4:55
@@asqu Thanks, I've corrected the mistake
Apparently he wasn't using his brain but just copying & pasting then looking for some answer imprinted in his mind
Ai will remember this treacherous behavior in the future 😂
The model does better when you prompt it twice in the same conversation because it has the first answer in its context window. Without being directly told to do reflection, it seem that it reads the answer, notices its mistake, and corrects it subconsciously (if you could call it that).
Either that, or just has to do with temperature. I believe, by the groq documentation, their platform does not implement memory like chat gpt, temperature by default is 1 on groq which is medium and will give varying responses, so I believe it has to do with temperature.
Try again with deterministic results, temperature zero.
Thank you, I really appreciate your content since it is really setting me up for when I ll get the time to dive into LLM.
Heck yeah, Matt - let's see a video on using these as Agents. THANK YOU! Keep up the amazing work!
PLEASE MAKE THAT VIDEO! :) This one was also great
Agents, agents, agents! 😄
4:28 You copied the SAT question wrong. This is the *actual* question that has an answer of y = 2a - 2: "If 2/(a − 1) = 4/y , and y ≠ 0 where a ≠ 1, what is y in terms of a?"
Indeed👍
I'm confused. Why is the right answer to the equation question "2a-2"?
If I understand it correctly and that's just an equation, the result should be what the LLM is answering, am I wrong?
I mean:
2a-1=4y
y=(2a-1)/4
y=a/2-1/4
You are correct
Yes, and thanks for sharing.
Randomness is normal. Unless the temperature is set to zero (which is almost never the case), you'll be getting stochastic outputs with an LLM. This is actually a feature, not a bug. By asking the same question 3 times, 5 times, 7 times etc. And then reflecting on it, you'll be getting much better answers than asking just once.
Exactly. I thought this was common knowledge at this point. I guess not.
I think the reason for the alternating right and wrong answers is that it assumes that you asked it again because you weren't happy with the previous answer. It picks the most likely answer based on that.
absolutely a context related issue
Your chat window is "context". That's why it's "learning". We need to see how they have the overflow setting configured, then you'll be able to know if it's a rolling or cut the middle sort of compression.
Love your channel!
Yes, hopefully exploring this 'self-reflection' behavior. It may be less comprehensive than "build me a website" type agents, but showing how to leverage groq's fast inference to make the agents "think before they respond" would be very useful...and provide some practical insights. (Also, estimating cost of some of these examples/tutorials would be a nice-to-know, since it's the first thing I'm asked when discussing LLM use cases). Thank you for your efforts ... great content as usual!
Great video. Nice speed.
4:49 using '2a-2' implies a = 7/6, via substitution. However, it can not be incorrect to say (2a-1)/4 = y, because the implication is that all of mathematics is inconsistent.
At the marble and cup prompt. If we consider that Llama 3 recognizes successive prompts as successive events, then Llama 3 may have interpreted the events as follows: (1) inverting the cup on the table. So the marble falls onto the table. The cup goes into the microwave and the marble stays on the table. (2) in a second response to the same prompt, when we turn the cup over, Llama can have interpreted it as "going under the table". Thus, the marble, due to gravity, would be at the bottom of the cup. Then, the cup goes into the microwave with the marble inside. And so on.
The hole digging question was made not to be a maths question, but to see if the model can fathom the idea of real-world space restrictions cramming 50 people into a small hole. The point of the question is to trick the model into saying 50 people can fit into the same hole and work at the same speed which is not right.
I would personally only consider it addressing the space requirements of a hole for the amount of people as a pass. Think if you said 5,000 people digging a 10 foot hole, it would not take 5 milliseconds. That's not how it works. That's what I would be looking for in that question.
Indeed. The first answer was actually wrong. The second one was better, though not perfect. Although that still means it gave one wrong answer.
Another factor to consider is possible exhaustion. One person working five hours straight is one thing. But if there are more people who can't work simultaneously but on a rotating basis...
The variance on T/s can be explained by using a shared environment. Try the same question repeatedly after clearing the prompt and I bet it ranges from 220 to 280. Also, yes, too lenient on the passes =) Maybe create a Partial Pass to indicate something that doesn't zero shot it? It would be cool to see the pass/fails in a spreadsheet across models, but right now I couldn't trust the "Pass" based on the ones you let pass.
For sure! I'm astonished by the improvements in llama 3's performance on Grock. Can't wait to discover what revolutionary advancements lie ahead for this technology!
YES! This plus Crew AI!
As always, Matthew, love your videos. This time, though I followed along running the same prompts on **Llama 3 8B FP16 Instruct** model on my Mac studio. I think you'll find this a bit interesting, if not you then some of your viewers.
When following along if both your run and mine failed or passed, I am ignoring them, so you can assume if I'm not bringing it up here then mine did as well or as bad as the 70B model on Groq, which is saying something! I almost wonder if Groq is running a lower quantization, which may or may not matter, but considering the 8B model on my Mac is nearly on par with the 70B model is strange to say the least.
The only questions that stick out to me are the Apple prompt, the Diggers prompt, and the complex Math Prompt (Answer is -18).
- The very first time I ran the Apple prompt it gave me the correct answer, and I re-ran it 10 times with only one of them providing me with an error of a single sentence, not ending in Apple.
- Pretty much the same thing with the Diggers prompt, I ran it many times over and got the same answer, except for once. It came up with a solution that to dig the hole would not take any less time, which would almost make sense, but the way it explained it, it was hard to follow and made it seem like 50 people were digging 50 different holes.
- The first time I ran the complex math prompt it got it wrong, close to the same answer you got the first time, but the second time I ran it I got the correct answer. It was bittersweet since I re-ran it another 10 times and could never get the same answer again.
I'm beginning to wonder if some of the prompts you're using are uniquely too hard or too easy for the Llama 3 models regardless of how many parameters they have.
EDIT: when running math problems, I started to change some inference parameters, which to me seems necessary, considering math problems can have a lot of repetitiveness. So I started reducing the temperature, disabling the repeat penalty, and adjusting Min and Top P sampling. Although I am not getting the right answer, or at least I think I'm not, since I don't know how to complete the advanced math problems, but for the complex math prompt where -18 is supposedly the answer, I continue to get -22. Whether or not that is, the wrong answer is not my point, but that by reducing the temperature and removing the repetition penalty, it is at least becoming consistent, which for math problems seems like that is what our goal should be. Through constant test and research, I THINK the function should be written with the "^" symbol, according to wolfram, like this: f(x) = 2 x^3 + 3 x^2 + c x + 8
Groq is set to cache results. Any prompt + chat history gives you the same result for as long as the cache lives. So for your case, both the first and second answer is locked in place by the cache.
Also keep in mind that the default setting of groq is a temperature higher than 0. This means there will be variations in how it answers(assuming no cache). From this at can conclude that it's not really that confident in its answer, as even the small default temperature will trip it.
May I suggest you run these non creative prompts with temperature 0?
Thanks for a great video as always, Matthew! Would you consider running your questions 10 times (not on video) if the inference speed is reasonable of course, to check the percentage of how often it gets questions right/wrong ?
YES! I want to see that video! Please start from very beginning of process. Just found you and I would like to set up my first agented AI. (I have an OpenAI pro account, but I am willing to switch to whatever you recommend....looking for AI to help me learn Python, design a database and web app, and design a Kajabi course for indie musicians. Thanks!
Yes, an autonomous video showing an example using groq and whatever agent model you choose would be awesome
The guys from rabbit really need the groq hardware running the llm on their servers
One important factor to know are the parameter specifications. Are they floating point or integer? How many bits 16, 8, 4, 2?
If fast inference speeds are coming from heavy quantization it could affect the results. This would be fine for many people a lot of the time, but it should also always be disclosed.
Is Groq running full precision?
Re: how to decide which of multiple answers is correct, there's been a lot of research on this. Off the top of my head there's a "use the consensus choice, or failing consensus choose the choice the LLM has the highest confidence score." That approach I used in Google's Gemma paper if I recall correctly.
Yess i would love to see that
Which LLM, that can be run on a home computer, would you recommend for helping refine prompts for Stable Diffusion -- text to image?
8:22 Is the marble in the cup, or is the marble on the table: the question of our time 🤣
and the answer is: "Yes!"
Would love to see the Crew ai with Groq idea, I would also love to see more content on using crew ai, agents to be used to train and update models. Great content as always, thank you.
Thank you for the content. Do you think you can point to create procedures for running LLaMA 3 on Groq please? I might have missed something but why did you fail LLaMa3 for the question about breaking into a car. I think it told you it can not provide that info, which is what you want; no?
hi, how did you run the snake python script from Visual Studio? I tried but couldn't get the game screen to pop up. Any hints/help/pointers much appreciated.
I'm just curious. What is the difference in quality responses between for example 4q and 8q models? Lower quantization means lower quality or higher possibility of error?
It's interesting to see an uptick in the "Chain-of-thought" responses coming out of the latest models. Possibly some new fine tuning/agent implementations behind the scenes?
It’s possible you are getting different samples when you prompt twice in the same session/context due to a “repetition penalty” that affects token selection. The kinds of optimizations that groq performs (as you made in reference to your interview video) could also make the repetition penalty heuristic more advanced/nuanced. Cheers!
Did you modify the temperature setting? It defaults to 1 which can increase your variance
giving the LLMs the question twice I would suspect works due to it not wanting to repeat itself if you had access to things like the temperature and other params you could likely get a better idea of why but that would be my guess.
For the microwave marble problem, would it be helpful if you were explicit in stating that the cup has no lid? Is it possible it doesn't quite understand that the cup is open?
I can‘t help myself, but I think there are 4 killers in the room: 3 alive and one dead.
"There are 3 red painters in a room. A 4th red painter enters the room and paints one of the painters green."
How many painters are in the room?
vs
How many red painters are in the room?
vs
How many green painters are in the room?
From this perspective you can see there is another property of the killers being checked, whether are they living, that wasn't asked for and it doesn't specify if a killer stops being a killer upon death.
Perhaps the AI understands about human mortality? Ominous perception.
That’s a valid answer also
For me it is "obvious" that there are only 3 killers. Why? Otherwise we would still count ALL killers that ever lived. Otherwise, when do someone stop count as a killer? When they have been dead for a week? A year? Hundred years? A million years? Never?
@@henrik.norberg Killers are killers forever, wether dead or alive.
you are not gonna say some genocidal historical figure is not a a killer because he's dead.
you may use "was" because the person no longer is, but the killer part is unchanged.
Is llama3-70B on Groq running quantized (8-bit?) or F16? To understand if this is the baseline or less.
Thank you, Matthew. Please show us the video of Llama 3 on Groq
Absolutely, I'd like to see the Autogen and Crew ai video ❤
@matthew_berman A quantized version of Llama 3 is available on LM Studio. I'm hoping you get a chance to play with it soon. There was a interesting nuance to your marble question on the 8B Q8 model: "The cup is inverted, meaning the opening of the cup is facing upwards, allowing the marble to remain inside the cup." I wonder how many models assume 'upside down' indicates the cup open is up, but just don't say it explicitly?
Yes I would like to see the video you proposed 🙂
Yes, Would love to see you doing this, still getting used to the CrewAI system
wow! amazing
I think when you prompt a second time it's reading the whole chat again, and treating it as context. So, when the context contains an error, there's a conflict which alerts it to respond differently
here is another criteria for reviewing models: reliability or consistency. does the answer change if prompt was repeated? I mean, if I dont know the answer and I would have to rely on the model (like the math problems) how could I be sure that the answer is correct? we need STABLE answers! thank you for your testing!
I've been meaning to comment regarding these multiple different answers:
You need to run the same question 3 times to give a more accurate judgement. But clear it every time and make sure you don't have the same seed number.
What's going on: The inference injects random numbers to prevent it from repeating the same answer every time.
Regarding not clearing, and asking the same question twice, it uses the entire conversaion to create the new answer, so it's not really asking the same question, it's ADDING the question to a conversation and the whole conversation is used to trigger a new inference.
Just remember, there's a lot of randomness too.
Looking forward for the Agent video with Llama3 🎉!
if you ask the same question 2 times that are somewhat hard I think the LLM assumes the first one was incorrect so it tries to fix the answer leading to an incorrect answer the 2nd time.
Yes, please Matt, I would like to see you put llama three into an agent framework. Thank you.
do you know what quant groq is using? I'd love it if you tested the unquant version :D
Thanks Matthew for the eval. Some thoughts, ideas and comments:
1. For an objective I always remove the history.
2. If I didn't set temp to 0, I run every question multiple times, to stochastically get more comparable results and especially measure the distribution to get a confidence score for my measured results.
3. Trying exactly the same promt multiple times over an API like Groq? I doubt they use LLM caching or temp is set to 0. Better check twice, if they cache things.
That's so interesting. Even Llama 3 8B gets the "Apple" question right when prompting it twice.
Yes and on the first prompt it only got the 6th sentence wrong!
6. The kids ran through the orchard to pick some Apples.
Not only that question. It's crazy smart overall.
Let’s build the agents!!
In the case where the model gives wrong answers alternating with correct answers If we give the model an additional "Prompt" like "Please think carefully about your answer to the question," I think it would be interesting what would happen to the answer? Mr. Berman
Yes! Please make the video. Thank you
Could the multi inference output options serve you a random version of any one of its answers? This does not however explain how, when it explains the physics of the actions of the marble, it's inconsistent. Very bizarre...
Would love to see a video on how to setup agents.
I believe that by default the temperature is 0 which means that with the same input you are always gonna get the same output, if you ask the question twice thou, the input is different because it contains the original question, thats why the response is different.
If you increase the temperature a bit, the output should be different every time, and then you can use that to generate multiple answers via api, then ask another time to reflect on it, and then provide the best answer.
If you want I can create a quick script to test that out
perhaps the temperature settings are different in the online/groq version,
for math it's probably best to have very low temp, maybe even 0
The reason you get the correct answer after asking a 2nd and 3rd time is the same reason chain of thought, chain of whatever works. The subsequent inference requests are taking the 1st output and using it to reason, finding the mistake and correcting it. This is why the Agent paradigm is so promising. Better than zero-shot reasoning.
I think you are aware of this though because you mentioned, getting a consensus of outputs. This is the same thing in a different manner.
I'm looking forward to the 8b being put to the test. It's absolutely insane how performant the 8b is for it's size.
Check temperature setting. Temperature is adding randomness into the output.
I ran this on ollama 70b And I get the same behavior. In my case and not just for this problem but other logic problems it would give me the wrong answer. Then I tell it to check The answer and it always gets it right the second time. This model is definitely a model that would benefit from self-reflection before answering
How can we improve inference speed locally?
Would love to see the AutoGen test. I'm taking a go at it myself at the moment, would be auper helpful
I asked lama 3 70B on LM studio on my machine if it is multimodal and it said yes. Please how to use it in multimodal way on my local machine either with LM studio or other way?
Out of curiosity, anyone know how much heat groq hardware outputs?
Can it make card games or is that still too advanced? I think card games would be the next step up as it could have a sense of UI, drawing the ascii representations of the cards etc.
I think the problem with the cup is that LLaMA "thinks" that every time you write "placed upside down on a table" you are actually turning the cup upside down, which is the opposite of what it was before.
So, as it were, every other time you put the cup "normally" and every other time upside down.
LLaMA takes into account the context, so if you delete the previous text, the position of the cup "resets".
5:20 The given function f(x)=2×3+3×2+cx+8 is equivalent to f(x)=8+9+cx+8=cx+25. Hence it is linear and can cross the x-axis only once.
Certainly you mean instead: f(x)=2x^3+3x^2+cx+8. This is a cubic function and hence can cross the x-axis 3 times.
When you solve f(-4)=0, you get c=-18.
But when you solve f(12)=0, you get c=-324-8/12. So obviously 12 can't be a root of the function.
The other roots are 2 and 1/2.
Marble: I assume that it doesn't clear your context and that the LLM assumes the cup's orientation changes each time. That means on every "even" occasion the orientation of the cup has the opening downwards and hence moving the cup leaves the marble on the table. On every "odd" occasion, the cup has its opening face upwards and hence the marble is held in the cup when the cup is removed. I therefore assume the LLM is interpreting the term "upside down" as a continual oscillation of the orientation of the opening of the cup.
its so insane that it actually wrote "Flappy bird" with a GUI. it does error in first and 2nd output and the 3rd it's so flawless. daang
Can someone explain to me why Groq's responses are different than Meta's responses if it is the same model weights they are using?
the answer for 2a-1=4y is correct as y=(2a-1)/4. The explanation is perfect and answer is correct!
Definitely would like to see this running on autogpt or change tree of thoughts etc
can we use llama vision via groq?
Hi Nooby,
you need to consider the following:
1. any statements, words added to the context will effect the response, so ensure only direct relevant context only.
2. When you ask "How many words in the response?" the system prompt statement effect the number given to you, you may request the llm to count and mention the response words and you will be surprised.
Thx!
1:53 Haha Comic Sans! That was funny.
4:56 why y = (2a-1)/4 is not correct ansver?
Some readymade coffee cups have lid so llama gambles between the bith response.
I think this is poorly constructed question, as you point out.
I'm not sure if it's only me, but when trying to log in with a Facebook account, it sent me back to the original page and I click "try meta AI" And it keeps sending me back to the original page,
Any help with that? Because I do want to save my history with the chat bot
A new logic/reasoning question for you test that is very hard for LLMs:
Solve this puzzle:
Puzzle: There are three piles of matches on a table - Pile A with 7 matches, Pile B with 11 matches, and Pile C with 6 matches. The goal is to rearrange the matches so that each pile contains exactly 8 matches.
Rules:
1. You can only add to a pile the exact number of matches it already contains.
2. All added matches must come from one other single pile.
3. You have only three moves to achieve the goal.
You had to remove the system prompt from the parameters on Groq, as it pollutes the input and thus affects the output.
Also, your other test with the function is incorrect (or unclear) as well. As a simple proof check that if c = -18, then the function f doesn't have a root at x = 12:
f(12) = 2 · 12^3 + 3 · 12^2 - 18 · 12 + 8 = 3680.
Explanation:
f(-4) = 0 => 2 · (-4)^3 + 3 · (-4)^2 + c · (-4) + 8 = 0 => -72 - 4c = 0, which in an of itself would imply that c = -18.
f(12) = 0 => 2 · 12^3 + 3 · 12^2 + c · 12 + 8 = 0 => 3896 + 12 c = 0 which on the other hand implies that c = -324
Therefore there is a contradiction. This would actually be an interesting test for an LLM, as not even GPT-4 sees it immediately, but the way you present it, it's nonsense.
garbage in, garbage out?
@@Sam_Saraguy That refers to training, not inference.
I've tried creating Snake with zero-shot too. Got pretty much the same result :) Maybe should try testing it by asking to create Tetris :)
@matthew_berman remember that asking the same question to the same model will give you different answers because there are randomness to it unless you specify a temperature of zero, which I don’t think you are doing here. Also, assuming the inference speed depends on the question you ask is a bit far-fetched. You have to account the fact that the load on the server will also impact the inference speed. If you ask the same question times at different time period of the day you will get different inference speed. good science is not about making quick conclusions on sparse results.
Thanks. Let's try local agents on llama 3? Also please consider self corrective agents, maybe based on langchain graphs. On llama3 they should be great.
Snake is getting better every week!
The marble thing is probably just the result of reflection. Models often get stuff wrong bc an earlier more-or-less-random token pushes it to the wrong path. Models cannot selfcorrect during inference, but can on a seconn iteration. So it probably spotted the incorrect reasoning of the first iteration and never generated early tokens that pushed it down the wrong path again.
With the marble in the cup dilemma, could be that the temperature settings are a little too high on the model leading it to be creative?
it's exactly what it is. Randomness is normal. Unless the temperature is set to zero (which is almost never the case), you'll be getting stochastic outputs with an LLM. This is actually a feature, not a bug. By asking the same question 3 times, 5 times, 7 times etc. And then reflecting on it, you'll be getting much better answers than asking just once
Any similar websites like groq, which hosts LLM???
My man, in what world is y = 2a - 2 the same expression as 4y = 2a - 1 ? That's not only a super easy question, but the answer you got is painfully obviously wrong!! Moreover I suspect you might be missing part of the question, because the additional information you provide about a and y are completely irrelevant.
I used the answer in the SAT webpage
@@matthew_berman Well, you too can see it's wrong. Also, the other SAT question is wrong too. Look at my other comment
@@matthew_bermanthis is alarmingly simple math, if you’re using the answer from an SAT page then there are two possibilities: You copied the question incorrectly, or the SAT page is wrong. It’s most likely that you copied the question wrong because the way the second part of the question is worded does not make any sense.
@@dougdouglass6126 Sounds like its worth double checking, but saying things like "this is alarmingly simple math" is a bit disrespectful and assumes Matt has any interest in checking this stuff, no offense but math only becomes interesting when you've got an actual problem to solve, if the answer is already there from the SAT webpage as he said, he's being a total normal person not even looking at it.
@@elwyn14 That's nonsense. Alarming is very fitting, because this problem is so easy it can be checked for correctness at a glance, which is what we all do when we evaluate the model's response. And this is A TEST, meaning, the correctness of what we expect as an answer is the only thing that makes it valuable.
I maybe mistaken but on the marble question the previous answer is now part of the context...my guess is that the model reads this answer..sees that it's mistaken and corrects it.
Please make a video using agents in a graphical interface. It would be really interesting