Meta Llama 3.1-405B Explained: The FUTURE of AI is OPEN-SOURCE!
Вставка
- Опубліковано 12 вер 2024
- Meta has finally delivered on their promise to revolutionize the open source large language model world with the release of the most powerful open source model in history! Llama 3.1 405B is the largest model ever released by meta and the performance numbers are equally impressive!
Tell us what you think in the comments below!
Meta Release: llama.meta.com...
HuggingFace Chat Llama3.1 405B: huggingface.co...
Something we haven't tested yet is fine-tuning such a SOTA model on specialized tasks, probably the best source of synthetic data for knowledge distillation, 405B is a robust model, better generalization, more stable and more sample-efficient, can't wait to see what the community will do with it, we've just unlocked a myriad of new use cases.
I'm waiting to see if this completely surpasses nVidia's Nemo model which they created solely for creating new datasets. 8B could also lead to really interesting agentic abilities! What are you planning to use this model for?
Nvidia's Nemo allows synthetic data, came out weeks ago. Things are moving fast...
Ollama has the 8B model in a 4 bit quantized version (ollama run llama3.1) which I have been playing with on an 8GB M1 Mini. It has a slightly higher memory footprint than the llama3 model at least in this configuration, probably due to the much larger context window. This causes it to swap and slow down the tokens per second. It also does not produce noticeably better results so far with my initial testing. It may require the 3 bit version to run in this envelope (which would degrade it further), however I wouldn't be surprised if Ollama can't do something as well since the model size is about the same (3 vs. 3.1). Having the greater context in such a small model opens all sorts of new possibilities. Fine-tuned versions of this will be very exciting for local use.
I've been having a similar experience with my M2 Max macbook pro. I think smaller quants are going to need to be tuned somewhat heavily to see consistent gains across any number of potential use-cases. Context window is everything.
under ollama or llamacpp new llama 3.1 is noy fully implemented yet and because of it is worse than llama 3 right now .
Good stuff. Not sure if you got trolled or if you're trolling us with the pronunciation of Claude -- in either case, it's pronounced "CLAWED." As in Claude Rains, Claude Shannon, Claude Monet, Claude Chabrol, Claude Debussy...
Hahaha, given the comments it's a bit of a meme. I understand it's CLAWDE haha.
It’s cloud. As in clouds that rain.
I'd argue RAG is still really useful. You can use it to minimize hallucinations and provide additional context beyond what the model was trained on. Thus even if RAG isn't useful for every situation and has its drawbacks it still will be a huge asset.
It looks powerful ! I’m very excited
Llama 3.1 70B is for now what I'm most impressed with!
Ah yes, the mysterious Clude model 😶🌫
But in all seriousness, Llama 3.1 is BIG news for open source, thanks Meta
Meta delivered and its a great day for progress in open source!
i view benchmark results especially human eval and coding, to be graded on a logarithmic scale, in the sense that the higher these benchmakr scores are 89 vs 92, that 3% difference is much more drastic than lets say 50 vs 53.
This is a great point! I also think these benchmarks fail to convey a lot of finer context with *how* these models generate code. Especially when you're comparing full files and just "completion".
How much expected gpu memory is needed exactly to run this locally on something like ollama when that is available?
Anywhere from 200GB to 1TB+ depending on quantization level
At least 256GB ;)
ollama is bad that you must re-encode kinda already working model into ollama format, i don't know what that is and why, but even such preparation for ollama takes huge amounts of storage in large models(usually ollama people doing that), so your only way is oobabooga and GGUF by system RAM (server motherboards have plenty RAM slots)
it all depends on ollama people hosting capabilities
Instead of fine tuning 405B directly, I would be really interested to see a re-implementation of Let's Verify step-by-step and fine tune a smaller model to pick the best approach for things like coding.
It'd get pretty expensive to run, but I wonder how well a best of ten or best of one hundred would do for coding problems.
Plus, there's obviously the agentic workflow applications where this really could be your coding assistant.
I think finetunes from 70B will be showing up first just because it's cheaper and will take less time. Agentic finetunes will rarely benefit from larger parameter counts, unless the architecture is wildly different compared to past llamas.
Exo should make that for Linux, it's dumb using only Mac's, with same problem of RAM. I see the only problem on market is that 128Gb LRDIMM is quite expensive and rare, it can be hard to find identical later if i'l invest money into it. At the same time 64Gb LRDIMM is plenty and affordable, i can buy 768Gb total like tomorrow. Of course DDR4, used server ram for my server board.
Maybe server farms migrating to 128’s and dumping 64's, so many of them.
They have a linux client, it's just not quite as mature. Initially EXO focused on Macs since the Mac Studio is an incredibly powerful platform and with proper networking is actually more cost effective than using the nVidia A100 80GB.
@@aifluxchannel thanks aiflux! 🤗
very impressive video can i use this model for my medical studies instade of gpt and claude like it has a better analysis?
Instruct models are generally better for "analysis" or flows that require multi-turn reasoning! What kind of AI do you currently use?
i was a gpt plus user but now i use different langauge models u know the companies every day make a surprise like good morning we interduce our new langauge model
I'm gonna try quantizing into 4bit ggufs. I think ill only need to offload about 70gb to cpu ram. I'm curious to see what the inference speeds are.
Hoping for anything better than 2-3 tok/s!
3:35 sure I have a hardware, like everyone 😭
We are all GPU poor my friend :)
How can you run the 8b model anywhere?
Technically the smallest quants available can run with as little as 8GB of Vram.
llama 405b is showing up on Groq model list, but on completion it produces: "message": "The model `llama-3.1-405b-reasoning` does not exist or you do not have access to it.",
Models "llama-3.1-8b-instant" and "llama-3.1-70b-versatile" however, do respond just fine. So, "llama-3.1-405b-reasoning" either not there or not free on Groq??
These are Groq specific quants, I'd imaging they're just having trouble scaling their infra today. Although I encountered a similar issue with Mistral Codesral, moreover the API that was "free", which in reality was only "free" if you had a form of payment hooked up for their conventional paid apis. I haven't interacted much with Groq outside of that.
@@Michael330167 Everybody is hitting these models all at once. HuggingChat is busy more often than not, which makes chaining with tool use difficult. But everyone (well everyone with serious hardware) can spin up a 3.1 405b server so it will saturate the market and then we should have no problems anymore (as long as Crowdstrike doesn’t get involved 😜)
clood
🤣
CLAWDE lmao
@@aifluxchannel I have faith you'll nail it sooner or later.
bro ive heard you say literally every possible pronunciation of Claude except the correct one... Clood...
Maybe I'm pronouncing to solicit comments correcting me ;)
@@aifluxchannelaktuallly it’s cl-oooorr-da
I like that you’re pronouncing the silent L. To me it’s so weird when people leave out the L.
Opensource AI is an absolute idiotic idea sorry.
While you, me, and your brother down the street might be just fine messing around with this right now.
Giving access to China, Russia, Iran/Iraq, etc... Access to the finished products of massive AI compute LLM's is atrocious.
Yes, this will be a lot of fun and very interesting to mess around with. No the devil doesn't have the same good intentions as you do.
As a programmer of 27 years I doubt having access to this excites people much more than it does me.
But as these AI models approach perfection it becomes more and more necessary to cut off nations who could use these models for harmful purposes.
Cutting these nations off from compute, like we did with China and Nvidia products, is faultered if we simply give them the finished product like Zuckerberg just did.
TBH this basically confirms to me that his wife has always been a Chinese spy. If you doubt that or think that is nutty you should look into her background.
Should we also classify linear algebra?