Might be a bit late about the llama-3.2 news as usual oops. Working on the New Claude 3.5 Sonnet video now (see u in 3 weeks lol and check out NVIDIA's Llama-3.2 resources 👇 Jetson AI Lab: nvda.ws/4eO2VFU Multimodal RAG with Llama 3.2 GitHub: nvda.ws/4eyspY0 NVIDIA NIM for Developers: nvda.ws/4dCXys1
pixtral is actually insane. it got thrown under the radar so hard it isn't even funny. AFAIK no major inference backend supports a quantized version of it ;-;
What does it help you do? I’m a noob just getting into this. Would love to know what kind of augmentation it would have on the workflow of an average Joe.
RAM on mobile devices isn't actually shared between the CPU and GPU like that. They actually do have dedicated VRAM, it's just in the order of like ~12 KB or so (literally orders of magnitude less than the frame buffer takes up, so it's usually not mentioned), but the GPU draws the screen as a sequence of tiles, so it never needs the whole frame buffer anyway, but it does spend a ton of time transferring data from RAM to tile memory just to render each screen frame. Some older phones let you adjust the resolution to to half/quarter of their native resolution to save power, because it drastically reduces the amount of RAM to VRAM transfers it takes to render the framebuffer on the GPU.
The > are in the wrong direction and llama 3.2 is the same as 3.1 but with vision, no difference in text weights so it’s not trained from a larger model
Pixtral, Qwen VL, Phi, there's so many. There's an open one that can ingest videos too, forgot the name. Sadly you ask any of them to OCR Japanese pages and they can't do it properly.
Because its been 8 months and still theres no publicly available models that can compete with even old models, not to mention needing to run at native precision plus decompression cost anyways due to lack of hardware?
@@bits360wastaken Hmm, you're missing my point though. Even though i agree with the lack of available public competitive models that's not what im asking, I'm talking about the coverage of the overall technology since it was open sourced with a huge update lately and great improvements with promising potentials. But besides the llama3-8B-1.58Bit-100B tokens model that can literally run a single core with 6-7 tokens per second, there's no public model as good as main stream models, but generally 1bit quantization is proven to be as close as float points quantization precisions but a lot more performant and efficient.
I would love to see video talking about how the hell these AI labs are making multi-modal models, it's obviously the direction that the industry is moving in and has been since GPT-4o, increasing scale on text-only data is getting harder and harder, including other modalities is the next obvious direction to scale in and could lead to models that generalise much better.
Could you please make a video about running these models in INT8 for local inference? There seems to be no content over the internet for 1B and 3B model with quantization for inference locally.
I just wish they would try to focus on efficiency and smaller size model at one point, we are reaching a point where this getting out of reach of the common mortal and their hardware
For the life of me I dont understand why the model sizes go from tiny to huge with nothing in between.. Why dont they make models that fully utilize 24GB RTX cards...
In my opinion the 3.2 series is unusable its extremely censored to a point its absurd. I thought for a second that Meta did something cool but once I saw 3.2 NOPE.
Might be a bit late about the llama-3.2 news as usual oops. Working on the New Claude 3.5 Sonnet video now (see u in 3 weeks lol
and check out NVIDIA's Llama-3.2 resources 👇
Jetson AI Lab: nvda.ws/4eO2VFU
Multimodal RAG with Llama 3.2 GitHub: nvda.ws/4eyspY0
NVIDIA NIM for Developers: nvda.ws/4dCXys1
You missed Pixtral 12B, a literal gem when it comes to multimodality. It is miles ahead of Llama 3.2 11B and comparable to Llama 3.2 90B.
pixtral is actually insane. it got thrown under the radar so hard it isn't even funny. AFAIK no major inference backend supports a quantized version of it ;-;
NVIDIA as a partner goes crazy, I am happy you are getting the attention you deserve 👍
I assume everyone will have a LLM running in their phones locally in 3-5 years
I run phi 3.5 mini locally on my phone, it is really good for the size and runs pretty well too.
What does it help you do? I’m a noob just getting into this. Would love to know what kind of augmentation it would have on the workflow of an average Joe.
What do you mean "I run it on my phone"?
What interface do you use? How do you manage the models?
@@DriftJunkie they have apps that handle all that now, you're living in the past lol
@@countofst.germain6417name pls
@@countofst.germain6417 can you gimme the name of the app which I can use to run it on a phone?
RAM on mobile devices isn't actually shared between the CPU and GPU like that. They actually do have dedicated VRAM, it's just in the order of like ~12 KB or so (literally orders of magnitude less than the frame buffer takes up, so it's usually not mentioned), but the GPU draws the screen as a sequence of tiles, so it never needs the whole frame buffer anyway, but it does spend a ton of time transferring data from RAM to tile memory just to render each screen frame. Some older phones let you adjust the resolution to to half/quarter of their native resolution to save power, because it drastically reduces the amount of RAM to VRAM transfers it takes to render the framebuffer on the GPU.
The > are in the wrong direction and llama 3.2 is the same as 3.1 but with vision, no difference in text weights so it’s not trained from a larger model
Its like they say: Thats cray cray
that's completely delulu
Pixtral, Qwen VL, Phi, there's so many.
There's an open one that can ingest videos too, forgot the name.
Sadly you ask any of them to OCR Japanese pages and they can't do it properly.
LLAMA in ComfyUI = interesting
Why aren't you covering 1bit LLMS, it sounds very promising and the tests are testifying to that too.
Because its been 8 months and still theres no publicly available models that can compete with even old models, not to mention needing to run at native precision plus decompression cost anyways due to lack of hardware?
@@bits360wastaken
Hmm, you're missing my point though.
Even though i agree with the lack of available public competitive models that's not what im asking, I'm talking about the coverage of the overall technology since it was open sourced with a huge update lately and great improvements with promising potentials.
But besides the llama3-8B-1.58Bit-100B tokens model that can literally run a single core with 6-7 tokens per second, there's no public model as good as main stream models, but generally 1bit quantization is proven to be as close as float points quantization precisions but a lot more performant and efficient.
I would love to see video talking about how the hell these AI labs are making multi-modal models, it's obviously the direction that the industry is moving in and has been since GPT-4o, increasing scale on text-only data is getting harder and harder, including other modalities is the next obvious direction to scale in and could lead to models that generalise much better.
Definitely do a video on multi-modality, thanks!
In those models, the image tokens are not fed through cross-attention but are instead provided alongside the text as input
Llama uses cross attention. Qwen is as you said.
Could you please make a video about running these models in INT8 for local inference? There seems to be no content over the internet for 1B and 3B model with quantization for inference locally.
they just released some new quants
x.com/AIatMeta/status/1849469912521093360
The 11B and and 90B aren't distilled but the 8B and 70B with vision encoders on top. Yeah a 20B vision encoder on the big one.
The < should be >. Nice video tho
no, 90b is 70b, 11b is 8b, you didn't pay attention to the papers, dude. the extra parameters are the vision adapter
im pretty sure they only said the image adapter is 100B for 405B params. As for the 90b and 11b, they didn't clarify how they have done it.
I just wish they would try to focus on efficiency and smaller size model at one point, we are reaching a point where this getting out of reach of the common mortal and their hardware
It makes sense the model to be good at summarizing social media posts because Meta uses their platforms as data 😂
Qwen 0.5b to 70b all have 130k tokens i think thats what you haven't heard of 😅
Man 😎you missed the lord of the underworld MOLMO BY Ai2
For the life of me I dont understand why the model sizes go from tiny to huge with nothing in between.. Why dont they make models that fully utilize 24GB RTX cards...
I wonder when something is going to convince me to abandon mistral.
Mistral don’t give af 😅
Try questions with wrong preposition. Mistral and Gemma cannot handle them. Llama and Qwen can.
it should v been " Boy Cloud " 😁😹✌
Is the video sponsored by nvidia?
Talk about nvidia new model
In my opinion the 3.2 series is unusable its extremely censored to a point its absurd. I thought for a second that Meta did something cool but once I saw 3.2 NOPE.
Gosh I'm so sick of this nerd talk