Pixtral is REALLY Good - Open-Source Vision Model

Matthew Berman

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 17 вер 2024
Let's test Pixtral, the newest vision model from MistralAI.
Try Vultr yourself when you visit getvultr.com/b... and use promo code "BERMAN300" for $300 off your first 30 days.
Join My Newsletter for Regular AI Updates 👇🏼
www.matthewber...
My Links 🔗
👉🏻 Main Channel: / @matthew_berman
👉🏻 Clips Channel: / @matthewbermanclips
👉🏻 Twitter: / matthewberman
👉🏻 Discord: / discord
👉🏻 Patreon: / matthewberman
👉🏻 Instagram: / matthewberman_ai
👉🏻 Threads: www.threads.ne...
👉🏻 LinkedIn: / forward-future-ai
Need AI Consulting? 📈
forwardfuture.ai/
Media/Sponsorship Inquiries ✅
bit.ly/44TC45V

КОМЕНТАРІ • 109

@matthew_berman 20 годин тому ⁺¹²
Pixtraaal or Pixtral?
@jimpanse3089 20 годин тому ⁺³
Does it deserve triple a?
@johnnycarson9247 19 годин тому
you nick it Pix T and own that sh1t
@akari8736 19 годин тому ⁺²
Pixtraaaal. Alternatively, you could wear a black beret, a white-and-black striped shirt and hold a cigarette, at which point you can go ahead and pronounce it either way.
@drashnicioulette9565 17 годин тому
Bro but toonblast?! Really man😂. This is awesome
@MilesBellas 16 годин тому
C'est Françaaaaaais?!
😅
@Sujal-ow7cj 20 годин тому ⁺²⁴
Don't forget it is 12b
@thesimplicitylifestyle 20 годин тому ⁺²⁰
R.I.P. Captchas 😅
😎🤖
@SolaVirtusNobilitat 20 годин тому ⁺²
🎉
@Justin_Arut 20 годин тому ⁺⁴
A lot of sites have already switched to puzzle type captchas, where you must move a piece or slide bar to the appropriate location in the image in order to pass the test. Vision models can't pass these until they're also able to actively manipulate page/popup elements. I haven't seen any models do this yet, but it probably won't be long before some LLM company implements it.
@starblaiz1986 19 годин тому ⁺¹
@@Justin_ArutActually this model busts those too. You see at the end how it was able to find Wally/Waldo by outputting a coordinate. You could use the same trick with a puzzle captcha to locate the start and end locations, and then from there it's trivially easy to automatically control the mouse to drag from start position to end position. Throw a little rand() action on that to make it intentionally imperfect movement more like a human and there will be no way for them to tell.
@hqcart1 16 годин тому ⁺¹
it was ripped few years ago dude...
@BeastModeDR614 19 годин тому ⁺⁷
we need an uncensored model
@timtim8011 18 годин тому ⁺⁶
For the bill gates one, you put in an image with "bill gates" in the filename! Doesn't that give the model a huge hint as to the content of the photo?
@sleepingbag2424 19 годин тому ⁺¹⁰
I think you should try giving it a photo of the word "Strawberry" and then ask it to tell you how many letter r's are in the word.
Maybe vision is all we needed to solve the disconnect from tokenization?
@onlyms4693 19 годин тому
But if they used the same tokenizing for the mark for specific image then it will be the same.
@PSpace-j4r 20 годин тому ⁺⁴⁵
We need AI doctors for everyone on earth
@Thedeepseanomad 19 годин тому ⁺¹
..and then all other form of AI workers producing value for us.
@darwinboor1300 18 годин тому
Just imagine the treatments that an AI "doctor" could hallucinate for you! A "doctor" that can't count the number of words in its treatment plan or R's in "strawberry". A "doctor" that provides false (hallucinated) medical literature references.
AI's will help healthcare providers well before they replaced them. They will screen for errors, collect and correlate data, suggest further testing and potential diagnoses, provide up-to-date medical knowledge, and preliminary case documentation. All of this will increase patient safety and will potentially allow providers to spend more time with their patients. HOWEVER, (in the US) these advancements may only lead to healthcare entities demanding that the medical staff see more patients to pay for the AI's. This in turn will further erode healthcare (in the US).
@jimpanse3089 18 годин тому
@@Thedeepseanomad Producing value for the few rich people who can afford to put them in place. you wont profit from it
@earthinvader3517 18 годин тому ⁺²
Don't fotget AI lawyers
@storiesreadaloud5635 18 годин тому
@@earthinvader3517 Dream scenario: no more doctors or lawyers
@atishbhattacharya3473 20 годин тому ⁺¹⁹
When are we getting AI presidents?
@tomoki-v6o 20 годин тому
Presidents that hallucinate
@tomaszzielinski4521 19 годин тому
Not sooner than you get a human-intelligence president.
@storiesreadaloud5635 17 годин тому
you think biden was real?
@picksalot1 20 годин тому ⁺¹³
Small, specialized Models makes sense. You don't use your eyes for hearing or your ears for tasting for good reason.
@rafyriad99 18 годин тому ⁺²
Bad comparison. Ears and eyes are sensors ie cameras and microphones. Your brain accepts all the senses and interprets them. AI is the brain in the analogy not the sensors
@xlretard 17 годин тому
they don't sense they process lol but still a good point
@ran_domness 19 годин тому ⁺²
Seems like constraints instilled by its creators sometimes limit its ability to do the task.
@drummin4life1281 19 годин тому ⁺⁴
I love your channel but I really hope that in the future you start to make some changes to some more advanced questions. I understand the difficulty of making sure that the questions are followable by your audience but you're asking 6th grader questions to something that theoretically is a PhD level. I really wish that you would put some more work and effort into crafting individualized questions for each model in order to test the constraints of individual model strengths and weaknesses not just a one-size-fits-all group of questions.
@DK.CodeVenom 20 годин тому ⁺⁶
Where is GPT-4o live screenshare option?
@kneelesh48 18 годин тому
They're working on it while they showed us the demo lmao
@JoelSapp 19 годин тому ⁺²
7:50 my iPhone could not read that this is QR code.
@635574 19 годин тому
Its the weirdest QR I've seen, I don't think he checked if it works for normal scanners.
@tungstentaco495 19 годин тому ⁺²
Now all we need is a quantized versions of this model so we can run it locally. Based on the model size, it looks like Q8 would run on 16Gb cards and Q6 would run on 12Gb. Although, I'm not sure if quantizing vision models works the same way as traditional llms.
@GraveUypo 17 годин тому
saw someone at hugging face saying this uses 60gb unquantized. you sure it reduces that much?
@tungstentaco495 17 годин тому
@@GraveUypo I was basing my numbers on the Pixtral 12B safetensors file on huggingface, which is 25.4Gb. I assumed it's an fp16 model. Although, I could be wrong on any or all of that, but the size sounds about right for 12B parameters.
@hypertectonics7009 16 годин тому
When you next test vision models you should try giving it architectural floor plans to describe, and also correlate various drawings like a perspective rendering or photo vs a floor plan (of the same building), which requires a lot of visual understanding. I did that with Claude 3.5 and it was extremely impressive.
@sergeykrivoy5143 19 годин тому
To ensure the accuracy and reliability of this model, fine-tuning is essential
@drwhitewash 18 годин тому
Funny that the companies actually call the inference "reasoning". Sounds more intelligent than it actually is.
@whitneydesignlabs8738 18 годин тому ⁺¹
The big question for me, is when will Pixtral be available on Ollama, which is my interface of choice... If it will work on Ollama, it opens up a world of possibilities.
@GraveUypo 17 годин тому
i use oobabooga but if it doesn't work there i'll switch to something else that works, idc
@Transforming-AI 17 годин тому
Matt, you made a point regarding decent smaller models used for specialized tasks. That comment reminds me of Agents obviously, each seemingly with their own specialized model for tasks and a facilitator to delegate to agents. I think most want to see smaller and smaller open source models getting better and better on benchmarks.
@DeepThinker193 16 годин тому
"Great, so captcha's are basically done"
Me as a web dev:
👁👄👁
@madrooky1398 18 годин тому
Lol the drawn image was actually much more difficult to read than the captcha in the beginning.
@Feynt 17 годин тому
"Mistral" is (English/American-ised) pronounced with an "el" sound. Pixtral would be similar. So "Pic-strel" would be appropriate. However the French pronunciation is with an "all" sound. Since mistral is a French word for a cold wind that blows across France, I would go with that for correctness. It's actually more like "me-strall", so in this case "pic-strall" should be correct.
At any rate, I look forward to a mixture of agents/experts scenario where pixtral gets mixed in with other low/mid weight models for fast responses.
@skit555 16 годин тому
It's easier to crush a benchmark of 7-8B models when you're a 12B model though :')
@GetzAI 19 годин тому ⁺¹
Why don't you ever use the BIG PCs you were sent?
@hqcart1 16 годин тому
dude hosting a 12b on a 16 CPUs & 184GB RAM! it's probably $2 per hour
@Hypersniper05 18 годин тому
Nemo is a underrated 12B model
@frankrpennington 19 годин тому
This plus open interpreter to monitor camera feeds and multiple desktops, chats, emails
@NB-uq1zx 18 годин тому
Could you please include object counting tasks in the vision-based model's evaluation? This would be valuable for assessing their ability to accurately count objects in images, such as people in photos or cars in highway scenes. I've noticed that some models, like Gemini, tend to hallucinate a lot on counting tasks, producing very inaccurate counts.
@opita 17 годин тому
Nonchalantly says Captcha is done. That was good.
@stephanmobius1380 16 годин тому
(If you ask it to identify an image make sure the filename is obfuscated.)
@wardehaj 17 годин тому
Thanks for the pixtral video!
@OscarTheStrategist 19 годин тому
Very Impressive for an open source 12B model.
@CodingCanal 20 годин тому
would be nice for some of these if you could repeat the prompt with a separate query to see if it got it by random. like the waldo one
@darwinboor1300 18 годин тому
Matthew,
I agree many models and many agents are the future. Missing from your system model is the AI prompt interpreter/parser, AI agentic system assembler, response validator (ie, the AI supervisor). The money is going to be in truth based models and in the supervisors. Agents will quickly outnumber humans.
@picklenickil 17 годин тому
Ask it to an ARC test.. you may just win a million bucks
@MingInspiration 19 годин тому
there can be a small model good at testing or picking which small model to use for the task 😊
@drashnicioulette9565 17 годин тому
Toonblast? Really?! 😂Love it
@vamshi-rvk 19 годин тому
Hello Matthew, love your work. Just curious about where you would get all these latest releases info from?
@w.2550 18 годин тому
The bill gates one I hope it wasn't reading the file name and drawing from that to identify the person.
@TheRealChrisVeal 19 годин тому
very good very niiiiice,
very good, very niiiiiice, a lot of chicken nugget
@Ecstasio 18 годин тому
119GB being used. Followed by Photos is using 133GB 🧐
@alexandermumm3922 19 годин тому
its funy you highlight waldo and I still cannot make him out
@tanya508 19 годин тому
What about multiple pictures as an input? I think this is very important and you didn't address it in the video. I think it would be cool to test it to for example find the differences in multiple pictures, or find out amount of vram usage when you prompt it with multiple images.
@harrypehkonen 19 годин тому
I thought facial recognition was "turned off" in most (some) models on purpose. Didn't Anthropic have that in their system prompt?
@JustaSprigofMint 19 годин тому
Do you think an AGI would be basically these specialised use-case LLMs working as agents for a master LLM?
@GraveUypo 18 годин тому
Uhhh finally. Been waiting for this for years
@MetaphoricMinds 16 годин тому
Ok so VULTR is giving out $300 worth of crack to lure me into a new "needed" expense.
Nice! 😊😂
@paulmichaelfreedman8334 18 годин тому
YUP, Captchas are basically done
@samlak7102 19 годин тому
We need AI for every job
@hughmanwho 16 годин тому
Note it's pretty much the best in benchmarks because they didn't show the AIs better than them in benchmarks 😂
@BruceWayne15325 19 годин тому
Very impressive!
@WmJames-rx8go 16 годин тому
Thanks!
@userx6679 18 годин тому
Someone should try seeing if it can do the trachtenberg method.
@hevymetldude 17 годин тому
Would it find the app that is not installed, if you explain the concept of the cloud download icon to it? Like if you tell it "Check for cloud symbols - it means the app is not installed."
@lordjamescbeeson8579 19 годин тому
I just signed up with Vutr and was wondering if you were going to do any videos on this? Does anyone know of training for this? I want to run my Lama on it.
@AINEET 19 годин тому
Tutorial to have a logic performing LLM query the vision LLM and process the results?
@rijnhartman8549 18 годин тому
nice! Can i run this on my CCTV cameras at our one safari farm? To identify animals etc?
@annieorben 17 годин тому
I tested the QR code with my phone. My phone doesn't recognize the QR code either. Maybe the contrast in the finder patterns is too subtle?
@uploadvideos3525 17 годин тому
Matthew Baman Copper
@AustinThomasPhD 18 годин тому
If I send you pictures of insects and plants (with IDs) can you see how good these vision models are at species ID?
@baheth3elmy16 18 годин тому
I signed up for Vulture using the link you provided but didn't get the $300
@frozenwalkway 19 годин тому
Just need one model that can pull from the Internet well and I can unaub from open ai
@sheldonsebastian7232 20 годин тому
Can you add a test given a complex process flow diagram is the VLM able to convert it into nested JSON?
@giordano5787 20 годин тому ⁺²
pixtraál
@Martelus 17 годин тому
Whats the difference between some small models specialized in code, math, etc. Or a mixture of agents? The moe wouldn't be better?
@MilesBellas 16 годин тому
Comfyui implementation and testing?
@karankatke 19 годин тому
Can i run this locally thru LMstudio or anythingLLM?
@aizenbob 18 годин тому
Hello, let me share my vision of the futur from what I saw with meta or in china. I've seen a paper where the chinese made trained llama 2 with Audio, Video, Images and text and this model broke through many sota benchmarks, especially in image recognition ect.
So add that with meta Chamelon wich has a common encoder and decoder, why is this important because pixtral has still an image encoder and a text encoder. Chameleon brings all in one encoder and decoder.
Fusing both idea I think the futur will be llama 4 with Audio (Music, voice and natural sounds), Text, Video and Images for inputs and outputs. We might also get a llama 3.1 or 4 reflexion to compete with 01. But if we go even further I think we might have llama 4.1 wich fuses both llama 4 multimodality and Llama 4 reflexion with one additon, this model must be able like use humans to choose if a query is system 1 or system 2 thinking.
If that happens in an 8 to 12b model free download. Boy that's .......... Woaaaaaaaaah. If you read this comment and find this interesting I'm curious about your point of view, if you think I'm dreaming or if it's achievable in the near futur.
@luisalfonsohernandez9239 17 годин тому
Can it be adapted to understand video?
@josephroman2690 18 годин тому
someone that has claimed the credits, it asks for credit card or just with creating a new account is enough?
@chasisaac 18 годин тому
Tested against Claude 3 haiku. Why not Claude 3.5 sonnet
@salehmir9205 19 годин тому
Can we run it on M1 macs?
@En1Gm4A 20 годин тому ⁺²
Awesome - Thx - these opensource Reviews really help keeping me up to speed 😎🤟
@kpr2 19 годин тому
Very impressive. I'll certainly get some use out of this. Thanks for the info!
@jimunderwood77 16 годин тому
Is it censored??
@Darkt0mb5 20 годин тому ⁺¹
I'm from the future hackers have figured out how to make people's phones with Tik Tok on them explode Zoomers in shambles
@Sven_Dongle 18 годин тому
It seems quite strange that it's abysmal at logic and reasoning, but quite adept at analyzing images and describing them, as if the 2 are not conceptually related in any way. This seems like some red flag orthogonality to me.

Наступне

Автоматичне відтворення

Sam Altman Teases Orion (GPT-5), NotebookLM, Pixtral, Meta Training on Facebook Data