“LLAMA2 supercharged with vision & hearing?!” | Multimodal 101 tutorial
Вставка
- Опубліковано 9 чер 2024
- Explore Multimodal language model, like LLaVA, which enables you reach GPT4 level multimodal abilities, unlock use cases like chat with images
🔗 Links
- Follow me on twitter: / jasonzhou1993
- Join my AI email list: www.ai-jason.com/
- My discord: / discord
- LLaVA link: llava-vl.github.io/
⏱️ Timestamps
0:00 Intro
1:03 What is multimodal?
1:23 LLaVA model
2:08 Demo
3:35 Use case: Product development
5:17 Use case: Content curation
6:27 Use case: Medical
7:07 Use case: Captcha
8:09 Use case: Robots
👋🏻 About Me
My name is Jason Zhou, a product designer who shares interesting AI experiments & products. Email me if you need help building AI apps! ask@ai-jason.com
#gpt #autogpt #ai #artificialintelligence #tutorial #stepbystep #openai #llm #largelanguagemodels #largelanguagemodel #chatgpt #multimodality #gpt4 #multimodal #llama2 #llama #llava #machinelearning - Наука та технологія
Jason your videos are next level!! Loved the agent that you made for research. I made one similar using your video and ive been using it to research my work and its pretty awesome! Saved me tons of time already!!!
Great videos dude! Love the content and how compressed the info is!
again, great video. Thank you Jason.
This was a great video. Thank you.
Thank you big man for such amazing videos, Thank you !
excellent content bro, keep up the good work
One of the best channels right now
Thank you AI Jason for sharing valuable AI developments. Would love to see in the future how to train the model on our own photos. Nice..
Yup, I really want to know how to train and fine tune it as well ...
i need more content from this channel!
Absolute banger dude your content is actually top tier
I concur 🤖
BEST AI Channel. Thank you Jason.
Great video as always
woahh, this is prob the best multi modal model ive tried, definitely open up lots of imagination!
I love it ! Thank you!!!!!!!!!!!!!!!
Legend 🙌🙌 super helpful
This channel will be huge
Great demo!
Nice introduction, thank you for your effort
Good stuff brother.
謝謝 AI Jason!
best AI content on youtube. Learned so much from you. Is it plausible to run this on consumer grade gaming machine with for instance rtx4090 ? Will you do an install / setup video?
awesome
Great video, the 13b multi model are doing amazing good. Love to see a video for the following use case: say I am a HR manager and have 2 job positions JOB-A and JOB-B. Can an LLM do the filtering of job resumes based on the requirements of JOB-A and JOB-B with few shot training or fine tuning. Its a prediction task alike sentiment analysis...
Like the example use cases. Indeed it seems LLAVA is not the good for rich text OCR. Definitely an area of improvement. Still promising anyway. I would love a second episode on fuyu 8b or a tutorial on how to further fine tune LLAVA for specific use case. Thanks a lot for sharing !
you used non squre images with a crop option. so what it saw was cropped
Ohhh good catch, I tried again and that definitely solved some of the issues!
LLaVA was out there a long time already. Great that they are not dead and added support for LLama V2
yea the result is much better after support LLama2!
@@AIJasonZ i haven't tested yet. It is able to understand/describe images better?
@@AIJasonZ It was able to describe something that LLama v1 couldn't?
Ok this is crazy! So now you can added more context. It's like us using our 5 senses to interpret information. But this part here @3:42...if this becomes possible where it builds full stack apps easily. Say goodbye to Junior developers. At that point anyone can sketch an app with the entire workflow, show the image to the A.I. along with the description like "Build this app you see with react in the front end, node js/express for the backend, create the api's and connect them to the front end" GAME OVER!!!
It's my understanding that Palm 2 is hooked to Bard. Gemini is the future. Google has to figure out how to mesh Gemini into Palm 2 and Palm 2 into Gemini. Gemini has all the new multimodal features that Palm 2 I assume will pick up if they can learn how to sync it.
Do you think the choice of vector database matters for storing this multimodal data? For example, does Weaviate vs.Cchroma offer certain features that might make it optimal for these multimodal vectors?
lol try it.
I wonder how these multi modal models will affect robotics and self driving
Is there any python api for this? I want to use it for image recognition
Anyone knows how to fine tune it for custom dataset?
I wonder why it failed the captcha? There’s already AI out there that can crack the captcha easily.
fucker i still fail captchas
This is insane... Because it's just a first experimental version of only a mere 13b parameter size model... And it can identify a pretty convoluted colourless picture and make the story out of it... Not to mention correctly rate a picture on an arbitrary score and identify what app you're gunning for without telling it the kind of app... The future looks pretty scary...
Not a dermatologist, but the foot looks like pityriasis rosea. 🤔
Hahah this is above my level 😝
I bet google have and use them all since years and has a profile from and about every android user in its database.
6:45 Well the completion doesnot mentions that it is no expert in medical domain and must seek some doctor
LOL.. In the Elon Musk photo he was smoking a blunt not a cigar. There is a big difference.
it's not a tutorial
First
Can you teach us how to use this model locally 🫠