Multimodal AI from First Principles - Neural Nets that can see, hear, AND write.

Diffusion models explained in 4-difficulty levels

Multimodality and Data Fusion Techniques in Deep Learning

«Вони вміють воювати як терористи»: військовослужбовець «Пастор»

Втрачене дитинство | GOVOR TikTok #govor #shots

Дурнєв дивиться сторіс #52

How do Multimodal AI models work? Simple explanation

AssemblyAI

Переглядів 27 832

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 23 сер 2024

КОМЕНТАРІ • 26

@emc3000 8 місяців тому ⁺⁸
Thank you for giving actual application examples of this stuff.
@CharlesMacKay88 7 місяців тому ⁺⁴
great video. thanks for condensing this into the most important facts and avoiding any clickbait or annoying stuff.
@leastofyourconcerns4615 8 місяців тому ⁺²
awesome short introduction to the subject! appreciate you guys for those vids!
@AssemblyAI 8 місяців тому ⁺¹
Thanks for watching!
@mystikalle 8 місяців тому ⁺³
Great video! I would like you to create a similar easy to understand video about the article "What AI Music Generators Can Do (And How They Do It)". Thanks!
@faisalron 3 місяці тому ⁺¹
Great content, really easy to understand! Thanks.
Btw, the speaker looks like Nicholas Galitzine... 🤣🤣
@PrantikRoychowdhury-e3c 20 днів тому
Great explanation
@MiroKrotky 8 місяців тому
I have the same nose AS the speaker in the video, a little pushed to the side. Great vid. Best speaker on the channel
@user-pe4xm7cq5z 28 днів тому
Awesome. Thanks!!
@asfandiyar5829 8 місяців тому ⁺¹
Thanks for the awesome video! Though I think it was a little too quick given the topic being covered.
@pablofe123 7 місяців тому
Brilliant, only six minutes.
@danielegrotti5231 8 місяців тому
HI, I saw a few second of your new video the Emergent Abilities of LLM, but after some hours disapper... Could you please re-upload the video? Was so interesting! Thanks you so much
@AssemblyAI 7 місяців тому
Hi there - the video has been re-uploaded! Here's the link:
ua-cam.com/video/bQuVLKn10do/v-deo.html
@keithwins 7 місяців тому
Great!
@andrewdunbar828 5 місяців тому ⁺¹
Ah so they all convert to text in the pipeline? That's disappointing. I was wondering how they did the equivalent of tokenization for the other modalities. Text is rich but it's still inherently lossy or will introduce a certain kind of artefacting.
@andrewdunbar828 5 місяців тому ⁺¹
Actually I hunted around and it seems that multimodal models do in fact tokenize the other modes, often the term "patch" is used as the equivalent to "token" for the other modes.
@joelmaiza 7 місяців тому
For text are LLM?
For image are...?
@xspydazx 4 місяці тому
rom transformers import VisionEncoderDecoderModel, VisionTextDualEncoderProcessor, AutoImageProcessor, AutoTokenizer
print('Add Vision...')
# ADD HEAD
# Combine pre-trained encoder and pre-trained decoder to form a Seq2Seq model
Vmodel = VisionEncoderDecoderModel.from_encoder_decoder_pretrained(
"google/vit-base-patch16-224-in21k", "LeroyDyer/Mixtral_AI_Tiny"
)
_Encoder_ImageProcessor = Vmodel.encoder
_Decoder_ImageTokenizer = Vmodel.decoder
_VisionEncoderDecoderModel = Vmodel
# Add Pad tokems
LM_MODEL.VisionEncoderDecoder = _VisionEncoderDecoderModel
# Add Sub Components
LM_MODEL.Encoder_ImageProcessor = _Encoder_ImageProcessor
LM_MODEL.Decoder_ImageTokenizer = _Decoder_ImageTokenizer
LM_MODEL
This is how you add vision to llm (you can embed the head inside )
print('Add Audio...')
#Add Head
# Combine pre-trained encoder and pre-trained decoder to form a Seq2Seq model
_AudioFeatureExtractor = AutoFeatureExtractor.from_pretrained("openai/whisper-small")
_AudioTokenizer = AutoTokenizer.from_pretrained("openai/whisper-small")
_SpeechEncoderDecoder = SpeechEncoderDecoderModel.from_encoder_decoder_pretrained("openai/whisper-small","openai/whisper-small")
# Add Pad tokems
_SpeechEncoderDecoder.config.decoder_start_token_id = _AudioTokenizer.cls_token_id
_SpeechEncoderDecoder.config.pad_token_id = _AudioTokenizer.pad_token_id
LM_MODEL.SpeechEncoderDecoder = _SpeechEncoderDecoder
# Add Sub Components
LM_MODEL.Decoder_AudioTokenizer = _AudioTokenizer
LM_MODEL.Encoder_AudioFeatureExtractor = _AudioFeatureExtractor
LM_MODEL
This is how you can add vision :Sound (you need to make sure device = CPU ... as it takes at least 19gb ram to create the vision model (just from config)(plus the models in memory) ( they take probably a minute to run(if you begin a new mistral model it genrate weights for each layer in memeory also so it takes a few mins)
@wealthassistant 8 місяців тому
How can chatGPT decode images? It’s mind boggling good at recognizing text in photos. I don’t see how you get that capability from training on images of cats and dogs.
@AssemblyAI 8 місяців тому ⁺¹
Unfortunately, no paper for GPT-4 has been published, so it is unknown. It could somehow combine Optical Character Recognition with something like a Vision Transformer to be able to understand images and read text so well!
@xspydazx 4 місяці тому
when training it learn captions for images (hence when inputting them you should give the most detailed description for each image) it then converts the image into its associated caption (because its not a database! , it has to have many images of a cat to recognise a cat image) (using haar cascades you cat pick individual items from an image) , so for object detection you would create a data set from ( a model using haar cascades to identify say eyes in a picture (Boxed) these recognized images can be fed into the model with thier descriptions:
For medical imagry a whole case history and file can be added with an image, hence being very detailed later images can bring the same detailed information to the surface again:
as a machine leaarning problem we need to remember how we trained networks to recognize pictures!
we also have OCR so these pretraiend OCR images can also be labled !
So once we have such data , we can slectively take information from a single image, its description as well as the other objects in the picture; (Captions do not include color information) so for coluor usage et and actual image inderstanding we have a diffuser !..... Hence stable diffusion ! with colour understanding we can generate simular images ! using a fractal !
hence FULL STACK MODEL DEVELOMENT ,
and that not the RAG! (which soon they will realize is a system whcih wwill need to be converted to a etl process! ( the llm long term memeory/ the rag the working memeory, the chat hiostory the shrot term memory) hence an ETL process will be required to update the local information into the main model , hence using the same tokenizer to tokenize the daat into the db so it can be loaded later quicker into the llm in a fine tuning! clearing the rag ! which should be performed as a backup ! ie monthly or anually !
@sereneThePity 4 місяці тому
backstreet freestyle
@QuintinMassey 4 місяці тому ⁺¹
A Woman, questionable (it is 2024 after all). A Female a little more certain (same reason) 😂
@seakyle8320 8 місяців тому ⁺¹
1:59 "concept of a woman"? ask woke people.

Наступне

Автоматичне відтворення

Multimodal AI from First Principles - Neural Nets that can see, hear, AND write.

Multimodal AI from First Principles - Neural Nets that can see, hear, AND write.

Diffusion models explained in 4-difficulty levels

Diffusion models explained in 4-difficulty levels

Multimodality and Data Fusion Techniques in Deep Learning

Multimodality and Data Fusion Techniques in Deep Learning

«Вони вміють воювати як терористи»: військовослужбовець «Пастор»

«Вони вміють воювати як терористи»: військовослужбовець «Пастор»

Втрачене дитинство | GOVOR TikTok #govor #shots

Втрачене дитинство | GOVOR TikTok #govor #shots

Дурнєв дивиться сторіс #52

Дурнєв дивиться сторіс #52

❌Разве такое возможно? #story

❌Разве такое возможно? #story

Googles GEMINI Just SHOCKED The ENTIRE INDUSTRY! (GPT-4 Beaten) Full Breakdown + Technical Report

Googles GEMINI Just SHOCKED The ENTIRE INDUSTRY! (GPT-4 Beaten) Full Breakdown + Technical Report

Fine-tune Multi-modal LLaVA Vision and Language Models

Fine-tune Multi-modal LLaVA Vision and Language Models

The most important AI trends in 2024

The most important AI trends in 2024

DiagramGPT - Honest Review of Eraser AI

DiagramGPT - Honest Review of Eraser AI

How AI 'Understands' Images (CLIP) - Computerphile

How AI 'Understands' Images (CLIP) - Computerphile

What nobody tells you about MULTIMODAL Machine Learning! 🙊 THE definition.

What nobody tells you about MULTIMODAL Machine Learning! 🙊 THE definition.

Why Does Diffusion Work Better than Auto-Regression?

Why Does Diffusion Work Better than Auto-Regression?

OpenAI CLIP: ConnectingText and Images (Paper Explained)

OpenAI CLIP: ConnectingText and Images (Paper Explained)

Learn LangGraph - The Easy Way

Learn LangGraph - The Easy Way

Экстремальные Прятки в Огромной Усадьбе Закрытая Школа!

Экстремальные Прятки в Огромной Усадьбе Закрытая Школа!

АЛАУДИНОВ у Скабеевой: эти их ВСУ нас НЕ ДОГОНЯТ 😁 [Пародия]

АЛАУДИНОВ у Скабеевой: эти их ВСУ нас НЕ ДОГОНЯТ 😁 [Пародия]

MELLSTROY - первое интервью: как живет самый обсуждаемый стример года

MELLSTROY — первое интервью: как живет самый обсуждаемый стример года

Яке значення Курської операції, якщо переможе Трамп #shorts #новини #Курськ #США #Трамп #війна #ЗСУ

Яке значення Курської операції, якщо переможе Трамп #shorts #новини #Курськ #США #Трамп #війна #ЗСУ

ВЛАД ШЕВЧЕНКО і АЛЛА МАЛКІН в СРАЧІ #26

ВЛАД ШЕВЧЕНКО і АЛЛА МАЛКІН в СРАЧІ #26

SCHOOLBOY RUNAWAY В РЕАЛЬНОЙ ЖИЗНИ 📚🔔 #schoolboy #runaway #schoolboyrunaway #shorts YOUNG

SCHOOLBOY RUNAWAY В РЕАЛЬНОЙ ЖИЗНИ 📚🔔 #schoolboy #runaway #schoolboyrunaway #shorts YOUNG

Удар по російській колоні в Курській області #shorts #війна #курськ #арміярф

Удар по російській колоні в Курській області #shorts #війна #курськ #арміярф

Максим Бородін & Zlata Ognevich - Без тебе | Прем'єра 2024

Максим Бородін & Zlata Ognevich - Без тебе | Прем'єра 2024