How do Multimodal AI models work? Simple explanation

Поділитися
Вставка
  • Опубліковано 23 сер 2024

КОМЕНТАРІ • 26

  • @emc3000
    @emc3000 8 місяців тому +8

    Thank you for giving actual application examples of this stuff.

  • @CharlesMacKay88
    @CharlesMacKay88 7 місяців тому +4

    great video. thanks for condensing this into the most important facts and avoiding any clickbait or annoying stuff.

  • @leastofyourconcerns4615
    @leastofyourconcerns4615 8 місяців тому +2

    awesome short introduction to the subject! appreciate you guys for those vids!

    • @AssemblyAI
      @AssemblyAI  8 місяців тому +1

      Thanks for watching!

  • @mystikalle
    @mystikalle 8 місяців тому +3

    Great video! I would like you to create a similar easy to understand video about the article "What AI Music Generators Can Do (And How They Do It)". Thanks!

  • @faisalron
    @faisalron 3 місяці тому +1

    Great content, really easy to understand! Thanks.
    Btw, the speaker looks like Nicholas Galitzine... 🤣🤣

  • @PrantikRoychowdhury-e3c
    @PrantikRoychowdhury-e3c 20 днів тому

    Great explanation

  • @MiroKrotky
    @MiroKrotky 8 місяців тому

    I have the same nose AS the speaker in the video, a little pushed to the side. Great vid. Best speaker on the channel

  • @user-pe4xm7cq5z
    @user-pe4xm7cq5z 28 днів тому

    Awesome. Thanks!!

  • @asfandiyar5829
    @asfandiyar5829 8 місяців тому +1

    Thanks for the awesome video! Though I think it was a little too quick given the topic being covered.

  • @pablofe123
    @pablofe123 7 місяців тому

    Brilliant, only six minutes.

  • @danielegrotti5231
    @danielegrotti5231 8 місяців тому

    HI, I saw a few second of your new video the Emergent Abilities of LLM, but after some hours disapper... Could you please re-upload the video? Was so interesting! Thanks you so much

    • @AssemblyAI
      @AssemblyAI  7 місяців тому

      Hi there - the video has been re-uploaded! Here's the link:
      ua-cam.com/video/bQuVLKn10do/v-deo.html

  • @keithwins
    @keithwins 7 місяців тому

    Great!

  • @andrewdunbar828
    @andrewdunbar828 5 місяців тому +1

    Ah so they all convert to text in the pipeline? That's disappointing. I was wondering how they did the equivalent of tokenization for the other modalities. Text is rich but it's still inherently lossy or will introduce a certain kind of artefacting.

    • @andrewdunbar828
      @andrewdunbar828 5 місяців тому +1

      Actually I hunted around and it seems that multimodal models do in fact tokenize the other modes, often the term "patch" is used as the equivalent to "token" for the other modes.

  • @joelmaiza
    @joelmaiza 7 місяців тому

    For text are LLM?
    For image are...?

    • @xspydazx
      @xspydazx 4 місяці тому

      rom transformers import VisionEncoderDecoderModel, VisionTextDualEncoderProcessor, AutoImageProcessor, AutoTokenizer
      print('Add Vision...')
      # ADD HEAD
      # Combine pre-trained encoder and pre-trained decoder to form a Seq2Seq model
      Vmodel = VisionEncoderDecoderModel.from_encoder_decoder_pretrained(
      "google/vit-base-patch16-224-in21k", "LeroyDyer/Mixtral_AI_Tiny"
      )
      _Encoder_ImageProcessor = Vmodel.encoder
      _Decoder_ImageTokenizer = Vmodel.decoder
      _VisionEncoderDecoderModel = Vmodel
      # Add Pad tokems
      LM_MODEL.VisionEncoderDecoder = _VisionEncoderDecoderModel
      # Add Sub Components
      LM_MODEL.Encoder_ImageProcessor = _Encoder_ImageProcessor
      LM_MODEL.Decoder_ImageTokenizer = _Decoder_ImageTokenizer
      LM_MODEL
      This is how you add vision to llm (you can embed the head inside )
      print('Add Audio...')
      #Add Head
      # Combine pre-trained encoder and pre-trained decoder to form a Seq2Seq model
      _AudioFeatureExtractor = AutoFeatureExtractor.from_pretrained("openai/whisper-small")
      _AudioTokenizer = AutoTokenizer.from_pretrained("openai/whisper-small")
      _SpeechEncoderDecoder = SpeechEncoderDecoderModel.from_encoder_decoder_pretrained("openai/whisper-small","openai/whisper-small")
      # Add Pad tokems
      _SpeechEncoderDecoder.config.decoder_start_token_id = _AudioTokenizer.cls_token_id
      _SpeechEncoderDecoder.config.pad_token_id = _AudioTokenizer.pad_token_id
      LM_MODEL.SpeechEncoderDecoder = _SpeechEncoderDecoder
      # Add Sub Components
      LM_MODEL.Decoder_AudioTokenizer = _AudioTokenizer
      LM_MODEL.Encoder_AudioFeatureExtractor = _AudioFeatureExtractor
      LM_MODEL
      This is how you can add vision :Sound (you need to make sure device = CPU ... as it takes at least 19gb ram to create the vision model (just from config)(plus the models in memory) ( they take probably a minute to run(if you begin a new mistral model it genrate weights for each layer in memeory also so it takes a few mins)

  • @wealthassistant
    @wealthassistant 8 місяців тому

    How can chatGPT decode images? It’s mind boggling good at recognizing text in photos. I don’t see how you get that capability from training on images of cats and dogs.

    • @AssemblyAI
      @AssemblyAI  8 місяців тому +1

      Unfortunately, no paper for GPT-4 has been published, so it is unknown. It could somehow combine Optical Character Recognition with something like a Vision Transformer to be able to understand images and read text so well!

    • @xspydazx
      @xspydazx 4 місяці тому

      when training it learn captions for images (hence when inputting them you should give the most detailed description for each image) it then converts the image into its associated caption (because its not a database! , it has to have many images of a cat to recognise a cat image) (using haar cascades you cat pick individual items from an image) , so for object detection you would create a data set from ( a model using haar cascades to identify say eyes in a picture (Boxed) these recognized images can be fed into the model with thier descriptions:
      For medical imagry a whole case history and file can be added with an image, hence being very detailed later images can bring the same detailed information to the surface again:
      as a machine leaarning problem we need to remember how we trained networks to recognize pictures!
      we also have OCR so these pretraiend OCR images can also be labled !
      So once we have such data , we can slectively take information from a single image, its description as well as the other objects in the picture; (Captions do not include color information) so for coluor usage et and actual image inderstanding we have a diffuser !..... Hence stable diffusion ! with colour understanding we can generate simular images ! using a fractal !
      hence FULL STACK MODEL DEVELOMENT ,
      and that not the RAG! (which soon they will realize is a system whcih wwill need to be converted to a etl process! ( the llm long term memeory/ the rag the working memeory, the chat hiostory the shrot term memory) hence an ETL process will be required to update the local information into the main model , hence using the same tokenizer to tokenize the daat into the db so it can be loaded later quicker into the llm in a fine tuning! clearing the rag ! which should be performed as a backup ! ie monthly or anually !

  • @sereneThePity
    @sereneThePity 4 місяці тому

    backstreet freestyle

  • @QuintinMassey
    @QuintinMassey 4 місяці тому +1

    A Woman, questionable (it is 2024 after all). A Female a little more certain (same reason) 😂

  • @seakyle8320
    @seakyle8320 8 місяців тому +1

    1:59 "concept of a woman"? ask woke people.