OpenAI’s CLIP explained! | Examples, links to code and pretrained model

Поділитися
Вставка
  • Опубліковано 2 чер 2024
  • Ms. Coffee Bean explains
    ❓ how OpenAI‘s CLIP works,
    ❔ what it can and cannot do
    ⁉️ and what people have been up to using CLIP in awesome applications!
    ➡️ AI Coffee Break Merch! 🛍️ aicoffeebreak.creator-spring....
    ▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
    🔥 Optionally, pay us a coffee to boost our Coffee Bean production! ☕
    Patreon: / aicoffeebreak
    Ko-fi: ko-fi.com/aicoffeebreak
    ▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
    📺 DALL-E explained by Ms. Coffee Bean: • OpenAI's DALL-E explai...
    Outline:
    * 00:00 CLIP and DALL-E
    * 02:14 What can CLIP do?
    * 04:22 How does CLIP work?
    * 09:12 Limitations
    * 12:39 Applications
    📄 CLIP paper: Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. Learning Transferable Visual Models From Natural Language Supervision. Image, 2, T2. cdn.openai.com/papers/Learnin... (last date accessed: 24.01.2021)
    📚 CLIP blog: openai.com/blog/clip/
    💻 Use CLIP with huggingface: huggingface.co/docs/transform...
    I've removed the Colab Link since it was deprecated and not working anymore. A lot is happening in ML and now it is even easier to use CLIP with huggingface 🤗. Link above!
    Other links:
    📄 Switch Transformers: Fedus, W., Zoph, B., & Shazeer, N. (2021). Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. arXiv preprint arXiv:2101.03961.
    arxiv.org/pdf/2101.03961.pdf
    📚 Check out the section “10 Novel Applications using Transformers”: paperswithcode.com/newsletter/3
    🐦Links to the application showcase in Twitter:
    1. / 1352630033832140800
    2. / 1351271379103002632
    3. / 1350030541467258881
    4. / 1351830997403254785
    ------------------------------------------
    🔗 Links:
    UA-cam: / aicoffeebreak
    Twitter: / aicoffeebreak
    Reddit: / aicoffeebreak
    #AICoffeeBreak #MsCoffeeBean #MachineLearning #AI #research
    Thumbnail contains the tick emoji designed by OpenMoji - the open-source emoji and icon project. License: CC BY-SA 4.0

КОМЕНТАРІ • 57

  • @SinanAkkoyun
    @SinanAkkoyun Рік тому +7

    its so cute that the coffee bean takes a pause when you take a breath!
    Thank you, this video was more conclusive than anything i've seen on CLIP, it really explained the intuition to vector embedding image to text pairs and what that means

  • @ravivarma5703
    @ravivarma5703 3 роки тому +11

    This channel is Gold - Excellent

  • @satishgoda
    @satishgoda Місяць тому +1

    Thank you so much for this succinct and action packed overview of CLIP.

    • @AICoffeeBreak
      @AICoffeeBreak  Місяць тому +2

      Thank you for visiting! Hope to see you again.

  • @pixoncillo1
    @pixoncillo1 3 роки тому +10

    Wow, Letiția, what a piece of gold! Love your channel

  • @EpicGamer-ux1tu
    @EpicGamer-ux1tu Рік тому +3

    Amazing video! This definitely deserves more views/likes. Congratulations. Much love.

  • @anilaxsus6376
    @anilaxsus6376 8 місяців тому +1

    I like the fact that you talked about the Ingredients they used, thank you very much for that.

  • @nasibullah1555
    @nasibullah1555 3 роки тому +8

    Great job again. Thanks to Ms. Coffee Bean ;-)

    • @AICoffeeBreak
      @AICoffeeBreak  3 роки тому +4

      Our pleasure! Or rather Ms. Coffee Bean's pleasure. I am just strolling along. 😅

  • @romeoleon1118
    @romeoleon1118 3 роки тому +4

    Amazing content ! Thanks for sharing :)

  • @cogling57
    @cogling57 2 роки тому +4

    Wow, such amazing clear, succinct explanations!

  • @OguzAydn
    @OguzAydn 3 роки тому +5

    underrated channel

  • @user-vm4sv5cf9y
    @user-vm4sv5cf9y 2 місяці тому +1

    Thanks! Very clear!👍

  • @vince943
    @vince943 3 роки тому +4

    Thank you for your continued research. ☕😇

    • @AICoffeeBreak
      @AICoffeeBreak  3 роки тому +4

      Any time! Except when I do not have the time to make a video. 🤫

  • @talk2yuvraj
    @talk2yuvraj 2 роки тому +3

    This is an excellent video, congrats.

  • @TheNilianne
    @TheNilianne 3 роки тому +5

    Thanks for the vidéo :)

    • @AICoffeeBreak
      @AICoffeeBreak  3 роки тому +3

      As always, it was Ms. Coffee Bean's pleasure! 😉

  • @harumambaru
    @harumambaru 2 роки тому +3

    Thanks for teaching me something new today! I will try to return a favour and tell that dog race is breed :) But as not native English speaker it made perfect sense to me

    • @harumambaru
      @harumambaru 2 роки тому +2

      Hunderasse is pretty good word :)

    • @AICoffeeBreak
      @AICoffeeBreak  2 роки тому +1

      You're right! Hunderasse was a false friend to me, thanks for uncovering him to me. 😅
      Do you also speak German?

    • @harumambaru
      @harumambaru 2 роки тому +2

      @@AICoffeeBreak I am only learning it. After I moved to work to Israel and learned Hebrew I decided not to stop fun and continue learning new languages. I made bold suggestion that living in Heidelberg makes you speak German, and then I went to Wikipedia page Dog breed and found German version of the page, then my suggestion was confirmed.

    • @AICoffeeBreak
      @AICoffeeBreak  2 роки тому +2

      @@harumambaru True detective work! :) It's great you are curious and motivated enough to learn new languages. Keep going!

  • @lewingtonn
    @lewingtonn Рік тому +1

    LEGENDARY!!!

  • @mishaelthomas3176
    @mishaelthomas3176 2 роки тому +5

    Thank you very much mam for such an insightful video tutorial. But I have one doubt. Suppose I trained the CLIP model on a dataset consist of two classes i.e dog and cat. After training. I tested my model on two new classes for example horses and elephants in the same way as told in the CLIP blog of OpenAI. Will it give me a satisfactory result? as you said that it can perform zero short learning.

    • @AICoffeeBreak
      @AICoffeeBreak  2 роки тому +5

      Hi Mishael, this is a little mode complicated than that. If you train CLIP from scratch on two classes (dog and cat), it will not recognize elephants, no.
      The zero-shot capabilities of CLIP do not come from magical understanding of the world and generalization capabilities, but from the immense amounts of data CLIP has seen during pretraining. In my humble opinion, true zero-shot does not exists in current models (yet). It is just our human surprise when "the model has learned how to read" combined with our ignorance of the fact that the model had a lot of optical character recognition (reading) to do during pre-training. Or: look, it can make something out of satellite images, while its training data was full of those, but with a slightly different objective.
      The current state of zero-shot in machine learning is that you have trained on task A (e.g. align images containing text and the text transcription) and that it then can do another, but similar task B (e.g. distinguishing writing styles or fonts).
      I am sorry this didn't come across in the video so well and that it left the impression that zero-shot is more than it is. Experts in the field know very well the limitations of this but like to exaggerate it a little bit to get funding and papers accepted; but also because even this limited type of zero-shot merits enthusiasm, because models have not been capable of this at all until recently.
      I might make a whole video about "how zero-shot is zero-shot". A tangential video on the topic is this one ua-cam.com/video/xqdHfLrevuo/v-deo.html where it becomes clear how the wrong interpretation of the "magic of zero-shot" led to mislabeling some behavior of CLIP as "adversarial attack", which is not.

    • @AICoffeeBreak
      @AICoffeeBreak  2 роки тому +5

      One thing to add: if you *fine-tune* CLIP on dogs and cats and test the model on elephants, it might recognize elephants; not because your fine-tuning, but because all the pre-training that has been done beforehand. But not even this is not guaranteed: while fine-tuning, the model might catastrophically forget everything from pre-training.

  • @mikewise992
    @mikewise992 4 місяці тому +1

    Thanks!

  • @henkjekel4081
    @henkjekel4081 10 днів тому +1

    Thank you so much:) So the vectors T from the text encoder and I from the image encoders are the latent representations of the last word/pixel from the encoders that would normally be used to predict the next word?

    • @AICoffeeBreak
      @AICoffeeBreak  10 днів тому +2

      Not the last word/image region, but a summary of the entire picture / text sentence. :)

    • @henkjekel4081
      @henkjekel4081 10 днів тому +1

      @@AICoffeeBreak Thank you for your quick reply:) Let me see, so I do understand the transformer architecture very well. The text encoder will just consist of the decoder part of a transformer. Due to all the self attention going on, the latent representation of the last word at the end of the decoder will contain the meaning of the entire sentence. That is why the model is able to predict the next word, based on just the last words latent representation. So could you elaborate on what you mean with the summary of the text sentence? Which latent representations are you talking about?

    • @henkjekel4081
      @henkjekel4081 9 днів тому +1

      Hmm, I'm reading something about a CLS token, maybe that's it?

    • @AICoffeeBreak
      @AICoffeeBreak  9 днів тому +2

      @@henkjekel4081 Yes, exactly! So, the idea of CLIP is that they need a summary vector for the image and one for the text to compare them via inner product.
      It is a bit of architecture-dependent how exactly to get them. CLIP in its latest versions uses a ViT where the entire image is summarised in the CLS token. But the authors experimented with convolutional backbones as well, as the original implementation had two variants: one using a ResNet image encoder and the other using a Vision Transformer. The ViT variant became more widely available and popular.
      And yes, the text encoder happens to be a decoder-only autoregressive (causal attention) LLM, but it could have just been a bidirectional encoder transformer as well. The authors chose a decoder LLM to make future variants of CLIP generate language too.
      But for CLIP as it is in the paper, all one needs is a neural net that outputs a image summary vector, and another one that outputs a text summary vector of the same dimensionality as the image vector.

  • @gkaplan93
    @gkaplan93 5 місяців тому +1

    couldnt find the link to colab that lets us experiment, can you please attach to description?

    • @AICoffeeBreak
      @AICoffeeBreak  5 місяців тому +1

      The Colab link has become obsolete since the video has been up (a lot is happening in ML). Now you can use CLIP much easier since it has been integrated in hugginggface: huggingface.co/docs/transformers/model_doc/clip This is the link included in the description right now.

  • @andresredondomercader2023
    @andresredondomercader2023 2 роки тому +3

    Hello, CLIP is impressive :) Is there a listing of all possible tags/results it can return?

    • @AICoffeeBreak
      @AICoffeeBreak  2 роки тому +1

      CLIP can compute image-text similarity for any piece of text you input it has seen during training. I do not know exactly the entire list, but you can think of at least 30k English words.

    • @andresredondomercader2023
      @andresredondomercader2023 2 роки тому +1

      @@AICoffeeBreak Many thanks Letitia. I think we are trying to use CLIP the other way around: It seems that the algorithm is great if you provide keywords to identify images containing objects related to those keywords. But we are trying to obtain keywords from a given image, and then categorise those keywords to understand what is in the image. Maybe I'm a bit lost in how CLIP works?

    • @AICoffeeBreak
      @AICoffeeBreak  2 роки тому +3

      @@andresredondomercader2023 CLIP computes similarities between image and text. So what you can do is take the image and compute similarities to every word of interest. When the similarity is high, then the image is likely to contain that word and you have an estimate for what is in the image, right?

    • @andresredondomercader2023
      @andresredondomercader2023 2 роки тому +1

      @@AICoffeeBreak thanks so much for taking the time to respond. In our project, we have about 300 categories: "Motor", "Beauty", "Electronics", "Sports"... Each category could be defined by a series of keywords; For instance "Sports" is made up of keywords like "Soccer", "Basketball", "Athlete"..., whilst "Motor" is made of keywords such as "Motorbike", "Vehicle", "Truck"... Our goal would be to take an image and obtain the related keywords (items in the image) that would help us associate the image with one or more categories.
      I guess we could invert the process, ie pushing into CLIP the various keywords we have for each category and then analyse the results to see which sets of keywords resulted in the highest probability, hence identifying the related category, but that seems very inefficient, since for each image we'd do 300 iterations (we have 300 categories).
      However, if given an image CLIP returned the matching keywords that are most appropriate to it, we could more easily then match those keywords returned by CLIP with our category keywords.
      Not sure if I'm missing something or maybe CLIP is just not suitable in this case.
      Thanks so much!

    • @AICoffeeBreak
      @AICoffeeBreak  2 роки тому +3

      @@andresredondomercader2023 You are right, this would be inefficient to do 300 iterations per image, just so one can use it out of the box without changing much to it.
      But I would argue that:
      1. inference is not that costly and you can to the following optimizations:
      2. For one image: since the image stays the same during the 300 queries, you only have to run the visual branch once. Saves you a lot of compute. Then you have to encode only the text 300 times for the 300 labels, but it is quite fast because your textual sequence length is so small (one word, mostly).
      3. For all images: You only have to compute the textual representations (run the textual) branch 300 times. Then you have the encodings.
      So a tip would be to compute the 300 textual representations (vectors). Store them. For each image, run the visual backbone and do the dot product of the image representation with the 300 (stored) textual representations.

  • @compilations6358
    @compilations6358 Рік тому +2

    what's that music in the end?

  • @y.l.deepak5107
    @y.l.deepak5107 8 місяців тому +1

    The colab isnt Working Mam please kindly check it once

    • @AICoffeeBreak
      @AICoffeeBreak  7 місяців тому +1

      Thanks for noticing! The Colab link has become obsolete since the video has been up (a lot is happening in ML). Now you can use CLIP much easier since it has been integrated in hugginggface: huggingface.co/docs/transformers/model_doc/clip
      I've updated the video description as well. :)

  • @joaquinpunales4365
    @joaquinpunales4365 2 роки тому +6

    Hi everybody :), we have been working locally with CLIP and exploring what can we achieve with the model, however we are still not sure if CLIP can be used in a production environment, I mean commercial usage, we have read CLIP's licence doc but it's still not clear, so if someone has a clear idea if that's allowed or not I'd be more than grateful !

  • @arrozenescau1539
    @arrozenescau1539 4 місяці тому

    freat video

  • @renanmonteirobarbosa8129
    @renanmonteirobarbosa8129 2 роки тому +3

    Make LSTMs great again, they are sad :/

    • @AICoffeeBreak
      @AICoffeeBreak  7 місяців тому +2

      I've been prompted by someone to think whether LSTMs should still be part of neural network fundamental courses. What do you think?
      Is it CNN then Transformers directly? Or are LSTMs more than a historical digression?

    • @renanmonteirobarbosa8129
      @renanmonteirobarbosa8129 7 місяців тому +1

      @@AICoffeeBreak The concepts are more important and understanding why it works. LSTMs are fun

  • @ashikkamal7912
    @ashikkamal7912 2 роки тому

    Subscribe kar diya bhai