OpenAI CLIP: ConnectingText and Images (Paper Explained)

Поділитися
Вставка
  • Опубліковано 20 тра 2024
  • #ai #openai #technology
    Paper Title: Learning Transferable Visual Models From Natural Language Supervision
    CLIP trains on 400 million images scraped from the web, along with text descriptions to learn a model that can connect the two modalities. The core idea is a contrastive objective combined with a large batch size. The resulting model can be turned into arbitrary zero-shot classifiers for new image & text tasks.
    OUTLINE:
    0:00 - Introduction
    3:15 - Overview
    4:40 - Connecting Images & Text
    9:00 - Building Zero-Shot Classifiers
    14:40 - CLIP Contrastive Training Objective
    22:25 - Encoder Choices
    25:00 - Zero-Shot CLIP vs Linear ResNet-50
    31:50 - Zero-Shot vs Few-Shot
    35:35 - Scaling Properties
    36:35 - Comparison on different tasks
    37:40 - Robustness to Data Shift
    44:20 - Broader Impact Section
    47:00 - Conclusion & Comments
    Paper: cdn.openai.com/papers/Learnin...
    Blog: openai.com/blog/clip/
    Code: github.com/openai/CLIP
    Abstract:
    State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on.
    Authors: Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever
    Links:
    TabNine Code Completion (Referral): bit.ly/tabnine-yannick
    UA-cam: / yannickilcher
    Twitter: / ykilcher
    Discord: / discord
    BitChute: www.bitchute.com/channel/yann...
    Minds: www.minds.com/ykilcher
    Parler: parler.com/profile/YannicKilcher
    LinkedIn: / yannic-kilcher-488534136
    BiliBili: space.bilibili.com/1824646584
    If you want to support me, the best thing to do is to share out the content :)
    If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
    SubscribeStar: www.subscribestar.com/yannick...
    Patreon: / yannickilcher
    Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
    Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
    Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
    Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n
  • Наука та технологія

КОМЕНТАРІ • 93

  • @hmate1119
    @hmate1119 3 роки тому +103

    This channel is insanely good. Deserves even more recognition. Great work! Subscribed

  • @MachineLearningStreetTalk
    @MachineLearningStreetTalk 3 роки тому +42

    This is a really important paper, I suggest people pay particular attention to Yannic's "robustness to data shift" section if you are short on time. I hope we can get the authors on to discuss this!

  • @jonatan01i
    @jonatan01i 3 роки тому +8

    Thank you so much for this,
    especially for not keeping the promise on cutting the video short!

  • @ghostlv4030
    @ghostlv4030 3 роки тому +18

    The idea is so simple and so hard to believe it is this effective! Okay, I see, NLP is so useful in vision now.

  • @user-yx5nh4tm9n
    @user-yx5nh4tm9n Рік тому +2

    Man, you have a talent to explain hard things! And your english is awesome!!

  • @aminasadi1040
    @aminasadi1040 Рік тому

    Thanks a lot for this awesome video! The explanations are very digestible even for a beginner.

  • @bukovelby
    @bukovelby Рік тому

    Just a Brilliant overview!

  • @jenishah9825
    @jenishah9825 3 роки тому

    I can't thank you enough for making such useful videos.

  • @ashrafg4668
    @ashrafg4668 2 роки тому

    Thank you for the explanation!

  • @naifalkhunaizi7847
    @naifalkhunaizi7847 7 місяців тому

    Truly great explanation!

  • @user-jx5pm9nx8p
    @user-jx5pm9nx8p 10 місяців тому

    Excellent! Thank you a lot!

  • @growthmpsfunnels3358
    @growthmpsfunnels3358 Рік тому

    Dude you are doing a great job. Perfect for the work..

  • @oflasch
    @oflasch 2 роки тому +1

    Great explanation! 👍

  • @MeatFingerSteam
    @MeatFingerSteam 3 роки тому +7

    Absolutely loved the Alec meme, thanks!

  • @maryamaghili1148
    @maryamaghili1148 3 роки тому

    Thank you for your great work! So is there any way we could find the actual label (text) they have used for training? I need to use this model for some classification tasks that I have, but I am wondering how to organize labels? I have only images with no annotation.

  • @G12GilbertProduction
    @G12GilbertProduction 3 роки тому

    In a 8 of 20 examples presented in this paper review is really measured by different compilers of models, but not only this same in 20, 45, 60 bites for a 1mm³ pixel outer the third output layer.

  • @ShivamSingh-xf8nb
    @ShivamSingh-xf8nb Рік тому

    Amazing explaination!

  • @ophir1080
    @ophir1080 2 роки тому +2

    Great video, thanks for sharing! Just one wonder if mine,
    why are we 100% sure that all these old known datasets are not just subsets of the images CLIP was trained on?

  • @florianhonicke5448
    @florianhonicke5448 3 роки тому

    New video from yannic!!! Saved my day :D

  • @shengyaozhuang3748
    @shengyaozhuang3748 3 роки тому +6

    Interestingly, similar training methods have been explored in the field of information retrieval for searching relevant documents to the given query. So, probably a good application of CLIP could be searching a wanted photo on the internet by using a text query.

  • @srinathtangudu4899
    @srinathtangudu4899 Рік тому

    Your videos are so good. Thanks:)

  • @prabhupadpradhan489
    @prabhupadpradhan489 2 роки тому

    The dataset which was used for pretraining the model (in the paper it is mentioned as WebImageText) is it made available for public use ?

  • @florisas.7557
    @florisas.7557 3 роки тому +1

    thanks yannic, great video! but the biggest question i havs is how they got this dataset with images+descriptions 🤔

  • @dl569
    @dl569 Рік тому

    thank you a lot!

  • @frankd1156
    @frankd1156 3 роки тому

    Very good Yanic......

  • @44Kokoloko
    @44Kokoloko 2 роки тому

    Am I understanding this right:
    The CLIP training results in having both a text and image encoder that are able to numerically represent the proximity between words and image representations, with vectors. These encoders can then be used on different datasets to good effect.
    In other words, it relies on the findings related to text embeddings (word2vec) to train corresponding "image embeddings" in a way that allows matching an image embedding to a text embedding. Text embeddings having proved to be able to encode relations between concepts in 3d space (king - man + woman = queen), you can then move between text and image representation of these concepts. Does that sound right?
    Also, what is the pretraining done on?

  • @key2thacity87
    @key2thacity87 10 місяців тому

    Hey @YannicKilcher /all, it seems like OpenAI is only referring to performance on the class of bananas at 39:05 (figure 13) not that zero-shot CLIP outperforms resnet in general on ImageNet. Earlier in the paper (8:15) they achieve 40% accuracy on ImageNet. Is 39:05, (figure 13) showing 72% accuracy on bananas or overall?

  • @xingjian417
    @xingjian417 3 місяці тому

    thanks for sharing

  • @GiangNguyen-of4qf
    @GiangNguyen-of4qf 2 роки тому

    best ever video explained Yannic :)

  • @raphaelsaeed
    @raphaelsaeed 6 місяців тому

    Well explained

  • @vsiegel
    @vsiegel 3 роки тому +6

    Trained on "the internet" - so technically speaking, it is a porn classifier, right? Except if it used a separate algorithm for "adult image filtering". Fascinating! (And funny!)

  • @eliteari
    @eliteari 2 місяці тому

    great video

  • @jeshweedleon3960
    @jeshweedleon3960 3 роки тому +1

    Imagine this but with more sensory data - audio, video, text, hell any string of bytes even. Wild...

  • @theocachet6496
    @theocachet6496 2 роки тому

    Do they check that Ti =! Tj for i =! j with (i, j) indexes of a minibatch? If it is not the case, than sometimes it may have conflict in the contrastive loss (max Ti,Ti and min Ti,Ti in the same computation). Do we agree?

  • @simonstrandgaard5503
    @simonstrandgaard5503 3 роки тому

    Mindblown again.

  • @morkovija
    @morkovija 3 роки тому

    Chuckled at that narrator cut! x)

  • @Abdulazizab2
    @Abdulazizab2 2 роки тому

    Great explanation! But I wonder how they measure the accuracy of zero-shot prediction, is it by containing the original word of the label only? or some sort of combination as the output of zero-shot CLIP would be a sentence I assume.

    • @gocomputing8529
      @gocomputing8529 11 місяців тому

      It is a bit too late, but I'll answer for the future people.
      From the video the classification is performed by creating a prompt. For example, if you know they are photos, you would say 'a photo of {label}'. As the video shows, the prompt you choose is really important for some applications (datasets)

  • @uniqued4ve
    @uniqued4ve 2 роки тому

    I'm missing a bit your critique points here! But thanks, good intro to CLIP

  • @akhilezai
    @akhilezai 3 роки тому +4

    Hey Yannic! I wanna know what software you use to "extend" your PDF with empty space that you use to write notes. Please tell us

    • @tsunamidestructor
      @tsunamidestructor 3 роки тому

      OneNote, afaik

    • @fayeq1745
      @fayeq1745 3 роки тому +1

      I was also wondering about that and figured out it might be OneNote.

    • @akhilezai
      @akhilezai 3 роки тому

      So I found another way to do it. Using latex's includepdf

    • @tsunamidestructor
      @tsunamidestructor 3 роки тому

      @@akhilezai you could also use LiquidText if you have an iPad

    • @akhilezai
      @akhilezai 3 роки тому

      @@tsunamidestructor thanks! I was sure it was possible on some apps on iPad, but I own Samsung tab s7+

  • @h3rtc
    @h3rtc 3 роки тому +1

    that Alec meme is fire haha!

  • @Xaelum
    @Xaelum 3 роки тому +1

    Just imagine a version of CLIP trained on random UA-cam video frames + Title or Subtitles.

  • @p.z.8355
    @p.z.8355 3 роки тому +3

    Do they have a specific strategy to sample the batches ? Maybe sampling totally unrelated captions initially of e.g dogs and planes, then in a later state in training sampling more subtly differing captions of e.g different breeds of dogs.

    • @YannicKilcher
      @YannicKilcher  3 роки тому

      I think it's just pure random.

    • @p.z.8355
      @p.z.8355 3 роки тому

      @@YannicKilcher merci :)

  • @chandrahmmouleb9611
    @chandrahmmouleb9611 Рік тому

    super Hit

  • @yaka169
    @yaka169 2 роки тому

    How it works is similar to siamese network, or how? I quite confused

  • @ranam
    @ranam 3 роки тому

    can i make an orc text recognizer with it

  • @black-snow
    @black-snow 3 роки тому +2

    "random GoPro fallen into a bunch of bananas" xD

  • @bhavikdhandhalya1853
    @bhavikdhandhalya1853 Місяць тому

    I thought you will explain how those image and words are processes so that they have some connection. No issue.

  • @emilyme9478
    @emilyme9478 Рік тому

    👍👍

  • @Kram1032
    @Kram1032 3 роки тому +2

    Can't wait for this to be done to, like, entire movies.
    "Just" take the actual movie scripts as text input and the entire resulting movies (the frames) as image input, and add the modality of sound on top.
    Could also add a bunch of other production data if available (such as, say, concept art, or voices and music unmixed or even making-of documentaries and interviews or entire books which those movies are based on etc.)
    Between (such versions of) CLIP and Dall-E you probably could make entire movies from scratch with just writing out scripts, and then refine them by giving some concept art or something.
    I mean that level is a long ways off I expect - mostly due to how much data needs to be fit into a model that has to be long-time coherent etc. - just the memory requirements as of right now would be quite insane.
    But *in principle* I think this could be possible.
    Resource-needs aside, I suspect adding a sound modality wouldn't even be that difficult in CLIP, right? You'd basically do the same symmetric contrastive classification but add a third concept to it dealing with sound.

  • @ThetaPhiPsi
    @ThetaPhiPsi 2 роки тому

    just for ppl watching this lately: They revised the results for STL-10 in another version of the paper. On p. 40 they write "We updated the STL10 scores from the previous version of this paper after fixing a CUDA-related bug."

  • @jonatan01i
    @jonatan01i 3 роки тому

    29:38
    Voice borrowed from Josh from Let's Game It Out

    • @morkovija
      @morkovija 3 роки тому +1

      funny how we all watch same channels

  • @imranq9241
    @imranq9241 2 роки тому

    Is it zero shot if you consider image captioning as a single task ?

    • @DajesOfficial
      @DajesOfficial Рік тому

      it is zero shot in terms of not using dataset-specific data. Otherwise it is obviously heavily trained

  • @Qumeric
    @Qumeric 3 роки тому +3

    It's weird that ImageNet-A performance is higher than ordinary ImageNet performance.

    • @norik1616
      @norik1616 3 роки тому +2

      Could it be, because the images are more artistic ≈ closer to labeled images ppl put on the internet?

  • @willrazen
    @willrazen 3 роки тому

    "We'll forgive it"

  • @GUINTHERKOVALSKI
    @GUINTHERKOVALSKI 11 місяців тому

    24:55 "i think prompt engineering will become quite a bit more relevant"

  • @florisas.7557
    @florisas.7557 3 роки тому

    shouldn't this easily beat imagenet state of the art if you actually finetune it on the full imagent dataset?

  • @herp_derpingson
    @herp_derpingson 3 роки тому +1

    18:20 This symmetric classification looks like a good idea. I wonder if we can use this for all classification tasks in general.
    28:40 If you look at the datasets it is weak at. They involve some form of arithmetic.
    This paper is a big deal. Kudos to the authors.

    • @YannicKilcher
      @YannicKilcher  3 роки тому +2

      good thought, but if you apply this to standard classification, you always have the same N labels, which would just reduce to the classic crossentropy loss

  • @p.z.8355
    @p.z.8355 3 роки тому +1

    Why do you even need a prompt ? Can't you just use the original label set ?

    • @DajesOfficial
      @DajesOfficial Рік тому

      They show in the paper and it is demonstrated in the video that prompt engineering adds 5 percent points to accuracy.

  • @Lee-vs5ez
    @Lee-vs5ez 3 роки тому

    Better with vision to do nlp

  • @harinkumar1073
    @harinkumar1073 3 роки тому

    44:00 "human model" lmao

  • @nakshatrasingh9202
    @nakshatrasingh9202 3 роки тому

    Switch transformer, Google. Video please 😭😭🙏🙏🙏

  • @jointcc2
    @jointcc2 Рік тому

    "logit" XDDDD