OpenAI CLIP - Connecting Text and Images | Paper Explained
Вставка
- Опубліковано 28 тра 2024
- ❤️ Become The AI Epiphany Patreon ❤️ ► / theaiepiphany
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
In this video, I cover the CLIP paper - Learning Transferable Visual Models from Natural Language Supervision.
You'll learn about:
✔️ How the contrastive learning behind CLIP works
✔️ All the nitty-gritty details behind the paper
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
✅ Bitter lessons by Sutton: www.incompleteideas.net/IncIde...
✅ CLIP paper: cdn.openai.com/papers/Learnin...
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
⌚️ Timetable:
00:00 OpenAI's CLIP
02:10 Detailed explanation of the method
06:00 Comparision with SimCLR
12:55 How does the zero-shot part work
20:45 WIT dataset
21:30 Why this method, hint efficiency
28:35 Zero-shot - generalizing to new tasks
31:30 Prompt programming and ensembling
34:00 Zero-shot perf
36:20 Few-shot comparison with best baselines
38:20 How good the zero-shot classifier is?
40:45 Compute error correlation
41:20 Quality of CLIP's embedding space
43:05 Robustness to distribution shift
49:10 Limitations (MNIST failure)
50:30 A short recap
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
💰 BECOME A PATREON OF THE AI EPIPHANY ❤️
If these videos, GitHub projects, and blogs help you,
consider helping me out by supporting me on Patreon!
The AI Epiphany ► / theaiepiphany
One-time donation:
www.paypal.com/paypalme/theai...
Much love! ❤️
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
💡 The AI Epiphany is a channel dedicated to simplifying the field of AI using creative visualizations and in general, a stronger focus on geometrical and visual intuition, rather than the algebraic and numerical "intuition".
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
👋 CONNECT WITH ME ON SOCIAL
LinkedIn ► / aleksagordic
Twitter ► / gordic_aleksa
Instagram ► / aiepiphany
Facebook ► / aiepiphany
👨👩👧👦 JOIN OUR DISCORD COMMUNITY:
Discord ► / discord
📢 SUBSCRIBE TO MY MONTHLY AI NEWSLETTER:
Substack ► aiepiphany.substack.com/
💻 FOLLOW ME ON GITHUB FOR COOL PROJECTS:
GitHub ► github.com/gordicaleksa
📚 FOLLOW ME ON MEDIUM:
Medium ► / gordicaleksa
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
#clip #openai #nlpsupervision
I love that OpenAI is pushing towards these more general methods in computer vision as well.
Unsupervised learning is about to become super mainstream in CV.
in fact, they call it Natural language "supervised" learning. (section 2.1)
Yes, it is a kind of parameterized classification though embedding vector from text input as the last Fully connected layers in the classification network of image classification network.
One of the favorite projects I’ve seen in a long time thanks for covering it.
It's awesome I agree!
Geez man do you ever take a holiday lol. I appreciate all your videos.
I do! As a matter of fact now! 😂 But since it's this fantastic pandemic+winter combo I'm making the most out of it.
Thanks!!
very well presented - greatly helped me improve my understanding. Thank you very much.
Great job! thanks!
Thank you for also giving the background on ConVIRT
Hi. Much about the value of a NN depends on the signal we give to it. In this example, as you said and very clever by the authors, if we treat it as classification task, we are giving the net a signal that some text belong to some image -not other images-, do that across millions of examples and create/learn intelligence.
How does the fine-tuning actually work here? They don't talk about it in the paper, right?
what is the app you use to annotate these papers btw?
Thanks for the video. I was wondering if you would also focus on the visual slam/ visual odometry as well. Thanks
Hey! You mean like the more classical CV algos or you have something more concrete in mind?
@@TheAIEpiphany I meant deep slam such as superpoints or super glue
Yes I agree, hope you have some time to review some vSLAM that uses DL especially features extraction and depth detection in monocular cams
@@abudawood_phd Yes would be really interesting! SLAM is a whole other beast
What is actually the main contribution of the CLIP paper? Data augmentation, zero shot, and using large datasets?
Thanks for this nice explanation. Let me please ask the following question: @ 05:40 you mentioned that CLIP paper was heavily inspired in the ConVIRT paper but the ConVIRT approach is only mentioned 2 times and it does not appear in the references. Is it that they intentionally did not referenced it?
Apologies. I found the reference in the Introduction. My CTRL+F could not find the Con-VIRT splited because of the paragraph length.
i love you
Aleksa really looks like the khan in GOT series XD
And this Khan is reaaaaaly smart!!