Analysis and Insights from Holistic Evaluation on Video Foundation Models | Multimodal Weekly 65
Вставка
- Опубліковано 9 лют 2025
- In the 65th session of Multimodal Weekly, we had Lucas Lee from the Twelve Labs Science team to present our recent work on evaluating video foundation models.
Connect with Lucas: hyeongminlee.g...
Check out the following resources about TWLV-I:
- Blog Post: www.twelvelabs...
- arXiV: arxiv.org/abs/...
- HuggingFace: huggingface.co...
GitHub: github.com/twe...
Timestamps:
00:15 Introduction
03:05 Lucas starts
03:22 How should we call video foundation models?
04:05 The most representative image feature extraction (VGGNet, CLIP)
05:10 Image retrieval is not equivalent to image embedding
06:13 Image representation
07:35 Image foundation models
07:50 DINO(v2)
09:28 MAE (Masked Auto Encoder)
10:40 I-JEPA
11:40 How about videos?
12:36 CLIP4Clip
14:00 The video foundation model architecture that we want
15:10 A vision transformer that can capture motions
15:40 New structures and supervisions for videos
15:52 VideoMAE
16:45 UMT (Unmasked Teacher)
18:10 V-JEPA
19:15 Video is not just a sequence of images
19:50 Kinetics-400 vs Something-Something v2
21:53 Motion vs Appearance in V-JEPA and VideoGLUE
22:36 Motion vs Appearance in TWLV-I
23:05 TWLV-I is Twelve Labs' first technical report on video foundation model
24:10 TWLV-I proposes a better evaluation framework
25:00 Directional motion distinguishability
26:13 TWLV-I code is available on GitHub!
27:10 Q&A with Lucas
Join the Multimodal Minds community to receive an invite for future webinars: / discord