Analysis and Insights from Holistic Evaluation on Video Foundation Models | Multimodal Weekly 65

Поділитися
Вставка
  • Опубліковано 9 лют 2025
  • ​​​​In the 65th session of Multimodal Weekly, we had Lucas Lee from the Twelve Labs Science team to present our recent work on evaluating video foundation models.
    Connect with Lucas: hyeongminlee.g...
    Check out the following resources about TWLV-I:
    ​- Blog Post: www.twelvelabs...
    ​- arXiV: arxiv.org/abs/...
    ​- HuggingFace: huggingface.co...
    ​GitHub: github.com/twe...
    Timestamps:
    00:15 Introduction
    03:05 Lucas starts
    03:22 How should we call video foundation models?
    04:05 The most representative image feature extraction (VGGNet, CLIP)
    05:10 Image retrieval is not equivalent to image embedding
    06:13 Image representation
    07:35 Image foundation models
    07:50 DINO(v2)
    09:28 MAE (Masked Auto Encoder)
    10:40 I-JEPA
    11:40 How about videos?
    12:36 CLIP4Clip
    14:00 The video foundation model architecture that we want
    15:10 A vision transformer that can capture motions
    15:40 New structures and supervisions for videos
    15:52 VideoMAE
    16:45 UMT (Unmasked Teacher)
    18:10 V-JEPA
    19:15 Video is not just a sequence of images
    19:50 Kinetics-400 vs Something-Something v2
    21:53 Motion vs Appearance in V-JEPA and VideoGLUE
    22:36 Motion vs Appearance in TWLV-I
    23:05 TWLV-I is Twelve Labs' first technical report on video foundation model
    24:10 TWLV-I proposes a better evaluation framework
    25:00 Directional motion distinguishability
    26:13 TWLV-I code is available on GitHub!
    27:10 Q&A with Lucas
    Join the Multimodal Minds community to receive an invite for future webinars: / discord

КОМЕНТАРІ •