Long-Form Video Reasoning and Question-Answering | Multimodal Weekly 55

Поділитися
Вставка
  • Опубліковано 9 лют 2025
  • ​​​​In the 55th session of Multimodal Weekly, we had three Ph.D candidates from Stony Brook University working on long-form video understanding under Michael Ryoo.
    ​​​​​​​​​​​​​​​​✅ Jongwoo Park will introduce LVNet - a video question answering framework with optimal strategies for key-frame selection and sequence-aware captioning.
    Connect with Jongwoo: / jongwpark
    LVNet: github.com/jon...
    ​​​​​​​​​​✅ Kumara Kahatapitiya will bring up LangRepo - a Language Repository for LLMs, that maintains concise and structured information as an interpretable (i.e., all-textual) representation.
    Follow Kumara: www3.cs.stonyb...
    LangRepo: github.com/kka...
    ​​​​​​​✅ Kanchana Ranasinghe will discuss MVU - an LLM-based framework for solving long video question answering benchmarks and discover multiple surprising results.
    Follow Kanchana: kahnchana.gith...
    MVU: kahnchana.gith...
    Timestamps:
    00:11 Introduction
    03:05 Jongwoo starts
    04:05 Why do we need keyframes in a very long-form video?
    05:46 LVNet - Overview
    06:58 LVNet - Detailed View
    07:50 LVNet - Temporal Scene Clustering
    09:32 LVNet - Coarse Keyframe Detection
    11:22 LVNet - Fine Keyframe Detector
    13:03 Performance on long-video reasoning - EgoSchema
    14:00 Performance on long-video reasoning - NExT-QA
    14:48 Performance on long-video reasoning - IntentQA
    15:13 Open-ended responses with LVNet
    18:40 Open-ended responses with Uniform Sampling
    21:10 LVNet - Summary
    24:03 Kumara starts
    24:45 Observations from LLMs for long-video
    26:45 LangRepo - Overview
    28:06 LangRepo - Detailed View
    29:20 LangRepo - write-to-repo
    31:20 LangRepo - read-from-repo
    31:50 Performance on long-video reasoning - EgoSchema
    32:24 Performance on long-video reasoning - NExT-QA
    32:47 Interesting observations
    34:33 Qualitative examples - single temporal scale
    35:18 Qualitative examples - multiple temporal scales
    37:52 Kanchana starts
    38:38 Spurious behaviors in ML models
    40:37 Extending to video tasks
    41:48 Multimodal language model baseline?
    45:17 Findings
    46:03 Let's inject video-specific information into the best naive baseline
    46:52 "Object centric" video modalities
    47:53 Detailed overview
    51:18 Likelihood selection
    51:58 Performance on EgoSchema Question Answering
    52:45 Zero-shot generality
    54:12 Q&A session with the speakers
    01:02:36 Conclusion
    Join the Multimodal Minds community to receive an invite for future webinars: / discord

КОМЕНТАРІ •