Long-Form Video Reasoning and Question-Answering | Multimodal Weekly 55
Вставка
- Опубліковано 9 лют 2025
- In the 55th session of Multimodal Weekly, we had three Ph.D candidates from Stony Brook University working on long-form video understanding under Michael Ryoo.
✅ Jongwoo Park will introduce LVNet - a video question answering framework with optimal strategies for key-frame selection and sequence-aware captioning.
Connect with Jongwoo: / jongwpark
LVNet: github.com/jon...
✅ Kumara Kahatapitiya will bring up LangRepo - a Language Repository for LLMs, that maintains concise and structured information as an interpretable (i.e., all-textual) representation.
Follow Kumara: www3.cs.stonyb...
LangRepo: github.com/kka...
✅ Kanchana Ranasinghe will discuss MVU - an LLM-based framework for solving long video question answering benchmarks and discover multiple surprising results.
Follow Kanchana: kahnchana.gith...
MVU: kahnchana.gith...
Timestamps:
00:11 Introduction
03:05 Jongwoo starts
04:05 Why do we need keyframes in a very long-form video?
05:46 LVNet - Overview
06:58 LVNet - Detailed View
07:50 LVNet - Temporal Scene Clustering
09:32 LVNet - Coarse Keyframe Detection
11:22 LVNet - Fine Keyframe Detector
13:03 Performance on long-video reasoning - EgoSchema
14:00 Performance on long-video reasoning - NExT-QA
14:48 Performance on long-video reasoning - IntentQA
15:13 Open-ended responses with LVNet
18:40 Open-ended responses with Uniform Sampling
21:10 LVNet - Summary
24:03 Kumara starts
24:45 Observations from LLMs for long-video
26:45 LangRepo - Overview
28:06 LangRepo - Detailed View
29:20 LangRepo - write-to-repo
31:20 LangRepo - read-from-repo
31:50 Performance on long-video reasoning - EgoSchema
32:24 Performance on long-video reasoning - NExT-QA
32:47 Interesting observations
34:33 Qualitative examples - single temporal scale
35:18 Qualitative examples - multiple temporal scales
37:52 Kanchana starts
38:38 Spurious behaviors in ML models
40:37 Extending to video tasks
41:48 Multimodal language model baseline?
45:17 Findings
46:03 Let's inject video-specific information into the best naive baseline
46:52 "Object centric" video modalities
47:53 Detailed overview
51:18 Likelihood selection
51:58 Performance on EgoSchema Question Answering
52:45 Zero-shot generality
54:12 Q&A session with the speakers
01:02:36 Conclusion
Join the Multimodal Minds community to receive an invite for future webinars: / discord