How Medusa Works
Вставка
- Опубліковано 14 лип 2024
- This week we cover the "Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads". A method that uses multiple decoding heads to predict multiple subsequent tokens in parallel using a tree-based attention mechanism.
--
Get Oxen 🐂 oxen.ai/
Oxen.ai makes versioning your datasets as easy as versioning your code! Even is millions of unstructured images, we quickly handle any type of data so you can build cutting-edge AI.
--
Medusa 📜 arc.net/l/quote/vvqinsvi
Medusa Notes 📜 www.oxen.ai/blog/arxiv-dives-...
Join Arxiv Dives 🤿 oxen.ai/community
Discord 🗿 / discord
--
Chapters
0:00 Introducing Daniel Varoli from Zapata.ai
2:00 The Problem with LLMs Today
3:45 How we Can Solve These Problems
8:30 Normal vs. Speculative Architecture
14:24 Speculative Decoding Example
15:35 Introducing Medusa
16:53 Medusa’s Decoding Heads
17:32 Generating Tokens With Medusa Heads
22:30 Verifying Candidates With Medusa
24:15 What if we Mess Up?
25:09 Rejecting Sampling For Accepting Candidates
29:11 Considering Many Completion Candidates at Once
31:56 Tree Attention Diagrams
40:00 How to Integrate Medusa Into a LLM
48:10 Results - Наука та технологія
Great job daniel! Thanks for linking to that reddit comment.