How Medusa Works

Поділитися
Вставка
  • Опубліковано 14 лип 2024
  • This week we cover the "Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads". A method that uses multiple decoding heads to predict multiple subsequent tokens in parallel using a tree-based attention mechanism.
    --
    Get Oxen 🐂 oxen.ai/
    Oxen.ai makes versioning your datasets as easy as versioning your code! Even is millions of unstructured images, we quickly handle any type of data so you can build cutting-edge AI.
    --
    Medusa 📜 arc.net/l/quote/vvqinsvi
    Medusa Notes 📜 www.oxen.ai/blog/arxiv-dives-...
    Join Arxiv Dives 🤿 oxen.ai/community
    Discord 🗿 / discord
    --
    Chapters
    0:00 Introducing Daniel Varoli from Zapata.ai
    2:00 The Problem with LLMs Today
    3:45 How we Can Solve These Problems
    8:30 Normal vs. Speculative Architecture
    14:24 Speculative Decoding Example
    15:35 Introducing Medusa
    16:53 Medusa’s Decoding Heads
    17:32 Generating Tokens With Medusa Heads
    22:30 Verifying Candidates With Medusa
    24:15 What if we Mess Up?
    25:09 Rejecting Sampling For Accepting Candidates
    29:11 Considering Many Completion Candidates at Once
    31:56 Tree Attention Diagrams
    40:00 How to Integrate Medusa Into a LLM
    48:10 Results
  • Наука та технологія

КОМЕНТАРІ • 1

  • @420_gunna
    @420_gunna 3 дні тому

    Great job daniel! Thanks for linking to that reddit comment.