Fast Inference of Mixture-of-Experts Language Models with Offloading

Поділитися
Вставка
  • Опубліковано 7 лис 2024

КОМЕНТАРІ • 6

  • @jacksonmatysik8007
    @jacksonmatysik8007 9 місяців тому +1

    I have been looking for channel like this for ages as I hate reading

  • @winterclimber7520
    @winterclimber7520 10 місяців тому +5

    Very exciting work! The resulting speed the paper proposes won't break any land speed records (2-3 tokens per second), but in my experience one of the most productive and practical applications of LLMs is prompting it with multiple choice questions, which only require a single token.
    This paper (and provided code!) for GPT3.5 levels of inference running local on consumer hardware is a huge breakthrough, and I'm excited to give it a try!

  • @fernandos-bs6544
    @fernandos-bs6544 8 місяців тому

    I just found your channel. It is amazing. Congratulations. Your numbers will grow soon, I am sure. Great quality and great content.

  • @ameynaik2743
    @ameynaik2743 7 місяців тому

    I believe this applicable only for single request? If you have change of experts, you will most likely have many experts active for various requests. Is my understanding correct? thank you.

  • @PaulSchwarzer-ou9sw
    @PaulSchwarzer-ou9sw 10 місяців тому

    Thanks! ❤