Deep Dive: Optimizing LLM inference

Поділитися
Вставка
  • Опубліковано 13 жов 2024

КОМЕНТАРІ • 26

  • @cybermanaudiobooks3231
    @cybermanaudiobooks3231 7 місяців тому +1

    Thanks Julien. Your recent series of videos have been top quality.
    A future video, you might consider making, is one about the different prompts required when fine-tuning. Why does llama2 differ from mistral? left-padding for some, right-padding for others; how does trl help simplify things? what is the history of this? what is chatml? guanaco? etc..,
    To get a solid foundation in navigating this area would be a helpful video to say the least!

    • @juliensimonfr
      @juliensimonfr  7 місяців тому +1

      Hi Cyberman, thank you for the kind words. I have a bit of an inference obsession at the moment, but I'll come back to training and fine-tuning after that. Your suggestions sound good, I'll add them to the list :)

  • @jiegong529
    @jiegong529 3 місяці тому

    Thanks so much for the crystal clear explanations! You understand them so well and it's even more amazing how you show them in bullet points and graphs to make your audience understand as well!

  • @mourady5588
    @mourady5588 2 місяці тому

    Thank you very much Julien for this high-quality excerpt!
    Could you please attach the slides in the description, as well as under the other videos?

    • @juliensimonfr
      @juliensimonfr  2 місяці тому

      Hi, you'll find the slides at fr.slideshare.net/slideshow/julien-simon-deep-dive-optimizing-llm-inference/270920916. I'll share the other ones in the next week or so.

    • @mourady5588
      @mourady5588 2 місяці тому

      @@juliensimonfr thanks a lot!

  • @sheikhshafayat6984
    @sheikhshafayat6984 Місяць тому

    The explanation was excellent. Thanks a lot!

  • @alexis91459
    @alexis91459 2 місяці тому

    Super cool! Just why in speculative decoding the validation part made by the bigger model is faster? I don"t understand how validation works

    • @juliensimonfr
      @juliensimonfr  2 місяці тому +1

      Good question. The main reason is that the input verification by the larger model only requires a single forward pass per candidate sequence. This is much faster than the usual text generation process, which requires one forward pass per new token.
      If the larger model disagrees on a particular token, then it will generate a better one and the next ones. However, all the tokens generated up to that point by the smaller model are used as is. So, in the end we get large-model generation quality, only quicker :)
      Makes sense ? Here's a detailed example: huggingface.co/blog/whisper-speculative-decoding

  • @billykotsos4642
    @billykotsos4642 3 місяці тому

    very informative as always !

  • @justwest
    @justwest 7 місяців тому +1

    how does the big LLM handle the "predicted" tokens? I mean, how does it check whether these are good or not?

    • @juliensimonfr
      @juliensimonfr  7 місяців тому +1

      Detailed explanation in huggingface.co/blog/assisted-generation. In a nutshell, you can run a forward pass with the large model on a speculative sequence and retrieve the logits (i.e. the probabilities) for the next token at each position in the sequence. If a speculative token does have the highest logit, then it's the token the large model would have generated.

    • @justwest
      @justwest 7 місяців тому

      @@juliensimonfr thx a lot (for the vids in general also, of course). I think I am still missing a point here. why do you get the logits *at each position in the sequence*? isnt the ouput of the model just probabilities for the *next* token? If I would want to have this *at each position*, wouldnt I have to forward pass multiple times? thx!

  • @RoyAAD
    @RoyAAD 5 місяців тому

    Very interesting. Can we batch 1 image of the LLM? Or we need multiple copies of it loaded in the gpu in order to batch? And how can we estimate the throughput? Say if I have a 70B model on 2 A100?

    • @juliensimonfr
      @juliensimonfr  5 місяців тому +2

      This works even on a single GPU. Here's the paper if you want to dive deeper: www.usenix.org/conference/osdi22/presentation/yu. Regarding benchmarks, I suggest you run your own to find the right latency/throughput trade-off for your application. This should help: www.databricks.com/blog/llm-inference-performance-engineering-best-practices

    • @RoyAAD
      @RoyAAD 5 місяців тому

      @@juliensimonfr Thanks. Do you have a link that explains how to calculate the feasability for an LLM?

  • @rbrowne4255
    @rbrowne4255 6 місяців тому

    Thanks for the video great job!!! in terms of Speculative decoding, can you provide any additional feedback on its impact on GPU performance/memory? i.e kv-cache usage or overall GPU memory resources

    • @juliensimonfr
      @juliensimonfr  6 місяців тому

      The only overhead is the assistant model, which can share layers with the large model . For example, see huggingface.co/blog/whisper-speculative-decoding, which says that there's only 8% of RAM overhead.

  • @bibiworm
    @bibiworm 4 місяці тому +1

    Would you mind sharing the slides please Sir? Thank you!

    • @juliensimonfr
      @juliensimonfr  2 місяці тому +1

      Hi, you can find the slides on Slideshare at fr.slideshare.net/slideshow/julien-simon-deep-dive-model-merging/270921708

  • @徐迟-i2t
    @徐迟-i2t 4 місяці тому

    很快就讲清楚了,好厉害!爱来自瓷器。

  • @Gerald-xg3rq
    @Gerald-xg3rq 5 місяців тому

    hi great video! how to set WAITING_SERVED_RATIO, MAX_BATCH_SIZE, MAX_BATCH_TOTAL_TOKENS, MAX_BATCH_PREFILL_TOKENS etc. for highest throughput? looking at llama2-7b-chat and llama3-8b-instruct with nvidia A10.

    • @juliensimonfr
      @juliensimonfr  5 місяців тому

      Hi, the doc is available at huggingface.co/docs/text-generation-inference/en/basic_tutorials/launcher. I would increase batch size and measure.