Mastering LLM Inference Optimization From Theory to Cost Effective Deployment: Mark Moyou

Поділитися
Вставка
  • Опубліковано 10 лют 2025
  • LLM inference is not your normal deep learning model deployment nor is it trivial when it comes to managing scale, performance and COST. Understanding how to effectively size a production grade LLM deployment requires understanding of the model(s), the compute hardware, quantization and parallelization methods, KV Cache budgets, input and output token length predictions, model adapter management and much more.
    If you want to deeply understand these topics and their effects on LLM inference cost and performance you will enjoy this talk.
    This talk will cover the following topics:
    Why LLM inference is different to standard deep learning inference
    Current and future NVIDIA GPU overview - which GPU(s) for which models and why
    Understanding the importance of building inference engines
    Deep recap on the attention mechanism along with different types of popular attention mechanisms used in production
    Deep dive on KV Cache and managing KV Cache budgets to increase throughput per model deployment
    Parallelism (reducing latency) - mainly tensor parallelism but data, sequence, pipeline and expert parallelism will be highlighted
    Quantization methods on weights, activations, KV Cache to reduce engine sizes for more effective GPU utilization
    Increasing throughput with inflight batching and other techniques
    Detailed performance analysis of LLM deployments looking at Time to first token, inter-token latencies, llm deployment characterizations, and more that can help reduce deployment costs
    The main inference engine referenced in the talk with TRT-LLM and the open-source inference serve NVIDIA Triton.
    Recorded live in San Francisco at the AI Engineer World's Fair. See the full schedule of talks at www.ai.enginee... & join us at the AI Engineer World's Fair in 2025! Get your tickets today at ai.engineer/2025
    About Mark
    Dr. Mark Moyou is a Senior Data Scientist at NVIDIA on the Retail team focused on enabling scalable machine learning for the nation's top Retailers. Before NVIDIA, he was a Data Science Manager in the Professional Services division at Lucidworks, an Enterprise Search and Recommendations company. Prior to Lucidworks, he was a founding Data Scientist at Alstom Transportation where he applied Data Science to the Railroad Industry in the US. Mark holds a PhD and MSc in Systems Engineering and a BSc in Chemical Engineering. On the side, Mark is the host of The AI Portfolio Podcast, The Caribbean Tech Pioneers, Progress Guaranteed Podcast and Director of the Southern Data Science Conference in Atlanta.

КОМЕНТАРІ • 7

  • @SamBeera
    @SamBeera Місяць тому +1

    great presentation Dr Moyou. You broke down the complex theory and math into visuals to explain under the hood activity in simple terms. Loved it

  • @mindfuel-ness
    @mindfuel-ness 20 днів тому

    This channel is god sent ❤

  • @IkechiGriffith
    @IkechiGriffith Місяць тому +2

    🇹🇹🇹🇹🇹🇹. Great talk and great breakdown at the start

  • @himanshusamariya9810
    @himanshusamariya9810 24 дні тому

    great presentation
    cleared many things on inference

  • @anshulgupta4
    @anshulgupta4 4 години тому

    can we get the slides for this presentation ?

  • @ricardofonseca7810
    @ricardofonseca7810 Місяць тому +1

    Sluguish