Understanding LLM Inference | NVIDIA Experts Deconstruct How AI Works

Understanding the LLM Inference Workload - Mark Moyou, NVIDIA

Visualizing transformers and attention | Talk for TNG Big Tech Day '24

Анна Трінчер - Вином текла (Official Music Video)

Как грузин обдурил СССР на десятки миллионов, используя хитрую схему

Tilt 'n' Shout #boardgames #настольныеигры #games #игры #настолки #настольные_игры

Mastering LLM Inference Optimization From Theory to Cost Effective Deployment: Mark Moyou

AI Engineer

Переглядів 6 359

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 10 лют 2025
LLM inference is not your normal deep learning model deployment nor is it trivial when it comes to managing scale, performance and COST. Understanding how to effectively size a production grade LLM deployment requires understanding of the model(s), the compute hardware, quantization and parallelization methods, KV Cache budgets, input and output token length predictions, model adapter management and much more.
If you want to deeply understand these topics and their effects on LLM inference cost and performance you will enjoy this talk.
This talk will cover the following topics:
Why LLM inference is different to standard deep learning inference
Current and future NVIDIA GPU overview - which GPU(s) for which models and why
Understanding the importance of building inference engines
Deep recap on the attention mechanism along with different types of popular attention mechanisms used in production
Deep dive on KV Cache and managing KV Cache budgets to increase throughput per model deployment
Parallelism (reducing latency) - mainly tensor parallelism but data, sequence, pipeline and expert parallelism will be highlighted
Quantization methods on weights, activations, KV Cache to reduce engine sizes for more effective GPU utilization
Increasing throughput with inflight batching and other techniques
Detailed performance analysis of LLM deployments looking at Time to first token, inter-token latencies, llm deployment characterizations, and more that can help reduce deployment costs
The main inference engine referenced in the talk with TRT-LLM and the open-source inference serve NVIDIA Triton.
Recorded live in San Francisco at the AI Engineer World's Fair. See the full schedule of talks at www.ai.enginee... & join us at the AI Engineer World's Fair in 2025! Get your tickets today at ai.engineer/2025
About Mark
Dr. Mark Moyou is a Senior Data Scientist at NVIDIA on the Retail team focused on enabling scalable machine learning for the nation's top Retailers. Before NVIDIA, he was a Data Science Manager in the Professional Services division at Lucidworks, an Enterprise Search and Recommendations company. Prior to Lucidworks, he was a founding Data Scientist at Alstom Transportation where he applied Data Science to the Railroad Industry in the US. Mark holds a PhD and MSc in Systems Engineering and a BSc in Chemical Engineering. On the side, Mark is the host of The AI Portfolio Podcast, The Caribbean Tech Pioneers, Progress Guaranteed Podcast and Director of the Southern Data Science Conference in Atlanta.

КОМЕНТАРІ • 7

@SamBeera Місяць тому ⁺¹
great presentation Dr Moyou. You broke down the complex theory and math into visuals to explain under the hood activity in simple terms. Loved it
@mindfuel-ness 20 днів тому
This channel is god sent ❤
@IkechiGriffith Місяць тому ⁺²
🇹🇹🇹🇹🇹🇹. Great talk and great breakdown at the start
@himanshusamariya9810 24 дні тому
great presentation
cleared many things on inference
@anshulgupta4 4 години тому
can we get the slides for this presentation ?
@ricardofonseca7810 Місяць тому ⁺¹
Sluguish

Наступне

Автоматичне відтворення

Understanding LLM Inference | NVIDIA Experts Deconstruct How AI Works

Understanding LLM Inference | NVIDIA Experts Deconstruct How AI Works

Understanding the LLM Inference Workload - Mark Moyou, NVIDIA

Understanding the LLM Inference Workload - Mark Moyou, NVIDIA

Visualizing transformers and attention | Talk for TNG Big Tech Day '24

Visualizing transformers and attention | Talk for TNG Big Tech Day '24

Анна Трінчер - Вином текла (Official Music Video)

Анна Трінчер - Вином текла (Official Music Video)

Как грузин обдурил СССР на десятки миллионов, используя хитрую схему

Как грузин обдурил СССР на десятки миллионов, используя хитрую схему

Tilt 'n' Shout #boardgames #настольныеигры #games #игры #настолки #настольные_игры

Tilt 'n' Shout #boardgames #настольныеигры #games #игры #настолки #настольные_игры

🐧 Penguin's Parmentier Ice Cream #Shorts

🐧 Penguin's Parmentier Ice Cream #Shorts

Quantization vs Pruning vs Distillation: Optimizing NNs for Inference

Quantization vs Pruning vs Distillation: Optimizing NNs for Inference

Instability is All You Need: The Surprising Dynamics of Learning in Deep Models

Instability is All You Need: The Surprising Dynamics of Learning in Deep Models

Exploring the Latency/Throughput & Cost Space for LLM Inference // Timothée Lacroix // CTO Mistral

Exploring the Latency/Throughput & Cost Space for LLM Inference // Timothée Lacroix // CTO Mistral

How to Construct Domain Specific LLM Evaluation Systems: Hamel Husain and Emil Sedgh

How to Construct Domain Specific LLM Evaluation Systems: Hamel Husain and Emil Sedgh

What if all the world's biggest problems have the same solution?

What if all the world's biggest problems have the same solution?

Trends in Deep Learning Hardware: Bill Dally (NVIDIA)

Trends in Deep Learning Hardware: Bill Dally (NVIDIA)

Scaling AI Workloads with Kubernetes: Sharing GPU Resources Across Multiple Containers - Jack Ong

Scaling AI Workloads with Kubernetes: Sharing GPU Resources Across Multiple Containers - Jack Ong

Real ROI: Lessons from Enterprises that have already succeeded with LLMs at Scale: Raza Habib

Real ROI: Lessons from Enterprises that have already succeeded with LLMs at Scale: Raza Habib

AI Inference: The Secret to AI's Superpowers

AI Inference: The Secret to AI's Superpowers

🥹 Вони НАРЕШТІ ЗУСТРІЛИСЬ! Зворушливі кадри ВОЗЗ'ЄДНАННЯ РОДИНИ. До сліз!

🥹 Вони НАРЕШТІ ЗУСТРІЛИСЬ! Зворушливі кадри ВОЗЗ'ЄДНАННЯ РОДИНИ. До сліз!

МЛАДШАЯ ШКОЛА vs СТАРШАЯ ШКОЛА ! 🤓 5 КЛАСС VS 11 КЛАСС 😱 *ТУТ ЕСТЬ ТЫ* 😳

МЛАДШАЯ ШКОЛА vs СТАРШАЯ ШКОЛА ! 🤓 5 КЛАСС VS 11 КЛАСС 😱 *ТУТ ЕСТЬ ТЫ* 😳

This GIRL knows how to Survive Winter! ❄️🔥#camping #survival #bushcraft #outdoors #lifehack

This GIRL knows how to Survive Winter! ❄️🔥#camping #survival #bushcraft #outdoors #lifehack

ДВУХГОЛОВЫЙ ГУСЬ

ДВУХГОЛОВЫЙ ГУСЬ

«Росіяни ховаються, як страуси, коли чують «Бредлі»: військовий ЗСУ на обороні Покровська

«Росіяни ховаються, як страуси, коли чують «Бредлі»: військовий ЗСУ на обороні Покровська

Sonya 🙄 #cat #lego

Sonya 🙄 #cat #lego

МОЕГО ПУПСИКА УДАРИЛИ😱 И КРАЛИ ЕГО КОТЕЙКУ😾! #robloxshorts #roblox #brookhaven

МОЕГО ПУПСИКА УДАРИЛИ😱 И КРАЛИ ЕГО КОТЕЙКУ😾! #robloxshorts #roblox #brookhaven

🔥 Трамп був у шоці, коли дізнався, ЩО СТАЛОСЬ У ОМАНІ! Зеленському світять вибори або... / ЧЕКАЛКИН

🔥 Трамп був у шоці, коли дізнався, ЩО СТАЛОСЬ У ОМАНІ! Зеленському світять вибори або... / ЧЕКАЛКИН