Quantization vs Pruning vs Distillation: Optimizing NNs for Inference

Mastering LLM Inference Optimization From Theory to Cost Effective Deployment: Mark Moyou

Accelerating LLM Inference with vLLM

Unique Hammer Handle Making Tips and Tricks that Work Extremely well #shorts #diy #tips #tools

ХТО представить Україну? ФІНАЛ Нацвідбору на Євробачення-2025

Sonya 🙄 #cat #lego

Understanding the LLM Inference Workload - Mark Moyou, NVIDIA

PyTorch

Переглядів 6 959

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 10 лют 2025
Understanding the LLM Inference Workload - Mark Moyou, NVIDIA
Understanding how to effectively size a production grade LLM deployment requires understanding of the model(s), the compute hardware, quantization and parallelization methods, KV Cache budgets, input and output token length predictions, model adapter management and much more. - Why LLM inference is different to standard deep learning inference - Current and future NVIDIA GPU overview - which GPU(s) for which models and why - Understanding the importance of building inference engines - Deep recap on the attention mechanism along with different types of popular attention mechanisms used in production - Deep dive on KV Cache and managing KV Cache budgets - Parallelism (reducing latency) - mainly tensor parallelism, but data, sequence, pipeline, and expert parallelism will be highlighted - Quantization methods on weights, activations, and KV Cache to reduce engine sizes for more effective GPU utilization - Increasing throughput with inflight batching and other techniques - Detailed performance analysis of LLM deployments looking at Time to first token, inter-token latencies, llm deployment characterizations, and more that can help reduce deployment costs

КОМЕНТАРІ • 2

@kavitachavda9818 5 днів тому
Excellent! You are an amazing teacher. Thank you Mark!
@balasubramaniam8697 3 місяці тому ⁺¹
Awesome Inference, Thank you Mark

Наступне

Автоматичне відтворення

Quantization vs Pruning vs Distillation: Optimizing NNs for Inference

Quantization vs Pruning vs Distillation: Optimizing NNs for Inference

Mastering LLM Inference Optimization From Theory to Cost Effective Deployment: Mark Moyou

Mastering LLM Inference Optimization From Theory to Cost Effective Deployment: Mark Moyou

Accelerating LLM Inference with vLLM

Accelerating LLM Inference with vLLM

Unique Hammer Handle Making Tips and Tricks that Work Extremely well #shorts #diy #tips #tools

Unique Hammer Handle Making Tips and Tricks that Work Extremely well #shorts #diy #tips #tools

ХТО представить Україну? ФІНАЛ Нацвідбору на Євробачення-2025

ХТО представить Україну? ФІНАЛ Нацвідбору на Євробачення-2025

Sonya 🙄 #cat #lego

Sonya 🙄 #cat #lego

The perfect snowball 😳❄️ (via @vidough/TT)

The perfect snowball 😳❄️ (via @vidough/TT)

LLM inference optimization: Architecture, KV cache and Flash attention

LLM inference optimization: Architecture, KV cache and Flash attention

Slaying OOMs - Mark Saroufim & Jane Xu, Meta

Slaying OOMs - Mark Saroufim & Jane Xu, Meta

Scaling AI Workloads with Kubernetes: Sharing GPU Resources Across Multiple Containers - Jack Ong

Scaling AI Workloads with Kubernetes: Sharing GPU Resources Across Multiple Containers - Jack Ong

NEW: Elon Musk On The Future Of Warfare

NEW: Elon Musk On The Future Of Warfare

Deep Dive into LLMs like ChatGPT

Deep Dive into LLMs like ChatGPT

Trends in Deep Learning Hardware: Bill Dally (NVIDIA)

Trends in Deep Learning Hardware: Bill Dally (NVIDIA)

CUDA Mode Keynote | Lily Liu | vLLM

CUDA Mode Keynote | Lily Liu | vLLM

Transformers (how LLMs work) explained visually | DL5

Transformers (how LLMs work) explained visually | DL5

Deep Dive: Optimizing LLM inference

Deep Dive: Optimizing LLM inference

Шаурма с сюрпризом

Шаурма с сюрпризом

ВЕНГАЛБИ СЖИГАЕТ АВТОПАРК ТАМАЕВА! 500 МЛН В ОГНЕ!

ВЕНГАЛБИ СЖИГАЕТ АВТОПАРК ТАМАЕВА! 500 МЛН В ОГНЕ!

«Росіяни ховаються, як страуси, коли чують «Бредлі»: військовий ЗСУ на обороні Покровська

«Росіяни ховаються, як страуси, коли чують «Бредлі»: військовий ЗСУ на обороні Покровська

🐧 Penguin's Parmentier Ice Cream #Shorts

🐧 Penguin's Parmentier Ice Cream #Shorts

Sonya 🙄 #cat #lego

Sonya 🙄 #cat #lego

ПОДРИФТИЛ С БАБУЛЕЙ #shorts

ПОДРИФТИЛ С БАБУЛЕЙ #shorts

Как грузин обдурил СССР на десятки миллионов, используя хитрую схему

Как грузин обдурил СССР на десятки миллионов, используя хитрую схему

SHE CAME BACK LIKE NOTHING HAPPENED! 🤣 #shorts

SHE CAME BACK LIKE NOTHING HAPPENED! 🤣 #shorts