LLMs in Production at GetYourGuide // Meghana Satish & Tina Treimane // LLMs III Talk

Understanding LLM Inference | NVIDIA Experts Deconstruct How AI Works

Efficiently Scaling and Deploying LLMs // Hanlin Tang // LLM's in Production Conference

Як азовська піхота прийняла групу розвідки вс рф? Зізнання окупантів і кадри з GoPro

Этот бой - Самое большое РАЗОЧАРОВАНИЕ за всю КАРЬЕРУ БУАКАВА!

Unexpected way to open the new Audi A6 e-tron Frunk 😮! #shorts

Exploring the Latency/Throughput & Cost Space for LLM Inference // Timothée Lacroix // CTO Mistral

MLOps.community

Переглядів 18 095

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 9 січ 2025

КОМЕНТАРІ • 20

@iandanforth Рік тому ⁺⁸
There seems to be a mistake in the cost estimate at 21:53. It uses the price for the A10 but the throughput of the H100. I believe the actual cost estimate would be $48, not $15.
@eduardoalvarez7152 11 місяців тому ⁺¹⁰
The math around 6:50 for A100 batch size isn't working out. It would be great if the values used to calculate the 400 batch size were provided.
Based on the equations provided for compute time and model load time, the point of intersection is Flops/(2*MemoryBand) NOT the (2*FLOPS)/MemoryBand which is in the video.
@TheAIEpiphany 7 місяців тому ⁺¹
I believe it was just a piece of napkin math: in reality he didn't count in KV cache at all in the P / mem bandwidth line which is a function of sequence length. That seems like the biggest approximation error I see here?
For the second line he discounted attention FLOPs and used just MLP FLOPs (the error of this approximation increases as the sequence grows, depends on the model size you're using e.g. for 7B model with a big sequence length, that term might actually be important).
Additionally the peak flops is a function of the data type and the operation you're executing, he's assuming bf16/fp16 which is what Mistral 7B is using, that gives you ~312 TFLOPs/s for A100.
All in all this is useful if you understand exactly the assumptions he's making.
@Venkat2811 7 місяців тому
@@TheAIEpiphany Yes, I was looking for KV cache as well. Your explanation makes sense.
@evermorecurious91 Рік тому ⁺⁴
This is gold!!!
@mndflctzn Рік тому ⁺²
This is awesome. Thanks for sharing super useful
@iogbole 3 місяці тому
The right continous profiling solution can help you find B* --> 7:23 with much less effort. 18:23 is where the power of low-level tracing with eBPF comes in; otherwise, the performance overhead is simply too high.
@MLOps 8 місяців тому
Join us at our first in-person conference on June 25 all about AI Quality: www.aiqualityconference.com/
@frank96997 11 місяців тому
Great talk! is there link to the slides for this talk?
@janilbolswong1953 Рік тому ⁺²
@5:40 why do we need to load the entire model all the time? can't we just load once? If so, we might lower the needs of memory movement, and the intersection would shift left
@attention42 Рік тому ⁺⁴
I guess "memory movement" mean movement from GPU memory(HBM) to GPU computing component.
Model parameter stored in GPU memory not in compute component. So for computing model parameter moved from HBM to compute component every forward pass.
@fraternitas5117 8 місяців тому
yes, it needs to be loaded in the gpu all the time. advanced users optimize their applications by sending an equal number of bytes as the memory maximum to optimize the utilizations of all memory in the clock cycle.
@marvelfancollection3690 Місяць тому
Guys do you have a sql ai inference model..I've been checking around but can seem to find any.
@boussouarsari4482 10 місяців тому
It's possible that I'm misunderstanding, but given our use of a significantly large key-value cache (2GB multiplied by the batch size), can we still assert that the memory bandwidth is solely influenced by the model's weights?
@yaxiongzhao6640 6 місяців тому
The KV cache's size is directly from the attention layer's size, which in turn is in proportional to model weights' total count
So model weights still proportionally determines the kv cache size, thus the statement.
@Gerald-iz7mv 9 місяців тому
hi what benchmark he run to generate the plots? any open source github links?
@windmaple Рік тому
Great talk!
@aneeinaec 6 місяців тому
Is that Ryan Gosling ❤
@AbdulK-kr2jv 8 місяців тому
What a horrible unethical response on the ethics of training data

Наступне

Автоматичне відтворення

LLMs in Production at GetYourGuide // Meghana Satish & Tina Treimane // LLMs III Talk

LLMs in Production at GetYourGuide // Meghana Satish & Tina Treimane // LLMs III Talk

Understanding LLM Inference | NVIDIA Experts Deconstruct How AI Works

Understanding LLM Inference | NVIDIA Experts Deconstruct How AI Works

Efficiently Scaling and Deploying LLMs // Hanlin Tang // LLM's in Production Conference

Efficiently Scaling and Deploying LLMs // Hanlin Tang // LLM's in Production Conference

Як азовська піхота прийняла групу розвідки вс рф? Зізнання окупантів і кадри з GoPro

Як азовська піхота прийняла групу розвідки вс рф? Зізнання окупантів і кадри з GoPro

Этот бой - Самое большое РАЗОЧАРОВАНИЕ за всю КАРЬЕРУ БУАКАВА!

Этот бой - Самое большое РАЗОЧАРОВАНИЕ за всю КАРЬЕРУ БУАКАВА!

Unexpected way to open the new Audi A6 e-tron Frunk 😮! #shorts

Unexpected way to open the new Audi A6 e-tron Frunk 😮! #shorts

Что будет если украсть в магазине шоколадку 🍫

Что будет если украсть в магазине шоколадку 🍫

Why Pydantic AI is the Future of AI Agents

Why Pydantic AI is the Future of AI Agents

Fast LLM Serving with vLLM and PagedAttention

Fast LLM Serving with vLLM and PagedAttention

LLM inference optimization: Architecture, KV cache and Flash attention

LLM inference optimization: Architecture, KV cache and Flash attention

AI Hardware: Training, Inference, Devices and Model Optimization

AI Hardware: Training, Inference, Devices and Model Optimization

Understanding the LLM Inference Workload - Mark Moyou, NVIDIA

Understanding the LLM Inference Workload - Mark Moyou, NVIDIA

The illusion of self and the illusion of free will, explained | Annaka Harris

The illusion of self and the illusion of free will, explained | Annaka Harris

Quantization vs Pruning vs Distillation: Optimizing NNs for Inference

Quantization vs Pruning vs Distillation: Optimizing NNs for Inference

Attention in transformers, visually explained | DL6

Attention in transformers, visually explained | DL6

Deep Dive: Optimizing LLM inference

Deep Dive: Optimizing LLM inference

СКОЛЬКО ИХ...?! #Shorts #Глент

СКОЛЬКО ИХ...?! #Shorts #Глент

Уличный боец с ДУХОМ воина

Уличный боец с ДУХОМ воина

Cute Baby Ties Up Dad And Wants To Play With His Phone #funny #fatherhoodlove#cute#fatherhoodmoments

Cute Baby Ties Up Dad And Wants To Play With His Phone #funny #fatherhoodlove#cute#fatherhoodmoments

ВОТ ПОЧЕМУ Япония живет в будущем 🤫 Утилизация масла #япония #токио #путешествия #shorts

ВОТ ПОЧЕМУ Япония живет в будущем 🤫 Утилизация масла #япония #токио #путешествия #shorts

ФИЛЬМ! НЕВИНОВНЫЙ ГОТОВИТ ДЕРЗКИЙ ПОБЕГ С НЕПРИСТУПНОГО ОСТРОВА-ТЮРЬМЫ! Мотылёк! Русский фильм

ФИЛЬМ! НЕВИНОВНЫЙ ГОТОВИТ ДЕРЗКИЙ ПОБЕГ С НЕПРИСТУПНОГО ОСТРОВА-ТЮРЬМЫ! Мотылёк! Русский фильм

Мама загинула у блокадному Чернігові, а тато у полоні РФ #війна #люди #україна #shorts #смерть

Мама загинула у блокадному Чернігові, а тато у полоні РФ #війна #люди #україна #shorts #смерть

THE AMAZING DIGITAL CIRCUS - Ep 4: Fast Food Masquerade

THE AMAZING DIGITAL CIRCUS - Ep 4: Fast Food Masquerade

УГАДАЙ КОНТЕЙНЕР - ЗАБЕРИ ТАЧКУ! Новогодний выпуск!

УГАДАЙ КОНТЕЙНЕР - ЗАБЕРИ ТАЧКУ! Новогодний выпуск!