GTC 2022 - CUDA: New Features and Beyond - Stephen Jones, CUDA Architect, NVIDIA

How GPU Computing Works | GTC 2021

How CUDA Programming Works | GTC 2022

У той момент вбили і мене, морально, фізично, - про загибель коханого

Україна - Венесуела: ОГЛЯД МАТЧУ / футзал, Чемпіонат світу-2024, 1/4 фіналу

РЕШАЮЩИЙ РАЗГОВОР: Золкин и Карпенко нашли ее мужа / "Жди меня" отдыхает!

GTC 2022 - How CUDA Programming Works - Stephen Jones, CUDA Architect, NVIDIA

Christopher Hollinworth

Переглядів 15 384

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 3 жов 2024

КОМЕНТАРІ • 16

@citizensmith3074 2 роки тому ⁺¹⁶
This video is pure gold: thanks so much for uploading I've learnt so much from it. I may have to watch it several times though!!! A great overview and introduction to so many areas for further study.
@TheAIEpiphany Рік тому ⁺¹
One thing that's confusing: if reading from a memory location in a different row is 3x slower than reading from a memory location in the same row - how come we get 13x slowdown? Worst case (if you're deliberately reading from a different row each time) - one would expect a 3x slowdown?
What am I missing out on? Is it the burst mode?
2) You're using float2 type so that means your thread is loading 4 bytes (for 2 points) not 8 bytes? Which would put the 4 warps into 512B loading territory instead of the optimal 1024? -> EDIT: ok, I just saw that p1 & p2 are actually float pointers so that does make sense.
3) How can we guarantee that p1 & p2 arrays (holding the points) are adjacent, i.e. in the same physical row in memory?
Great video! The sound quality is a bit off though.
@brady1123 9 місяців тому ⁺²
It's 3x slower for reading a single value, but it gets worse when reading many contiguous values where the burst column read can read many values in one operation.
For example, let's say that we're reading two sets of 10 values, one set of which are all contiguous in a row, and one set that are all on different rows. And you have the three ops in the video: LOAD a row, READ a column, STORE the row back.
For the contiguous values: time = LOAD + BURST READ + STORE = 3 ops
For the disjoint values: time = (LOAD + READ + STORE)*10 = 30 ops
That's how you get the 10x speed-up.
@steveHoweisno1 Рік тому
Excellent. For the matrix multiply, you’re reusing the same row multiple times but the columns would have to be loaded in every time. So how do you increase compute intensity of the columns?
@webgpu 10 місяців тому
Christopher, do you think the long time it takes for ram to be accessed could be decreased by embedding a basic cpu in those ram modules?
@christopherhollinworth7405 10 місяців тому ⁺¹
Good question, I don't know!
@GeorgePaul82 7 місяців тому
Is there a chance you can do a video about Why AMD's version isnt as good as NVIDIA ?
@christopherhollinworth7405 7 місяців тому
I've not got a AMD gfx card ZLUDA means it does not really matter www.phoronix.com/review/radeon-cuda-zluda
@ryderbrooks1783 6 місяців тому
AMD's issue is tooling and the general software ecosystem. The hardware is reasonably close.
@codingmachine2817 Рік тому ⁺³
33:10 FlashAttention proved this wrong
@brady1123 9 місяців тому
"Occupancy is the most powerful tool that you have for tuning a program. **Once you're doing your best for memory access patterns** there's pretty much no algorithmic optimization that you can do that'll speed your program up by as much as 33%"
I thought FlashAttention's major contribution was optimizing memory access patterns, namely reducing the number of HBM loads/stores.
@ChimiChanga1337 7 місяців тому
can you please explain this a bit more? I'm trying to teach myself flash attention's cuda code.
@TaklaciPidgeotto 2 місяці тому
He literally said "Once you are doing your best for memory access paterns" and Flash Attention is a MEMORY ACCESS algorithm, it reduces the memory access to GPU HMB RAM.
@dGooddBaddUgly 6 місяців тому ⁺³
Look like Intel is out of the question here.
@christopherhollinworth7405 5 місяців тому
They are getting better in terms of energy efficiency and performance www.cnbc.com/2024/04/09/intel-unveils-gaudi-3-ai-chip-as-nvidia-competition-heats-up-.html
@SavageBits 2 місяці тому
@@christopherhollinworth7405 Limitation of Gaudi is that it is a less flexible fixed function matrix math accelerator. General purpose compute engine in Hopper/Blackwell architecture can better support rapidly evolving LLM algos. Another issue is interconnect bandwidth: NVLINK5 absolutely crushes PCIE5

Наступне

Автоматичне відтворення

GTC 2022 - CUDA: New Features and Beyond - Stephen Jones, CUDA Architect, NVIDIA

GTC 2022 - CUDA: New Features and Beyond - Stephen Jones, CUDA Architect, NVIDIA

How GPU Computing Works | GTC 2021

How GPU Computing Works | GTC 2021

How CUDA Programming Works | GTC 2022

How CUDA Programming Works | GTC 2022

У той момент вбили і мене, морально, фізично, - про загибель коханого

У той момент вбили і мене, морально, фізично, — про загибель коханого

Україна - Венесуела: ОГЛЯД МАТЧУ / футзал, Чемпіонат світу-2024, 1/4 фіналу

Україна – Венесуела: ОГЛЯД МАТЧУ / футзал, Чемпіонат світу-2024, 1/4 фіналу

РЕШАЮЩИЙ РАЗГОВОР: Золкин и Карпенко нашли ее мужа / "Жди меня" отдыхает!

РЕШАЮЩИЙ РАЗГОВОР: Золкин и Карпенко нашли ее мужа / "Жди меня" отдыхает!

Новые технологии в МФЦ 😅 #ComedyClub #КамедиКлаб #АнтонИванов #АлексейСмирнов #Смирняга #тнт4 #тнт

Новые технологии в МФЦ 😅 #ComedyClub #КамедиКлаб #АнтонИванов #АлексейСмирнов #Смирняга #тнт4 #тнт

An Intro to GPU Architecture and Programming Models I Tim Warburton, Virginia Tech

An Intro to GPU Architecture and Programming Models I Tim Warburton, Virginia Tech

CPU vs GPU vs TPU vs DPU vs QPU

CPU vs GPU vs TPU vs DPU vs QPU

6 Horribly Common PCB Design Mistakes

6 Horribly Common PCB Design Mistakes

Mike Seddon - Rust GPU Compute

Mike Seddon - Rust GPU Compute

Signals. I spent 2 years to understand this part.

Signals. I spent 2 years to understand this part.

The Value of Source Code

The Value of Source Code

Stanford Seminar - NVIDIA GPU Computing: A Journey from PC Gaming to Deep Learning

Stanford Seminar - NVIDIA GPU Computing: A Journey from PC Gaming to Deep Learning

Trends in Deep Learning Hardware: Bill Dally (NVIDIA)

Trends in Deep Learning Hardware: Bill Dally (NVIDIA)

РОДИТЕЛИ НА ШКОЛЬНОМ ПРАЗДНИКЕ

РОДИТЕЛИ НА ШКОЛЬНОМ ПРАЗДНИКЕ

Військовий прощається із побратимом #війна #war #зсу #україна

Військовий прощається із побратимом #війна #war #зсу #україна

LIFE HACK ✈️🚕 #VictoriaPfeifer #lifehacks

LIFE HACK ✈️🚕 #VictoriaPfeifer #lifehacks

когда не обедаешь в школе // EVA mash

когда не обедаешь в школе // EVA mash

ПАНИКА в Кремле! ЖЕСТКИЕ удары по РФ: украинское оружие делает свое. Огненные кадры | Фронт News

ПАНИКА в Кремле! ЖЕСТКИЕ удары по РФ: украинское оружие делает свое. Огненные кадры | Фронт News

Дурнєв та Фелікс Редька дивляться сторіс ZОМБІ #54 (napisy PL, eng subtitles)

Дурнєв та Фелікс Редька дивляться сторіс ZОМБІ #54 (napisy PL, eng subtitles)

Зняла шкарпетки з чоловіка і зробила...

Зняла шкарпетки з чоловіка і зробила...

哈哈大家为了进去也是想尽办法！#火影忍者 #佐助 #家庭

哈哈大家为了进去也是想尽办法！#火影忍者 #佐助 #家庭