The TornadoVM Programming Model Explained

TF-RMM Stage 1 Memory Management

Kubernetes-like Control Planes for Declarative APIs - A Practical Introduction to kcp

Як азовська піхота прийняла групу розвідки вс рф? Зізнання окупантів і кадри з GoPro

The Security Guard Fell Into The Trap Of The Beauty #still #parkour #funny#skate

Заява ЗАЛУЖНОГО ШОКУВАЛА увесь СВІТ😱ТРЕТЯ СВІТОВА ВІЙНА ПОЧАЛАСЬ?

Can TornadoVM run Matrix Multiply faster than OpenCL Native?

Juan Fumero

Переглядів 495

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 31 гру 2024

КОМЕНТАРІ • 2

@reinerfranke5436 4 дні тому
Looking only at the metric of relative operations per second is misleading as GPUs and as well as CPUs have much lower memory bandwidth limitations than compute limitations. For large problem sizes depending on architecture there is a characteristic number made of the compute to memory throughput. If your problem is far away from this number the choosen architecture is inefficient. Btw memory bandwidth grow very slow as it is bound to packaging technology.
I the field of circuit simulation GPUs never surpasses for the sparse matrix part memory latency and cache block partioning is unfit for GPU but model state update with some hundred code lines of 100k and more instances fit very well to GPU. Nevertheless the hole compound process is then limited by PCIe bandwidth. There is an interessting new development that GPU powerful upcoming APUs with shared CPU memory surpassed old CPU/GPU combos.
@juanfumero 4 дні тому ⁺³
Thanks for your comment and for bringing up these important considerations in GPU programming. You are correct that memory bandwidth and PCIe limitations can significantly impact performance, especially for large datasets and certain types of algorithms such as the ones requiring irregular memory access or those requiring frequent data transfers.
However, the main focus of this video is to demonstrate how TornadoVM can achieve performance comparable to, or even exceeding, hand-optimized OpenCL code for a specific matrix multiplication implementation. As I mentioned in the video too, I acknowledged that this example might not be fully representative of all applications in TornadoVM, and there are further optimizations to explore, such as shared memory (or local memory) exploitation.
My goal here is to showcase TornadoVM's ability to automatically apply compiler and runtime optimizations, which can simplify GPU programming and potentially lead to performance gains. While memory bandwidth and other architectural factors are crucial for overall performance, this video specifically highlights the potential benefits of TornadoVM's optimization strategies. As such, the main part of the video (mins 14 - 45) measures performance using the total runtime (end-to-end), including kernel time, Java runtime scheduling, and data transfers (copy outs excluded, as TornadoVM caches read-only data, as I explained in the video as well). Thus, if I run TornadoVM using a discrete GPU (e.g., NVIDIA RTX 4090), I should compare the native code using the same discrete GPU in order to understand performance differences between the two.
Thanks again for your input!

Наступне

Автоматичне відтворення

The TornadoVM Programming Model Explained

The TornadoVM Programming Model Explained

TF-RMM Stage 1 Memory Management

TF-RMM Stage 1 Memory Management

Kubernetes-like Control Planes for Declarative APIs - A Practical Introduction to kcp

Kubernetes-like Control Planes for Declarative APIs - A Practical Introduction to kcp

Як азовська піхота прийняла групу розвідки вс рф? Зізнання окупантів і кадри з GoPro

Як азовська піхота прийняла групу розвідки вс рф? Зізнання окупантів і кадри з GoPro

The Security Guard Fell Into The Trap Of The Beauty #still #parkour #funny#skate

The Security Guard Fell Into The Trap Of The Beauty #still #parkour #funny#skate

Заява ЗАЛУЖНОГО ШОКУВАЛА увесь СВІТ😱ТРЕТЯ СВІТОВА ВІЙНА ПОЧАЛАСЬ?

Заява ЗАЛУЖНОГО ШОКУВАЛА увесь СВІТ😱ТРЕТЯ СВІТОВА ВІЙНА ПОЧАЛАСЬ?

Рождение Немецкой Легенды - Mercedes 190E 2.3-16

Рождение Немецкой Легенды - Mercedes 190E 2.3-16

Errata Framework

Errata Framework

Understanding Kubernetes Deployment Components - Kubernetes Learning Series

Understanding Kubernetes Deployment Components - Kubernetes Learning Series

System Design - Rate Limiter

System Design - Rate Limiter

Jago de Vreede: SDKman UI, a user interface on top of SDKMAN for all platforms (#13)

Jago de Vreede: SDKman UI, a user interface on top of SDKMAN for all platforms (#13)

QNX practice introduction resource managers, bsp deployment, tracelogger, secpol

QNX practice introduction resource managers, bsp deployment, tracelogger, secpol

Tools EVERY Software Engineer Should Know

Tools EVERY Software Engineer Should Know

ESNOG32 23 OCT 2024 - Enhancing Network Visibility and Enforcement for ISPs and Service Providers

ESNOG32 23 OCT 2024 - Enhancing Network Visibility and Enforcement for ISPs and Service Providers

TF-M Automated testing of interrupt handling in TF-M

TF-M Automated testing of interrupt handling in TF-M

Traffic Analysis Café: Detection Stepping-Stone Intrusions with DeepCoFFEA and ESPRESSO.

Traffic Analysis Café: Detection Stepping-Stone Intrusions with DeepCoFFEA and ESPRESSO.

😯 Подарила сыну БМВ, но не ожидала такой реакции на машину! | Новостничок

😯 Подарила сыну БМВ, но не ожидала такой реакции на машину! | Новостничок

СКАНДАЛЬНЫЙ бой Али, когда в ринге ему противостояли сразу ДВОЕ #shorts

СКАНДАЛЬНЫЙ бой Али, когда в ринге ему противостояли сразу ДВОЕ #shorts

Ветеран війни отримав гроші на житло

Ветеран війни отримав гроші на житло

Дал Свою Безлимитную Карту Друзьям, Потратили Миллионы... (Хазяева, Кокошка, Дилблин, Сатир)

Дал Свою Безлимитную Карту Друзьям, Потратили Миллионы... (Хазяева, Кокошка, Дилблин, Сатир)

Прочистка шлюзов

Прочистка шлюзов

🤔Можно ли спастись от Ядерки в Холодильнике ? #shorts

🤔Можно ли спастись от Ядерки в Холодильнике ? #shorts

Правильный подход к детям

Правильный подход к детям

Рабочий способ бросить вредную привычку

Рабочий способ бросить вредную привычку