Looking only at the metric of relative operations per second is misleading as GPUs and as well as CPUs have much lower memory bandwidth limitations than compute limitations. For large problem sizes depending on architecture there is a characteristic number made of the compute to memory throughput. If your problem is far away from this number the choosen architecture is inefficient. Btw memory bandwidth grow very slow as it is bound to packaging technology. I the field of circuit simulation GPUs never surpasses for the sparse matrix part memory latency and cache block partioning is unfit for GPU but model state update with some hundred code lines of 100k and more instances fit very well to GPU. Nevertheless the hole compound process is then limited by PCIe bandwidth. There is an interessting new development that GPU powerful upcoming APUs with shared CPU memory surpassed old CPU/GPU combos.
Thanks for your comment and for bringing up these important considerations in GPU programming. You are correct that memory bandwidth and PCIe limitations can significantly impact performance, especially for large datasets and certain types of algorithms such as the ones requiring irregular memory access or those requiring frequent data transfers. However, the main focus of this video is to demonstrate how TornadoVM can achieve performance comparable to, or even exceeding, hand-optimized OpenCL code for a specific matrix multiplication implementation. As I mentioned in the video too, I acknowledged that this example might not be fully representative of all applications in TornadoVM, and there are further optimizations to explore, such as shared memory (or local memory) exploitation. My goal here is to showcase TornadoVM's ability to automatically apply compiler and runtime optimizations, which can simplify GPU programming and potentially lead to performance gains. While memory bandwidth and other architectural factors are crucial for overall performance, this video specifically highlights the potential benefits of TornadoVM's optimization strategies. As such, the main part of the video (mins 14 - 45) measures performance using the total runtime (end-to-end), including kernel time, Java runtime scheduling, and data transfers (copy outs excluded, as TornadoVM caches read-only data, as I explained in the video as well). Thus, if I run TornadoVM using a discrete GPU (e.g., NVIDIA RTX 4090), I should compare the native code using the same discrete GPU in order to understand performance differences between the two. Thanks again for your input!
Looking only at the metric of relative operations per second is misleading as GPUs and as well as CPUs have much lower memory bandwidth limitations than compute limitations. For large problem sizes depending on architecture there is a characteristic number made of the compute to memory throughput. If your problem is far away from this number the choosen architecture is inefficient. Btw memory bandwidth grow very slow as it is bound to packaging technology.
I the field of circuit simulation GPUs never surpasses for the sparse matrix part memory latency and cache block partioning is unfit for GPU but model state update with some hundred code lines of 100k and more instances fit very well to GPU. Nevertheless the hole compound process is then limited by PCIe bandwidth. There is an interessting new development that GPU powerful upcoming APUs with shared CPU memory surpassed old CPU/GPU combos.
Thanks for your comment and for bringing up these important considerations in GPU programming. You are correct that memory bandwidth and PCIe limitations can significantly impact performance, especially for large datasets and certain types of algorithms such as the ones requiring irregular memory access or those requiring frequent data transfers.
However, the main focus of this video is to demonstrate how TornadoVM can achieve performance comparable to, or even exceeding, hand-optimized OpenCL code for a specific matrix multiplication implementation. As I mentioned in the video too, I acknowledged that this example might not be fully representative of all applications in TornadoVM, and there are further optimizations to explore, such as shared memory (or local memory) exploitation.
My goal here is to showcase TornadoVM's ability to automatically apply compiler and runtime optimizations, which can simplify GPU programming and potentially lead to performance gains. While memory bandwidth and other architectural factors are crucial for overall performance, this video specifically highlights the potential benefits of TornadoVM's optimization strategies. As such, the main part of the video (mins 14 - 45) measures performance using the total runtime (end-to-end), including kernel time, Java runtime scheduling, and data transfers (copy outs excluded, as TornadoVM caches read-only data, as I explained in the video as well). Thus, if I run TornadoVM using a discrete GPU (e.g., NVIDIA RTX 4090), I should compare the native code using the same discrete GPU in order to understand performance differences between the two.
Thanks again for your input!