It is also interesting to know about a new schedulers used for dispatching ray-tracing routines (e.g. closest/anyhit, that is dynamically scheduled). Are they accessible directly (or at least indirectly) from CUDA cores (C++).
I haven't tried using any of the new Volta architecture with CUDA. From what I understand, the RT cores are only able to accelerate triangle-ray intersections, and tree traversal. So you would have to write a kernel that effectively does the ray-tracing, but is scheduled partly on the RT cores. As for how to actually write a kernel to use the RT cores, it looks like you basically create special CUDA kernels which use an rtcontext, which can share the kernel context and memory pointers with a normal CUDA context. Also, there appears to be no way to combine the RT kernels and CUDA kernels, where any communication needs to be handled on the host side (one of the biggest flaws of CUDA in my opinion - the inability to run a kernel continually without direct control from the host). The API for programming with the RT cores is part of CUDA 10, which is freely accessible from NVidia's website. Hopefully that was helpful.
Thanks! I could have sworn someone else asked as similar question. The CUDA C samples can be found on NVidia's website. But the gist is: 1) you have to first prime the tensor cores via wmma::load_matrix_sync 2) then you can perform the operation via wmma::mma_sync 3) you can read back the result via wmma::store_matrix_sync
It would depend on what information you want specifically, and how far down the rabbit hole you want to go. There's very little detail on NVidia's part though, and most of it is educated speculation from the architecture community, much of which is not collected in single sources or written down as it's considered "common knowledge" or "obvious". Regarding warp scheduling, there's even less detail there. It's mostly a resource allocation problem where entire SMs are allocated warps based on the scheduling engine in the GPU (it's not clear if it's a software scheduler running on a small CPU, or if it's a hardware scheduler that iterates over a descriptor list - my guess would be a software scheduler though).
Thanks! I didn't realize that it was called a Systolic Array. The layout of a Systolic array makes a little more sense, but I believe the implementation presented in the video is functionally equivalent. Additionally, both implementations have the same band-width limitations.
Procrastination Is All You Need: Exponent Indexed Accumulators for Floating Point, Posits and Logarithmic Numbers bfloat16 MAC, one addition and one multiplication per clock : ~100 LUTs + 1 DSP48E2 @ > 600 MHz result accumulated in > 256 bits Tensor core needs 64 of these => ~ 6,400 LUTs + 64 DSP48E2
so why dont you just say "matrix operation core" or matrix multiplication core, why would make things complicated with complex differing terminology, "tensor"
Probably because the association was for AI/ML workloads which work with tensors (matrices are a special case of the more general tensor object). Though I am not sure why "Tensor Core" was chosen as the name since other AI/ML architectures call them "Matrix Cores" or "MxM Cores" (for GEMM). It might just be a result of marketing. I would say "MFU" or "Matrix Function Unit" would be the most descriptive term, but that doesn't sound as catchy.
I did the actual voice recording for this video several years ago. It was a lengthy editing process, which I got tired of, causing me to stop producing videos for 2 years. The thing that got me back into producing them was the AI speech synthesis. For me, it's a tradeoff between time commitment and production value, and I don't think the increased value of recording audio is worth increasing the production time by 10x (especially considering all of the time spent researching and verifying the video material beforehand).
For normies Tensor Core = DLSS + raid tracing = BETTER GAMING For machine Learning = Tensor Core = Better And Faster output I see two different world....
I think you meant Ray Tracing. As I understand it, the Tensor Cores are not used for RT acceleration, and are only used for DLSS. DLSS is a Convolutional Neural Network (CNN) evaluation on an image (so basically a bunch of filter passes), which is what the Tensor Cores are really good at doing. The interesting thing in terms of machine learning, is that it's not clear how the architecture of the Tensor Cores are setup internally (I doubt NVidia will let us know). Though, if you look at the NVidia example code, you load the matrices first, and then do the matrix-matrix multiply. So in order to get the most usage, you probably need to be doing a data stationary or weight stationary operation. If you need to change both data and weights, then using the FP32 units will probably yield better performance. So not necessarily faster for ML either.
@@RTLEngineering hmmm... thanks that ray was wrong due to predictive text. Also I heard the general term that Tensor Core will "improve deep/machine learning performance". I don't know if it's true or not but what about your thoughts?? I'm going to but a laptop for machine/deep learning purpose and I was deeply interested in tensor core due to it's capability of good deep learning performance. So I'm a bit confused whether to spend some money and get rtx card or go with default gtx cards. Please reply. I would really like to know whether it will make any difference. Also I absolutely loved your video even though I'm not a pure computer science student and started ML as a hobby I was able to get about 85-90% of Tensor core concept. Thank you
That will entirely depend on the application, from both what the underlying algorithm is, and how the code is written. For example, if you are using some ML code that doesn't use the TC instructions, then the TCs won't improve anything. Similarly, if the code isn't written to take full advantage of the TCs, then they may end up having no improvement at best, and could end up resulting in a reduction in performance at worst. If the ML code uses some higher level library like Tensor Flow, then I'm not sure if the underlying kernels will take advantage of the TCs at all (I would imagine that they have added such features, but that may not be practical). If the cost difference doesn't matter to you, I would go with a RTX card just to have the TCs if you need them / to play around with, but keep in mind that the VRAM matters too. To get the most performance, you want to be continually feeding the compute units with data, however, if you can only fit a single data set into VRAM, then you may have identical performance to a GTX card instead. On the other hand, if you are only processing a single data set at a time, you may not see a performance improvement at all. So it depends on the application. Personally, I went to RTX cards in my most recent builds so that I had the TCs and RTs to play around with, though I have yet to use them.
@@RTLEngineering hmmm... okay so for now or for few years TC won't have much effect of ML areas. So it's better to go with cheaper GTX for now. Anyways Thanks for your advice.
That isn't what I said. I said that it will depend on the application being run... It's like traction control in a car, it's really only useful if you go off-road (needing it would depend on where you plan to drive). I don't know what you plan on doing that's ML, so I can't suggest anything further... If you plan on writing your own Cuda kernels for ML, then you can make sure to make use of the TCs. If you are using someone else's code, then it depends on how they wrote their code.
Partially correct, Matrices are a type of Rank-2 Tensor, so they are a subset. Some Tensors are Matrices, and all Matrices are Tensors, but not all Tensors are Matrices. It would be more accurate to call it a "Matrix Core", but that doesn't sound as catchy. You could also call it a "Systolic Array Processor", but that's also not as catchy. I suspect they were thinking about the fact that you can form a Rank-N tensor with Rank-2 operations (technically you can do it with Rank-1 operations as in SIMD). Anyway, blame the name confusion on marketing.
What a great shop and tour! I LOVE the detail and the thought process of creating something that will last for many,many decades.
Hats off to u bro....
Well done explanation
It is also interesting to know about a new schedulers used for dispatching ray-tracing routines (e.g. closest/anyhit, that is
dynamically scheduled). Are they accessible directly (or at least indirectly) from CUDA cores (C++).
I haven't tried using any of the new Volta architecture with CUDA. From what I understand, the RT cores are only able to accelerate triangle-ray intersections, and tree traversal. So you would have to write a kernel that effectively does the ray-tracing, but is scheduled partly on the RT cores. As for how to actually write a kernel to use the RT cores, it looks like you basically create special CUDA kernels which use an rtcontext, which can share the kernel context and memory pointers with a normal CUDA context. Also, there appears to be no way to combine the RT kernels and CUDA kernels, where any communication needs to be handled on the host side (one of the biggest flaws of CUDA in my opinion - the inability to run a kernel continually without direct control from the host).
The API for programming with the RT cores is part of CUDA 10, which is freely accessible from NVidia's website.
Hopefully that was helpful.
Great job 👍
Fantastic overview. Any chance of a follow up with some CUDA C samples?
Thanks! I could have sworn someone else asked as similar question. The CUDA C samples can be found on NVidia's website. But the gist is:
1) you have to first prime the tensor cores via wmma::load_matrix_sync
2) then you can perform the operation via wmma::mma_sync
3) you can read back the result via wmma::store_matrix_sync
Thanks for explanation is there a source that you can recommend about warp scheduling and SM's ?
It would depend on what information you want specifically, and how far down the rabbit hole you want to go.
There's very little detail on NVidia's part though, and most of it is educated speculation from the architecture community, much of which is not collected in single sources or written down as it's considered "common knowledge" or "obvious".
Regarding warp scheduling, there's even less detail there. It's mostly a resource allocation problem where entire SMs are allocated warps based on the scheduling engine in the GPU (it's not clear if it's a software scheduler running on a small CPU, or if it's a hardware scheduler that iterates over a descriptor list - my guess would be a software scheduler though).
Systolic Array multiplier like tpu's Mxu unit
Thanks! I didn't realize that it was called a Systolic Array. The layout of a Systolic array makes a little more sense, but I believe the implementation presented in the video is functionally equivalent. Additionally, both implementations have the same band-width limitations.
@@RTLEngineering u r welcome... Yeahh even implementation in video would give same result ... Anyways thqs for the video 👍
Funny to think how we see tessellation as triangles when it’s a triangle representing a pyramid, representing points.
Typo: should be ...+ A[0,3]*B[3,0]... at 1:32
Thanks for pointing that out!
Procrastination Is All You Need: Exponent Indexed Accumulators for Floating Point, Posits and Logarithmic Numbers
bfloat16 MAC, one addition and one multiplication per clock : ~100 LUTs + 1 DSP48E2 @ > 600 MHz result accumulated in > 256 bits
Tensor core needs 64 of these => ~ 6,400 LUTs + 64 DSP48E2
It's on Linkedin, eventually on arXiv. YT is not letting me post more, not sure why
Good video 😍
so why dont you just say "matrix operation core" or matrix multiplication core, why would make things complicated with complex differing terminology, "tensor"
Probably because the association was for AI/ML workloads which work with tensors (matrices are a special case of the more general tensor object). Though I am not sure why "Tensor Core" was chosen as the name since other AI/ML architectures call them "Matrix Cores" or "MxM Cores" (for GEMM). It might just be a result of marketing.
I would say "MFU" or "Matrix Function Unit" would be the most descriptive term, but that doesn't sound as catchy.
damn
I commented on another video about it sounding like a computer speaking. This video sounds like a human, but the mic quality is much lower.
I did the actual voice recording for this video several years ago. It was a lengthy editing process, which I got tired of, causing me to stop producing videos for 2 years. The thing that got me back into producing them was the AI speech synthesis. For me, it's a tradeoff between time commitment and production value, and I don't think the increased value of recording audio is worth increasing the production time by 10x (especially considering all of the time spent researching and verifying the video material beforehand).
For normies Tensor Core = DLSS + raid tracing = BETTER GAMING
For machine Learning = Tensor Core = Better And Faster output
I see two different world....
I think you meant Ray Tracing. As I understand it, the Tensor Cores are not used for RT acceleration, and are only used for DLSS. DLSS is a Convolutional Neural Network (CNN) evaluation on an image (so basically a bunch of filter passes), which is what the Tensor Cores are really good at doing.
The interesting thing in terms of machine learning, is that it's not clear how the architecture of the Tensor Cores are setup internally (I doubt NVidia will let us know). Though, if you look at the NVidia example code, you load the matrices first, and then do the matrix-matrix multiply. So in order to get the most usage, you probably need to be doing a data stationary or weight stationary operation. If you need to change both data and weights, then using the FP32 units will probably yield better performance. So not necessarily faster for ML either.
@@RTLEngineering hmmm... thanks that ray was wrong due to predictive text. Also I heard the general term that Tensor Core will "improve deep/machine learning performance". I don't know if it's true or not but what about your thoughts?? I'm going to but a laptop for machine/deep learning purpose and I was deeply interested in tensor core due to it's capability of good deep learning performance. So I'm a bit confused whether to spend some money and get rtx card or go with default gtx cards.
Please reply. I would really like to know whether it will make any difference.
Also I absolutely loved your video even though I'm not a pure computer science student and started ML as a hobby I was able to get about 85-90% of Tensor core concept. Thank you
That will entirely depend on the application, from both what the underlying algorithm is, and how the code is written. For example, if you are using some ML code that doesn't use the TC instructions, then the TCs won't improve anything. Similarly, if the code isn't written to take full advantage of the TCs, then they may end up having no improvement at best, and could end up resulting in a reduction in performance at worst.
If the ML code uses some higher level library like Tensor Flow, then I'm not sure if the underlying kernels will take advantage of the TCs at all (I would imagine that they have added such features, but that may not be practical).
If the cost difference doesn't matter to you, I would go with a RTX card just to have the TCs if you need them / to play around with, but keep in mind that the VRAM matters too. To get the most performance, you want to be continually feeding the compute units with data, however, if you can only fit a single data set into VRAM, then you may have identical performance to a GTX card instead. On the other hand, if you are only processing a single data set at a time, you may not see a performance improvement at all. So it depends on the application. Personally, I went to RTX cards in my most recent builds so that I had the TCs and RTs to play around with, though I have yet to use them.
@@RTLEngineering hmmm... okay so for now or for few years TC won't have much effect of ML areas. So it's better to go with cheaper GTX for now. Anyways Thanks for your advice.
That isn't what I said. I said that it will depend on the application being run... It's like traction control in a car, it's really only useful if you go off-road (needing it would depend on where you plan to drive). I don't know what you plan on doing that's ML, so I can't suggest anything further... If you plan on writing your own Cuda kernels for ML, then you can make sure to make use of the TCs. If you are using someone else's code, then it depends on how they wrote their code.
Tensors and Matrices are not the same mathematical objects. There is some confusion in there
Partially correct, Matrices are a type of Rank-2 Tensor, so they are a subset. Some Tensors are Matrices, and all Matrices are Tensors, but not all Tensors are Matrices.
It would be more accurate to call it a "Matrix Core", but that doesn't sound as catchy. You could also call it a "Systolic Array Processor", but that's also not as catchy. I suspect they were thinking about the fact that you can form a Rank-N tensor with Rank-2 operations (technically you can do it with Rank-1 operations as in SIMD). Anyway, blame the name confusion on marketing.