Your hardware is good and optimal I think. But to get a bigger response you need to change the model settings in LM Studio (tokens to response). By this time you have already discovered it by yourself or with the help of Google. And I was surprised that you try so small models, which is possible to run even (your) CPU with acceptable speed about 8-12 tokens (I get so on my 2699v3 or even 2670v3). But on 2 pcs of P40 with 24 VRAM each I expect you to run some smarter and much bigger models like 36 GB. And their responses will be better, but slower. If both P40s can do at least 24 tokens/sec, that would be nice.
As far as I know, GPUs, as implied by their name, are meant to receive graphics-rendering related tasks from the CPU, in order to free up the CPU to do regular computational tasks. GPUs are NOT meant or designed to do regular, non-graphical related tasks. And yet, over and over I see people, you included Joe (for example, with these Tesla cards), installing GPUs in their server for the purpose of increasing the computational, non-graphical capacity of their server. Does this even work? Won’t the CPU offload only graphical tasks to the GPU?
LLM models do not directly require video cards to perform their computations. However, video cards can be used to provide additional computational power that can help improve the performance and accuracy of these models. In order to achieve this, LLM models can use a technique called " GPU computing ". This involves offloading certain computational tasks from the CPU to the graphics card, which is specifically designed to handle these types of tasks. By using GPU computing, LLM models can significantly reduce the time required to perform their calculations, which can lead to more accurate and efficient predictions. Overall, while video cards may not be necessary for the operation of LLM models, they can provide a valuable tool for improving their performance and accuracy. Thanks for watching.
The P40 has CUDA capabilities specifically designed for deep learning. Also, I wouldn't use the P40 for graphical computing. GPU's are also general purpose as compared to FPGA cards that are further specialized for each AI. They are used for a variety of advanced computational tasks. However, they're not currently popular because these LLM architectures are changing so often that by the time you fabricate the chips, the architecture might already be obsolete. A GPU let's us change the architecture with the cost of performance. GPU's are used for gaming and graphics but by the time it's sent to the GPU, it's essentially only the mathematical computation needed to calculate each triangle. We use a GPU over a CPU because they can perform many more simultaneous operations.
Your hardware is good and optimal I think. But to get a bigger response you need to change the model settings in LM Studio (tokens to response). By this time you have already discovered it by yourself or with the help of Google. And I was surprised that you try so small models, which is possible to run even (your) CPU with acceptable speed about 8-12 tokens (I get so on my 2699v3 or even 2670v3). But on 2 pcs of P40 with 24 VRAM each I expect you to run some smarter and much bigger models like 36 GB. And their responses will be better, but slower. If both P40s can do at least 24 tokens/sec, that would be nice.
That unit is still a work in progress, and looking forward to learning with it. Thanks for watching :)
You may have to change the settings on your windows task manager to show CUDA by default it doesn't.
ME FIRST !!! WOO HOOO !!!
As far as I know, GPUs, as implied by their name, are meant to receive graphics-rendering related tasks from the CPU, in order to free up the CPU to do regular computational tasks. GPUs are NOT meant or designed to do regular, non-graphical related tasks. And yet, over and over I see people, you included Joe (for example, with these Tesla cards), installing GPUs in their server for the purpose of increasing the computational, non-graphical capacity of their server. Does this even work? Won’t the CPU offload only graphical tasks to the GPU?
LLM models do not directly require video cards to perform their computations. However, video cards can be used to provide additional computational power that can help improve the performance and accuracy of these models.
In order to achieve this, LLM models can use a technique called " GPU computing ". This involves offloading certain computational tasks from the CPU to the graphics card, which is specifically designed to handle these types of tasks.
By using GPU computing, LLM models can significantly reduce the time required to perform their calculations, which can lead to more accurate and efficient predictions. Overall, while video cards may not be necessary for the operation of LLM models, they can provide a valuable tool for improving their performance and accuracy. Thanks for watching.
The P40 has CUDA capabilities specifically designed for deep learning. Also, I wouldn't use the P40 for graphical computing.
GPU's are also general purpose as compared to FPGA cards that are further specialized for each AI. They are used for a variety of advanced computational tasks. However, they're not currently popular because these LLM architectures are changing so often that by the time you fabricate the chips, the architecture might already be obsolete. A GPU let's us change the architecture with the cost of performance.
GPU's are used for gaming and graphics but by the time it's sent to the GPU, it's essentially only the mathematical computation needed to calculate each triangle. We use a GPU over a CPU because they can perform many more simultaneous operations.
@@hammadusmani7950 Thanks for the information.