This tutorial is so underrated! Hands down the most clear and in-depth understanding of DDP for someone who doesn't know multi-processing in Pytorch. I came across this after watching 4-5 other videos. Strongly recommend this one.
21:19 Where does the averaging of gradients happen? On the CPU as shown in the animation? Or all the GPUs talk to each other directly and averaging happens on each GPU?
It depends on the HW you got and the backend you are using. I suppose with NVIDIA servers and nccl backend it all happens between GPUs without CPU involvement. The connection is done device-to-device
sir if i have more data like more than 100gb which cannot be stored in google colab then how should i approach this problem for training my model on whole data
I have a question. The train function runs on each process independent of the other (train functions running on other process). Within train, the epoch may finish at different times for each train function. How does the PyTorch distributed know that when it is time to synchronize gradients? BTW - this is the best lecture I have seen on this topic :+1:
This tutorial is so underrated! Hands down the most clear and in-depth understanding of DDP for someone who doesn't know multi-processing in Pytorch. I came across this after watching 4-5 other videos. Strongly recommend this one.
I think the questions are excellent
Thanks a lot. really enjoyed it. God bless you all
21:19 Where does the averaging of gradients happen? On the CPU as shown in the animation? Or all the GPUs talk to each other directly and averaging happens on each GPU?
It depends on the HW you got and the backend you are using. I suppose with NVIDIA servers and nccl backend it all happens between GPUs without CPU involvement. The connection is done device-to-device
thanks
sir if i have more data like more than 100gb which cannot be stored in google colab then how should i approach this problem for training my model on whole data
I have a question. The train function runs on each process independent of the other (train functions running on other process). Within train, the epoch may finish at different times for each train function. How does the PyTorch distributed know that when it is time to synchronize gradients? BTW - this is the best lecture I have seen on this topic :+1:
all processes are sync every gradient update.
Thanks a lot for this, helped with my interview prep!
Really good and clear, thank you for this video!
Thank you very much. Very good presentation, comprehensive and clear.
so clear and well-explained. Thank you very much
super clear! Thanks!
so clear,so great