Distributed Training with PyTorch: complete tutorial with cloud infrastructure and code

Поділитися
Вставка
  • Опубліковано 27 чер 2024
  • A complete tutorial on how to train a model on multiple GPUs or multiple servers.
    I first describe the difference between Data Parallelism and Model Parallelism. Later, I explain the concept of gradient accumulation (including all the maths behind it). Then, we get to the practical tutorial: first we create a cluster on Paperspace with two servers (each having two GPUs) and then training a model in a distributed manner on the cluster.
    We will explore collective communication primitives: Broadcast, Reduce and All-Reduce and the algorithm behind them.
    I also provide a template on how to integrate DistributedDataParallel in your existing training loop.
    In the last part of the video we review advanced topics, like bucketing and computation-communication overlap during backpropagation.
    Code: github.com/hkproj/pytorch-tra...
    PDF slides: github.com/hkproj/pytorch-tra...
    Chapters
    00:00:00 - Introduction
    00:02:43 - What is distributed training?
    00:04:44 - Data Parallelism vs Model Parallelism
    00:06:25 - Gradient accumulation
    00:19:38 - Distributed Data Parallel
    00:26:24 - Collective Communication Primitives
    00:28:39 - Broadcast operator
    00:30:28 - Reduce operator
    00:32:39 - All-Reduce
    00:33:20 - Failover
    00:36:14 - Creating the cluster (Paperspace)
    00:49:00 - Distributed Training with TorchRun
    00:54:57 - LOCAL RANK vs GLOBAL RANK
    00:56:05 - Code walkthrough
    01:06:47 - No_Sync context
    01:08:48 - Computation-Communication overlap
    01:10:50 - Bucketing
    01:12:11 - Conclusion
  • Наука та технологія

КОМЕНТАРІ • 49

  • @amishasomaiya9891
    @amishasomaiya9891 Місяць тому +2

    Starting to watch my 3rd video on this channel, after transformer from scratch and quantization. Thank you for the great content and also for the code and notes to look back again. Thank you.

  • @karanacharya18
    @karanacharya18 Місяць тому +2

    Super high quality lecture. You have a gift of teaching, man. Thank you!

  • @chiragjn101
    @chiragjn101 6 місяців тому +9

    Great video, thanks for creating this. I have use DDP quite a lot but seeing the visualizations for communication overlap helped me build a very good mental model.
    Would love to see more content around distributed training - Deepspeed ZeRO, Megatron DP + TP + PP

  • @user-td8vz8cn1h
    @user-td8vz8cn1h 2 місяці тому +2

    This is second video Ive watched from this channel after "quantization". And frankly wanted to express my gratitude towards your work as it is very easy to follow and the level of abstractions is tenable to understand concepts holistically.

  • @abdallahbashir8738
    @abdallahbashir8738 3 місяці тому +3

    I really love your vidoes. you have a natural talent on simplifying logic and code. in same capacity as Andrej

  • @vasoyarutvik2897
    @vasoyarutvik2897 Місяць тому +2

    this channel is hidden gem

  • @svkchaitanya
    @svkchaitanya 2 дні тому +1

    You rock always 😂

  • @oliverhitchcock8436
    @oliverhitchcock8436 6 місяців тому +3

    Another great video, Umar. Nice work

  • @631kw
    @631kw 6 місяців тому +3

    Amazing content! Thanks for your sharing

  • @prajolshrestha9686
    @prajolshrestha9686 6 місяців тому +1

    Thankyou so much for this amazing video. It is really informative.

  • @nova2577
    @nova2577 4 місяці тому +1

    You deserve many more likes and subscribers!

  • @810602jay
    @810602jay 6 місяців тому +1

    Amazing learning stuff ! Very Thanks !~ 🥰🥰🥰

  • @user-wm5xv5ei8o
    @user-wm5xv5ei8o 3 місяці тому +1

    very nice and informative video. Thanks

  • @user-jf6li8mn3l
    @user-jf6li8mn3l 5 місяців тому

    The video was very interesting and useful. Please make a similar video on DeepSpeed functionality. And in general, how to train large models (for example LLaMa SFT) on distributed systems (Multi-Server) when GPUs are located on different PCs.

  • @d.s.7857
    @d.s.7857 6 місяців тому +1

    Thank you so much for this

  • @felipemello1151
    @felipemello1151 Місяць тому +1

    I wish i could like it twice

    • @umarjamilai
      @umarjamilai  Місяць тому

      You can share it on social media. That's the best way to thank me 😇

    • @felipemello1151
      @felipemello1151 Місяць тому

      @@umarjamilai not sure if it’s in your plans, but if you are open to suggestions, I would love to watch a video on multimodal models. Again, awesome work!

  • @loong6127
    @loong6127 3 місяці тому +1

    Great video

  • @user-od3ig9qt6h
    @user-od3ig9qt6h 6 місяців тому +2

    Thank you very much for your wonderful video. Can you teach a video on how to use the accelerate library with dpp?

  • @Yo-rw7mq
    @Yo-rw7mq 2 місяці тому +1

    Great!

  • @manishsharma2211
    @manishsharma2211 6 місяців тому +1

    you teach soooooooo good

  • @rohollahhosseyni8564
    @rohollahhosseyni8564 3 місяці тому +1

    great video

  • @user-el4uh3uk2k
    @user-el4uh3uk2k 3 місяці тому +1

    fantastic

  • @Engrbilal143
    @Engrbilal143 4 місяці тому

    Awesome video. Please make tutorial on FSDP as well

  • @sounishnath513
    @sounishnath513 6 місяців тому +1

    SUUUPERRRR

  • @tryit-wv8ui
    @tryit-wv8ui 6 місяців тому

    another banger

  • @mdbayazid6837
    @mdbayazid6837 6 місяців тому +1

    Federated learning basics please.❤

  • @hellochli
    @hellochli 6 місяців тому +1

    Thanks!

    • @umarjamilai
      @umarjamilai  6 місяців тому

      谢谢你!我们在领英connect吧

  • @mandarinboy
    @mandarinboy 5 місяців тому

    Great intro video. Do you have any plans to also cover other parallelism: Model, Pipeline, Tensor, etc.

  • @riyajatar6859
    @riyajatar6859 3 місяці тому +1

    In broadcast , if we are sending the copy of file from rank 0 and rank 4 node to other node. How is the total time still 10 second. Because still I am having same internet speed of 1MB/s.
    Could anyone explain? I am bit confused.
    Also what happens if I am having odd numbers of nodes

  • @madhusudhanreddy9157
    @madhusudhanreddy9157 6 місяців тому

    If time permits for you, Please make an video for entire GPU and TPU and how to them effectively and most of us donno .
    please create a playlist for pytorch for beginners and intermediates.
    Thanks for reading.

  • @waynelau3256
    @waynelau3256 2 місяці тому

    Working with fsdp and megatron now and I really want to figure this out from scratch haha, it sounds fun but a big headache

  • @ramprasath6424
    @ramprasath6424 6 місяців тому +1

    please do some thing related to audio large models like conformers,quartznet ,etc

  • @madhusudhanreddy9157
    @madhusudhanreddy9157 6 місяців тому

    Hi Umar, Great video and enjoyed thorughly but i have one question.why are we using the approach of sum(grad1+grad2+....+gradN), why cant we use Avg of Gradients.

    • @umarjamilai
      @umarjamilai  6 місяців тому +2

      Of course you can (but you don't have to) use the average of the gradients. Actually, people usually take the average of the gradients. The reason we use the average is because we want the loss to be (more of less) the same as the non-distributed model, so you can compare the plots of the two. I don't know if PyTorch internally automatically takes the average of the gradients, I'd have to check the documentation/source.

    • @madhusudhanreddy9157
      @madhusudhanreddy9157 6 місяців тому

      @@umarjamilaithanks for the info.

  • @Allen-TAN
    @Allen-TAN 6 місяців тому +1

    Always great to watch your video, excellent work

  • @khoapham7303
    @khoapham7303 6 місяців тому +2

    I'm always confused with DP and DDP. Can you please tell me the difference between them? While both of them belong to data parallelism method.

    • @umarjamilai
      @umarjamilai  6 місяців тому +6

      DP only works on a single machine, while DDP can work on multiple machines. However, PyTorch now recommends using DDP also for single-machine setup.

    • @khoapham7303
      @khoapham7303 6 місяців тому

      @@umarjamilai thank you for your reply

  • @user-fw5sg5mx4m
    @user-fw5sg5mx4m Місяць тому

    could provide another videos with respect to model parallel and pipeline parallel ? thanks..

  • @Erosis
    @Erosis 6 місяців тому

    Wouldn't the accumulated gradient need to be divided by the total number of individual gradients summed (or the learning rate needs to be divided by this value) to make it equivalent?

    • @umarjamilai
      @umarjamilai  6 місяців тому +2

      Yes, if you want to treat the "cumulative gradient" as a big batch, then you'd usually divide it by the number of items to keep it equivalent to the single-item setup. But it's not mandatory: as a matter of fact, loss functions on PyTorch have a "reduction" parameter, which is usually set to "mean" (so dividing the loss by the number of items) but can also be set to "sum".
      One reason we usually calculate the "mean" loss is because we want to make comparisons between models with different hyperparameters (batch size), so the loss should not depend on the batch size.
      But remember that mathematically you don't have to

  • @ai__76
    @ai__76 2 місяці тому

    How to do in Kubernetes? Please explain it.

  • @user-ze3ok8hh6c
    @user-ze3ok8hh6c 6 місяців тому

    do you have a discord channel?

  • @milonbhattacharya4097
    @milonbhattacharya4097 4 місяці тому

    shouldnt loss be accumulated ? loss += (y_pred - y_actual)^0.5

    • @user-pt7gs2ei1r
      @user-pt7gs2ei1r 4 місяці тому

      In my understanding, yes the loss is accumulated for one batch theoretically, and the gradients are computed based on this accumulated loss too. But in the parallel implementation, both the loss calculated in the feedforward process, and the gradients calculated in the back propagation process executed in a parallel way. Here @umarjamilai use a for loop to illustrate the de facto parallel mechanism.