Scale and Accelerate the Distributed Model Training in Kubernetes Cluster

Поділитися
Вставка
  • Опубліковано 16 сер 2023
  • Speaker:
    Jack Jin, Lead ML Infra Engineer, Zoom
    Jack Jin is a lead machine learning infrastructure engineer at Zoom AI/ML, designed and built end to end Kubernetes cluster based ML platform for multiple Zoom ML teams on shared GPU resource pool, to run the distributed model training, like PyTorch DDP with Kubeflow PyTorchJob for accelerating the multi GPU multi nodes training performance with RDMA, RoCE and SRIOV. He also designed and built the data ETL, data exploration, big data processing and ML exploration system and infrastructure. Before joined to Zoom, Jack was MLOps Cloud Lead in Roche Genentech and Cloud consultant in IBM/Taos and was involved in building ML platform serving 500 data scientists of Roche globally
    Abstract:
    In order to orchestrate Deep Learning workloads that scale across multiple GPUs and nodes, Kubernetes offers a compelling solution. With Kubernetes and Kubeflow PytorchJob, we can easily schedule and track a distributed training job on multi-GPU single-node, and multi-GPU multi-nodes in a shared GPU resource pool. To accelerate deep learning training at Zoom, we enable RDMA, RoCE to bypass the CPU kernel and offload the TCP/IP protocol. We apply this technology in Kubernetes with SRIOV by NVIDIA Network Operator in a heterogenous GPUs cluster with 4 GPU servers and 8 GPU servers, and reach a near linear performance increase as the GPU number and worker node increases

КОМЕНТАРІ • 1

  • @user-pz2vs2ht6c
    @user-pz2vs2ht6c 7 місяців тому

    Thanks for the very good overview of training distributed systems on kubernetes, would love to see more detailed information making all the pieces fit together !