Production ML for Mission Critical Applications

Scaling AI Workloads with Kubernetes: Sharing GPU Resources Across Multiple Containers - Jack Ong

Introduction to Microservices, Docker, and Kubernetes

Получилось у Вики?😂 #хабибка

I wish I could change THIS fast! 🤣

КАК РОССИЯНКА ПРОБИЛА ДНО! У нас в пл@ну…он живет лучше, чем вы / Золкин

Scale and Accelerate the Distributed Model Training in Kubernetes Cluster

MLOps World: Machine Learning in Production

Переглядів 408

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 16 сер 2023
Speaker:
Jack Jin, Lead ML Infra Engineer, Zoom
Jack Jin is a lead machine learning infrastructure engineer at Zoom AI/ML, designed and built end to end Kubernetes cluster based ML platform for multiple Zoom ML teams on shared GPU resource pool, to run the distributed model training, like PyTorch DDP with Kubeflow PyTorchJob for accelerating the multi GPU multi nodes training performance with RDMA, RoCE and SRIOV. He also designed and built the data ETL, data exploration, big data processing and ML exploration system and infrastructure. Before joined to Zoom, Jack was MLOps Cloud Lead in Roche Genentech and Cloud consultant in IBM/Taos and was involved in building ML platform serving 500 data scientists of Roche globally
Abstract:
In order to orchestrate Deep Learning workloads that scale across multiple GPUs and nodes, Kubernetes offers a compelling solution. With Kubernetes and Kubeflow PytorchJob, we can easily schedule and track a distributed training job on multi-GPU single-node, and multi-GPU multi-nodes in a shared GPU resource pool. To accelerate deep learning training at Zoom, we enable RDMA, RoCE to bypass the CPU kernel and offload the TCP/IP protocol. We apply this technology in Kubernetes with SRIOV by NVIDIA Network Operator in a heterogenous GPUs cluster with 4 GPU servers and 8 GPU servers, and reach a near linear performance increase as the GPU number and worker node increases

КОМЕНТАРІ • 1

@user-pz2vs2ht6c 7 місяців тому
Thanks for the very good overview of training distributed systems on kubernetes, would love to see more detailed information making all the pieces fit together !

Наступне

Автоматичне відтворення

Production ML for Mission Critical Applications

Production ML for Mission Critical Applications

Scaling AI Workloads with Kubernetes: Sharing GPU Resources Across Multiple Containers - Jack Ong

Scaling AI Workloads with Kubernetes: Sharing GPU Resources Across Multiple Containers - Jack Ong

Introduction to Microservices, Docker, and Kubernetes

Introduction to Microservices, Docker, and Kubernetes

Получилось у Вики?😂 #хабибка

Получилось у Вики?😂 #хабибка

I wish I could change THIS fast! 🤣

I wish I could change THIS fast! 🤣

КАК РОССИЯНКА ПРОБИЛА ДНО! У нас в пл@ну…он живет лучше, чем вы / Золкин

КАК РОССИЯНКА ПРОБИЛА ДНО! У нас в пл@ну…он живет лучше, чем вы / Золкин

ТРИ УРОВНЯ вашей ЛИЧНОСТИ. КТО ТЫ? - ТОПЛЕС

ТРИ УРОВНЯ вашей ЛИЧНОСТИ. КТО ТЫ? — ТОПЛЕС

A friendly introduction to distributed training (ML Tech Talks)

A friendly introduction to distributed training (ML Tech Talks)

Kubernetes Networking Intro and Deep-Dive - Bowei Du & Tim Hockin, Google

Kubernetes Networking Intro and Deep-Dive - Bowei Du & Tim Hockin, Google

What is RAG? (Retrieval Augmented Generation)

What is RAG? (Retrieval Augmented Generation)

Explaining Distributed Systems Like I'm 5

Explaining Distributed Systems Like I'm 5

Distributed training with Ray on Kubernetes at Lyft

Distributed training with Ray on Kubernetes at Lyft

Kubernetes Batch Processing at Scale - A Scheduling Perspective - Lim Haw Jia & Fan Deliang

Kubernetes Batch Processing at Scale - A Scheduling Perspective - Lim Haw Jia & Fan Deliang

Building a GPU cluster for AI

Building a GPU cluster for AI

What runs ChatGPT? Inside Microsoft's AI supercomputer | Featuring Mark Russinovich

What runs ChatGPT? Inside Microsoft's AI supercomputer | Featuring Mark Russinovich

Event-Driven Architectures Done Right, Apache Kafka • Tim Berglund • Devoxx Poland 2021

Event-Driven Architectures Done Right, Apache Kafka • Tim Berglund • Devoxx Poland 2021

I wish I could change THIS fast! 🤣

I wish I could change THIS fast! 🤣

ОСКАР ИСПОРТИЛ ДЖОНИ ЖИЗНЬ 😢 @lenta_com

ОСКАР ИСПОРТИЛ ДЖОНИ ЖИЗНЬ 😢 @lenta_com

LISA - ROCKSTAR (Official Music Video)

LISA - ROCKSTAR (Official Music Video)

Really practical tips and tricks! How to securely fasten a wire to a metal pole #shorts #diy #tips

Really practical tips and tricks! How to securely fasten a wire to a metal pole #shorts #diy #tips

Самое СЛОЖНОЕ оживление на канале!

Самое СЛОЖНОЕ оживление на канале!

Час на цвинтар ❗️ Кім Чен Ин подарував Путіну надгробок

Час на цвинтар ❗️ Кім Чен Ин подарував Путіну надгробок

ROCK PAPER SCISSOR! (55 MLN SUBS!) feat @PANDAGIRLOFFICIAL #shorts

ROCK PAPER SCISSOR! (55 MLN SUBS!) feat @PANDAGIRLOFFICIAL #shorts

🔥Позивний "БАНДЕРА"

🔥Позивний "БАНДЕРА"