Is Your GPU Really Working Efficiently in the Data Center? N Ways to Imp... Xiao Zhang & Wu Ying Jun
Вставка
- Опубліковано 8 лис 2024
- Don't miss out! Join us at our upcoming conference: Open Source Summit + AI_Dev: Open Source GenAI & ML Summit in Tokyo from October 28-29, 2024. Connect with peers as the community gathers to further the education and advancement of open source and GenAI. Learn more at events.linuxfo...
Is Your GPU Really Working Efficiently in the Data Center? N Ways to Improve GPU Usage | 您的GPU在数据中心真的高效工作吗?提高GPU使用率的N种方法 - Xiao Zhang, DaoCloud & Wu Ying Jun, China Mobile
AI has penetrated into various industries, and companies have purchased many expensive AI GPU devices and used them for training and inference.
Is MFU performing well?
Is the GPU card being monopolized by a large number of applications that are not heavily used?
Do these AI devices work efficiently 24/7?
This session will combine our mass production practices to summarize N ways to improve the MFU of AI accelerators,
We will share some experience in training LLMs with hundreds of billions of parameters on a large-scale K8s cluster with thousands of AI accelerators(GPUs or NPUs), including model parallelism, switch-affinity scheduling, checkpoint efficiency optimization, recovery from checkpoint and so on.
At the same time, we will also introduce how to improve MFU through GPU share technology, solve tidal scenarios with the help of training-inference hybrid solutions, and improve GPU utilization by node grouping and matching training and inference applications.
人工智能已经渗透到各个行业,企业购买了许多昂贵的AI GPU设备并用于训练和推理。
MFU表现良好吗?
GPU卡是否被大量未被充分利用的应用程序垄断?
这些AI设备是否全天候高效工作?
本次会议将结合我们的批量生产实践,总结出提高AI加速器MFU的N种方法。
我们将分享在拥有数千个AI加速器(GPU或NPU)的大型K8s集群上训练拥有数千亿参数的大型语言模型(LLM)的经验,包括模型并行性、交换机亲和调度、检查点效率优化、从检查点恢复等。
同时,我们还将介绍如何通过GPU共享技术提高MFU,利用训练-推理混合解决方案解决潮汐场景,并通过节点分组和匹配训练与推理应用提高GPU利用率。