Scaling RoCE Networks for AI Training | Adi Gangidi

Поділитися
Вставка
  • Опубліковано 21 сер 2024
  • In this talk we provide an overview of Meta's RDMA deployment based on RoCEV2 transport for supporting our production AI Training infrastructure. We will shed light on how we designed our infrastructure to both maximize raw performance and consistency that is fundamental for the workload. We will talk about the challenges we solved in Routing, Transport and Hardware layers we solved along the way to scale our infrastructure. We will also touch on opportunities that remain in this space to make further progress over the next few years.

КОМЕНТАРІ • 3

  • @lolcat6294
    @lolcat6294 Місяць тому +2

    🎯 Key points for quick navigation:
    00:19 *Meta transitioned AI training from horizontal to vertical scaling, requiring a dedicated RDMA network over converged Ethernet.*
    01:00 *RDMA fabrics at Meta support tens of thousands of GPUs, handling diverse AI training use cases.*
    03:11 *AI training involves complex, recursive processes that scale vertically with HPC-style parallel processing.*
    05:32 *RDMA with RoCE V2 enables high-bandwidth, low-latency GPU communication crucial for AI training.*
    08:30 *Meta's network design for AI training includes balanced topologies and traffic patterns accommodating hierarchical and full mesh models.*
    12:44 *Load balancing challenges in RDMA deployments at Meta involve adapting to uneven distribution of server destinations across IP prefixes.*
    16:52 *Issues with slow receivers impacting network performance at Meta are often related to GPU memory allocation pressures, causing PCI and network bottlenecks.*
    Made with HARPA AI

  • @jagsinghbrar
    @jagsinghbrar 8 місяців тому +1

    Adi, that was a good talk . I enjoyed watching it. Lots of useful info. Thank you! - Jag

  • @aaa-hw2ty
    @aaa-hw2ty 18 днів тому

    Each spine switch connects to 256 ToR switch and some uplink switches. which types of spine switches can support nearly 300 * 400Gbps ports?