Source Routing for AI Fabrics

Поділитися
Вставка
  • Опубліковано 11 вер 2024
  • Presented by Kishore Atreya (Marvell) | Prathyaya Bhandarkar (Marvell)
    For large-scale, multi-tenant AI clusters that rely on Ethernet fabrics, high tail latency and jitter increase job completion time as training elements idle while waiting for data to arrive. Multiple congestion avoidance approaches have been proposed to address this challenge, including enhanced congestion control and load balancing, packet spraying and fabric scheduling.
    The challenge these options face is complexity and unpredictable behavior.
    This presentation details an approach to scheduling AI workloads in an Ethernet fabric. We propose a simplified source routing framework enabled by SAI that predetermines flow paths, and programs them across access nodes in an AI fabric, taking advantage of AI training flow predictability.
    In the proposed framework, software controllers engineer traffic flows between training elements, optimizing for bandwidth utilization, load, and latency. Using such an approach reduces network cost and power requirements as compared to fabric scheduling.

КОМЕНТАРІ •