Enabling Cost-Efficient LLM Serving with Ray Serve

Поділитися
Вставка
  • Опубліковано 20 сер 2024
  • Ray Serve is the cheapest and easiest way to deploy LLMs, and has served billions of tokens in Anyscale Endpoints. This talk discusses how Ray Serve reduces cost via fine-grained autoscaling, continuous batching, and model parallel inference, as well as the work we've done to make it easy to deploy any Hugging Face model with these optimizations.
    Takeaways:
    • Learn how Ray Serve saves costs by using fewer GPUs with finegrained autoscaling and integrating with libraries like VLLM to maximize GPU utilization.
    About Anyscale
    ---
    Anyscale is the AI Application Platform for developing, running, and scaling AI.
    www.anyscale.com/
    If you're interested in a managed Ray service, check out:
    www.anyscale.c...
    About Ray
    ---
    Ray is the most popular open source framework for scaling and productionizing AI workloads. From Generative AI and LLMs to computer vision, Ray powers the world’s most ambitious AI workloads.
    docs.ray.io/en...
    #llm #machinelearning #ray #deeplearning #distributedsystems #python #genai

КОМЕНТАРІ • 3

  • @elephantum
    @elephantum Місяць тому

    It should be noted, that since this talk, Anyscale deprecated Ray LLM and now recommend vLLM

  • @yukewang3164
    @yukewang3164 5 місяців тому +3

    awesome talk, with useful insights!

  • @MrEmbrance
    @MrEmbrance 15 днів тому

    no thanks