Maximizing Query Speed and Minimizing Costs in Data Lakes with Open-Source Caching

Поділитися
Вставка
  • Опубліковано 18 вер 2024
  • Presented by Beinan Wang at OSA Con 2023
    As data lakes scale in complexity and size, companies face challenges with slow and inconsistent data access, rapidly growing storage costs, and high operation costs when migrating to the cloud. In this talk, we discuss an open-source caching framework we designed to improve performance by 1.5x and reduce storage costs by millions per year. The framework leverages tools like Hadoop, Parquet, Hudi, and Alluxio and applies to both on-prem and cloud environments.
    By the end of the session, you will learn:
    The challenges of performance and cost in large-scale data lakes
    Architectural considerations for caching to accelerate queries, maximize cache hit rates, and reduce storage costs
    Leveraging open-source tools like Alluxio for performant caching on high-density HDDs
    Advanced techniques like segmented file caching, affinity-based scheduling, and cache filtering to optimize cache usage
    Monitoring cache usage and working sets with metrics and traces

КОМЕНТАРІ •