OnehouseHQ
OnehouseHQ
  • 19
  • 8 168
Apache XTable Brings Interoperability for Hudi, Iceberg, and Delta Lake
Watch our highlights video describing how Apache XTable (Incubating) - formerly OneTable - brings interoperability to all major data table formats. This video comes from a session at Open Source Data Summit in November 2023.
Transcript
Tim Brown
All right, so to kick things off, I think we need to start from a shared understanding of what are
these table formats. And at their core, these formats are really just great abstractions that allow
us to think about large sets of data like tables. We can do inserts, updates, deletes, manage the
tables, and really just make our lives a lot easier when dealing with this data at scale. So there's
three main players in this space right now. But when you think about it at rest, when these
tables are just existing in your S3 bucket or GCS or your Azure account, they're not all that
different. They're metadata in a set of Parquet files or whatever other format you're using to
store that data. So with that in mind, the OneTable project is really just this omnidirectional
interop for the table format.
It's a lightweight translation layer of this metadata that allows us to, you know, read,
fundamentally read the tables as any format and work with any sort of tools that maybe require
a certain format. The project itself is not a separate table format. It's not adding any new
metadata. It's just building on top of what's already present there by these three existing
projects.
So there's a lot of engineering decisions that have to be made when you're building out your
data lake, and a lot of time can be spent tinkering with different formats, trying to understand the
features that exist in each of these formats that fit your use cases best.
It could be possible that one format is not going to fit every use case, and maybe there are also
vendors that you're considering using that only support one of these three formats. You don't
want to be laying awake at night wondering what life would have been like if you had just
chosen the other format. So with that, we present OneTable, which helps you write once and
query everywhere.
Ashvin Agrawal
But this is just the beginning. There are a lot of good interesting problems yet to be solved which
includes merge on read, delete, vectors, Apache Paimon which is a new project coming from
Alibaba team that is getting integrated. We are super focused on improving the
performance using the footprint of OneTable so that it can live hidden behind in the infra and
automatically generate these target table formats. We want to work on various deployment modes we and also take it more closer to the engine integration as possible. And in the long run even,
there are many more exciting features both for researchers and for industry. For example, active
writers synchronize, commit timestamps and so on.
I'm super excited and there are a lot of problems which we want to work on with you. I
welcome you all to join the project. Come and share, contribute, provide us insight.
Переглядів: 193

Відео

NOW Insurance Brings ML/AI to Life with Onehouse
Переглядів 283 місяці тому
NOW Insurance delivers complex insurance policies, well, now. A policy that used to take weeks to price and deliver can now be priced and sold in as little as three minutes. All developed and run quickly, easily, and affordably, with the power of Onehouse. What is NOW Insurance? We primarily do insurance for medical professionals. So think like physicians, nurses, nurse practitioners. In this i...
The Universal Data Lakehouse: User Journeys Part 2
Переглядів 224 місяці тому
In this second of two videos, Balaji Varadarajan, Karthik Natarajan, and Satya Narayan continue their conversation with Vinoth Chandar of Onehouse.
The Universal Data Lakehouse: User Journeys Part 1
Переглядів 344 місяці тому
In this first of two videos, data engineering leaders from Robinhood, Uber, and Walmart discuss the choices they've made and benefits they've achieved from the universal data lakehouse.
Step by Step Guide for Change Data Capture from PostgreSQL to the Onehouse Universal Data Lakehouse
Переглядів 1864 місяці тому
Often, operational databases become overburdened with analytics workloads. Watch this tutorial to learn how to integrate PostgreSQL for fully managed change data capture (CDC) to the Onehouse Universal Data Lakehouse.
Scaling and Governing Robinhood’s Data Lakehouse with Apache Hudi
Переглядів 834 місяці тому
Robinhood manages tens of millions of customer accounts and petabytes of data, used for a multitude of purposes. Robinhood uses Apache Hudi to power a Universal Data Lakehouse architecture, handling data management with high performance and at low cost.
OSS in Today's Data Architectures
Переглядів 665 місяців тому
The OSS panel discussion at Open Source Data Summit 2023 was led by CEO Vinoth Chandar of Onehouse, featuring data technology all-stars from Confluent, Google, LinkedIn, Microsoft, Starburst, and Uber. Speakers appear in the following order in the accompanying video: * Vinoth Chandar, CEO at Onehouse * Raghu Ramakrishnan, CTO for Data at Microsoft * Justin Borgman, Chairman and CEO at Starburst...
Diving into Uber's Cutting-Edge Data Infrastructure
Переглядів 1696 місяців тому
As an astoundingly successful, global transportation provider, Uber has a voracious appetite for up-to-the-minute data. In response to this demand, Apache Hudi sprung from Uber nearly a decade ago - and they have not stopped innovating yet.
Notion Handles 10x Data Growth with Apache Hudi
Переглядів 1127 місяців тому
Notion's data team watched as their PostgreSQL database was overwhelmed by data scale, generated by the company's business success. Sharding was a short-term fix, but Apache Hudi and a universal data lakehouse architecture offer a long-term solution.
Implementing Apache Hudi at Walmart
Переглядів 1327 місяців тому
In this brief video from Open Source Data Summit, Speakers Ankur Ranjan and Ayush Bijawat describe the strategic shift from data lake to lakehouse architecture at Walmart. Key points include the slow and clumsy classical data lake approach to updates, vs. the efficient approach enabled by Apache Hudi, and a summary of all the benefits delivered by Hudi, including schema enforcement, ACID transa...
Hudi 1.0 Preview, Part 2: Key Features and Implementation Specifics
Переглядів 847 місяців тому
In this video from Open Source Data Summit, Speakers Bhavani Sudha Saktheeswaran and Sagar Sumit describe key features of Hudi 1.0 as it seeks to provide "A Database Experience on the Data Lake." Features include: using LSM trees to unlock infinite database time travel; using functional indexes to eliminate partitions (!); the development of a new filegroup reader for improved performance; and ...
Hudi 1.0 Preview, Part 1: Hudi Background and Goals for Hudi 1.0
Переглядів 1097 місяців тому
In this video from Open Source Data Summit, Speakers Bhavani Sudha Saktheeswaran and Sagar Sumit describe the background of the Hudi project and goals for the upcoming major release, Hudi 1.0. Specifics include the origins of Hudi at Uber, the widespread use of Hudi today, and goals for Hudi 1.0 as it seeks to provide "A Database Experience on the Data Lake."
Introducing Onehouse and the Universal Data Lakehouse
Переглядів 5278 місяців тому
Onehouse delivers the universal data lakehouse as a fully-managed cloud service. The universal data lakehouse frees your data from traditional data silos * Deliver minute-level data freshness: Ingest up-to-the-minute data from event streams, databases, and file storage * Simplify data operations: Reduce data engineering burden with a fully-managed cloud service * Reduce costs: Cut your data inf...
Introducing the Onehouse Universal Data Lakehouse
Переглядів 1,2 тис.9 місяців тому
The data lakehouse is a modern architecture that combines the benefits of data warehouses and data lakes without the drawbacks and limitations. Onehouse builds on this architecture with a universal approach that makes data from all your favorite sources - streams, databases, and cloud storage, for example - available to all the common query engines and languages. Learn more at www.onehouse.ai.
Data Lakehouse Deep Dive: Hudi, Iceberg, and Delta Lake
Переглядів 3,5 тис.Рік тому
On Tuesday, August 22nd, Onehouse presented a LinkedIn Live webinar on Apache Hudi, Apache Iceberg, and Delta Lake. You can see the video here or on LinkedIn: www.linkedin.com/events/7095484265877950465/ The video was based on Kyle Weller's popular blog post on the three major lakehouse projects: www.onehouse.ai/blog/apache-hudi-vs-delta-lake-vs-apache-iceberg-lakehouse-feature-comparison You c...
Deep Dive: Hudi, Iceberg, and Delta Lake
Переглядів 855Рік тому
Deep Dive: Hudi, Iceberg, and Delta Lake
Ingest Postgres CDC data into the lakehouse with Onehouse and Confluent
Переглядів 335Рік тому
Ingest Postgres CDC data into the lakehouse with Onehouse and Confluent
Full Workshop Recap: Build a ride-share lakehouse platform
Переглядів 333Рік тому
Full Workshop Recap: Build a ride-share lakehouse platform
AWS and Apache Hudi Workshop Overview: Build a ride share lakehouse platform
Переглядів 247Рік тому
AWS and Apache Hudi Workshop Overview: Build a ride share lakehouse platform

КОМЕНТАРІ

  • @yuweixiao1943
    @yuweixiao1943 3 місяці тому

    will existing data in postgres be synced to hudi too? or just changes since the creation of the streaming

  • @padam_discussion
    @padam_discussion 3 місяці тому

    Interesting video... great

  • @SoumilShah
    @SoumilShah 5 місяців тому

    great video

    • @onehouseHQ
      @onehouseHQ 5 місяців тому

      Glad you ejoyed it!

  • @HoorayforOranges
    @HoorayforOranges 6 місяців тому

    Thank you so much for this. This is the only video I could find that takes a real deep dive into the data without propaganda towards any one candidate.

  • @JG-zu6nq
    @JG-zu6nq Рік тому

    mistake at 22:41, there's no limitation that you 'cant cross over the boundary' in a query when you do partition evolution in Iceberg

    • @kjweller
      @kjweller Рік тому

      You can cross the boundary, but the query predicates need to be right to get the same performance across both partition schemes.

    • @JG-zu6nq
      @JG-zu6nq Рік тому

      @@kjweller what exactly does that mean, one just has to write select * from table where ts > timestamp '2023-08-21 00:00:00' and even if the partitioning was evolved from say daily to hourly on 08/25 that will work and prune the partitions

    • @kjweller
      @kjweller Рік тому

      @@JG-zu6nq take an example if you were partitioning by date daily, and you want to evolve this to partition by userId or vice-versa. A query with only one of the predicates will be efficient just for that section of the partitioned data. It works great for evolving partitioning within different aggregate levels of same value, but struggles across different values.

    • @paulfunigga
      @paulfunigga Рік тому

      @@kjweller what about schema evolution, in your article it says that hudi's schema evolution is good only on spark sql. What if I use hudi with trino? Is schema evolution going to be bad? Also, is hudi good with trino at all? In trino's slack channel they said that they prioritize iceberg.

    • @paulfunigga
      @paulfunigga Рік тому

      @@kjweller also, in your "which format to choose" why didn't you add another point: hudi's table services are managed, compared to iceberg and delta lake, I think it's a big thing.