Create on premise Data Lakehouse with Apache Iceberg | Nessie | MinIO | Lakehouse

Поділитися
Вставка
  • Опубліковано 22 лип 2024
  • In this video cover the data lakehouse. A data lake house is a concept that combines elements of both data lakes and data warehouses to bring us the best of both worlds. It aims to provide a unified platform for storing, managing, and analyzing both unstructured data and structured data.
    What is Data Lake? aws.amazon.com/big-data/datal...
    Link to GitHub repo: github.com/hnawaz007/pythonda...
    Link to Data Lake Video:
    On-premis: • How to build on-premis...
    AWS: • How to create an AWS S...
    💥Subscribe to our channel:
    / haqnawaz
    📌 Links
    -----------------------------------------
    #️⃣ Follow me on social media! #️⃣
    🔗 GitHub: github.com/hnawaz007
    📸 Instagram: / bi_insights_inc
    📝 LinkedIn: / haq-nawaz
    🔗 / hnawaz100
    -----------------------------------------
    #dataanalytics #datalakehouse #opensource
    Topics covered in this video:
    ==================================
    0:00 - Introduction to Data Lakehouse
    0:53 - Data Lakehouse prominent Features
    1:50 - Data Lake from Previouse session
    2:31 - Data Lakehouse Overview
    3:34 - Tech Stack of on-premise Data Lakehouse
    3:44 - Start Docker Containers
    4:02 - MinIO (S3) Buckets, File(s) & Keys
    4:56 - Configure Dremio
    5:07 - Add MinIO (S3) Source
    5:57 - Add Nessie Catalog
    6:38 - Format File
    7:33 - Create Iceberg Table
    7:59 - Copy Data to Table
    8:35 - SQL DML Operations
    9:47 - Table History and Time Travel
    10:29 - Coming Soon
  • Наука та технологія

КОМЕНТАРІ • 13

  • @BiInsightsInc
    @BiInsightsInc  8 місяців тому

    Link to to Data Lake Videos On-premis and AWS:
    ua-cam.com/video/DLRiUs1EvhM/v-deo.html&t
    ua-cam.com/video/KvtxdF7b_l8/v-deo.html

    • @hungnguyenthanh4101
      @hungnguyenthanh4101 8 місяців тому

      Can you try with another project with deltalake,hive-metastore?

  • @andriifadieiev9757
    @andriifadieiev9757 8 місяців тому

    Great video, thank you!

  •  3 місяці тому

    Amazing!

  • @hungnguyenthanh4101
    @hungnguyenthanh4101 8 місяців тому

    very good!

  • @rafaelg8238
    @rafaelg8238 2 місяці тому +2

    great video, congrats.
    If possible, bring an end-to-end architecture with streaming data ingested directly into the lakehouse.
    also something related to the integration of datalake and datalakehouse.

    • @BiInsightsInc
      @BiInsightsInc  2 місяці тому +1

      That’s a great idea 💡. I will put something together that combines the data streaming and the data lake. This will give an end to end implementation.

  •  3 місяці тому

    Today I use apache Nifi to retrieve data from APIs, DBs and mariadb is my main DW. I've been testing dremio/nessie/minIO using docker-compose and I still have doubts about the best way to ingest data in Dremio. There are databases and APIs that cannot be connected directly to it. I tested sending parquet files directly to the storage, but the upsert/merge is very complicated and the jdbc connection with Nifi didn't help me either. What would you recommend for these cases?

    • @BiInsightsInc
      @BiInsightsInc  3 місяці тому

      Hi there, Dremio is a SQL Query Engine like Trino and Presto. You do not insert/ingest data in dremio directly. The S3 layer is where you store your data. Apache Iceberg provides the Lakehouse Management service (upsert/merge) for the objects in the catalog. I'd advise to handle upsert/merge in the catalog layer rather than S3, sole reason of the iceberg's presence in this stack. Here is an article on how to handle upsert using SQL.
      medium.com/datamindedbe/upserting-data-using-spark-and-iceberg-9e7b957494cf

  •  3 місяці тому

    This is so insane. Is it also possible to query data from a specific versionstate directly instead of only the metadata? I am wondering if this would be suitable for bigger Datasets? Have you ever benchmarked this stack with a big Dataset? If the versioncontrol is scalable with bigger datasets and higher change frequency, this would be a crazy good solution to implement.

    • @BiInsightsInc
      @BiInsightsInc  3 місяці тому +1

      Yes, it is possible to query data using the specific snapshot id. We can time travel using the available snapshot id to view our Iceberg data from a different point in time, see Time Travel Queries. The processing of large dataset depends on your set up. If you have multiple node with enough ram/compute power than you can process large data. Or levrage a cloud cluster that you can scale up or down depening on your needs.
      select count(*)
      from s3.ctas.iceberg_blog
      AT SNAPSHOT '4132119532727284872';

  • @nicky_rads
    @nicky_rads 8 місяців тому

    nice video! Data lakehouses offer a lot of functionality at an affordable price. It seems like dremio is the platform that allows you to aggregate all of these services together ? Could you go a little more in depth on some of the services.

    • @BiInsightsInc
      @BiInsightsInc  8 місяців тому

      Thanks. Yes, dremio engines brings various services together to offer data lake house functionality. I will be going over Iceberg and the project Nessie in the future.