Create on premise Data Lakehouse with Apache Iceberg | Nessie | MinIO | Lakehouse
Вставка
- Опубліковано 22 лип 2024
- In this video cover the data lakehouse. A data lake house is a concept that combines elements of both data lakes and data warehouses to bring us the best of both worlds. It aims to provide a unified platform for storing, managing, and analyzing both unstructured data and structured data.
What is Data Lake? aws.amazon.com/big-data/datal...
Link to GitHub repo: github.com/hnawaz007/pythonda...
Link to Data Lake Video:
On-premis: • How to build on-premis...
AWS: • How to create an AWS S...
💥Subscribe to our channel:
/ haqnawaz
📌 Links
-----------------------------------------
#️⃣ Follow me on social media! #️⃣
🔗 GitHub: github.com/hnawaz007
📸 Instagram: / bi_insights_inc
📝 LinkedIn: / haq-nawaz
🔗 / hnawaz100
-----------------------------------------
#dataanalytics #datalakehouse #opensource
Topics covered in this video:
==================================
0:00 - Introduction to Data Lakehouse
0:53 - Data Lakehouse prominent Features
1:50 - Data Lake from Previouse session
2:31 - Data Lakehouse Overview
3:34 - Tech Stack of on-premise Data Lakehouse
3:44 - Start Docker Containers
4:02 - MinIO (S3) Buckets, File(s) & Keys
4:56 - Configure Dremio
5:07 - Add MinIO (S3) Source
5:57 - Add Nessie Catalog
6:38 - Format File
7:33 - Create Iceberg Table
7:59 - Copy Data to Table
8:35 - SQL DML Operations
9:47 - Table History and Time Travel
10:29 - Coming Soon - Наука та технологія
Link to to Data Lake Videos On-premis and AWS:
ua-cam.com/video/DLRiUs1EvhM/v-deo.html&t
ua-cam.com/video/KvtxdF7b_l8/v-deo.html
Can you try with another project with deltalake,hive-metastore?
Great video, thank you!
Amazing!
very good!
great video, congrats.
If possible, bring an end-to-end architecture with streaming data ingested directly into the lakehouse.
also something related to the integration of datalake and datalakehouse.
That’s a great idea 💡. I will put something together that combines the data streaming and the data lake. This will give an end to end implementation.
Today I use apache Nifi to retrieve data from APIs, DBs and mariadb is my main DW. I've been testing dremio/nessie/minIO using docker-compose and I still have doubts about the best way to ingest data in Dremio. There are databases and APIs that cannot be connected directly to it. I tested sending parquet files directly to the storage, but the upsert/merge is very complicated and the jdbc connection with Nifi didn't help me either. What would you recommend for these cases?
Hi there, Dremio is a SQL Query Engine like Trino and Presto. You do not insert/ingest data in dremio directly. The S3 layer is where you store your data. Apache Iceberg provides the Lakehouse Management service (upsert/merge) for the objects in the catalog. I'd advise to handle upsert/merge in the catalog layer rather than S3, sole reason of the iceberg's presence in this stack. Here is an article on how to handle upsert using SQL.
medium.com/datamindedbe/upserting-data-using-spark-and-iceberg-9e7b957494cf
This is so insane. Is it also possible to query data from a specific versionstate directly instead of only the metadata? I am wondering if this would be suitable for bigger Datasets? Have you ever benchmarked this stack with a big Dataset? If the versioncontrol is scalable with bigger datasets and higher change frequency, this would be a crazy good solution to implement.
Yes, it is possible to query data using the specific snapshot id. We can time travel using the available snapshot id to view our Iceberg data from a different point in time, see Time Travel Queries. The processing of large dataset depends on your set up. If you have multiple node with enough ram/compute power than you can process large data. Or levrage a cloud cluster that you can scale up or down depening on your needs.
select count(*)
from s3.ctas.iceberg_blog
AT SNAPSHOT '4132119532727284872';
nice video! Data lakehouses offer a lot of functionality at an affordable price. It seems like dremio is the platform that allows you to aggregate all of these services together ? Could you go a little more in depth on some of the services.
Thanks. Yes, dremio engines brings various services together to offer data lake house functionality. I will be going over Iceberg and the project Nessie in the future.