How to build on-premise Data Lake? | Build your own Data Lake | Open Source Tools | On-Premise
Вставка
- Опубліковано 8 лип 2024
- In this video, we will cover the exciting world of data-lake. Data Lake is an essential component of Modern Data Stack. We have developed a Data Lake in AWS environment using AWS S3, Glue and Athena. What if we want to deploy our own data lake with open source tools on our infrastructure? We will deploy an on-premise data lake using open source technologies. This way we can learn the technologies behind data lake and most of the cloud offering use the same technologies.
What is Data Lake? aws.amazon.com/big-data/datal...
Link to GitHub repo: github.com/hnawaz007/pythonda...
💥Subscribe to our channel:
/ haqnawaz
📌 Links
-----------------------------------------
#️⃣ Follow me on social media! #️⃣
🔗 GitHub: github.com/hnawaz007
📸 Instagram: / bi_insights_inc
📝 LinkedIn: / haq-nawaz
🔗 / hnawaz100
-----------------------------------------
#dataanalytics #datalake #opensource
Topics covered in this video:
==================================
0:00 - Introduction to Data Lake
1:36 - Tech Stack of on-premise Data Lake
1:49 - Docker Containers Overview
3:26 - Data Lake Configurations
4:48 - Start Docker Containers
5:59 - MinIO (S3) Bucket and File(s)
6:53 - File mapping to SQL Table
7:12 - Trino Cluster
7:32 - Trino SQL Engine Connection
8:37 - Create Schema
9:03 - Create Table
9:36 - Query External Table
10:12 - SQL Analysis
10:29 - Data Lake Tech Review
11:51 - Coming Soon - Наука та технологія
one of the best tuto's on youtube, thank you so much !
Thank you very much for this short but very useful video!
I love all your videos Haq. Great work. :)
great video, congrats.
Thanksss!!!
Thankyou for the video, great as always 🎉
I want to ask in this video, when we use trino for query engine, can we use DML and even DDL for that external table ?
or we can just select from it ?
Thank you
There are a number limitations to do DMLs on Hive. Please read the documentation link for more details - cwiki.apache.org/confluence/display/Hive/Hive+Transactions. It’s recommend not to use DML on Hive managed tables especially if the data volume is huge these operations would become too slow. DML operations would be considerably faster if done on a partition/bucket instead of the full tables. Nevertheless it better to handle the edits in file and do a full refresh via external table and only use DML on managed tables as last resort. We define the table via DDL so yes.
This is very informative !!! Thank you...
Can you please also make a video about creating an open source version of Amazon Forecast?
Amazon Forecast is a time-series forecasting service based on machine learning (ML). We can certainly do it using open source. I will cover time-series forecasting in the future. In the mean time check out the ML Predictive Analytics with following video: ua-cam.com/video/TR6vn4lZ3Mo/v-deo.html&t
thank you. can we build transactional data lake using iceberg /hudi on this minio storage.
Yes, you can build a data lake using Iceburg and MinIO. Here is a guide that showcases both of these tools in conjunction.
resources.min.io/c/lakehouse-architecture-with-iceberg-minio?x=jAF4uk&Lakehouse+%2B+Icerberg+on+PF+1.0+-+080322&hsa_acc=8976569894&hsa_cam=17954061482&hsa_grp=139012460799&hsa_ad=614757163838&hsa_src=g&hsa_tgt=kwd-1717916787486&hsa_kw=apache%20iceberg&hsa_mt=b&hsa_net=adwords&hsa_ver=3&gclid=Cj0KCQjwnrmlBhDHARIsADJ5b_mrNZMG2PHc14akJyBoy3nW-8INcEQ8MFRjifDGkjGDeDiNqcAxVvkaAgToEALw_wcB
Can we directly connect trino with S3? no hive inbetween. I want to install trino on EC2
I’m afraid not. Trino needs the tables schema/metadata and that’s managed by the Hive metastore. Alternatively we can use Apache icebergs but we need the table mappings before Trino query engine can access the data stored in s3.
How to build data lakehouse? Please next video with topic Lakehouse.❤
Yes, data lake house is on my radar. I will cover it in the future videos.
Thankyou for video
Hi sir, if I want to use Spark to save data to the data lake you built, how do I do that? (I just started learning about Data lake and Spark)
Below is a sample code to write data to MinIO bucket with Spark.
package com.medium.scala.sparkbasics
import com.amazonaws.SDKGlobalConfiguration
import org.apache.spark.sql.SparkSession
object MinIORead_Medium extends App {
System.setProperty(SDKGlobalConfiguration.DISABLE_CERT_CHECKING_SYSTEM_PROPERTY, "true")
lazy val spark = SparkSession.builder().appName("MinIOTest").master("local[*]").getOrCreate()
val s3accessKeyAws = "minioadmin"
val s3secretKeyAws = "minioadmin"
val connectionTimeOut = "600000"
val s3endPointLoc: String = "127.0.0.1:9000"
spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", s3endPointLoc)
spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", s3accessKeyAws)
spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", s3secretKeyAws)
spark.sparkContext.hadoopConfiguration.set("fs.s3a.connection.timeout", connectionTimeOut)
spark.sparkContext.hadoopConfiguration.set("spark.sql.debug.maxToStringFields", "100")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.path.style.access", "true")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.connection.ssl.enabled", "true")
val yourBucket: String = "minio-test-bucket"
val inputPath: String = s"s3a://$yourBucket/data.csv"
val outputPath = s"s3a://$yourBucket/output_data.csv"
val df = spark
.read
.option("header", "true")
.format("minioSelectCSV")
.csv(inputPath)
df
.write
.mode("overwrite")
.parquet(outputPath)
}
@@BiInsightsInc Hi sir, I did everything like your video and it worked fine, but when I remove the schema 'sales' there is an error 'access denied' ?
@@akaile2233 You cannot delete objects from Trino engine. You can do so in the Hive Metastore. In this example, we re using Maria db. So you can connect to it delete objects from there. Changes will be reflected in the mappings you see in Trino.
@@BiInsightsInc Sorry to bother, there are too many tables in metastore_db, which ones should I delete?
Sorry for the noob question, but can I create a Data Lake like this inside a PowerEdge T550 server instead of my desktop or laptop? Without resorting to paid cloud services?
Yes, you can create this setup on your server. This way you will use your own infrastructure and avoid paid services and data exposure to outside services.
@@BiInsightsInc Hi! Thank you very much for answering my question. And would you be able to tell me what RAM, cache and SSD requirements I need to have on the server to implement this setup, without slowing down processing for Data Science?
@@juliovalentim6178 the hardware requirements will ultimately depends on the amount of data you are processing and you can tweak it once you perform tests with actual data. Anyways, here are some recommendations from minIO. First is an actual data lake deployment you can use for reference. Second link is for a production scale data lake. Hope this helps.
blog.min.io/building-an-on-premise-ml-ecosystem-with-minio-powered-by-presto-r-and-s3select-feature/
min.io/product/reference-hardware
@@BiInsightsInc Of course it helped! Thank you so much again. Congratulations on the excellent content of your channel. I will always be following. Best Regards!
You mentioned to someone that Apache Iceberg could be an alternative to Hive. Would you be interested in recording a new video about it?
@zera215 I have covered the Apache Iceberg and how to utilize in the similar setup in the following video: ua-cam.com/video/vnNHDylGtEk/v-deo.html
@@BiInsightsInc I am looking for full open source solution. Do you know if I can just exchange hive and Iceberg in the architecture of this video?
@@zera215 you can use Hive and Iceberg together. Yout still need a metastore in order to work with Iceberg. Here is an example of how to use them together.
iceberg.apache.org/hive-quickstart/
@@BiInsightsInc Thank you, and congrats for your great work =-D
Hi, Haq. Nice video, i'm trying to make it works but cannot load minio catalog.
You are not able to connect to MinIO in DBeaver? What's the error you receive there?
@@BiInsightsInc hi haq, thanks. Finally i connect minio and trino. But i have a question how much deep trino can read parquet file? I trying to read parquet file from minio with directory structure: s3a://datalake/bronze/erp/customers/. Inside customer folder i have folders for each year/month/day. When try to read the files trino return 0 rows.