How to build on-premise Data Lake? | Build your own Data Lake | Open Source Tools | On-Premise

BI Insights Inc

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 8 лип 2024
In this video, we will cover the exciting world of data-lake. Data Lake is an essential component of Modern Data Stack. We have developed a Data Lake in AWS environment using AWS S3, Glue and Athena. What if we want to deploy our own data lake with open source tools on our infrastructure? We will deploy an on-premise data lake using open source technologies. This way we can learn the technologies behind data lake and most of the cloud offering use the same technologies.
What is Data Lake? aws.amazon.com/big-data/datal...
Link to GitHub repo: github.com/hnawaz007/pythonda...
💥Subscribe to our channel:
/ haqnawaz
📌 Links
-----------------------------------------
#️⃣ Follow me on social media! #️⃣
🔗 GitHub: github.com/hnawaz007
📸 Instagram: / bi_insights_inc
📝 LinkedIn: / haq-nawaz
🔗 / hnawaz100
-----------------------------------------
#dataanalytics #datalake #opensource
Topics covered in this video:
==================================
0:00 - Introduction to Data Lake
1:36 - Tech Stack of on-premise Data Lake
1:49 - Docker Containers Overview
3:26 - Data Lake Configurations
4:48 - Start Docker Containers
5:59 - MinIO (S3) Bucket and File(s)
6:53 - File mapping to SQL Table
7:12 - Trino Cluster
7:32 - Trino SQL Engine Connection
8:37 - Create Schema
9:03 - Create Table
9:36 - Query External Table
10:12 - SQL Analysis
10:29 - Data Lake Tech Review
11:51 - Coming Soon
Наука та технологія

КОМЕНТАРІ • 34

@alaab82 Місяць тому
one of the best tuto's on youtube, thank you so much !
@hernanlopezvergara6133 5 місяців тому
Thank you very much for this short but very useful video!
@datawise.education 5 місяців тому
I love all your videos Haq. Great work. :)
@rafaelg8238 2 місяці тому
great video, congrats.
@wallacecamargo1043 7 місяців тому
Thanksss!!!
@TheMahardiany Рік тому ⁺¹
Thankyou for the video, great as always 🎉
I want to ask in this video, when we use trino for query engine, can we use DML and even DDL for that external table ?
or we can just select from it ?
Thank you
@BiInsightsInc Рік тому ⁺¹
There are a number limitations to do DMLs on Hive. Please read the documentation link for more details - cwiki.apache.org/confluence/display/Hive/Hive+Transactions. It’s recommend not to use DML on Hive managed tables especially if the data volume is huge these operations would become too slow. DML operations would be considerably faster if done on a partition/bucket instead of the full tables. Nevertheless it better to handle the edits in file and do a full refresh via external table and only use DML on managed tables as last resort. We define the table via DDL so yes.
@LucasRalambo-bp3vb Рік тому
This is very informative !!! Thank you...
Can you please also make a video about creating an open source version of Amazon Forecast?
@BiInsightsInc Рік тому ⁺¹
Amazon Forecast is a time-series forecasting service based on machine learning (ML). We can certainly do it using open source. I will cover time-series forecasting in the future. In the mean time check out the ML Predictive Analytics with following video: ua-cam.com/video/TR6vn4lZ3Mo/v-deo.html&t
@user-wd1od9cu1g Рік тому
thank you. can we build transactional data lake using iceberg /hudi on this minio storage.
@BiInsightsInc Рік тому
Yes, you can build a data lake using Iceburg and MinIO. Here is a guide that showcases both of these tools in conjunction.
resources.min.io/c/lakehouse-architecture-with-iceberg-minio?x=jAF4uk&Lakehouse+%2B+Icerberg+on+PF+1.0+-+080322&hsa_acc=8976569894&hsa_cam=17954061482&hsa_grp=139012460799&hsa_ad=614757163838&hsa_src=g&hsa_tgt=kwd-1717916787486&hsa_kw=apache%20iceberg&hsa_mt=b&hsa_net=adwords&hsa_ver=3&gclid=Cj0KCQjwnrmlBhDHARIsADJ5b_mrNZMG2PHc14akJyBoy3nW-8INcEQ8MFRjifDGkjGDeDiNqcAxVvkaAgToEALw_wcB
@anujsharma4011 11 місяців тому
Can we directly connect trino with S3? no hive inbetween. I want to install trino on EC2
@BiInsightsInc 11 місяців тому
I’m afraid not. Trino needs the tables schema/metadata and that’s managed by the Hive metastore. Alternatively we can use Apache icebergs but we need the table mappings before Trino query engine can access the data stored in s3.
@hungnguyenthanh4101 10 місяців тому
How to build data lakehouse? Please next video with topic Lakehouse.❤
@BiInsightsInc 10 місяців тому
Yes, data lake house is on my radar. I will cover it in the future videos.
@akaile2233 11 місяців тому ⁺¹
Thankyou for video
Hi sir, if I want to use Spark to save data to the data lake you built, how do I do that? (I just started learning about Data lake and Spark)
@BiInsightsInc 10 місяців тому ⁺²
Below is a sample code to write data to MinIO bucket with Spark.
package com.medium.scala.sparkbasics
import com.amazonaws.SDKGlobalConfiguration
import org.apache.spark.sql.SparkSession
object MinIORead_Medium extends App {
System.setProperty(SDKGlobalConfiguration.DISABLE_CERT_CHECKING_SYSTEM_PROPERTY, "true")
lazy val spark = SparkSession.builder().appName("MinIOTest").master("local[*]").getOrCreate()
val s3accessKeyAws = "minioadmin"
val s3secretKeyAws = "minioadmin"
val connectionTimeOut = "600000"
val s3endPointLoc: String = "127.0.0.1:9000"
spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", s3endPointLoc)
spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", s3accessKeyAws)
spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", s3secretKeyAws)
spark.sparkContext.hadoopConfiguration.set("fs.s3a.connection.timeout", connectionTimeOut)
spark.sparkContext.hadoopConfiguration.set("spark.sql.debug.maxToStringFields", "100")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.path.style.access", "true")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.connection.ssl.enabled", "true")
val yourBucket: String = "minio-test-bucket"
val inputPath: String = s"s3a://$yourBucket/data.csv"
val outputPath = s"s3a://$yourBucket/output_data.csv"
val df = spark
.read
.option("header", "true")
.format("minioSelectCSV")
.csv(inputPath)
df
.write
.mode("overwrite")
.parquet(outputPath)
}
@akaile2233 10 місяців тому
@@BiInsightsInc Hi sir, I did everything like your video and it worked fine, but when I remove the schema 'sales' there is an error 'access denied' ?
@BiInsightsInc 10 місяців тому
@@akaile2233 You cannot delete objects from Trino engine. You can do so in the Hive Metastore. In this example, we re using Maria db. So you can connect to it delete objects from there. Changes will be reflected in the mappings you see in Trino.
@akaile2233 9 місяців тому
@@BiInsightsInc Sorry to bother, there are too many tables in metastore_db, which ones should I delete?
@juliovalentim6178 Рік тому
Sorry for the noob question, but can I create a Data Lake like this inside a PowerEdge T550 server instead of my desktop or laptop? Without resorting to paid cloud services?
@BiInsightsInc Рік тому ⁺¹
Yes, you can create this setup on your server. This way you will use your own infrastructure and avoid paid services and data exposure to outside services.
@juliovalentim6178 Рік тому
@@BiInsightsInc Hi! Thank you very much for answering my question. And would you be able to tell me what RAM, cache and SSD requirements I need to have on the server to implement this setup, without slowing down processing for Data Science?
@BiInsightsInc Рік тому
@@juliovalentim6178 the hardware requirements will ultimately depends on the amount of data you are processing and you can tweak it once you perform tests with actual data. Anyways, here are some recommendations from minIO. First is an actual data lake deployment you can use for reference. Second link is for a production scale data lake. Hope this helps.
blog.min.io/building-an-on-premise-ml-ecosystem-with-minio-powered-by-presto-r-and-s3select-feature/
min.io/product/reference-hardware
@juliovalentim6178 Рік тому
@@BiInsightsInc Of course it helped! Thank you so much again. Congratulations on the excellent content of your channel. I will always be following. Best Regards!
@zera215 3 місяці тому
You mentioned to someone that Apache Iceberg could be an alternative to Hive. Would you be interested in recording a new video about it?
@BiInsightsInc 3 місяці тому
@zera215 I have covered the Apache Iceberg and how to utilize in the similar setup in the following video: ua-cam.com/video/vnNHDylGtEk/v-deo.html
@zera215 3 місяці тому
@@BiInsightsInc I am looking for full open source solution. Do you know if I can just exchange hive and Iceberg in the architecture of this video?
@BiInsightsInc 3 місяці тому
@@zera215 you can use Hive and Iceberg together. Yout still need a metastore in order to work with Iceberg. Here is an example of how to use them together.
iceberg.apache.org/hive-quickstart/
@zera215 3 місяці тому
@@BiInsightsInc Thank you, and congrats for your great work =-D
@oscardelacruz3087 4 місяці тому
Hi, Haq. Nice video, i'm trying to make it works but cannot load minio catalog.
@BiInsightsInc 4 місяці тому
You are not able to connect to MinIO in DBeaver? What's the error you receive there?
@oscardelacruz3087 4 місяці тому
@@BiInsightsInc hi haq, thanks. Finally i connect minio and trino. But i have a question how much deep trino can read parquet file? I trying to read parquet file from minio with directory structure: s3a://datalake/bronze/erp/customers/. Inside customer folder i have folders for each year/month/day. When try to read the files trino return 0 rows.

Наступне

Автоматичне відтворення

Manage your data pipelines with Dagster | Software defined assets | IO Managers | Updated project