AWS Glue: Read CSV Files From AWS S3 Without Glue Catalog

DataEng Uncomplicated

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 28 сер 2022
This video is about how to read in data files stored in csv in AWS S3 in AWS Glue when your data is not defined in the AWS Glue Catalog. This video uses the create_dynamic_frame_from_options method
AWS Documentation: docs.aws.amazon.com/glue/late...
Code example: github.com/AdrianoNicolucci/d...
#aws, #awsglue

КОМЕНТАРІ • 54

@Diminishstudioz Рік тому ⁺¹
I am so happy that I found this channel
@DataEngUncomplicated Рік тому
I'm happy you found it to Farsim! Thanks for subscribing!
@akshitha2110 11 місяців тому ⁺¹
Thank You. This is very helpful. My use case is to take the csv files from S3 and perform Data Quality checks and output in the parquet format. I was planning to use Pyspark in aws and I think this is a simple procedure I can follow to do the same.
@DataEngUncomplicated 10 місяців тому
No problem! Yup this approach would work. Why do you need to use pyspark though? Are you analyzing millions of records? If it's only 1000s or 100,000s lambda functions or just using a glue shell job might be sufficient
@priyanka2309 Рік тому
Excellent
@sumanranjan6597 7 місяців тому
Hi, I'm having an error while running the first default code. Plz provide the IAM role used to launch notebook in the aws glue.
@vvkk-vl9jw Рік тому
thank u very much for this video playlist. pls upload new videos on multiple condition.
@DataEngUncomplicated Рік тому
Thanks, can you elaborate on what videos would be helpful on multiple conditions?
@vvkk-vl9jw Рік тому
@@DataEngUncomplicated thank u for replying. i want new videos 1)using triggers for crawler, and connect to sns service for msg similar like that. 2)join oracle database to glue for querying. I really appreciate your efforts.💟
@tiktok4372 Рік тому
What is the better option? reading from glueCatalog or directly from S3 ?
I’m working on a project that everyday new data files are loaded into S3 bucket ( right now almost parquet files, but in the feature there will be any other format). When the files are already in S3, we trigger AwsGlue Job to read(via glueCatalog), transform and write to data to another S3 bucket. But before starting Glue job, we need to start the related crawlers to crawl the new files(register new partition, update schema if there is any change,…). Because of that, we need to create many crawlers and orchestrate them base on the event of corresponding file is loaded into S3, and waiting for crawlers to finish running also takes time and cost. Do you think we keep doing that or just read file directly from S3 ? is there any risk or performance issue between 2 methods or any other recommendation? Thank you very much
@DataEngUncomplicated Рік тому
Hey, sorry for the late reply.
Whether to read from GlueCatalog or directly from S3 depends on the specific requirements and constraints of your project. Here are some factors to consider:
Performance: Reading data directly from S3 can be faster than reading through GlueCatalog, as GlueCatalog adds an additional layer of metadata management. However, the performance difference may not be significant, especially if you use partitioning and indexing in GlueCatalog to optimize queries.
Schema evolution: If your data schema is likely to change frequently or unpredictably, using GlueCatalog can provide a more flexible and automated way to manage schema evolution. GlueCatalog can automatically detect schema changes and update table definitions, which can save you from having to manually update your code.
Cost: Using GlueCatalog can add some additional cost to your AWS bill, as you are paying for the metadata management and indexing that GlueCatalog provides. However, the cost may be small compared to the benefits of using GlueCatalog for your specific use case.
@PRI_Vlogs_Australia Рік тому
Thank you for this awesome explanation. Can I please request you to make the video about 'How to implement Change Data Capture' using python? and Secondly, How to automate Python pipelines to load the data in AWS cloud say S3. Thanks.
@DataEngUncomplicated Рік тому ⁺¹
Thanks! Sure I will add the change data capture to my video suggestion list. I have a couple of videos on writing data to s3 using AWS lambda service and AWS glue you can check out. Check out this blog post on aws related to CDC with aws glue aws.amazon.com/blogs/big-data/implement-a-cdc-based-upsert-in-a-data-lake-using-apache-iceberg-and-aws-glue/
This might be helpful if you can leverage the iceberg file format
@udaynayak4788 Рік тому
just came with new scenario, can you please create one UDF in pyspark aws glue. needed the most
@devanshaggarwal2627 11 місяців тому
What IAM Role should I choose while creating ETL job in Jupyter notebook to write this code?
@DataEngUncomplicated 10 місяців тому
There isn't an existing role that will give you everything. You need to add permissions to your s3 bucket if you are using it for reading and writing data as well.
@joelluis4938 Рік тому
There is any reason to avoid Catalog ? I'm just learning about Glue and I use the Catalog.
I have other question.. I've tried to run a Crawler to take one csv file from my S3 buket but when i check the new tables, it doensn't recognize the column names. It shows col 0 col1 col2 col3. Do you know why this happens? or how to solve it ?
@DataEngUncomplicated Рік тому
No reason to avoid Catalog, I made this video because you might need to read files with aws glue that have not already been configured in the aws glue catalog.
Is your schema defined in your first column of your csv files? That is one reason I could think of why it's not showing up in the catalog.
@alejandrasilva8008 Рік тому
Hello, great video. Thanks yoy.
So a cuestion.
When I run the code .printSchema()
The notebbok run :
root
++
||
++
++
and I review the file and it has header. What happened?
and thank you for your answer.
@DataEngUncomplicated Рік тому
Hey thanks you. I've seen this when no data is being returned...can you confirm there are no records being returned?
@shashankreddy8390 Рік тому
Hi buddy this is a nice video, but every one creates video on reading and writing from s3.
1. Can you create a video on how to use Glue studio notebook (interactive session) to read data from Awsgluecatalog and write the results to S3?
2. Please can you include every step- i.e what kind of permissions should we need to create to read and write.
(I am getting a lot of permission denied errors)
Also recommend doing a video on Athena notebook editor reading data from Gluecatalog using pyspark.
(Please also include detailed permissions steps)
@DataEngUncomplicated Рік тому
Hi Shashank, these are great video suggestions I will add them to my list, I have broken my videos down into smaller segments but it having a video end to end might be beneficial esp with the permission challenges
@shashankreddy8390 Рік тому
@@DataEngUncomplicated what number is my request on your list 😅😅😅😅
@powerspan 5 місяців тому
Hello there, In my csv lot of non utf8 characters are there how can i ignore them while uploading since its throwing error "unable to parse the file"
@DataEngUncomplicated 4 місяці тому
In AWS Glue, you can use PySpark to read a CSV file and ignore non-UTF8 characters. Here’s an example if you convert your dynamicFrame into a pyspark dataframe
# Replace non-UTF8 characters
for column in df.columns:
df = df.withColumn(column, col(column).cast("string").alias(column))
@himanshusingh-nv5wn 4 місяці тому
I am getting iam:passrole failed to start the session
I do have glue console full policy attached to iam role
@malvika2011 Рік тому
Thank you for this video, I am getting an error glueContext not defined. Even though when starting a notebook in aws glue it is getting imported automatically.
Thank you
@DataEngUncomplicated Рік тому ⁺¹
Hi Malvika, it sounds like you did not define glueContext correctly, I would check to make sure you included the template python code that comes when you first create a new glue job
@malvika2011 Рік тому
@@DataEngUncomplicated Thank you 😊 I will check and get back. Thank you for the response. Merry Christmas and a Happy New Year to you !
@DataEngUncomplicated Рік тому
Thanks! Merry Christmas and happy new year!
@jomymcet 9 місяців тому
Can anyone please help me. I have some NON_ASCII characters in my file placed inside S3. How can I remove those junk characters from that file in S3 using AWS Glue?? Please help.
@DataEngUncomplicated 9 місяців тому ⁺¹
Hi, try posting on AWS repost, you might get a quicker response for this particular problem.
@patilharss Рік тому
How i can update the file and store it again in s3?
@DataEngUncomplicated Рік тому
Hey Harsh, do you want to replace the same data on AWS S3? There is a parameter on right that will overwrite the partition which could be an option
@muralichiyan Місяць тому
Data bricks glue are same?
@DataEngUncomplicated Місяць тому
If you're asking if databricks and glue are the same then no they definitely are not.
@yagnasivasai Рік тому
Do you have any course related to the content?
@DataEngUncomplicated Рік тому
Hey, Unfortunately I don't have a formal course but I am building out a youtube playlist related to aws glue for reading transforming and writing data: ua-cam.com/play/PL7bE4nSzLSWci0WpYafgTOBcqpdtO3cdY.html
@denmur77 Рік тому
Thanks for your valuable videos! I'm working on an interesting task. I need to use Kinesis Data Streams as a source in AWS Glue (without Lambda or other AWS services) and put data into RDS Aurora PostgreSQL. I can NOT do that for some reason. Do you think it's possible?
@DataEngUncomplicated Рік тому
Yes you can! Aws glue actually supports a streaming mode which supports kinesis as a data source!
@denmur77 Рік тому
@@DataEngUncomplicated did you try?
@denmur77 Рік тому
@@DataEngUncomplicated I can put data from KDS to S3 or grab data from S3 and put it on RDS but can't do directly from KDS.
@DataEngUncomplicated Рік тому
It says it supports it: docs.aws.amazon.com/glue/latest/dg/add-job-streaming.html
@denmur77 Рік тому
@@DataEngUncomplicated I saw that. Unfortunately it doesn't work.
@shashankemani1609 3 місяці тому
Could you please let me know why are you using gluecontext as you are not using any of the glue ETL functionalities and why are you using dynamic dataframe as you are not dealing with semi-structured or unstructured data? any specific reason?
@DataEngUncomplicated 3 місяці тому
Hi there, although in this tutorial I am not using and glue transformation methods, I am using the create_dynamic_frame_from_options method to load the data which is from the GlueContext class. This is why we need to use gluecontext. Dynamic dataframes can be for structured data s well not only just semi-structured or unstructured.
@bk3460 2 місяці тому
sorry, what is wrong with df = spark.read.csv(path)?
@DataEngUncomplicated 2 місяці тому
That works too but it's not using the aws glue library to do it.
@bk3460 2 місяці тому
@@DataEngUncomplicated Sorry, I'm new to Spark and Glue. Would you mind to elaborate about glue library are you referring to? I know about Glue Data Catalogue, but it is not affected when I use df = spark.read.csv(path).
@DataEngUncomplicated 2 місяці тому
Give a read on the aws glue api and the transformations that come with it: docs.aws.amazon.com/glue/latest/dg/aws-glue-api.html

Наступне

Автоматичне відтворення

AWS Glue: Write Parquet With Partitions to AWS S3