AWS Glue: Read CSV Files From AWS S3 Without Glue Catalog

Поділитися
Вставка
  • Опубліковано 28 сер 2022
  • This video is about how to read in data files stored in csv in AWS S3 in AWS Glue when your data is not defined in the AWS Glue Catalog. This video uses the create_dynamic_frame_from_options method
    AWS Documentation: docs.aws.amazon.com/glue/late...
    Code example: github.com/AdrianoNicolucci/d...
    #aws, #awsglue

КОМЕНТАРІ • 54

  • @Diminishstudioz
    @Diminishstudioz Рік тому +1

    I am so happy that I found this channel

  • @akshitha2110
    @akshitha2110 11 місяців тому +1

    Thank You. This is very helpful. My use case is to take the csv files from S3 and perform Data Quality checks and output in the parquet format. I was planning to use Pyspark in aws and I think this is a simple procedure I can follow to do the same.

    • @DataEngUncomplicated
      @DataEngUncomplicated  10 місяців тому

      No problem! Yup this approach would work. Why do you need to use pyspark though? Are you analyzing millions of records? If it's only 1000s or 100,000s lambda functions or just using a glue shell job might be sufficient

  • @priyanka2309
    @priyanka2309 Рік тому

    Excellent

  • @sumanranjan6597
    @sumanranjan6597 7 місяців тому

    Hi, I'm having an error while running the first default code. Plz provide the IAM role used to launch notebook in the aws glue.

  • @vvkk-vl9jw
    @vvkk-vl9jw Рік тому

    thank u very much for this video playlist. pls upload new videos on multiple condition.

    • @DataEngUncomplicated
      @DataEngUncomplicated  Рік тому

      Thanks, can you elaborate on what videos would be helpful on multiple conditions?

    • @vvkk-vl9jw
      @vvkk-vl9jw Рік тому

      @@DataEngUncomplicated thank u for replying. i want new videos 1)using triggers for crawler, and connect to sns service for msg similar like that. 2)join oracle database to glue for querying. I really appreciate your efforts.💟

  • @tiktok4372
    @tiktok4372 Рік тому

    What is the better option? reading from glueCatalog or directly from S3 ?
    I’m working on a project that everyday new data files are loaded into S3 bucket ( right now almost parquet files, but in the feature there will be any other format). When the files are already in S3, we trigger AwsGlue Job to read(via glueCatalog), transform and write to data to another S3 bucket. But before starting Glue job, we need to start the related crawlers to crawl the new files(register new partition, update schema if there is any change,…). Because of that, we need to create many crawlers and orchestrate them base on the event of corresponding file is loaded into S3, and waiting for crawlers to finish running also takes time and cost. Do you think we keep doing that or just read file directly from S3 ? is there any risk or performance issue between 2 methods or any other recommendation? Thank you very much

    • @DataEngUncomplicated
      @DataEngUncomplicated  Рік тому

      Hey, sorry for the late reply.
      Whether to read from GlueCatalog or directly from S3 depends on the specific requirements and constraints of your project. Here are some factors to consider:
      Performance: Reading data directly from S3 can be faster than reading through GlueCatalog, as GlueCatalog adds an additional layer of metadata management. However, the performance difference may not be significant, especially if you use partitioning and indexing in GlueCatalog to optimize queries.
      Schema evolution: If your data schema is likely to change frequently or unpredictably, using GlueCatalog can provide a more flexible and automated way to manage schema evolution. GlueCatalog can automatically detect schema changes and update table definitions, which can save you from having to manually update your code.
      Cost: Using GlueCatalog can add some additional cost to your AWS bill, as you are paying for the metadata management and indexing that GlueCatalog provides. However, the cost may be small compared to the benefits of using GlueCatalog for your specific use case.

  • @PRI_Vlogs_Australia
    @PRI_Vlogs_Australia Рік тому

    Thank you for this awesome explanation. Can I please request you to make the video about 'How to implement Change Data Capture' using python? and Secondly, How to automate Python pipelines to load the data in AWS cloud say S3. Thanks.

    • @DataEngUncomplicated
      @DataEngUncomplicated  Рік тому +1

      Thanks! Sure I will add the change data capture to my video suggestion list. I have a couple of videos on writing data to s3 using AWS lambda service and AWS glue you can check out. Check out this blog post on aws related to CDC with aws glue aws.amazon.com/blogs/big-data/implement-a-cdc-based-upsert-in-a-data-lake-using-apache-iceberg-and-aws-glue/
      This might be helpful if you can leverage the iceberg file format

  • @udaynayak4788
    @udaynayak4788 Рік тому

    just came with new scenario, can you please create one UDF in pyspark aws glue. needed the most

  • @devanshaggarwal2627
    @devanshaggarwal2627 11 місяців тому

    What IAM Role should I choose while creating ETL job in Jupyter notebook to write this code?

    • @DataEngUncomplicated
      @DataEngUncomplicated  10 місяців тому

      There isn't an existing role that will give you everything. You need to add permissions to your s3 bucket if you are using it for reading and writing data as well.

  • @joelluis4938
    @joelluis4938 Рік тому

    There is any reason to avoid Catalog ? I'm just learning about Glue and I use the Catalog.
    I have other question.. I've tried to run a Crawler to take one csv file from my S3 buket but when i check the new tables, it doensn't recognize the column names. It shows col 0 col1 col2 col3. Do you know why this happens? or how to solve it ?

    • @DataEngUncomplicated
      @DataEngUncomplicated  Рік тому

      No reason to avoid Catalog, I made this video because you might need to read files with aws glue that have not already been configured in the aws glue catalog.
      Is your schema defined in your first column of your csv files? That is one reason I could think of why it's not showing up in the catalog.

  • @alejandrasilva8008
    @alejandrasilva8008 Рік тому

    Hello, great video. Thanks yoy.
    So a cuestion.
    When I run the code .printSchema()
    The notebbok run :
    root
    ++
    ||
    ++
    ++
    and I review the file and it has header. What happened?
    and thank you for your answer.

    • @DataEngUncomplicated
      @DataEngUncomplicated  Рік тому

      Hey thanks you. I've seen this when no data is being returned...can you confirm there are no records being returned?

  • @shashankreddy8390
    @shashankreddy8390 Рік тому

    Hi buddy this is a nice video, but every one creates video on reading and writing from s3.
    1. Can you create a video on how to use Glue studio notebook (interactive session) to read data from Awsgluecatalog and write the results to S3?
    2. Please can you include every step- i.e what kind of permissions should we need to create to read and write.
    (I am getting a lot of permission denied errors)
    Also recommend doing a video on Athena notebook editor reading data from Gluecatalog using pyspark.
    (Please also include detailed permissions steps)

    • @DataEngUncomplicated
      @DataEngUncomplicated  Рік тому

      Hi Shashank, these are great video suggestions I will add them to my list, I have broken my videos down into smaller segments but it having a video end to end might be beneficial esp with the permission challenges

    • @shashankreddy8390
      @shashankreddy8390 Рік тому

      @@DataEngUncomplicated what number is my request on your list 😅😅😅😅

  • @powerspan
    @powerspan 5 місяців тому

    Hello there, In my csv lot of non utf8 characters are there how can i ignore them while uploading since its throwing error "unable to parse the file"

    • @DataEngUncomplicated
      @DataEngUncomplicated  4 місяці тому

      In AWS Glue, you can use PySpark to read a CSV file and ignore non-UTF8 characters. Here’s an example if you convert your dynamicFrame into a pyspark dataframe
      # Replace non-UTF8 characters
      for column in df.columns:
      df = df.withColumn(column, col(column).cast("string").alias(column))

  • @himanshusingh-nv5wn
    @himanshusingh-nv5wn 4 місяці тому

    I am getting iam:passrole failed to start the session
    I do have glue console full policy attached to iam role

  • @malvika2011
    @malvika2011 Рік тому

    Thank you for this video, I am getting an error glueContext not defined. Even though when starting a notebook in aws glue it is getting imported automatically.
    Thank you

    • @DataEngUncomplicated
      @DataEngUncomplicated  Рік тому +1

      Hi Malvika, it sounds like you did not define glueContext correctly, I would check to make sure you included the template python code that comes when you first create a new glue job

    • @malvika2011
      @malvika2011 Рік тому

      @@DataEngUncomplicated Thank you 😊 I will check and get back. Thank you for the response. Merry Christmas and a Happy New Year to you !

    • @DataEngUncomplicated
      @DataEngUncomplicated  Рік тому

      Thanks! Merry Christmas and happy new year!

  • @jomymcet
    @jomymcet 9 місяців тому

    Can anyone please help me. I have some NON_ASCII characters in my file placed inside S3. How can I remove those junk characters from that file in S3 using AWS Glue?? Please help.

    • @DataEngUncomplicated
      @DataEngUncomplicated  9 місяців тому +1

      Hi, try posting on AWS repost, you might get a quicker response for this particular problem.

  • @patilharss
    @patilharss Рік тому

    How i can update the file and store it again in s3?

    • @DataEngUncomplicated
      @DataEngUncomplicated  Рік тому

      Hey Harsh, do you want to replace the same data on AWS S3? There is a parameter on right that will overwrite the partition which could be an option

  • @muralichiyan
    @muralichiyan Місяць тому

    Data bricks glue are same?

    • @DataEngUncomplicated
      @DataEngUncomplicated  Місяць тому

      If you're asking if databricks and glue are the same then no they definitely are not.

  • @yagnasivasai
    @yagnasivasai Рік тому

    Do you have any course related to the content?

    • @DataEngUncomplicated
      @DataEngUncomplicated  Рік тому

      Hey, Unfortunately I don't have a formal course but I am building out a youtube playlist related to aws glue for reading transforming and writing data: ua-cam.com/play/PL7bE4nSzLSWci0WpYafgTOBcqpdtO3cdY.html

  • @denmur77
    @denmur77 Рік тому

    Thanks for your valuable videos! I'm working on an interesting task. I need to use Kinesis Data Streams as a source in AWS Glue (without Lambda or other AWS services) and put data into RDS Aurora PostgreSQL. I can NOT do that for some reason. Do you think it's possible?

    • @DataEngUncomplicated
      @DataEngUncomplicated  Рік тому

      Yes you can! Aws glue actually supports a streaming mode which supports kinesis as a data source!

    • @denmur77
      @denmur77 Рік тому

      @@DataEngUncomplicated did you try?

    • @denmur77
      @denmur77 Рік тому

      @@DataEngUncomplicated I can put data from KDS to S3 or grab data from S3 and put it on RDS but can't do directly from KDS.

    • @DataEngUncomplicated
      @DataEngUncomplicated  Рік тому

      It says it supports it: docs.aws.amazon.com/glue/latest/dg/add-job-streaming.html

    • @denmur77
      @denmur77 Рік тому

      @@DataEngUncomplicated I saw that. Unfortunately it doesn't work.

  • @shashankemani1609
    @shashankemani1609 3 місяці тому

    Could you please let me know why are you using gluecontext as you are not using any of the glue ETL functionalities and why are you using dynamic dataframe as you are not dealing with semi-structured or unstructured data? any specific reason?

    • @DataEngUncomplicated
      @DataEngUncomplicated  3 місяці тому

      Hi there, although in this tutorial I am not using and glue transformation methods, I am using the create_dynamic_frame_from_options method to load the data which is from the GlueContext class. This is why we need to use gluecontext. Dynamic dataframes can be for structured data s well not only just semi-structured or unstructured.

  • @bk3460
    @bk3460 2 місяці тому

    sorry, what is wrong with df = spark.read.csv(path)?

    • @DataEngUncomplicated
      @DataEngUncomplicated  2 місяці тому

      That works too but it's not using the aws glue library to do it.

    • @bk3460
      @bk3460 2 місяці тому

      @@DataEngUncomplicated Sorry, I'm new to Spark and Glue. Would you mind to elaborate about glue library are you referring to? I know about Glue Data Catalogue, but it is not affected when I use df = spark.read.csv(path).

    • @DataEngUncomplicated
      @DataEngUncomplicated  2 місяці тому

      Give a read on the aws glue api and the transformations that come with it: docs.aws.amazon.com/glue/latest/dg/aws-glue-api.html