AWS Data Wrangler: Write Parquet to AWS S3

Поділитися
Вставка
  • Опубліковано 6 жов 2024

КОМЕНТАРІ • 58

  • @youngzproduction7498
    @youngzproduction7498 2 роки тому +1

    I’m new to AWS, but your tutorial makes all the hardship go away. Thanks.

  • @Lukas-wm8dy
    @Lukas-wm8dy Рік тому

    Great content, two comments:
    1. I think what would be helpful is if you would briefly mention the purpose of the dataset parameter - what is the difference between a "Parquet" and an "Ordinary" dataset to the user and in terms of functionality? In which scenarios does one provide benefits over the other?
    2. Also, is it required to write out the datatypes? Generally Glue crawlers work very effectively for datasets stored in Parquet, so I'm unclear about the purpose
    Thanks for doing these videos!

    • @affanahmed7064
      @affanahmed7064 Рік тому

      Hey Lukas, the datatypes were very useful for me. Since I did a COPY from S3 to Redshift, I can use the datatypes parameter to match my table schema in Redshift. Without it, I ran into errors because the python dataframe had datatypes like 'int64' which were not compatible with Redshift's 'int'.

  • @lilmune
    @lilmune 3 роки тому +1

    lifesaver video! Thank you so much :)

    • @DataEngUncomplicated
      @DataEngUncomplicated  3 роки тому

      you're welcome Alissa, thanks for leaving a comment and I'm glad it was helpful!

  • @thalanayarswaminathansaipr7703
    @thalanayarswaminathansaipr7703 2 роки тому +1

    Very good. It will be really helpful.

  • @kanchidoshi6907
    @kanchidoshi6907 2 роки тому

    Thanks for the video. You have mentioned ad default database. What dialect it is and what would be the credentials? Also how can I load the parquet to Redshift?

    • @DataEngUncomplicated
      @DataEngUncomplicated  2 роки тому

      Hi Kanchi, the "default" database in the glue catalog is not actually an RDS database with credentials, it's a Hive database according to the description which is just a grouping of whatever tables you assign to it. You will need to add your redshift database connection to the glue catalog so you can write to it with the awswrangler.redshift.connect method. checkout my video on how to add a database connection, its not on redshift but it might help. ua-cam.com/video/UUoQAe_NzaA/v-deo.html

  • @gonzalonistal2765
    @gonzalonistal2765 3 роки тому

    great video, explained very simply. Thank you

  • @AlDamara-x8j
    @AlDamara-x8j 5 місяців тому

    Thanks for this tutorial! I have a question: How do I store one big parquet file in s3 without running with Kernel dying issues? I have already use df.convert_dtypes() but it is still failing. My file has over 1.5 million rows. Files below 900k rows are not failing! Thanks

  • @himanshu090783
    @himanshu090783 2 роки тому

    very useful. I was looking for this concrete info.. good work... Can I get info on external source like terdata to S3.

    • @DataEngUncomplicated
      @DataEngUncomplicated  2 роки тому

      Hi Himanshu, thanks for the comment!
      The AWS data wrangler library will help load and read data within a variety of AWS services. If you want to do this with Python, you will have to look at another library to read it into Python.

  • @gayathrichakravarthy1056
    @gayathrichakravarthy1056 3 роки тому +1

    Very informative! I am breaking up a 50gig CSV into multiple parquets using Data Wrangler. However, some partitions have different datatypes (something about pandas perhaps?) making my spectrum query crash. Is there something in Data Wrangler that will help overcome this issue? Thanks

    • @DataEngUncomplicated
      @DataEngUncomplicated  3 роки тому

      Thanks for watching! Hey Gayathri, I suggest you use the pandas dataframe "astype" method to transform the data to have the same datatype when you write it to S3. So Instead of just transforming your data to parquet, make sure you transform your column schema to be the same for your dataset.

    • @gayathrichakravarthy1056
      @gayathrichakravarthy1056 3 роки тому

      @@DataEngUncomplicated Thank you. We have nearly 700 cols so going down the astype make the code lengthy, but it appears that's the only choice left :)

    • @DataEngUncomplicated
      @DataEngUncomplicated  3 роки тому

      One table has 700 columns?!?

    • @gayathrichakravarthy1056
      @gayathrichakravarthy1056 3 роки тому

      @@DataEngUncomplicated :) Yes, it does. It's an Analytics dataset and as you can tell we are struggling to load it all up!

  • @bhavanis9300
    @bhavanis9300 3 роки тому +1

    thanks for this video

  • @yorkstbd
    @yorkstbd 3 роки тому

    Great video!

  • @MJ-em_jay
    @MJ-em_jay 3 роки тому

    This is super interesting... thanks!

  • @tanapatyoonuch5427
    @tanapatyoonuch5427 2 роки тому

    great video, thanks!
    question : 7:35 What would happen, if some of date data are Na/Null?

  • @anishachoudhury_
    @anishachoudhury_ 2 роки тому

    Hi, I want to convert my dataframe into a table in timestream using this awswrangler. But the problem is where do I give my path of my already created db and table

    • @DataEngUncomplicated
      @DataEngUncomplicated  2 роки тому

      Hi Anusha, check out the documentation to find where this parameter is. In this video I only covered writing to s3

  • @SP-db6sh
    @SP-db6sh 2 роки тому

    Partition is like index or primary Kay , isn't it ? What's the diff?

    • @DataEngUncomplicated
      @DataEngUncomplicated  2 роки тому

      None of those, a partition is a directory in s3. It could be a date for example date=2013-03-01

  • @petarulev6977
    @petarulev6977 Рік тому

    Hello! If I have a script which runs hourly, is it fine to have database, table and partition_cols specified each time? Isn't it going to run some crawlers to create new partitions (for example, date) and thus cause additional AWS charges? Or it is just fine to use these three arguments at each call?

    • @DataEngUncomplicated
      @DataEngUncomplicated  Рік тому

      I'm not sure what you are trying to do exactly and what aws charges you are talking about. This method doesn't trigger a glue crawler. But if you have data to write for new partitions, it will create new partitions.

  • @zealkapadiya1096
    @zealkapadiya1096 2 роки тому

    could you please make a video on how to write parquet files to s3 using java!(without using spark)

    • @DataEngUncomplicated
      @DataEngUncomplicated  2 роки тому

      Hi zeal, aws data wrangler is a python library. Sorry I don't program in java

    • @surajmalthumkar7171
      @surajmalthumkar7171 Рік тому

      Hi Zeal, Did u find a way to write parquet files to s3 using java ? Please ping me if u do

  • @Adithiscorner
    @Adithiscorner 3 роки тому

    Hi can you run the same code multiple times and show us? I am getting some access denied exception when. Trying to run overwrite the existing file in s3

    • @DataEngUncomplicated
      @DataEngUncomplicated  3 роки тому +1

      Hi Avanthi, sounds like you have an s3 permission issue. Your role probably doesn't have permission to overwrite an existing file.

    • @Adithiscorner
      @Adithiscorner 3 роки тому

      @@DataEngUncomplicated we have s3 full permissions. Can it be bcoz of glue table overwriting permission?

    • @DataEngUncomplicated
      @DataEngUncomplicated  3 роки тому

      There could be a deny statement that might stop people for overwriting a file for that role? It's possible I would check the glue permissions as well

  • @varshawaingankar1492
    @varshawaingankar1492 3 роки тому

    Is there a way to convert mzXML to parquet?

    • @DataEngUncomplicated
      @DataEngUncomplicated  3 роки тому +1

      That's a very unique file format you don't see too often! I did a quick search and couldn't find a way to convert it with aws data wrangler.

    • @varshawaingankar1492
      @varshawaingankar1492 3 роки тому

      @@DataEngUncomplicated Thank you - this video was very helpful. May be I can try mzXML to csv and csv to parquet.

  • @haliltaylan6181
    @haliltaylan6181 3 роки тому

    Hi my wrangler timesout on lambda.
    I used use_threads=true but it didnt work much better.
    The chunksize I set on True.
    Could you give me any tips?
    Thanks!

    • @DataEngUncomplicated
      @DataEngUncomplicated  3 роки тому +1

      Is your lambda running inside a vpc? I had this time out error happen to me before. If so, you need to add a vpc endpoint to the resource you are trying to connect to so it can access it. There is a function in aws data wrangler to set the vpc endpoint.

    • @haliltaylan6181
      @haliltaylan6181 3 роки тому

      @@DataEngUncomplicated thanks for the fast reply.
      It’s not a connectivity issue. I have a 192MB Parquet which I want to partition twice.
      I have set concurrent_prtitioning = True.
      But it is still to slow.
      Could you tell me how I could increase the speed?
      Thanks

    • @haliltaylan6181
      @haliltaylan6181 3 роки тому +1

      @@DataEngUncomplicated for a small filesize it’s just working fine.

    • @DataEngUncomplicated
      @DataEngUncomplicated  3 роки тому +1

      have you experienced your lambdas crashing or just taking way too long? 192mb is a huge file to throw at a lambda function which only had a maximum memory of 10gb and timeout of 15 min. How much ram is allocated to your lambda?...I found I my lambdas were crashing when the file was larger than 100mb myself.
      do you have partitions on your lambda? One suggestion I have is only query from the partitions you need instead of the entire dataset, this will save you from reading the entire dataset in memory. Another thing you can do query your dataset through aws Athena through aws data wrangler so you make Athena do all the heavy lifting rather than your lambda to read in all that data

    • @haliltaylan6181
      @haliltaylan6181 3 роки тому

      @@DataEngUncomplicated thank you :)
      I tried the Athena approach but Athena can’t insert into more than 100 Partitions.
      I will try to solve this issue on Monday.
      Have a nice weekend

  • @clock0500
    @clock0500 8 місяців тому

    How did you set up this environment?

    • @clock0500
      @clock0500 8 місяців тому

      I used Boto3 on a Jupyter cell. I hope that's good.

    • @DataEngUncomplicated
      @DataEngUncomplicated  8 місяців тому

      What is your definition of environment? Just the script?

    • @DataEngUncomplicated
      @DataEngUncomplicated  8 місяців тому +1

      I was running this locally on a Jupiter notebook for this example.

    • @clock0500
      @clock0500 8 місяців тому

      ​@@DataEngUncomplicatedI meant the way you connected Jupyter to AWS

    • @DataEngUncomplicated
      @DataEngUncomplicated  8 місяців тому +1

      I configured my aws account user with the aws CLI.@@clock0500

  • @venkatabora1459
    @venkatabora1459 2 роки тому

    Thanks for the video. Is it possible to add a valid name to the parquet file within the partition? Instead of saving it as 99c6134rtgyh5.snappy.parquet. can we add a valid name? Appreciate your help

    • @DataEngUncomplicated
      @DataEngUncomplicated  2 роки тому

      Hi Venkata, yes this should he possible! There is a parameter that should controll adding a prefix to the file name. Check out the apj documentation

    • @venkatabora1459
      @venkatabora1459 2 роки тому

      @@DataEngUncomplicated Thanks i added the parameter filename_prefix but it just concatinated that parameter to the existing file
      ex: filename_prefix99c6134rtgyh5.snappy.parquet. I am looking for filename_prefix.parquet