Great content, two comments: 1. I think what would be helpful is if you would briefly mention the purpose of the dataset parameter - what is the difference between a "Parquet" and an "Ordinary" dataset to the user and in terms of functionality? In which scenarios does one provide benefits over the other? 2. Also, is it required to write out the datatypes? Generally Glue crawlers work very effectively for datasets stored in Parquet, so I'm unclear about the purpose Thanks for doing these videos!
Hey Lukas, the datatypes were very useful for me. Since I did a COPY from S3 to Redshift, I can use the datatypes parameter to match my table schema in Redshift. Without it, I ran into errors because the python dataframe had datatypes like 'int64' which were not compatible with Redshift's 'int'.
Thanks for the video. You have mentioned ad default database. What dialect it is and what would be the credentials? Also how can I load the parquet to Redshift?
Hi Kanchi, the "default" database in the glue catalog is not actually an RDS database with credentials, it's a Hive database according to the description which is just a grouping of whatever tables you assign to it. You will need to add your redshift database connection to the glue catalog so you can write to it with the awswrangler.redshift.connect method. checkout my video on how to add a database connection, its not on redshift but it might help. ua-cam.com/video/UUoQAe_NzaA/v-deo.html
Thanks for this tutorial! I have a question: How do I store one big parquet file in s3 without running with Kernel dying issues? I have already use df.convert_dtypes() but it is still failing. My file has over 1.5 million rows. Files below 900k rows are not failing! Thanks
Hi Himanshu, thanks for the comment! The AWS data wrangler library will help load and read data within a variety of AWS services. If you want to do this with Python, you will have to look at another library to read it into Python.
Very informative! I am breaking up a 50gig CSV into multiple parquets using Data Wrangler. However, some partitions have different datatypes (something about pandas perhaps?) making my spectrum query crash. Is there something in Data Wrangler that will help overcome this issue? Thanks
Thanks for watching! Hey Gayathri, I suggest you use the pandas dataframe "astype" method to transform the data to have the same datatype when you write it to S3. So Instead of just transforming your data to parquet, make sure you transform your column schema to be the same for your dataset.
@@DataEngUncomplicated Thank you. We have nearly 700 cols so going down the astype make the code lengthy, but it appears that's the only choice left :)
Hi, I want to convert my dataframe into a table in timestream using this awswrangler. But the problem is where do I give my path of my already created db and table
Hello! If I have a script which runs hourly, is it fine to have database, table and partition_cols specified each time? Isn't it going to run some crawlers to create new partitions (for example, date) and thus cause additional AWS charges? Or it is just fine to use these three arguments at each call?
I'm not sure what you are trying to do exactly and what aws charges you are talking about. This method doesn't trigger a glue crawler. But if you have data to write for new partitions, it will create new partitions.
Hi can you run the same code multiple times and show us? I am getting some access denied exception when. Trying to run overwrite the existing file in s3
Hi my wrangler timesout on lambda. I used use_threads=true but it didnt work much better. The chunksize I set on True. Could you give me any tips? Thanks!
Is your lambda running inside a vpc? I had this time out error happen to me before. If so, you need to add a vpc endpoint to the resource you are trying to connect to so it can access it. There is a function in aws data wrangler to set the vpc endpoint.
@@DataEngUncomplicated thanks for the fast reply. It’s not a connectivity issue. I have a 192MB Parquet which I want to partition twice. I have set concurrent_prtitioning = True. But it is still to slow. Could you tell me how I could increase the speed? Thanks
have you experienced your lambdas crashing or just taking way too long? 192mb is a huge file to throw at a lambda function which only had a maximum memory of 10gb and timeout of 15 min. How much ram is allocated to your lambda?...I found I my lambdas were crashing when the file was larger than 100mb myself. do you have partitions on your lambda? One suggestion I have is only query from the partitions you need instead of the entire dataset, this will save you from reading the entire dataset in memory. Another thing you can do query your dataset through aws Athena through aws data wrangler so you make Athena do all the heavy lifting rather than your lambda to read in all that data
@@DataEngUncomplicated thank you :) I tried the Athena approach but Athena can’t insert into more than 100 Partitions. I will try to solve this issue on Monday. Have a nice weekend
Thanks for the video. Is it possible to add a valid name to the parquet file within the partition? Instead of saving it as 99c6134rtgyh5.snappy.parquet. can we add a valid name? Appreciate your help
@@DataEngUncomplicated Thanks i added the parameter filename_prefix but it just concatinated that parameter to the existing file ex: filename_prefix99c6134rtgyh5.snappy.parquet. I am looking for filename_prefix.parquet
I’m new to AWS, but your tutorial makes all the hardship go away. Thanks.
Thanks! I'm glad it was helpful!
Great content, two comments:
1. I think what would be helpful is if you would briefly mention the purpose of the dataset parameter - what is the difference between a "Parquet" and an "Ordinary" dataset to the user and in terms of functionality? In which scenarios does one provide benefits over the other?
2. Also, is it required to write out the datatypes? Generally Glue crawlers work very effectively for datasets stored in Parquet, so I'm unclear about the purpose
Thanks for doing these videos!
Hey Lukas, the datatypes were very useful for me. Since I did a COPY from S3 to Redshift, I can use the datatypes parameter to match my table schema in Redshift. Without it, I ran into errors because the python dataframe had datatypes like 'int64' which were not compatible with Redshift's 'int'.
lifesaver video! Thank you so much :)
you're welcome Alissa, thanks for leaving a comment and I'm glad it was helpful!
Very good. It will be really helpful.
Thanks, I'm glad it was helpful!
Thanks for the video. You have mentioned ad default database. What dialect it is and what would be the credentials? Also how can I load the parquet to Redshift?
Hi Kanchi, the "default" database in the glue catalog is not actually an RDS database with credentials, it's a Hive database according to the description which is just a grouping of whatever tables you assign to it. You will need to add your redshift database connection to the glue catalog so you can write to it with the awswrangler.redshift.connect method. checkout my video on how to add a database connection, its not on redshift but it might help. ua-cam.com/video/UUoQAe_NzaA/v-deo.html
great video, explained very simply. Thank you
You're welcome! Thanks for leaving me a comment!
Thanks for this tutorial! I have a question: How do I store one big parquet file in s3 without running with Kernel dying issues? I have already use df.convert_dtypes() but it is still failing. My file has over 1.5 million rows. Files below 900k rows are not failing! Thanks
very useful. I was looking for this concrete info.. good work... Can I get info on external source like terdata to S3.
Hi Himanshu, thanks for the comment!
The AWS data wrangler library will help load and read data within a variety of AWS services. If you want to do this with Python, you will have to look at another library to read it into Python.
Very informative! I am breaking up a 50gig CSV into multiple parquets using Data Wrangler. However, some partitions have different datatypes (something about pandas perhaps?) making my spectrum query crash. Is there something in Data Wrangler that will help overcome this issue? Thanks
Thanks for watching! Hey Gayathri, I suggest you use the pandas dataframe "astype" method to transform the data to have the same datatype when you write it to S3. So Instead of just transforming your data to parquet, make sure you transform your column schema to be the same for your dataset.
@@DataEngUncomplicated Thank you. We have nearly 700 cols so going down the astype make the code lengthy, but it appears that's the only choice left :)
One table has 700 columns?!?
@@DataEngUncomplicated :) Yes, it does. It's an Analytics dataset and as you can tell we are struggling to load it all up!
thanks for this video
You're welcome!
Great video!
This is super interesting... thanks!
You're welcome! Thanks for watching and leaving a comment!
great video, thanks!
question : 7:35 What would happen, if some of date data are Na/Null?
Thanks Tanapat, it should write it as null to parquet.
Hi, I want to convert my dataframe into a table in timestream using this awswrangler. But the problem is where do I give my path of my already created db and table
Hi Anusha, check out the documentation to find where this parameter is. In this video I only covered writing to s3
Partition is like index or primary Kay , isn't it ? What's the diff?
None of those, a partition is a directory in s3. It could be a date for example date=2013-03-01
Hello! If I have a script which runs hourly, is it fine to have database, table and partition_cols specified each time? Isn't it going to run some crawlers to create new partitions (for example, date) and thus cause additional AWS charges? Or it is just fine to use these three arguments at each call?
I'm not sure what you are trying to do exactly and what aws charges you are talking about. This method doesn't trigger a glue crawler. But if you have data to write for new partitions, it will create new partitions.
could you please make a video on how to write parquet files to s3 using java!(without using spark)
Hi zeal, aws data wrangler is a python library. Sorry I don't program in java
Hi Zeal, Did u find a way to write parquet files to s3 using java ? Please ping me if u do
Hi can you run the same code multiple times and show us? I am getting some access denied exception when. Trying to run overwrite the existing file in s3
Hi Avanthi, sounds like you have an s3 permission issue. Your role probably doesn't have permission to overwrite an existing file.
@@DataEngUncomplicated we have s3 full permissions. Can it be bcoz of glue table overwriting permission?
There could be a deny statement that might stop people for overwriting a file for that role? It's possible I would check the glue permissions as well
Is there a way to convert mzXML to parquet?
That's a very unique file format you don't see too often! I did a quick search and couldn't find a way to convert it with aws data wrangler.
@@DataEngUncomplicated Thank you - this video was very helpful. May be I can try mzXML to csv and csv to parquet.
Hi my wrangler timesout on lambda.
I used use_threads=true but it didnt work much better.
The chunksize I set on True.
Could you give me any tips?
Thanks!
Is your lambda running inside a vpc? I had this time out error happen to me before. If so, you need to add a vpc endpoint to the resource you are trying to connect to so it can access it. There is a function in aws data wrangler to set the vpc endpoint.
@@DataEngUncomplicated thanks for the fast reply.
It’s not a connectivity issue. I have a 192MB Parquet which I want to partition twice.
I have set concurrent_prtitioning = True.
But it is still to slow.
Could you tell me how I could increase the speed?
Thanks
@@DataEngUncomplicated for a small filesize it’s just working fine.
have you experienced your lambdas crashing or just taking way too long? 192mb is a huge file to throw at a lambda function which only had a maximum memory of 10gb and timeout of 15 min. How much ram is allocated to your lambda?...I found I my lambdas were crashing when the file was larger than 100mb myself.
do you have partitions on your lambda? One suggestion I have is only query from the partitions you need instead of the entire dataset, this will save you from reading the entire dataset in memory. Another thing you can do query your dataset through aws Athena through aws data wrangler so you make Athena do all the heavy lifting rather than your lambda to read in all that data
@@DataEngUncomplicated thank you :)
I tried the Athena approach but Athena can’t insert into more than 100 Partitions.
I will try to solve this issue on Monday.
Have a nice weekend
How did you set up this environment?
I used Boto3 on a Jupyter cell. I hope that's good.
What is your definition of environment? Just the script?
I was running this locally on a Jupiter notebook for this example.
@@DataEngUncomplicatedI meant the way you connected Jupyter to AWS
I configured my aws account user with the aws CLI.@@clock0500
Thanks for the video. Is it possible to add a valid name to the parquet file within the partition? Instead of saving it as 99c6134rtgyh5.snappy.parquet. can we add a valid name? Appreciate your help
Hi Venkata, yes this should he possible! There is a parameter that should controll adding a prefix to the file name. Check out the apj documentation
@@DataEngUncomplicated Thanks i added the parameter filename_prefix but it just concatinated that parameter to the existing file
ex: filename_prefix99c6134rtgyh5.snappy.parquet. I am looking for filename_prefix.parquet