AWS Tutorials - Interactively Develop Glue Job using Jupyter Notebook

AWS Tutorials

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 5 бер 2022
One can use Jupyter Notebook with AWS Glue Studio to develop Glue Job in an interactive manner. One can write and test job code line by one and once done, simply save it as Glue Job. Learn how to do interactive Glue Job development in Glue Studio using Jupyter Notebook.
Наука та технологія

КОМЕНТАРІ • 56

@shubham19941 7 місяців тому
Amazing video ! Keep making and sharing such educational videos
@MuhammadImran-lr5tn Рік тому
Hello sir i am using aws python shell and when write library from awsglue.context import GlueContext it gives error saying no module named awsglue.context. can you help me in this regard thank you.
Рік тому
any method for autosaving the notebook work ?
@user-vb7im1jb1b 10 місяців тому
Thanks for this tutorial. Can you show us how you did define your dojo-glue-role?
@ashnashah-grover1655 2 роки тому
How do you configure BOOKMARKS?
I tried this and it didnt work:
{
"job bookmark": "enable"
}
@johnclair7844 Рік тому
I m getting error while writing the file in the output folder
@blekota1974 2 роки тому ⁺²
It would be better if you always add a set of links with the resources used in the material. For example, add a location for the example data and links to video materials if you did it earlier. It would save your audience time, I believe.
Another idea. Add a CloudFormation code for your example. This would be something wonderful.
@AWSTutorialsOnline 2 роки тому ⁺³
Great suggestion! I will try to add as much resources as possible going forward.
@nishadt 2 роки тому
great video, I tried running some exploratory analysis using Glue Studio, however, after some time I am getting a 'Security Token Expired' message. I did not set any AWS credentials, I am able to access all other services in the AWS console. "when calling the GetSession operation: The security token included in the request is expired " Can you please help?
@nishadt 2 роки тому
Sorry, found the answer - when initiating the glue notebook, the role attached to the notebook attempts to use existing AssumeRole session credentials - default is set to 1 hour, you can change it in the IAM role - Summary section to a higher value.
@AWSTutorialsOnline 2 роки тому
Great. Sorry could not reply early. Have taken off for few weeks.
@wklenk 2 роки тому ⁺¹
Is there a roadmap when this feature will be available also in other AWS regions?
@AWSTutorialsOnline 2 роки тому
I am sure it will be generally available at some point of time. Not sure about the timeline.
@VinhNguyen-ho5px 2 роки тому ⁺¹
Hi! Can you do a video on how to create Spark UI for dev endpoint? It's not very straightforward and it's useful for debugging.
@AWSTutorialsOnline 2 роки тому
Sure. I will plan about it,
@amitpatil7828 2 роки тому ⁺¹
Sir, need your advice.
I have created glue job which should read from oracle Rds and insert data to redshift. It works for small tables but for large table when I create dynamic frame from options with connection option as query it does not work
@AWSTutorialsOnline 2 роки тому
how many records in the large table? and what does it mean by not working? You see any error?
@amitpatil7828 2 роки тому
@@AWSTutorialsOnline Sir, this large table has 3 months data and I want data only as sysdate-1 . I am getting error as An error occurred while calling o95.getDynamicFrame.key not found: location
@AWSTutorialsOnline 2 роки тому
@@amitpatil7828 Error does not seem related to data size. Are you sure - your small and large data are the same? It looks more like missing a field or something.
@amitpatil7828 2 роки тому
@@AWSTutorialsOnline yes, no issue with data size. Issue is with query option which I am using in dynamic frame from options.
Connection_type=“oracle”,
Connection_option={“url”: “”,
“query”: “select * from table where updated_date >= trunc(sysdate-1)”,
“username”: “xyz”,
“password”: “****”,
}
This is giving error….
@syedshahasad9551 2 роки тому
How to deploy this job in a reusable fashion for multiple tables ingestion/transformations using single notebook?
@AWSTutorialsOnline 2 роки тому
You will need to parameterize the job and pass reusable parameters when executing the job.
@niranjanjamkhande3773 2 роки тому
How to get a second cell in which configuring and starting an interactive session is available?. For me, its not coming automatically.
@AWSTutorialsOnline 2 роки тому
It should come automatic. can you please try again?
@Mustafa-yk8lk 2 роки тому ⁺¹
How to choose between Glue notebook and EMR notebook ( both is Serverless now )deployment for setting up Apache PySpark ETLs codes?
@Mustafa-yk8lk 2 роки тому ⁺¹
I think we can use glue dev end or glue etl jobs or glue studio notebooks for small transformation w dyf and emr notebook for complexe transformation w df can't do w dyf interact w glue data catalog in the same time
@AWSTutorialsOnline 2 роки тому ⁺¹
Also Glue only supports Apache Spark but EMR support many other big data frameworks and platforms.
@Mustafa-yk8lk 2 роки тому
@@AWSTutorialsOnline Yes.
but we go to emr if we have some limitation of use spark w spark in glue right
emr=open source pyspark df
glue= open source pyspark dyf (convert df to dyf is very cost!)
@Mustafa-yk8lk 2 роки тому ⁺¹
I'm confusing about when i use Glue python vs Lambda python in some use cases like ingest data from out of the cloud or load data to redshift using copy command
@AWSTutorialsOnline 2 роки тому ⁺¹
Use Python when ingestion at any cost would not go beyond 15 mins. Also your file is not so big that it will exhaust allocated memory to Lambda.
@Mustafa-yk8lk 2 роки тому
@@AWSTutorialsOnline Yeah exactly. the most difference i see is that Glue Jobs have much longer maximum runtimes than Lambda functions. A Lambda function can run for at most 15 minutes before it’s terminated. Glue Jobs have a default timeout of 2,880 minutes, or 48 hours. If you have a long-running data retrieval task, Python Shell Jobs are a much better tool than Lambda functions.
@purabization Рік тому
amazing video, can you make a video on how to ingest data from salesforce to s3 bucket. thanks in advance
@DanielWeikert 2 роки тому
Could you dive deeper into Glue/Pyspark. How to learn coding here. Thanks
@AWSTutorialsOnline 2 роки тому
Please check this tutorial. Hope it helps. ua-cam.com/video/vT9vu3NMsk4/v-deo.html
@ankitgaur1504 2 роки тому
usefull video
@amolubale3354 2 роки тому ⁺¹
Hello ,
How to add python library in this notebook.
I have tried ;
%additional_python_module s2cell
But this is not working
Would you Please help me here to add library
@AWSTutorialsOnline 2 роки тому ⁺¹
What error you get? have you zipped all python files in root directory, uploaded to S3 bucket and then mention that location in %additional_python_module. Also check if the notebook IAM role has access to the S3 location where your python code zip is stored. Hope it helps.
@amolubale3354 2 роки тому
@@AWSTutorialsOnline thank you so much. It works ! I really appreciate your help and your work which your doing for AWS learners!
@vigneshj3206 Рік тому ⁺¹
Could please let me know what is the policy you have used in you IAM role. I'm getting error in create glue session as permission is denied.
@Confusedcapybara8772 Рік тому ⁺¹
Hopefully you fixed your error! In case anyone else is having a similar issue, always include the error message says when asking these questions. Every job is different and can require different policies.
@vigneshj3206 Рік тому ⁺¹
@@Confusedcapybara8772 yeah some how i fixed thanks for the reply
@venkatagajulpalli3321 2 роки тому ⁺¹
How we can to SSHTunneling in AWS Glue Job?
@AWSTutorialsOnline 2 роки тому
I don't think you have access to Apache Spark environment which Glue uses. It is serverless.
@venkatagajulpalli3321 2 роки тому ⁺¹
@@AWSTutorialsOnline Thank you!
I mean Can we able to run the below code
with SSHTunnelForwarder(
ssh_address_or_host = (hostname, 22),
ssh_username = 'username',
ssh_pkey = 'CB.pem',
remote_bind_address=(ip, 5432),
local_bind_address=('', 45432)
) as server:
server.start()
print(server.local_bind_address)
server.close()
@AWSTutorialsOnline 2 роки тому ⁺¹
I think you can. You need to configure Glue Connection (network type) to establish network with the host. And the associate the connection with the Glue job.
@venkatagajulpalli3321 2 роки тому
@@AWSTutorialsOnline I am thankful if you could able to make a video on this
@swagatamondal4904 2 роки тому ⁺¹
what are the permission u have for u r dojogluejob iam role ?
@AWSTutorialsOnline 2 роки тому
AWSGlueServiceRole + AmazonS3FullAccess
@STLEON 2 роки тому
@@AWSTutorialsOnline Failed to authenticate user due to missing information in request.
@AWSTutorialsOnline 2 роки тому
@@STLEONcan you please share the exact error text
@dwaram1 2 роки тому
@@AWSTutorialsOnline - My IAM role has both of these roles, but I get the following error when trying to run the second block in my notebook:
An error occurred (AccessDeniedException) when calling the CreateSession operation: User: assumed-role/AWSGlueServiceRoleDefault/GlueJobRunnerSession is not authorized to perform: iam:PassRole on resource: AWSGlueServiceRoleDefault because no identity-based policy allows the iam:PassRole action. What did I do wrong?
@AWSTutorialsOnline 2 роки тому ⁺¹
@@dwaram1 Not sure why I am not facing this error. Use a custom IAM role with the following permissions - AWSGlueServiceRole + AmazonS3FullAccess + Custom Policy (with iam:PassRole permission). Use this role. It is documented here - docs.aws.amazon.com/glue/latest/dg/attach-policy-iam-user.html

Наступне

Автоматичне відтворення

AWS Tutorials - Amazon Athena ACID Transactions (Powered by Apache Iceberg)