AWS Tutorials - Interactively Develop Glue Job using Jupyter Notebook
Вставка
- Опубліковано 5 бер 2022
- One can use Jupyter Notebook with AWS Glue Studio to develop Glue Job in an interactive manner. One can write and test job code line by one and once done, simply save it as Glue Job. Learn how to do interactive Glue Job development in Glue Studio using Jupyter Notebook.
- Наука та технологія
Amazing video ! Keep making and sharing such educational videos
Hello sir i am using aws python shell and when write library from awsglue.context import GlueContext it gives error saying no module named awsglue.context. can you help me in this regard thank you.
any method for autosaving the notebook work ?
Thanks for this tutorial. Can you show us how you did define your dojo-glue-role?
How do you configure BOOKMARKS?
I tried this and it didnt work:
{
"job bookmark": "enable"
}
I m getting error while writing the file in the output folder
It would be better if you always add a set of links with the resources used in the material. For example, add a location for the example data and links to video materials if you did it earlier. It would save your audience time, I believe.
Another idea. Add a CloudFormation code for your example. This would be something wonderful.
Great suggestion! I will try to add as much resources as possible going forward.
great video, I tried running some exploratory analysis using Glue Studio, however, after some time I am getting a 'Security Token Expired' message. I did not set any AWS credentials, I am able to access all other services in the AWS console. "when calling the GetSession operation: The security token included in the request is expired " Can you please help?
Sorry, found the answer - when initiating the glue notebook, the role attached to the notebook attempts to use existing AssumeRole session credentials - default is set to 1 hour, you can change it in the IAM role - Summary section to a higher value.
Great. Sorry could not reply early. Have taken off for few weeks.
Is there a roadmap when this feature will be available also in other AWS regions?
I am sure it will be generally available at some point of time. Not sure about the timeline.
Hi! Can you do a video on how to create Spark UI for dev endpoint? It's not very straightforward and it's useful for debugging.
Sure. I will plan about it,
Sir, need your advice.
I have created glue job which should read from oracle Rds and insert data to redshift. It works for small tables but for large table when I create dynamic frame from options with connection option as query it does not work
how many records in the large table? and what does it mean by not working? You see any error?
@@AWSTutorialsOnline Sir, this large table has 3 months data and I want data only as sysdate-1 . I am getting error as An error occurred while calling o95.getDynamicFrame.key not found: location
@@amitpatil7828 Error does not seem related to data size. Are you sure - your small and large data are the same? It looks more like missing a field or something.
@@AWSTutorialsOnline yes, no issue with data size. Issue is with query option which I am using in dynamic frame from options.
Connection_type=“oracle”,
Connection_option={“url”: “”,
“query”: “select * from table where updated_date >= trunc(sysdate-1)”,
“username”: “xyz”,
“password”: “****”,
}
This is giving error….
How to deploy this job in a reusable fashion for multiple tables ingestion/transformations using single notebook?
You will need to parameterize the job and pass reusable parameters when executing the job.
How to get a second cell in which configuring and starting an interactive session is available?. For me, its not coming automatically.
It should come automatic. can you please try again?
How to choose between Glue notebook and EMR notebook ( both is Serverless now )deployment for setting up Apache PySpark ETLs codes?
I think we can use glue dev end or glue etl jobs or glue studio notebooks for small transformation w dyf and emr notebook for complexe transformation w df can't do w dyf interact w glue data catalog in the same time
Also Glue only supports Apache Spark but EMR support many other big data frameworks and platforms.
@@AWSTutorialsOnline Yes.
but we go to emr if we have some limitation of use spark w spark in glue right
emr=open source pyspark df
glue= open source pyspark dyf (convert df to dyf is very cost!)
I'm confusing about when i use Glue python vs Lambda python in some use cases like ingest data from out of the cloud or load data to redshift using copy command
Use Python when ingestion at any cost would not go beyond 15 mins. Also your file is not so big that it will exhaust allocated memory to Lambda.
@@AWSTutorialsOnline Yeah exactly. the most difference i see is that Glue Jobs have much longer maximum runtimes than Lambda functions. A Lambda function can run for at most 15 minutes before it’s terminated. Glue Jobs have a default timeout of 2,880 minutes, or 48 hours. If you have a long-running data retrieval task, Python Shell Jobs are a much better tool than Lambda functions.
amazing video, can you make a video on how to ingest data from salesforce to s3 bucket. thanks in advance
Could you dive deeper into Glue/Pyspark. How to learn coding here. Thanks
Please check this tutorial. Hope it helps. ua-cam.com/video/vT9vu3NMsk4/v-deo.html
usefull video
Hello ,
How to add python library in this notebook.
I have tried ;
%additional_python_module s2cell
But this is not working
Would you Please help me here to add library
What error you get? have you zipped all python files in root directory, uploaded to S3 bucket and then mention that location in %additional_python_module. Also check if the notebook IAM role has access to the S3 location where your python code zip is stored. Hope it helps.
@@AWSTutorialsOnline thank you so much. It works ! I really appreciate your help and your work which your doing for AWS learners!
Could please let me know what is the policy you have used in you IAM role. I'm getting error in create glue session as permission is denied.
Hopefully you fixed your error! In case anyone else is having a similar issue, always include the error message says when asking these questions. Every job is different and can require different policies.
@@Confusedcapybara8772 yeah some how i fixed thanks for the reply
How we can to SSHTunneling in AWS Glue Job?
I don't think you have access to Apache Spark environment which Glue uses. It is serverless.
@@AWSTutorialsOnline Thank you!
I mean Can we able to run the below code
with SSHTunnelForwarder(
ssh_address_or_host = (hostname, 22),
ssh_username = 'username',
ssh_pkey = 'CB.pem',
remote_bind_address=(ip, 5432),
local_bind_address=('', 45432)
) as server:
server.start()
print(server.local_bind_address)
server.close()
I think you can. You need to configure Glue Connection (network type) to establish network with the host. And the associate the connection with the Glue job.
@@AWSTutorialsOnline I am thankful if you could able to make a video on this
what are the permission u have for u r dojogluejob iam role ?
AWSGlueServiceRole + AmazonS3FullAccess
@@AWSTutorialsOnline Failed to authenticate user due to missing information in request.
@@STLEONcan you please share the exact error text
@@AWSTutorialsOnline - My IAM role has both of these roles, but I get the following error when trying to run the second block in my notebook:
An error occurred (AccessDeniedException) when calling the CreateSession operation: User: assumed-role/AWSGlueServiceRoleDefault/GlueJobRunnerSession is not authorized to perform: iam:PassRole on resource: AWSGlueServiceRoleDefault because no identity-based policy allows the iam:PassRole action. What did I do wrong?
@@dwaram1 Not sure why I am not facing this error. Use a custom IAM role with the following permissions - AWSGlueServiceRole + AmazonS3FullAccess + Custom Policy (with iam:PassRole permission). Use this role. It is documented here - docs.aws.amazon.com/glue/latest/dg/attach-policy-iam-user.html