ETL From AWS S3 to Amazon Redshift with AWS Lambda dynamically.
Вставка
- Опубліковано 22 січ 2022
- ===================================================================
1. SUBSCRIBE FOR MORE LEARNING :
/ @cloudquicklabs
===================================================================
2. CLOUD QUICK LABS - CHANNEL MEMBERSHIP FOR MORE BENEFITS :
/ @cloudquicklabs
===================================================================
3. BUY ME A COFFEE AS A TOKEN OF APPRECIATION :
www.buymeacoffee.com/cloudqui...
===================================================================
This video shows complete end-to-end demo achieving ETL with using AWS lambda (python) service dynamically where a data will be loaded from AWS S3 bucket to Amazon RedShift. It has clean architectural explanation, part by part explanation as well. And demo configuring complete automation set up.
Please find the relevant file at repo :
github.com/RekhuGopal/PythonH...
#etl #amazonredshift #s3 #lambda - Наука та технологія
Thanks for sharing knowledge. Looking forward to see more videos from you. 🙏🙏🙏
Thank you for watching my videos.
Thank you so much for sharing this video
Thank you for watching my videos.
Glad that it helped you.
Thank you for the video, can you please let me know how can we change the Datatype. In my table , i am having other datatypes where as my CSV file will give only VARCHAR. It will be helpful if you show the demo on table which is having different datatypes other than Varchar
Thank you for watching my videos.
Sure.. I shall create new video on it asap, stay tuned.
Thank you for the video.
I am facing one issue.
I am unable to see host,db-name,pwd,table-name in the lambda environmental variables 28:44
Did i miss anything? Please help.
Thank you for watching my videos.
Did you check the code file that I shared in description it has logic to read env variable and you need set the variables as well.
Excellent tutorial.
Thank you for watching my videos.
Happy learning.
The lambda is in a vpc and s3 is outside the vpc so my code is giving timeout error. What to do?
Thank you for watching my videos.
May be you need to check if S3 bucket vpc endpoint interface is enabled or not here and also make sure that AWS lambda is integrated with VPC.
Hello how to get that S3 key please clarify me,iam little bit confused
Thank you for watching my videos.
Those are IAM user key and secret, please check description for shared files.
How you have generate Aws secrete and access key and how to use in lambda function with role
Thank you for watching my videos.
what is the acess_secrete and acess_key. Are these IAM ROLE CREDS( if yes then IAM role for what)?
Thank you for watching my videos.
Those are IAM user acess_secrete and access_key, where required access role is attached.
Sorry to sound like a broken record. I am still very confused.
Basically,
If i am reading it correctly, we need to create a IAM user with redshift access and pass the creds for that IAM role to the query string.
Do we need to supply some special roles to the lambda itself?
Does the IAM role need to have some permission over lambda?
When uploading a new file the table is not being with the value of new files , can you please help
Thank you for watching my videos.
Can you please check lambda is getting triggered when you upload the file in S3 bucket.
There you could figure out the reason for not being uploaded in Redshift cluster. Error log might help me to help you
How you have create role and Access key and Secrete key an all
Thank you for watching my videos.
Is there a redshift severless version of tutorial on this?
Thank you for watching my videos.
Please check my latest video on Redshift , I have used Serverless.
I have tried to implement the same but getting error in Lambda. I think my redshift is not able to connect to the S3. How can I figure out this?
Thank you for watching my videos.
May I know what is the error that you are seeing here
@@cloudquicklabs Thanks for reply...pasting error right here only....after querry....
[ERROR] InternalError_: S3ServiceException:Access Denied,Status 403,Error AccessDenied,Rid S5RM15AXTZNQ248Z,ExtRid 7K6TdTJUNhNS+fKGY2HgUYjq40HjYsI6Hdes69txVhXESjZkqZM8myV5tQcagVwJh/uvB4564IM=,CanRetry 1
DETAIL:
-----------------------------------------------
error: S3ServiceException:Access Denied,Status 403,Error AccessDenied,Rid S5RM15AXTZNQ248Z,ExtRid 7K6TdTJUNhNS+fKGY2HgUYjq40HjYsI6Hdes69txVhXESjZkqZM8myV5tQcagVwJh/uvB4564IM=,CanRetry 1
code: 8001
Please upload one video on your setup lambda, redshift
Thank you for watching my videos. I shall create video here
Thanks!
Thank you for watching my videos.
I followed the same steps, but seems when I upload a csv to S3, the trigger does not invoke the lambda function. what could be the reason?
Thank you for watching my videos.
Following could be the reasons.
1. Check if you have have trigger from S3 bucket object creation is added or not
2. Check if Lambda role has the trust to AWS Lambda service.
@@cloudquicklabs I will double check . Thanks much . I also got an error saying psycopg2 does not have attribute connect error while testing the lambda . I am using Python run time 3.7 , I am not sure what caused this issue ?
You need to create a Lambda layer, I have attached the zip file of psycopg2 module, please create a Lambda layer for the same
@@cloudquicklabs I did create the lambda layer using Python runtime 3.7
@@cloudquicklabs finally I successfully loaded the data from S3 to redshift table, however when printing print(curs.fetchmany(3)), it gives [ERROR] ProgrammingError: no results to fetch, but data is available in redshift table. what could be the reason?
Thank you for sharing the video, this works. Please let me know how can i ignore the header while using this code also how can i load a tsv i.e Tab separated file using the code. It will be helpful.
Thank you for watching my videos.
As we are loading data with lambda, you can ignore first row with code tactics.
To load tsv files first converts to csv files which is easy then you can follow same steps mentioned in this video.
www.google.com/amp/s/www.geeksforgeeks.org/python-convert-tsv-to-csv-file/amp/
got confused between IAM ROLE and user.....can u plz help
Thank you for watching my videos.
Here IAM role is for AWS Lambda service role which gets invoked during file upload in S3 bucket. And IAM user credentials are for SQL 'COPY' command run to copy data from s3 bucket to Amazon Redshift cluster table. Watch again hope you get this.
Happy learning.
i dont see Python 3.7, i selected python 3.8, i get this error [ERROR] Runtime.ImportModuleError: Unable to import module 'lambda_function': No module named 'psycopg2._psycopg' Traceback (most recent call last):
Thank you for watching my videos.
I am creating a new video on topic where I would use latest Python version, it would be released at next weekend.
Hi bro, I am doing a similar thing for a course project, but there are multiple csv files uploaded at each time, and there are three different tables that need separate queries. However, it takes 40 seconds to insert for each table. This is too long. Would a connection pool make it quicker? What is your opinion to solve this? Thank uou
Thank you for watching my videos.
There multiple options here.
1. Even though multiple files are uploaded still you will come to know what are those file names.. So you could iterate through foreach loop and upload in to tables (then identifying which table is something you could do with code logic and convention write three copy commands as I shown in my videos and execute accordingly).
2. I believe there are less time issues here as Lambda can run for upto 15 minutes in one run.
3. If it still crosses 15 minutes may be you could do that with some like shown in ua-cam.com/video/uf8heaG0IhU/v-deo.html
I am getting error message in cloud watch log events after uploading my CSV file in s3 like "[ERROR] InternalError_: The specified S3 prefix 'Weather+readings+Analysis.csv' does not exist".
Please enable vpc endpoint for s3 service of type gateway. Then it should be working.
Because here VPC is private network and S3 become public so it is not able to communicate, I believe adding endpoints will solve the issue here.
Do let me know if you still facing the issues further
@@cloudquicklabs still getting the same error even after adding endpoints
Okay it looks file name issue - is it possible to change file name to 'WeatherReadingAnalysis.csv' as it looks to S3 url with spaces name is the issues
@@cloudquicklabs Thank you Sir error resolved.
Wow.. Happy to know that you issue is resolved.
Thank you so much for watching my videos.
I am getting error while testing the code in lambda function:-
{
"errorMessage": "'Records'",
"errorType": "KeyError",
"stackTrace": [
" File \"/var/task/lambda_function\", line 7, in lambda_handler
for record in event['Records']:
"
]
}
Thank you for watching my video.
It dictionary key mismatch issues. First print the event, check the dictionary value and add it accordingly.
@@cloudquicklabs Print the event :-
import os
def lambda_handler(event, context):
print(event)
When I print the event from above code I got this :-
Test Event Name
test
Response
null
Function Logs
START RequestId: 753eb791-2312-447a-bf06-3becf8f90926 Version: $LATEST
{'key1': 'value1', 'key2': 'value2', 'key3': 'value3'}
END RequestId: 753eb791-2312-447a-bf06-3becf8f90926
REPORT RequestId: 753eb791-2312-447a-bf06-3becf8f90926 Duration: 1.69 ms Billed Duration: 2 ms Memory Size: 128 MB Max Memory Used: 36 MB Init Duration: 109.30 ms
Request ID
753eb791-2312-447a-bf06-3becf8f90926
is this a incremental data load in the table
Thank you for watching my videos.
This will upload the new files data that getting generated in s3 side everytime but not incremental upload.
This works with an s3 but what about an s3 folder??? I mean a folder inside the S3
Thank you for watching videos.
As it is event driven it should work in that case as well. But if there are any error , it could be handled in code. Let me know if you face any errors in these cases.
@@cloudquicklabs Yes it does work. Cool ETL. Do you have any videos about ETL that are hard to make like IRL jobs
Could you please provide the sample data csv file in github repo.
Thank you for watching my videos.
Yes, Sharing it soon.
Please stay tuned.
Thanks, could make video with AWS redshift serverless
Thank you for watching videos.
Sure, I shall create a video on it.
@@cloudquicklabs thanks in advance waiting ✋️ 😊
Hi sir,
First of all, I would like to say thank you for sharing such wonderful tutorial video with the code, great video lecture. Looking forward to see some more videos on AWS Glue, EMR, Athena, EC2 as well.
I am unable to download the zip file from hit repo, is there anything you can suggest ?
Thank you for watching my videos.
Indeed I shall make more videos on mentioned Services.
You could clone the repo to down the .zip file.
@@cloudquicklabs thank you sir
Unable to import module 'lambda_function': No module named 'psycopg2._psycopg'
what should we do for it?
Thank you for watching my videos.
Please check the video descriptions where I have given the link of 'psycopg' package to create the required Lambda layer and this helps to import the module. I encourage to go through the video again.
csv file is not uploaded on Github repo ?
Thank you for watching my videos.
In this video csv files would be uploaded directly s3 bucket.
For anyone watching the video, you just have to upload any csv file with 3 columns to the s3 bucket. No special format needed. You're welcome :)
How to provide access_secrets and access_key
Thank you for watching my videos.
Here I have used acess_secrete and acess_key as Lambda environment variables.
You can trying following docs.aws.amazon.com/lambda/latest/dg/configuration-envvars.html
kindly add glue job so that schema also be figure out automatically
Thank you for watching my videos.
I shall create a new video on it soon.
thanks for video, i have the following problem when executing the lambda function, I get this error "Unable to import module 'LambdaTest::LambdaTest': No module named 'LambdaTest::LambdaTest"
Thank you for watching my videos.
I believe you have created the lambda as suggested in video, please try to recreate the Lambda with runtime python 3.7 and also creat the lamba layer using same python.zip folder that I have shared and you must attach same on lamba created.
@@cloudquicklabs I have done everything as in the video, but it keeps throwing me the same error that the LambdaTest module is missing
Try creating the lambda first with 'Author from scratch' option using aws console, then copy paste the code that I have given in repo link.. It should work this way.
@@cloudquicklabs find the problem, in the runtime configuration, the controller was wrong, I configured it as in the video and it worked perfect, I only have the vpc part left, thank you very much
Happy that it worked for you now.
Happy learning ahead.
youtube should allow us to do a DOUBLE LIKE, because that's what you deserve
Thank you for watching my videos.
And thank you very much for encouraging note. Keep watching, Happy learning.
How to load json file in redshift by using lambda can any one please tell
Thank you for watching my videos.
In Json file case may be Lambda code has to changed to download, read and upload to Redshift.
May I know what would be your json file size.
@@cloudquicklabs less than 100 mb I am working on live stream data
IF u have code please share it
Thank you for coming back on this.
Code should be very simple in this case I shall create new video on this.. Please stay tuned.
@@cloudquicklabs share link when it's completed
Nice effort. But very unorganized flow. It would have been at best 20 minute video if you would have spoken properly and in the correct chronology.
Thank you for watching videos.
I shall take valuable feedbacks and work on it.
@@cloudquicklabs Please don't mind, brother. I know it takes a lot of time to prepare, record and edit. What you are explaining is very resourceful.
But, I guess, instead of explaining things on the fly, it is better to script so that you can minimize the words and don't jump between topics. This was my experience.
I wanted to watch and learn but could not maintain that attention span.
Thank you very much for this valuable input. I shall adapt these in my next videos.
I CANNOT UNDERSTAND ANYTHING YOU SAY!
Thank you for watching videos.
Apologise that you are not following the video. Could you please watch the video again and may stop - start and watch
@@cloudquicklabs Eh?
while I am testing the code in lambda getting the error like
Test Event Name
(unsaved) test event
Response
{
"errorMessage": "could not connect to server: No such file or directory
\tIs the server running locally and accepting
\tconnections on Unix domain socket \"/tmp/.s.PGSQL.5439\"?
",
"errorType": "OperationalError",
"stackTrace": [
" File \"/var/task/lambda_function.py\", line 25, in lambda_handler
password = password)
",
" File \"/opt/python/psycopg2/__init__.py\", line 126, in connect
conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
"
]
}
what is the problem
Thank you for watching my videos.
Could you please check if you have enabled VPC endpoint for S3 service on VPC where Redshift cluster is hosted. And please make sure that you have integrated your lamba wit same VPC as well.
@@cloudquicklabs May I know how to check and integrate it.
I am getting the below error....
{
unable to import module 'lambda function'
}
START RequestId: 40274256-8381-4eaf-84b9-51d8c833df43 Version: $LATEST
Unable to import module 'lambda_function': No module named 'psycopg2._psycopg'
I created the layer as per the instruction I uploaded the python.zip . and selected the run time version as python 3.6
I am using this for postgrasql
Thank you for watching my videos.
Please try use 3.7 or Greater run time versions as 3.6 is already decommissioned from AWS.
Do follow all steps as shown in video.