Add RDS Data Source In AWS Glue

Поділитися
Вставка
  • Опубліковано 15 жов 2024

КОМЕНТАРІ • 88

  • @BeABetterDev
    @BeABetterDev 3 роки тому +2

    Glue has so much depth to it. Great video!

    • @DataEngUncomplicated
      @DataEngUncomplicated  3 роки тому +1

      Thank you! There many components to AWS Glue. I will be making more videos and tutorials about glue soon!

    • @fabian-manzano
      @fabian-manzano 3 роки тому

      @@DataEngUncomplicated I was also wondering to populate the stpes after will be to add node transform and node output data catalog? I did this but I am getting error: An error occurred while calling o106.pyWriteDynamicFrame. ERROR: duplicate key value violates unique constraint

    • @DataEngUncomplicated
      @DataEngUncomplicated  3 роки тому

      @@fabian-manzano Yes the steps after would be to add a node transform and node output depending on what you are trying to do in your workflow. It seems that you have received an error message on write because you attempted to write a record which violates the unique constraint if you are writing it to a database. Perhaps you have a duplicate record in your dataset.

  • @shovan3112
    @shovan3112 Рік тому +1

    You are doing amazing job simplifying things for common people who dont have the aws background. Please keep it up. Your channel will get millions of subscriptions over time for sure. Good luck brother.

  • @imransadiq5851
    @imransadiq5851 Рік тому

    Thank you for the superb video. I want to ask how to create connection if my RDS SQL Server db instance is in another AWS account not in the same account where i am creating connection.

  • @jovidog9573
    @jovidog9573 5 місяців тому

    Hello. I made a Glue Job that performs ETL changes to data in an S3 Bucket and exports the changed data to a Redshift database, but now I'm thinking of changing from Redshift to PostgreSQL. I know this video is for importing RDS data into Glue, but if I follow the video's instructions, would I also be able to export it back into RDS?

    • @DataEngUncomplicated
      @DataEngUncomplicated  5 місяців тому

      Hi, This video is only about how to add an RDS data source like postgres to AWS Glue Catalog. So if you establish your postgres database connection, you should be able to read and write data to it.

  • @maheshmushyam8153
    @maheshmushyam8153 10 місяців тому

    Can you make a video on adding the endpoint to connect publicly accessible RDS with Glue?

  • @code1530
    @code1530 Рік тому

    awesome! easy to follow instructions. One question. Is it possible to crawl data from RDS with table classification as csv? It output postgresql by default.

    • @DataEngUncomplicated
      @DataEngUncomplicated  Рік тому +1

      Thanks, I'm not sure actually.

    • @code1530
      @code1530 Рік тому

      Got it bro. I want to querry crawl results to athena without a glue job

    • @DataEngUncomplicated
      @DataEngUncomplicated  Рік тому

      Yea you don't need to use a glue job just to crawl...You can use the glue crawler to crawl postgres

  • @vierminus
    @vierminus 2 роки тому

    Intro so on point, very nice 😅

  • @SafaaSelim
    @SafaaSelim 3 роки тому

    nice video, one question is what if the RDS is in a different account ?

    • @DataEngUncomplicated
      @DataEngUncomplicated  3 роки тому

      Good question! There are two AWS Glue methods for granting cross-account access to a resource:
      Use a Data Catalog resource policy or
      use an IAM role

  • @gus882008
    @gus882008 3 роки тому

    Hi there! Great Video but I have a Question? Actually, We have a Rol to connect to S3 Bucket, is necessary that this Role have permission in redshift? or not is it necessary thanks

    • @DataEngUncomplicated
      @DataEngUncomplicated  3 роки тому

      Hi Gustavo, In aws glue, you need to set up a database connection so you can read data from redshift. This videoo was specifically for RDS Databases which does not include redshift. You will need to pass in the user name and password of the redshift role to the database connection so you can connect to data from this database.

  • @javiermadriz7834
    @javiermadriz7834 3 місяці тому

    My databases is in the default vpc however an error occurred and this mentioned s3 endpoint, Why I need s3 endpoint if my database is at the same vpc?

  • @iamdare
    @iamdare 2 роки тому

    Hi! Great job. When setting up your "access to your data store, ",how did you create that ETLDEMO" instance

    • @DataEngUncomplicated
      @DataEngUncomplicated  2 роки тому

      Hi Dare, for this example, I just manually created it in the RDS Console to create my postgres instance.

  • @mangeshxjoshi
    @mangeshxjoshi 3 роки тому

    good explanation , does aws glue etl tool support change data capture transformation to any rds database . assuming S3 files will be loaded initially into postgre sql db and other incremental S3 files (delta files) will updated to postgre sql ,
    or is any other custom code need to write to handled delta , i did not see any transformation in aws glue to handled CDC data

    • @DataEngUncomplicated
      @DataEngUncomplicated  3 роки тому

      Thanks Mangesh!
      Glue has a bookmarking feature which keeps track of what records have been processed previously. I would look into this to see if it meets your use case. If you have bookmarking enabled, you don't need to write custom code because it will keep track of what records you have processed previously and won't process these records again.

  • @DanielWeikert
    @DanielWeikert 2 роки тому

    This does not work for me due to routing vpc nat gateway issues. Do you have a video on how to cofigure this`?

  • @melanijagerasimovska6152
    @melanijagerasimovska6152 2 роки тому

    Great video:) I have one question, why do we add endpoint to the s3 service and not to rds service (rds is our source)?

    • @DataEngUncomplicated
      @DataEngUncomplicated  2 роки тому +1

      Thanks Melanija, I could have done a better job explaining the reason why in my video and it's been so long that I forgot the reason. I'm going to re-create the connection to see why and get back to you.

    • @quinnmichael2657
      @quinnmichael2657 2 роки тому

      @@DataEngUncomplicated Hey, checking in on this. We're seeing "data previews" for the source and the transformation steps but then blank in the target (S3). Thanks in advance!

  • @ajinkyarajane917
    @ajinkyarajane917 2 роки тому

    Hi @DataEng Uncomplicated, I have a question here, Why did we use JDBC as the Node type (data source)? Can't we directly select RDS as Node type or as Data source?

    • @DataEngUncomplicated
      @DataEngUncomplicated  2 роки тому

      Hi ajinkya, I believe when I made this video this wasn't an option but it appears it is now so go ahead and use it

  • @sriramkrishnaswamy5595
    @sriramkrishnaswamy5595 Рік тому

    why do u need an aws vpc gateway endpoint? Its a bit confusing . We are trying to connect rds to glue. Shouldnt the endpoint be connected to the glue and not s3?

    • @DataEngUncomplicated
      @DataEngUncomplicated  10 місяців тому

      Sorry for the delay in response.
      we need an S3 VPC endpoint when configuring an RDS database instead of a Glue VPC endpoint because Glue stores its scripts and temporary files in an S3 bucket. Even though the Glue job connects to RDS, it still needs to access S3 for these files.
      Setting up an S3 VPC endpoint provides private connectivity between the VPC and S3, without exposing the connection to the public internet. This allows Glue to securely access the S3 bucket.
      Some key points:
      Glue stores scripts and temp files in S3, so it needs access to S3 even if the job connects to RDS.
      A VPC endpoint for S3 enables private connectivity from the VPC to S3 over the AWS network, without a public IP address.

  • @sumanbhattacharjee8839
    @sumanbhattacharjee8839 2 роки тому

    Do you create vpc endpoint to S3 service or to glue? Glue is connection to the database right? do you have a video on how to create the endpoint?

    • @DataEngUncomplicated
      @DataEngUncomplicated  2 роки тому +1

      Hi Suman, I created a vpc to the s3 service and not the glue service. Sorry, I don't have a video on adding a vpc endpoint but I will add it to my list of future videos if you think this would be helpful for others. let me know what you think?

    • @sumanbhattacharjee8839
      @sumanbhattacharjee8839 2 роки тому +1

      @@DataEngUncomplicated Hi, Thank you for responding. I was facing issue with this. I have RDS MySQL database and I'm not able to connect from glue. The DB is accessible local tools like DBeaver. Even Lambda can connect to the same database. All the services are on the same vpc, region and security group. So this endpoint creation video can help me to solve the issue...

    • @giancarlopoemape5041
      @giancarlopoemape5041 Рік тому

      @@sumanbhattacharjee8839 Hi, I'm facing the same issue. Have you resolved it?

    • @AronBergara
      @AronBergara Рік тому

      on VPC page, look for the Endpoint menu option, then create a new endpoint for S3 of the type "Gateway" on the same VPC and subnet of the DB instance.@@giancarlopoemape5041

  • @redolfmahlaule9893
    @redolfmahlaule9893 3 роки тому +1

    hi sir ,after reading data from my postgresSQL using aws glue how to take it to s3 ?i will appreatiate your reply

    • @DataEngUncomplicated
      @DataEngUncomplicated  3 роки тому

      You have many AWS service options to achieve this depending on your data size and type of data you are working with. A popular method of building data pipelines is using AWS glue. If you want a no code option to develop a glue job, check out my glue studio overview video to learn more: ua-cam.com/video/NuGqN3Aj07M/v-deo.html
      If you code in python and are a fan of working with pandas, another option could be leveraging the python library aws data wrangler to do this: ua-cam.com/video/5pVpFnvRDW4/v-deo.html

    • @redolfmahlaule9893
      @redolfmahlaule9893 3 роки тому

      hi sir,can you assist me how can you trace glue job in x-ray using xray-daemon sdk

  • @AJEETKUMAR-yj8tv
    @AJEETKUMAR-yj8tv Рік тому

    Hi sir
    I have created one MySQL instance and also created few table with sample data ,then I have created database in data catalog and now I want create connection of database in AWS glue then it is throwing error like invalid parameter , I am unable to fix this error, pls help me to fix this error

    • @DataEngUncomplicated
      @DataEngUncomplicated  Рік тому

      Hi there, try posting on AWS repost or AWS support with more information about your issue to see if someone can help you out!

  • @aneeshmarathe7269
    @aneeshmarathe7269 3 роки тому

    Amazing explanation, thank you for this. I followed the same method as yours where i am able to get the tables attached to the database using Crawlers. Also tried building a Spark script using Glue studio. However, i am still not able to connect to RDS from Glue. Tried all possible ways to debug. Do help me out here

    • @DataEngUncomplicated
      @DataEngUncomplicated  3 роки тому

      Hi Aneesh, thanks for the comment! Was your crawler able to successfully crawl the database and find the tables? One common issue I see is that the Rds is usually in a vpc so you will need to add a vpc endpoint so your database can communicate with the aws glue service. If this is not the issue, is there any error messages that come up?

    • @aneeshmarathe7269
      @aneeshmarathe7269 3 роки тому

      @@DataEngUncomplicated Thanks for your reply. Yes, my crawler was easily able to find the table associated with the database. Issue that I am facing is when connecting/migrating data from AWS Glue either via Python/Pyspark scripts. Below is the error I am getting :
      ERROR [main] glue.ProcessLauncher (Logging.scala:logError(70)): InvocationTargetException java.lang.reflect.InvocationTargetException
      Exception in User Class java.lang.reflect.UndeclaredThrowableException
      Caused by: java.net.ConnectException: Connection refused

    • @DataEngUncomplicated
      @DataEngUncomplicated  3 роки тому

      One suggestion I have is creating a super simple glue studio job that reads from this database. If you can read successfully, you can rule out an issue with the vpc. If you still have an issue than you might have some issue with your pyspark code.

    • @aneeshmarathe7269
      @aneeshmarathe7269 3 роки тому

      @@DataEngUncomplicated Thank you, As per your suggestion I gave a try to just read the tables data from Glue studio but ended up with below stated error:
      Py4JJavaError: An error occurred while calling o64.getDynamicFrame. : com.microsoft.sqlserver.jdbc.SQLServerException: The TCP/IP connection to the host xxxxx, port 1433 has failed. Error: "Connection timed out: no further information
      However, I am not facing any issues while connecting to RDS using SQL Server tool/Python/AWS Crawlers too.
      I am not understanding what am I missing here.

    • @aneeshmarathe7269
      @aneeshmarathe7269 3 роки тому +1

      @@DataEngUncomplicated Thank you for your suggestion, It was a VPC issue. Took some time to figure out however, I was able to transfer data from S3 to RDS. Thanks for all your help :)

  • @helovesdata8483
    @helovesdata8483 Рік тому

    I keep getting test connection failed with no additional information . I created the vpc endpoint to s3 with route tables and My vpc is publicly accessible 🤯 Its not creating the tables

    • @DataEngUncomplicated
      @DataEngUncomplicated  Рік тому

      Hey, Does your IAM Role have sufficient permissions?

    • @helovesdata8483
      @helovesdata8483 Рік тому

      ​@@DataEngUncomplicated Yes I created a roll to give glue access to S3. I was thinking if permissions was an issue it would give me an error about access. It's only saying test connection failed

  • @ManojKumar-vp1zj
    @ManojKumar-vp1zj Рік тому

    Hi, my instance is not popping up into Instance section. can you pls guide me how to do this? This in advance

    • @DataEngUncomplicated
      @DataEngUncomplicated  Рік тому

      Hi, are you in the same region of your instance?

    • @ManojKumar-vp1zj
      @ManojKumar-vp1zj Рік тому

      @@DataEngUncomplicated Sorted bro... You are doing amazing work. Pls create more video tutorials. I saw all your videos in last 2 days many times.

    • @DataEngUncomplicated
      @DataEngUncomplicated  Рік тому

      @@ManojKumar-vp1zj Thanks for the kind words! I'm working on it!

  • @sebasfavaron
    @sebasfavaron 3 роки тому

    I get an error code 30 when testing the connection. Could I be testing with wrong credentials? I've run out of ideas to debug it

  • @chitraalavanthar3729
    @chitraalavanthar3729 3 роки тому +1

    How will you load partition table to data lake ?

    • @DataEngUncomplicated
      @DataEngUncomplicated  3 роки тому

      When selecting the "data target" node to write your data, make sure to add your partition into the "Partition keys" parameter.

  • @codingbreak8032
    @codingbreak8032 Рік тому

    What if i want to connect it to ec2? Is it possible ?

    • @DataEngUncomplicated
      @DataEngUncomplicated  Рік тому

      Hi, do you mean to a database on an ec2 machine?

    • @codingbreak8032
      @codingbreak8032 Рік тому

      @@DataEngUncomplicated yes , can the AWS Glue connect to a postgres hosted in ec2 instance? to be specific the postgres is a legacy version 9.3

    • @DataEngUncomplicated
      @DataEngUncomplicated  Рік тому +1

      I did a quick check for you, yes! it's possible, you need to add a new database connection and instead of choosing "RDS" make sure to select the "connection type" as "JDBC" and it should work!

    • @codingbreak8032
      @codingbreak8032 Рік тому

      @@DataEngUncomplicated thank you ! Will try this after my vacation. Keep it up bro!

    • @codingbreak8032
      @codingbreak8032 Рік тому

      @@DataEngUncomplicated hi , what will I input in the jdbc url? just the ip address for the host?

  • @kowshicnatarajan
    @kowshicnatarajan 2 роки тому

    Hi,
    When dealing with a postgesql table which has a primary key a column "Id" it's impossible for any glue job to reference it.
    If we dig into the error log, here is the following exact error:
    ERROR: column "Id" does not exist

    • @DataEngUncomplicated
      @DataEngUncomplicated  2 роки тому +1

      Strange, can you see the id column in the AWS glue catalog table?

    • @kowshicnatarajan
      @kowshicnatarajan 2 роки тому

      @@DataEngUncomplicated yes

    • @DataEngUncomplicated
      @DataEngUncomplicated  2 роки тому

      That's strange I haven't seen this before. I wonder if it's an issue with it searching it as a lower case vs mixed case

    • @kowshicnatarajan
      @kowshicnatarajan 2 роки тому

      @@DataEngUncomplicated All my other column names are Mixed Case and I have no problem referencing them. This only occurs when the column is called "Id" and primary key.

  • @aaddiis45021
    @aaddiis45021 3 роки тому

    I am not getting any instance option in instance selection
    edit used jdbc option and was able to get it

    • @ViniciusCassalesDev
      @ViniciusCassalesDev 3 роки тому

      I hava same problem. Can't solve it yet using rds connection type.

    • @aaddiis45021
      @aaddiis45021 3 роки тому

      @@ViniciusCassalesDev use jdbc connection. Google jdbc dabasr link

    • @ViniciusCassalesDev
      @ViniciusCassalesDev 3 роки тому +1

      @@aaddiis45021 I Need to do with RDS Connection

    • @DataEngUncomplicated
      @DataEngUncomplicated  3 роки тому

      Is your database in the same region of your glue catalog?

    • @DataEngUncomplicated
      @DataEngUncomplicated  3 роки тому

      Is our RDS Database in a VPC? if so, you will need to add a vpc endpoint.

  • @admiralbenbow7677
    @admiralbenbow7677 4 місяці тому

    i guess you forgot to show how to make a connection in pgadmin first

    • @DataEngUncomplicated
      @DataEngUncomplicated  4 місяці тому

      Can you explain why you think you need to make a connection in pgadmin first? I walked through how to create the database connection in the glue catalog.

    • @admiralbenbow7677
      @admiralbenbow7677 4 місяці тому

      @@DataEngUncomplicated sorry iam new to this so forgive me if i am asking silly questions, isn't the data stored locally on your computer so you have to make a connection there first if not how can glue find where it's and how it automatically recognized etldemo

    • @admiralbenbow7677
      @admiralbenbow7677 4 місяці тому

      @@DataEngUncomplicated Oh silly me i got confused with data migration my bad😅

  • @chitraalavanthar3729
    @chitraalavanthar3729 3 роки тому

    How will you load partition table into data lake ?

    • @DataEngUncomplicated
      @DataEngUncomplicated  3 роки тому

      There are many different ways this can be achieved. Using AWS data wrangler for example, there is a parameter to specify the partition columns you want to use.