AWS Glue PySpark: Upserting Records into a Redshift Table

Поділитися
Вставка
  • Опубліковано 12 кві 2023
  • This video is a step by step guide on how to upsert records into a dynamic dataframe using pyspark. This video will use a file from s3 that has new and existing records that we want to perform an upsert into our redshift table.
    github: github.com/AdrianoNicolucci/d...
    Related videos: • Add Redshift Data Sour...

КОМЕНТАРІ • 26

  • @mariumbegum7325
    @mariumbegum7325 Рік тому

    Great explanation 😀

  • @tuankyou9158
    @tuankyou9158 10 місяців тому

    Thanks for sharing your solution 😍😍

  • @ashishsinha5338
    @ashishsinha5338 Рік тому

    good explanation regards to stagging.

  • @critical11creator
    @critical11creator Рік тому +1

    Amazing tutorials! Truly haven't seen such drilled down stuff for a while. Is there a native PySpark course, perhaps, in the making? :) I'm certain it will be very appreciated by many if such a course existed on this channel

    • @DataEngUncomplicated
      @DataEngUncomplicated  Рік тому

      Thank you for your kind words! I have been slowly adding pyspark related content but I don't have a full course in the making, I wish I had more time!

  • @asfakmp7244
    @asfakmp7244 2 місяці тому

    Thanks for the video! I've tested the entire workflow, but I'm encountering an issue with the section on creating a DynamicFrame from the target Redshift table in the AWS Glue Data Catalog and displaying its schema. While I can see the updated schema reflected in the Glue catalog table, the code you provided still prints the old schema.

  • @vivek2319
    @vivek2319 Рік тому

    Please make more such diverse videos with what-if scenarios..

  • @datagufo
    @datagufo 7 місяців тому

    Hi Adriano, first of all thanks for the amazing series of tutorials. They are really clear and detailed.
    I am trying to implement the UPSERT into Redshift using AWS Glue, but I am getting what seems to be an odd problem.
    If I run my glue script from the notebook (it is actually a copy-paste from your notebook, with minor adaptations to make it work with my data and setup), when writing to Redshift the "preactions" and "postactions" are ignored, meaning that I end up with just a `staging` table that never gets deleted and to which data are simply appended. And no `target` table is ever created.
    Have you ever had such a problem. I could not find any solution online and I do not understand why your code would work for you and not in my case.
    Thanks again!

    • @DataEngUncomplicated
      @DataEngUncomplicated  7 місяців тому

      Ciao Alberto, thanks!
      Hmm I think I might have had this happen to me before. Can you check to make sure you haven't misspelt any of the parameters, I think I'd there is an error it would ignore it the preactions.

    • @datagufo
      @datagufo 7 місяців тому

      Ciao Adriano (@@DataEngUncomplicated)!
      Thanks a lot for your reply. I also thought that might be the case, but it does not seem like it is.
      I really tried to copy & paste your code. Moreover, that happens also with code generated with the Visual Editor, which I assume having the correct syntax.
      I was wondering whether it could be related to the permissions of the role that is used to run the script, but I do not see why it would be allow writing data in the table, but not the SQL preaction ...
      In the meantime, I really enjoyed your other video about local development, and it really helps to keep dev costs down and to speed up significantly the devel cycle.

    • @DataEngUncomplicated
      @DataEngUncomplicated  7 місяців тому

      Did you check to make sure your user in the database has permissions to create and drop a table? Maybe your user only has read/write access?

  • @rambandi4330
    @rambandi4330 Рік тому

    Thanks for the video, Does this work for RDS oracle ?

    • @DataEngUncomplicated
      @DataEngUncomplicated  Рік тому

      Im not sure. I haven't worked with RDS oracle but in theory it should.

    • @rambandi4330
      @rambandi4330 Рік тому

      @@DataEngUncomplicated Thanks for the response👍

  • @muralikrishnavattikunta8466
    @muralikrishnavattikunta8466 11 днів тому

    can you do a simple video from s3 to oracle data migration

  • @mohammadfatha7740
    @mohammadfatha7740 Рік тому

    I followed the same steps but it's throwing error like id column is integer and you are trying to query is varying

    • @DataEngUncomplicated
      @DataEngUncomplicated  Рік тому

      Hey, it sounds like you might have different data types in your column. You perhaps think its an int but there is actually some strings in there.

  • @NasimaKhatun-jb7qo
    @NasimaKhatun-jb7qo 4 місяці тому

    Hi Where are you running the code

    • @DataEngUncomplicated
      @DataEngUncomplicated  4 місяці тому

      Hi, I'm running my code locally using an interactive glue session.

    • @NasimaKhatun-jb7qo
      @NasimaKhatun-jb7qo 4 місяці тому

      I am trying my hands on how to run the code locally, can you create some video on how to run glue jobs locally(notebook version).. setup and configuration

    • @DataEngUncomplicated
      @DataEngUncomplicated  4 місяці тому

      I actually have many videos on this, for example see this one: ua-cam.com/video/__j-SyopVBs/v-deo.html. You can setup docker to run glue in there or use interactive sessions but that will cost compute in aws since you are just connecting to the cluster remotely but you can use a jupyter notebook to do this

    • @NasimaKhatun-jb7qo
      @NasimaKhatun-jb7qo 4 місяці тому

      Yes I have seen that video and had the same impression of cost.
      I am trying to setup local where I can use local spark(AWS glue) and let's say jupyter notebook. Also if locally I will be able to connect s3 and other services.
      Do you recommend other way to work locally? Also how this setup can be done .. trying since long ,not getting success