AWS Tutorials - Using Job Bookmarks in AWS Glue Jobs

Поділитися
Вставка
  • Опубліковано 6 вер 2024
  • The exercise URL - aws-dojo.com/e...
    AWS Glue uses job bookmark to track processing of the data to ensure data processed in the previous job run does not get processed again. Job bookmarks help AWS Glue maintain state information and prevent the reprocessing of old data.

КОМЕНТАРІ • 50

  • @victorfeight9644
    @victorfeight9644 Рік тому

    Best explanation of maxBand I have heard.

  • @VishalSharma-hv6ks
    @VishalSharma-hv6ks 2 роки тому +2

    Hi Sir,
    Thanks a lot for this wonderful video.
    I have a doubt. Like I am using AWS Glue as ETL which is reading data everyday from Oracle RDBMS.
    But in Oracle I have update and delete with insert. You mentioned that we can use incremental read using bookmarking but what about the delete and update in Oracle side.
    How can we handle this situation.
    Thank you sir in advance.

  • @sivahanuman4466
    @sivahanuman4466 Рік тому +1

    Excellent Sir Very Useful

  • @howards5205
    @howards5205 11 місяців тому

    This is a great video. The visualization helped a lot also. Thank you so much!

  • @yusnardo
    @yusnardo 2 роки тому +2

    can I run the workflow recursively? I use boundedSize in my glue job. So I need to run the job multiple time in every month until the bookmark was done

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  2 роки тому

      a job can start another instance of the same job in the job code as long as concurrency allows. But is not a true recursive call - so think about exist condition when doing so.

  • @harishnttdata2325
    @harishnttdata2325 3 роки тому +2

    Very Useful Video. Time saver

  • @tiktok4372
    @tiktok4372 2 роки тому +1

    Thank you for the video, i have a question that does job bookmark work with DataFrame, suppose i use glueContext.create_data_frame_from_catalog, and then do some transformation to the Dataframe and and write the Dataframe to S3 bucket

  • @tylerdurden8692
    @tylerdurden8692 Рік тому +1

    When i try to speicify multiple keys in jobbookmarkkeys , its not working, its taking only the primary of jdbc always. even when there is some modifcations on existing records also its not given, it processing again, anything i am missing here

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  Рік тому +1

      you can multiple key as long as they increasing or decreasing in values. it that happening in the table?

    • @tylerdurden8692
      @tylerdurden8692 Рік тому

      @@AWSTutorialsOnline no, it means u are saying like the key field should be auto increment kind of field

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  Рік тому

      @@tylerdurden8692 yes, increment or decrement. Please check this link, it has rules about JDBC - docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html

  • @veerachegu
    @veerachegu 2 роки тому +1

    Tq so much explanation is very clear cut

  • @abir95571
    @abir95571 2 місяці тому

    How does job bookmark scale on massive data set ?

  • @pulakhazra5792
    @pulakhazra5792 2 роки тому +1

    Much clear and helpful.

  • @mohdshoeb5101
    @mohdshoeb5101 3 роки тому +1

    How i can manage multiple join table through bookmarks.Because When joining table I don't have unique key so that I concatanate multiple id then I get unique key. I need to set bookmark with multiple key. Please tell me how we can do

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  3 роки тому

      Apologies for the late response due to my summer break.
      Joining tables for bookmark not possible. You might want to create an ETL Glue Job which merges these datasets together and create primary key. Then run bookmark based processing on the merged dataset. Hope it helps,

  • @deepakshrikanttamhane285
    @deepakshrikanttamhane285 2 роки тому +2

    Hi Sir , Its very helpful but how configure s3 timestamp based job bookmark instead of using bookmark key

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  2 роки тому

      I think when you just enable job bookmark without mentioning any key; it uses timestamp for the bookmark purpose. Please check this link - docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html

    • @deepakshrikanttamhane285
      @deepakshrikanttamhane285 2 роки тому

      Great , It works

  • @YogithaVenna
    @YogithaVenna 3 роки тому +1

    Where is the state information stored? Is it persisted in any data store? What happens behind the scenes?

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  3 роки тому

      The information is not public so cannot say with confidence.

  • @mylikeskeyan2055
    @mylikeskeyan2055 Рік тому

    Please put some demo for jdbc with bookmarking for a table and shows the daily updated records only in the output

  • @abdulhaseeb4980
    @abdulhaseeb4980 3 роки тому +1

    Hi, I hope you are doing great. Currently I'm saving the entries for new files on SQS and then read from Glue to read those files but now I want to use the bookmark option. I'm using Python shell job and it's not supported in it. Now I will move to spark job but I will not use spark context there. can you please guide me how I can do this?

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  3 роки тому

      In order to use job bookmark, you have to program in certain way using spark context. This link might help - docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html

  • @creativeminds7397
    @creativeminds7397 2 роки тому +1

    Hello ,
    Your videos are simply superb 👌, I have pgp encrypted files in s3 and I need to implement bookmarks ,can you help either it work or not . If not any another approach to follow

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  2 роки тому

      Hi, sorry never worked with pgp files. Hard to say without testing,

  • @vishalrajmane7649
    @vishalrajmane7649 3 роки тому +1

    Do u have any video for incremental load in aws glue for newly inserted updated and deleted data from source to target??

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  3 роки тому

      I don't have any video on this. But if you are ingesting data from relational database then there are two methods which can work - 1) Using Lake Formation Blueprint or 2) Using Amazon Database Migration Service (DMS) to move data to S3.
      I have videos about blueprint and DMS but it does not cover incremental update scenario. You can check them in my channel.

    • @vishalrajmane7649
      @vishalrajmane7649 3 роки тому

      Thnks for the help. I will check the options that u have suggested..🙂

  • @deepakbhutekar5450
    @deepakbhutekar5450 Рік тому

    sir, how we handle updated records using jobbookmark.? or How jobBookmarkKey identify given record is been updated . becoz once particular record is processes and bookmark and if for some reason process record got updated in source table so how we handle this situation using jobBookMark..?

  • @joseabzum3073
    @joseabzum3073 3 роки тому +1

    What if I want to delete a .csv? Can some process automatically delete the parquet file?

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  3 роки тому

      You need to use boto3 S3 API to delete the file. Please check this link - boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.delete_object

    • @joseabzum3073
      @joseabzum3073 3 роки тому

      @@AWSTutorialsOnline Hi, but how can I know what parquet file belongs to a deleted .csv?

  • @user-gs5bl9jm9k
    @user-gs5bl9jm9k Рік тому

    Hello, how can we rest glue job state ?

  • @selvaganesh2529
    @selvaganesh2529 3 роки тому

    Hi , when I try to reset the bookmark I am getting "entitynotfoundexception , continuation for job not found" source is s3 I hav not altered the transformation ctx also, what might be the error

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  3 роки тому

      not sure. never come across this error. Can you share more details about what you are doing - some how which I can reproduce.

    • @selvaganesh2529
      @selvaganesh2529 3 роки тому

      @@AWSTutorialsOnline I fixed the issue, it was due to job_name which I have given as parameter which shouldn't be given as per aws documentation..

  • @kumark3176
    @kumark3176 2 роки тому

    Hi Sir,
    Thanks for sharing the information on Bookmark.
    I have a task to work on building the bookmark functionality using the PySpark & bookmarking in DynamoDB.
    I am new to the Bigdata framework technologies & we're moving from glue bookmarking to our own customized code (written in pyspark or java).
    Can you please suggest any material or sample code when I can use as a reference. We're trying to update based on lastUpdatedTime & DelayTime as motioned by you in this tutorial. Please reply & help me. Thank you..

  • @vishalrajmane7649
    @vishalrajmane7649 3 роки тому +1

    If u have plz provide me th link.

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  3 роки тому +1

      I don't have any video incremental update. But if you are ingesting data from relational database then there are two methods which can work - 1) Using Lake Formation Blueprint or 2) Using Amazon Database Migration Service (DMS) to move data to S3.
      I have videos about blueprint and DMS but it does not cover incremental update scenario. You can check them in my channel and go through AWS documentation to understand incremental update part.