Advancing Spark - Rethinking ETL with Databricks Autoloader

Поділитися
Вставка
  • Опубліковано 24 січ 2025

КОМЕНТАРІ • 64

  • @raulescamilla3328
    @raulescamilla3328 3 місяці тому

    Awesome explanation, It was just what I was looking for.

  • @theenduser1037
    @theenduser1037 4 роки тому +1

    I read all the comments and I watched the full video too coz it’s very much informative. I had a interview question on the same context. I wish I had known about this video earlier.

  • @OliverVallack
    @OliverVallack 4 роки тому +2

    Great video as always, keep them coming!

    • @theenduser1037
      @theenduser1037 4 роки тому

      I desperately need help to write json file to azure table storage from Databricks

  • @HiYurd
    @HiYurd Рік тому

    Really good job on this tutorial. TY

  • @meghapatil6426
    @meghapatil6426 4 роки тому +1

    This was extremely helpful!

  • @shivaniverma2881
    @shivaniverma2881 2 роки тому

    How to read comma in data itself in CSV file?

  • @JackPickle
    @JackPickle 4 роки тому +4

    Great tutorial, many thanks. Is there any chance you could post your code?

  • @briancuster7355
    @briancuster7355 3 роки тому

    I've used autoloader before and had some problems with it loading parquet files. I never figured out what the problem was though.

  • @guptaashok121
    @guptaashok121 3 роки тому

    I tried to implement this, using csv as source, though it worked. I am having issue with header. Header record is getting written In my target delta table with each file. I am not sure where I can give that option. I checked all options of cloud file and did not find any.

  • @davidsou
    @davidsou 3 роки тому +1

    What's the difference between Autoloader and achieving the same result with Spark Streaming in Databricks?

    • @konradpaniec6038
      @konradpaniec6038 3 роки тому

      He answered that question at the bottom of the comment section

  • @AlejandraLemmo
    @AlejandraLemmo 3 роки тому

    Hello and thank you for this video. We have implemented autoloader in the PRD environment but we are experiencing that the process takes longer. He went from running in 4 minutes to 10 minutes. I see a connection with that the folder where the files arrive contains many files ...

    • @AdvancingAnalytics
      @AdvancingAnalytics  3 роки тому

      Are you using cloudFiles.useNotifications? If not, then yes, it'll be using directory listing mode, which will read the entire contents of the blob and compare it to the checkpoint to work out what files need to be loaded - this process shouldn't take a super lengthy time, but can start hitting performance problems if you have many thousands of files!
      If that's not the case, then it's more likely a cluster sizing issue with the streaming query, which shouldn't be impacted by existing files!
      Simon

  • @skipa9906
    @skipa9906 8 місяців тому

    Great video. Where can we get the sample files?

  • @nsudeesh601
    @nsudeesh601 3 роки тому

    How to configure autoloader on AWS. Pls share vedio link if you have.
    Thanks

  • @bhaveshpatelaus
    @bhaveshpatelaus 4 роки тому +1

    What if the files get deleted in the blob storage? Does this affect the dataframe row count. Do you have to have all files present in blob storage at all times. Just thinking in terms of cost savings. Can a regular clean up of blob storage run to save storage costs.

    • @AdvancingAnalytics
      @AdvancingAnalytics  4 роки тому +3

      Hey - the files in blob are ignored once triggered. When a blob lands, it triggers a read in the dataframe, after which there is no permanent link between that blob and the data in your delta table.
      So yes, you can absolutely clear down your blob table once the files have been processed. Just bear in mind, if you load a file with the same name, this will be ignored by autoloader (it will assume you have already processed this file!)

    • @bhaveshpatelaus
      @bhaveshpatelaus 4 роки тому +3

      @@AdvancingAnalytics Great. Thanks for the response. Loving the work you are doing. Keep it up..

  • @vinayakakm-n5v
    @vinayakakm-n5v Рік тому

    I'm getting this error "
    com.databricks.sql.cloudfiles.errors.CloudFilesIllegalArgumentException: Found notification-setup authentication options for the (default) directory"

  • @JD-xd3xp
    @JD-xd3xp 3 роки тому

    Can i use Auto Loader to load Shape file (GeoSpatial file)

  • @morrolan
    @morrolan 3 роки тому +2

    I see that the default file discover mode is directory listing. This removes the need for the event grid and storage queue complexity. I am surprised that you didn't mention this or talk about it. Wouldn't this be ideal for many use cases?

    • @AdvancingAnalytics
      @AdvancingAnalytics  3 роки тому +3

      The directory listing approach is the same used in standard Spark Streaming over a file source, I guess I was focusing on the notification queues as the major autoloader benefit at that time, avoiding slow-down over large directories.
      Autoloader has had a bunch of features added since that give it additional benefits, such as the new schema evolution elements. Probably a very different conversation now than when it was released :)
      Simon

    • @777ashwin
      @777ashwin 3 роки тому +1

      You should avoid using file discovery mode in production, the storage costs are gonna be very high because it will continuously list the files in the directory.

  • @jayjaysize
    @jayjaysize 4 роки тому +1

    Thanks for the video. How can you get an audit of who uploaded the file? Is it possible to pass the meta data of the user's name, who uploaded the file, so that you can log that in the delta lake as well?

    • @AdvancingAnalytics
      @AdvancingAnalytics  4 роки тому +2

      Good question! Don't know - there's metadata available in the queue but I've not looked into detail. Certainly worth checking out!
      Simon

  • @Rothbardo
    @Rothbardo 4 роки тому +1

    So should any processes that update data based on data changes happen at the data lake layer, then it'll automatically get updated witih autoloader?
    What if an file gets overwritten?

    • @AdvancingAnalytics
      @AdvancingAnalytics  4 роки тому +4

      It's fairly blunt about that - it ONLY pushes through new files. Files that are appended to/overwritten don't trigger a new write. Definitely narrows the use case - basically if you're receiving transactional data into a regular location with unique file names, Autoloader will be great. If you have a more complex ingestion pattern, its probably not appropriate.
      For the later layers of the lake, I wouldn't use Autoloader - ie: you wouldn't point it at a delta table source and trigger based on new files there. For a chain of updates, you can stream directly from a delta table in a similar way, which won't pick up duplicate records.
      Definitely some downsides to the tool, doesn't solve everything. But if your data fits the expected pattern, could be a very nice way of doing early-stage ingestion.
      Simon

    • @Rothbardo
      @Rothbardo 4 роки тому +1

      @@AdvancingAnalytics Thanks for your reply!
      Love you content, man. Great work!

  • @sid0000009
    @sid0000009 4 роки тому

    Thanks for the video. I believe it does not support immutable source files. If we use Execute only once the only advantage we get is that we don’t need to determine which files are new and to be processed each time. This kind of confused me as checkpoints are meant to determine the same? What difference it would make if we don't use auto-loader and use execute once option with checkpoint..wont be determine self what files were processed already.. thank you!

  • @dineshbandaru5821
    @dineshbandaru5821 4 роки тому

    Can we write data into Data Warehouse using this method? and if so how?

  • @rakeshraushan6573
    @rakeshraushan6573 2 роки тому

    Can we use autoloader with delta source? is it recommended or am I headed in the wrong direction? My scenario involves transformation of new records arriving in the delta table. Please suggest. Thanks

  • @akankshagupta5067
    @akankshagupta5067 3 роки тому

    Thank you for the video but I have a question, I see that writeStream has spike in the dashboard each time file is loaded in the BLOB but why does row count is still 0 when I am checking in the table as I am directly write into the table?
    df.writeStream.format("csv").option("delimiter", ",").option("checkpointLocation", "mnt/path/path/_checkpoint").table("table1")

  • @mc1912
    @mc1912 3 роки тому +1

    For the first 3 files I think you don't see anything in the queue because the messages are already consumed at the time you query the queue.
    When you upload the next 2 files the stream is stopped, so you see the messages in the queue because they are not consumed.
    After you trigger once the stream those messages are consumed and dissaper from the queue.

    • @AdvancingAnalytics
      @AdvancingAnalytics  3 роки тому +1

      I'd need to rewatch the vid (it's been a while!) but the initialisation of an autoloader run (ie: when there is no existing checkpoint) will use directory listing as the event grid setup will only trigger messages for new/added files. Otherwise yep, messages will stay in the queue until flagged to dequeue once the stream has processed them, can sometimes be tricky to see what messages are in the queue as it's processing through them!

  • @jhoop2002
    @jhoop2002 3 роки тому

    Cool video!!!

  • @qweasdzxc2007
    @qweasdzxc2007 3 роки тому

    Thank you! What is a way of loading multible tables using Autoloader? Should i create separate stream for each directory or there is a convinient way of doing ELT for several sources using one Autoloader stream?

    • @AdvancingAnalytics
      @AdvancingAnalytics  3 роки тому

      Each separate source/destination is a separate streaming query. We usually build out generic, parameterised autoloader processes then trigger the notebooks in parallel to run various streams simultaneously

    • @jhansle1
      @jhansle1 2 роки тому

      @@AdvancingAnalytics Basically took the same approach but can't get it working with excel xlsx files. Any guidance on this?

    • @AdvancingAnalytics
      @AdvancingAnalytics  2 роки тому

      @@jhansle1 Hrm - I've not seen Autoloader working with the Excel reader, given that's a 3rd party driver. Were you doing direct file streaming, or trying to get it work work within the cloudFiles context?
      Simon

  • @werner6816
    @werner6816 Рік тому

    Does autoloader+delta tables work with serverless SQL warehouses, too ?

    • @AdvancingAnalytics
      @AdvancingAnalytics  Рік тому

      Yep, should do. Although we would tend to use autoloader via a normal spark cluster to load the data, then use DBX Serverless to query the delta tables - don't have to use a single cluster for everything :)

  • @vinayakakm-n5v
    @vinayakakm-n5v Рік тому

    Where can i get the client id and client secret?

    • @AdvancingAnalytics
      @AdvancingAnalytics  Рік тому

      Thats the Service Principal - it's an "application" created within Azure Active Directory (soon to be renamed as Microsoft Entra ID), you can get those parameters from the app object itself through AAD

    • @vinayakakm-n5v
      @vinayakakm-n5v Рік тому

      @@AdvancingAnalytics Does the redirect url has to be something relevant or can it be ignored?

    • @vinayakakm-n5v
      @vinayakakm-n5v Рік тому

      @@AdvancingAnalytics I used the correct keys but im getting this error "com.databricks.sql.cloudfiles.errors.CloudFilesIllegalArgumentException: Found notification-setup authentication options for the (default) directory"

  • @abhradwipmukherjee3697
    @abhradwipmukherjee3697 Рік тому

    Great as usual. If my streaming data is getting appended in a single file, will this work?

  • @gamachu2000
    @gamachu2000 4 роки тому

    what is the difference between autoloader and event trigger from Azure data Factory. Are the same. Thanks

    • @AdvancingAnalytics
      @AdvancingAnalytics  4 роки тому +3

      Hi! They're two different implementations that both use EventGrid behind the scenes. So the actual technology that triggers is the same.
      However, an ADF file trigger will trigger for each and every file that lands in blob. With Autoloader, the queue is held separately so you have the ability to offset and trigger periodically to process files that have come in since it last ran. For me, Autoloader has more flexibility, but it depends on your scenario

  • @kunalr_ai
    @kunalr_ai 4 роки тому +1

    Keep it up

  • @morris2582
    @morris2582 4 роки тому

    Good Video ! Thanks
    When I Try Autoloader find that is not support readStream XML file format , isn't it? When I try Autoloader got this exception.
    java.lang.UnsupportedOperationException: Data source com.databricks.spark.xml does not support streamed reading

    • @AdvancingAnalytics
      @AdvancingAnalytics  4 роки тому

      Yeah, we've found Autoloader can be a little fussy about the files that it can read, and it can sometimes hide away the errors that would normally be returned by datareader failures. One thing is definite - if the stream reader can't read the file type, then nor can Autoloader!

  • @deepjyotimitra1340
    @deepjyotimitra1340 2 роки тому

    thank you so much for the video. its going to reduce pain of lot of data engineers.
    one quick question from my side - does the autoloader support data lake gen1? and also if i don't want to create any event subscription, can i achieve it ?

  • @IanSkelton
    @IanSkelton 4 роки тому

    Great video! I had this set up successfully reading json files from ADLS v2 with DRB version 7.0. However, when upgrading to DBR 7.2 or 7.3 it creates the event grid objects but fails to read the json files. Have you had any success with these versions?

    • @AdvancingAnalytics
      @AdvancingAnalytics  4 роки тому +1

      Yep! Currently using it on-project with a client on Runtime 7.3LTS. Autoloader over CSV/Parquet not JSON but should be fine.
      I did find it suddenly demanded I provided a schema, even for parquet files, but once provided all seemed absolutely fine!
      Simon

    • @IanSkelton
      @IanSkelton 4 роки тому

      Thanks Simon.
      Turns out the nullability constraints in the schema definiton are ignored in DBR 7.0.
      In 7.2 / 7.3, if nulls are encountered and your schema is set not allow them the stream fails with a generic "file could not be read" error. This had me stumped for ages!

    • @AdvancingAnalytics
      @AdvancingAnalytics  4 роки тому +1

      ​@@IanSkelton Yeah, it's not a great error message! We've come to know the "file could not be read" as "someone messed up the schema definition" - nullability would definitely come into it!

  • @marcocaviezel2672
    @marcocaviezel2672 3 роки тому

    Hi Simon!
    Once again an awesome video!
    I followed your instructions and used ADLS Gen2 as source and sink.
    Somehow for me it only worked if I used .option(„cloudFiles.validateOptions“, „false“) in the readStream command, but everything worked as expected in the end. (Except that I don‘t see the messages in the queue.
    Do you have any idea why I need this option and did you find a solution how to show queues?

    • @marcocaviezel2672
      @marcocaviezel2672 3 роки тому

      Hi again,
      I just found the solution.
      I had Databricks only as a Storage Blob Data Contributor in the blob.
      I changed this to the Contributor role and now it works fine. (Except that the queues aren‘t displayed)

    • @AdvancingAnalytics
      @AdvancingAnalytics  3 роки тому

      @@marcocaviezel2672 if the queues aren't being created, it might be in directory listing mode. If you're got cloudfiles.useNotifications set to True, then it will be creating the event grid topic & queue according to your config? The dequeue time is also pretty quick, so it's possible the messages are getting to the queue, then being read & deleted before you see them in there :)
      Simon

    • @marcocaviezel2672
      @marcocaviezel2672 3 роки тому

      @@AdvancingAnalytics Hi Simon,
      Thanks for your help!
      I toggle cloudfiles.useNotifications to true, but it definitely could be that the data is already deleted, since I‘m testing with a really small dataset. I thought that they might be saved and I can use them as a history, but if I understand correctly that‘s not the case.
      Have a great day!
      Marco