Advancing Spark - Data Lakehouse Star Schemas with Dynamic Partition Pruning!

Поділитися
Вставка
  • Опубліковано 8 січ 2025

КОМЕНТАРІ • 19

  • @krishnakomanduri9432
    @krishnakomanduri9432 4 роки тому +3

    Hey!
    I have been watching UA-cam videos since ages but this is the first time I am commenting on a UA-cam video. Your content is awesome and I mean it! Just buy a professional mic and a HD camera and never stop making videos like this one. I'd love to see more practical demonstrations on your channel. Good job!

  • @ConnorRoss311
    @ConnorRoss311 4 роки тому +2

    Great video! Can’t wait either for git project

  • @loganboyd
    @loganboyd 4 роки тому

    Really like your videos. UA-cam is NOT the best source for good detailed Spark content but watching videos is better than reading :)
    We are moving to Cloudera CDP from a HDP platform in the next couple of months. Spark 3.0 and it's new features look cool and should be very helpful.
    Am I understanding the DPP feature correctly if I said, it's only going to provide partition pruning when these two things are true:
    1. you have a predicate on a column on a smaller dimension table that is joined to a larger fact table
    2. the join key from the fact table side is an existing partitioned column

    • @AdvancingAnalytics
      @AdvancingAnalytics  4 роки тому

      Hey Logan - yep, I believe that's correct. This means you'll need to have tied your partitioning strategy to a foreign key of some sort to get maximum benefit from this approach, otherwise you'll never be hitting it... that said, I'm now questioning myself, I'll have a quick play over the next couple of days and confirm that it's only when the join key is your partition column. Lemme get back to you with a definitive!
      Simon

    • @karol2614
      @karol2614 2 роки тому

      @@AdvancingAnalytics
      Do you have any answer to question Logan ?

  • @gardnmi
    @gardnmi 3 роки тому

    I'm not sure if there have been updates to how spark handles data partitioning since this video but when I tried out your example on a delta table it actually managed to filter the date partition using the calculated date field within the fact table (See below). However, when I tested it with a non date dimension that was partitioned such as organization_id and filtering for organizational_name it was not able to filter the partitions so the dynamic partitioning join with a organizational_dim table outperformed the filter in the fact table.
    PartitionFilters: [isnotnull(service_from_date#299881), (date_format(cast(service_from_date#299881 as timestamp), y...,

  • @divyanshjain6679
    @divyanshjain6679 3 роки тому

    Hi!
    I have gone through with AQE video & found it very interesting.
    Coming to DPP, I'm totally new to Delta Lake n don't know much about the concept. Can u please share the block of code you have used to load data to Delta Table. Also, which databricks datset have u loaded as I can see multiple folders inside "nyctaxi" dataset.
    Thanks

  • @karol2614
    @karol2614 2 роки тому

    what is the best partitioning strategy for star schema warehouse? There are big facts in this structure that are related to a large number of dimensions - partitioning after one connection, queries will be suboptimal when using another key.

  • @EvgenAnufriev
    @EvgenAnufriev 2 роки тому

    Could you share your opinion on if the Data Vault methodology is good for implementing it using Databricks Spark and/or. Spark Streaming (Azure Cloud), Delta tables? Data size is in tens of GB/ TBs

  • @mohitsanghai5455
    @mohitsanghai5455 3 роки тому

    Great Video...Just have few questions - u applied filter on dimension table Date and spark filters the data, converts it into hash table and broadcast it. At the same time it applied partition pruning on Fact table Sales and only pick up the required records. Does it broadcast those records as well? Does subquery broadcast means those records? What if the filtered data is also huge? Will
    Spark still broacast it? or use SortMerge Join in that case.

    • @AdvancingAnalytics
      @AdvancingAnalytics  3 роки тому +1

      Broadcast join just means that one of the two joining tables is small enough to be broadcast. So if one side of the join is huge, each worker will only have the rdd blocks it needs, but it will pull a whole copy of the smaller table onto each worker so that it can satisfy all joins.
      If both sides of the query are huge, then yeah it'll revert to a SortMerge etc, but at least it will still have pushed the partition filter back down to the file system
      Simon

    • @mohitsanghai5455
      @mohitsanghai5455 3 роки тому

      @@AdvancingAnalytics Thanks for clearing some doubts...But what was subquery broadcast ?

  • @ravisamal3533
    @ravisamal3533 3 роки тому

    Hey can you index your spark videos playlist

  • @adrestrada1
    @adrestrada1 3 роки тому

    Hi Simon, Do you ve Github to start following you! ?

    • @AdvancingAnalytics
      @AdvancingAnalytics  3 роки тому +1

      Not really! I have a git account for slides/demos from conference talks, but the examples on UA-cam are all very quick & hardcoded to my env. We're looking at ways of sharing the notebooks in a more sustainable way!

  • @flixgpt
    @flixgpt 3 роки тому +1

    Your accent is irritating and even the subtitle is not able to pick up .. it's hard to follow

    • @Advancing_Terry
      @Advancing_Terry 3 роки тому +5

      If you press ALT+F4, UA-cam will change the accent. It's a cool feature

    • @bittu007ize
      @bittu007ize 3 роки тому

      @@Advancing_Terry awesome feature

    • @curiouslycally
      @curiouslycally 6 місяців тому

      your comment is irritating