Advancing Fabric - Lakehouse vs Warehouse

Поділитися
Вставка
  • Опубліковано 29 вер 2024
  • Microsoft Fabric has been released, there's a flood of information out in the world for you to start digesting! Fabric has different experiences, different workload types, but the underlying data is held as Delta tables.
    With the T-SQL Synapse Warehouse workload, and the Spark-based Data Engineering workload, it's a confusing choice. Does it matter? Are there pros/cons to each side? Why are there two experiences??
    In this video Simon walks through the differences and explains where he would use the two workloads!
    If you're looking at starting a Microsoft Fabric journey, and need some real Lakehouse experts, talk to Advancing Analytics!

КОМЕНТАРІ • 58

  • @effortLus
    @effortLus Рік тому +24

    Watching your videos is such a time saver it actually feels like I'm cheating within the industry

  • @SurajSingh-ct9dy
    @SurajSingh-ct9dy Рік тому +4

    Now do you choose Databricks or Microsoft Fabric to build a Lakehouse!?

  • @ItsNotAboutTheCell
    @ItsNotAboutTheCell Рік тому +8

    Fabricators!!!

  • @alisabenesova
    @alisabenesova 3 місяці тому

    Did someone try to run same analysis-queries or so in Dbx and and Fabric? Is there any real comparison? I mean... If I am foinf to chose f16 is it equivavalent to what 16core 32G ram or???

  • @AndreasBergstedt
    @AndreasBergstedt Рік тому +7

    Good take!
    Just for reference, I've built dynamic T-SQL based metadata frameworks for longer than I would like to admit, going back to 2011. Just to highlight that data Engineering is not uniquely an Apache Spark paradigm.

    • @AdvancingAnalytics
      @AdvancingAnalytics  Рік тому +6

      Yep, absolutely - I've built way too many SQL frameworks myself. But I differentiate between the code generators (my engine builds SQL at runtime/deployment), dynamic SQL (my code cannot be followed without lighting a candle and performing a ritual sacrifice), and actual dynamic, parameterised workflows in a programming language (scala, python, C#, whatever). I'm likely using the wrong terms, but I think it's important to differentiate the step change of how we think about our ETL between those approaches.

    • @chasedoe2594
      @chasedoe2594 Рік тому

      I having meta data driven framework as well. It make maintaining the pipeline way easier for me.

    • @AndreasBergstedt
      @AndreasBergstedt Рік тому +3

      @@AdvancingAnalytics This "my code cannot be followed without lighting a candle and performing a ritual sacrifice" made me Literally roll on the floor laughing, Black magic and all taken into consideration ill jump back to my DOS scripting days %1 🤣

  • @alexischicoine2072
    @alexischicoine2072 7 місяців тому

    Is the fabric spark cheaper enough to offset the loss of performance you’d get from databricks optimizations and photon?

  • @Phoenixspin
    @Phoenixspin Рік тому +2

    "Agonizing indecision" - that's my life.

  • @oscarzapatajr2
    @oscarzapatajr2 Рік тому +1

    Good analysis, but you didn't include data factory in the equation... You can use data flows (just power query online) to do some fairly complex data trasformations and automate when those data flows run to load data in a data wharehouse...
    For someone with a background in data analysis and powerbi, this is a more natural choice, because we are already used to power query

  • @PhilKoay
    @PhilKoay Рік тому +2

    Great work! But also should it be highlighted the scalability of the spark engine, and arguably that the Polaris is truly serverless. Is there much in the way of pros and cons around this? Cheers

  • @BeArt1980
    @BeArt1980 Рік тому +2

    One of the things that excited me in the preview was Dataflow Gen2! Are you going to touch that as well?

  • @dandical
    @dandical Рік тому +2

    This was exactly the question I needed answering after the announcement.

  • @yoavlubo
    @yoavlubo Рік тому +1

    I love the simplicity approach!
    Rear to find it these days, thank you!
    It all comes down to your team/consumers skillsets, but if the have all the skillsets, it doesn’t matter at all

  • @alexischicoine2072
    @alexischicoine2072 7 місяців тому

    Transactional support is probably the one thing you’d gain over databricks.

    • @alexischicoine2072
      @alexischicoine2072 7 місяців тому

      It’s something I’ve leveraged in snowflake in the past to be able to modify two tables in a transaction and have a rollback if it fails. In reality though it’s not that hard to live without it.

  • @FakhrudinJivanjee
    @FakhrudinJivanjee Рік тому +1

    Looking forward to it.. I would be more interested to see the cost impact (bang for buck) for organisations:
    - Wanting to move from traditional ADF, Synapse, ADLS to using Fabric
    - Wanting to make a choice between Fabric and Databricks, As much as they complement each other they can also be used exclusively.
    And does storage taxonomy go out the window with whoever has access able to write to one lake in whatever way they feel (I could be incorrect about this assumption)

    • @oscarzapatajr2
      @oscarzapatajr2 Рік тому

      Could you explain how databrick could be used instead of fabric? Sounds really interesting, but I don't know that framework...

  • @bassplayer6
    @bassplayer6 Рік тому +1

    Thoughts on how SSAS "direct Lake" will work\impact or not when used via VS "tabular models" and PBI? I am assuming SSAS dimensional is mostly dead at this point, correct?

    • @AdvancingAnalytics
      @AdvancingAnalytics  Рік тому +1

      Oooh, that's a Microsoft question. I can take a guess - given SSAS is largely used to get around model size limits in Power BI, and DirectLake pretty much removes any model size limits... The argument for SSAS doesn't look great. But I honestly don't know, there might be other factors!

    • @paulmaksimovic9235
      @paulmaksimovic9235 Рік тому

      I agree - we havent carried out an SSAS implementation for a couple of years. PPU removed any need for it. Fabric seems to put the nail in the coffin. The only potential for consideration is where microsoft offered x number of pro licenses as a consequence of another microsoft implementation (dynamics).

  • @BajkeTheSerb
    @BajkeTheSerb 9 місяців тому

    Hi, are you sure about the "full T-SQL support" in the Warehouse? (Some features ie. NVARCHAR data type, temporary tables etc. are NOT supported.)

  • @NeumsFor9
    @NeumsFor9 Рік тому +1

    Marketing exercises. Muscle up, baby! RIP dedicated sql pool.....😂

    • @chasedoe2594
      @chasedoe2594 Рік тому +1

      I was expecting something new for dedicated sql pool but didn't expect this big that I have to move out of Azure.

    • @NeumsFor9
      @NeumsFor9 Рік тому

      @@chasedoe2594 to be fair, the dedicated pool technology base is from the MPP Datallegro appliance......and going back the early to mid 2000s. Simon could keep me honest on that date range though.....with Teradata I believe being the originator of the mpp tech

    • @EricLawson-Clyde
      @EricLawson-Clyde Рік тому

      ​@@NeumsFor9 I'm fairly sure Tandem NonStop got in there first with MPP. During the 90's, it already had disk processes that performed SQL predicate filtering (selection & projection) using dedicated memory cache and it an X & Y communications bus fabric for dual path reads and writes. That was baked in for fault tolerance, but came to the fore when the SQL engine was deployed into the OS post flat file (Enscribe, the equivalent of VSAM). All of this gave it the unique ability to hot swaps CPUs and add them in, without the need to reboot. You could have 16 locally configured paired CPUs andif I recall it was technically feasible to run that out to many hundreds of distributed nodes.
      This is the timeline, not sure where Teradata fitted in, would be good to know.
      1970 - Information Retrieval whitepaper written by EF Codd published ​
      1970 to 1980 - Ingres research project at University of California, Berkeley. The codebase, via BSD license, spawned Sybase, Microsoft SQL Server & Tandem NonStop SQL​
      1979 - Oracle was first commercially available product​
      1981 - IBM SQL / DS, now known as DB2, spawned from System / R​
      1987 - NonStop SQL released by Tandem Computers, ft. parallel processing & fault tolerance​
      1989 - SQL Server 1.0 released by Microsoft, Sybase & Ashton Tate​
      1995 - SQL Server 6.0 first version without Sybase involvement

  • @qkktech
    @qkktech 11 місяців тому

    Fabric will solve a lot at ods side og darawarehouse but there are gaps that are not filled and it is hard to fill also. Like some cleaning things and moddeling issues. When you relay only current state of data then lakehous may fit but when versioning goes into play then datawarehouse winns. there are few occations when to use lakehouse reporting. But when going to synapse side then it has problems with supporting modern darawarehouse like indexing random md5 hashes that are barebone of hasdifs used in data vault datawarehouse system. Also when talking taking in data to sql server there are limitations like 50 column limit. That is another thing that MS is doing differently and it will be MS against rest of the world thing. Fabric acrually has worse support to get data out of its own products like d365 than other cloud based solutions. When i want to get d365 data to warehouse then fabric may fit when they get rid 50column limitation and then i apply dbt-fabric and then i get data vault model out of dbt and then i build unified star schema based marts on top of it. The current tooling of fabric is not optimized for getting out diffs. Like what chabged in data when i made report of 23q2 today and 3 months ago why i get different results when calculations are not changed. it is a lot more complicated.

  • @alexischicoine2072
    @alexischicoine2072 7 місяців тому

    I enjoyed using data factory in the past it’s great for ingesting or moving data around and for orchestration. I wouldn’t use fabric for transformations at scale though.

  • @antonyliokaizer
    @antonyliokaizer Рік тому

    I have two tables in a Lakehouse and I want them also in the Data Warehouse, should I use "Copy Data" in pipleline to clone them or any other way to create a "Shortcut" from Warehouse to Lakehouse??? Thanks.

  • @preetipics
    @preetipics Рік тому

    great comparison, thanks. In my org, the power BI tenant is managed by europe and they dont know if they're going to be activating fabric for U.K to test. Will MS make us move to fabric at some point anyway?

  • @jpcst
    @jpcst 5 місяців тому

    Great and simple explanation! Thanks for sharing this!

  • @gauravchaturvedi3615
    @gauravchaturvedi3615 6 місяців тому

    Could you cover in your next video following topics related to Fabric - 1) The storage used by One Lake is deployed to customer's Azure Account or Microsoft's own Azure account? 2) Do we have the capability to define Data Life cycle Management policy for data on One Lake like we have ADLS G2? 3) Is the cost of One Lake storage same as cost of a blob or ADLS G2 storage?

    • @alisabenesova
      @alisabenesova 3 місяці тому

      It should be so, it is just an abstract layer on top of Gen2. Also the storage is probably created on the fly, how you used data, same as databricks, but when you do pricing, you can estimate Gigs necessary and see. It is very cheap, like 0,024 for 1GB.

  • @samuelbrown797
    @samuelbrown797 Рік тому

    Good and clear explanation, thank you. Learn lots. 😎

  • @johnfromireland7551
    @johnfromireland7551 Рік тому

    Of course! Fabricator makes a LOT of sense!!

  • @jordanfox470
    @jordanfox470 Рік тому

    Still gonna use dbx. Just gonna store my unity catalog in onelake. Get all the benefits of databricks and all the benefits of direct lake without all the choppiness of Microsoft data engineering products.

    • @AdvancingAnalytics
      @AdvancingAnalytics  Рік тому

      And that's a perfectly valid approach! Hoping we'll see closer integration with Unity & OneLake as both sides mature!

  • @krishnakoirala2088
    @krishnakoirala2088 Рік тому

    Can we apply MERGE INTO between two tables (one is in Lakehouse another is in Warehouse)?

    • @AdvancingAnalytics
      @AdvancingAnalytics  Рік тому

      Yep. Each of the experiences can read data from the other. So you can read from a warehouse and merge it into a Lakehouse, as it's only one of the types where the write is being performed.
      What you can't do, is write into both a Warehouse and a Lakehouse at the same time, using the same workload.

  • @srbasha74
    @srbasha74 Рік тому

    In Dbx, I can insert data into tables using both SQL and Spark. Yeah, the SQL is slightly different to T-SQL, but not a big deal for Data warehousing, isn't it?

    • @AdvancingAnalytics
      @AdvancingAnalytics  Рік тому +1

      Exactly, so if you're happy with Spark SQL and don't need full T-SQL, the Lakehouse side of the fence does what you need

  • @norimashiron3635
    @norimashiron3635 7 місяців тому

    Great video .

  • @chasedoe2594
    @chasedoe2594 Рік тому

    Could you tell me Fabric Synapse Warehouse T-SQL compatibility compared to Dedicated SQL Pool?
    I just scared how much refactoring we need to do for migration.
    And since it is store in a delta not a database storage anymore. Is there a trade-off using Synapse warehouse? Can we utilize all the delta meta data from SQL Warehouse? Or it is just a plain basic delta like in Stream Analytics?

    • @AdvancingAnalytics
      @AdvancingAnalytics  Рік тому

      That's a full migration path to help move your code from Dedicated into Fabric Warehouse, from what I've heard it's a big ambition to make the migration as smooth as possible.
      And the Delta from Warehouse is full Delta, with all the trans log

    • @chasedoe2594
      @chasedoe2594 Рік тому +1

      ​@@AdvancingAnalytics So sun is setting on dedicated then. This is not unexpected tho, giving that MS rarely put any update on dedicated SQL Pool lately, but having to migrate out of Azure entirely is quite hard sell.
      Will there be equivalent product in Azure release as well? Coz the way PBI governance and permission control is not the best way to apply with IT infrastructure IMHO.

  • @Splinta100
    @Splinta100 Рік тому

    good stuff

  •  Рік тому

    Simon, what I still don't understand and don't see answers to in the documentation or even here, is probably the most important issues: pricing and performance. I'm wondering about the fact of attaching the entire platform to the Power BI Premium license with monthly commitment or Azure Subscription paid pay as you go) . Within them, we have various SKUs, and each one has specific computing power. So how does it work in the case of Fabric Synapse Lakehouse and Warehouse? Can we still talk about choosing the size of Spark clusters, paying for the amount of processed data (like in Synapse serverless), or for compute time (like in dedicated pool)? And will the differences lie precisely in the money we have to spend or the time we have to spend on data processing? They say "We SaaS'ified all the platforms" , but again they do not share details on pricing and performance. Do you know anyhting about it? Maybe I just missed something in the docs? :/
    Thanks.
    And of course thank you for the content. You and John Savill are my single source of Azure & Data truth :)

    • @AdvancingAnalytics
      @AdvancingAnalytics  Рік тому +2

      Hola! We'll do a video digging into how capacities are managing but at the simplest level - you pay for a certain size of SKU, similar to Power BI Premium. P1 now being F64. This gives you a certain number of v-cores you can use, averaged out across the day. For an F64 that's 128 cores (edited: originally said 32 cores, forgot it was double not half!)
      So you might use a mix of spark, SQL and power BI throughout the day. Or you might use a huge amount of spark for an hour or two, then nothing for the rest of the day. The capacity is measured out across 24 hours to determine that you're within your capacity.
      You can put guiderails in to determine how many executors you're using in spark etc, but it all goes against your central capacity allocation.
      You can also separate capacities out - give the business one capacity for their reporting etc, whilst the engineers use a different capacity to produce the datasets.
      There's more to it, with other consumption methods and configuration, but that'll wait for the video!

    • @AdvancingAnalytics
      @AdvancingAnalytics  Рік тому +2

      Quick update - edited the previous reply, I'd put F64 as having 32 cores, it's actually 128, ie: double cores not half! Serves me right for answering questions from on holiday :D

    •  Рік тому

      @@AdvancingAnalytics Thanks, Simon. I really appreciate your (your team's) dedication, professionalism in dealing with details, and straightforwardness in commenting, along with the humor you bring to it :)
      Also, the fact that you go back to correct statements in your comments shows that you care about quality and brand. Hats off to you.
      I will definitely try my best to understand the "power" and division of this "power" in terms of the entire capacity. Benchmarks will be crucial in assessing what we can achieve with each SKU and how long it will take to process this and that. And, of course, how to design mixed workloads to avoid interference, queues, resource QoS battles, and similar issues. There's still a lot ahead of us, and there's still a lot ahead for Microsoft. We'll see how it ends :)

  • @vjraitila
    @vjraitila Рік тому

    Did you deliberately avoid mentioning the SQL endpoint in the context of a Data Engineering workload (lakehouse side)?

    • @AdvancingAnalytics
      @AdvancingAnalytics  Рік тому +1

      Hah, nope, nothing deliberate, just didn't have time to go into connectivity for each object :)

    • @vjraitila
      @vjraitila Рік тому

      @@AdvancingAnalytics Okay :). As you probably already figured out, I just find it to be another point of confusion for people trying to make out a difference between the warehouse and lakehouse workloads.

    • @oscarzapatajr2
      @oscarzapatajr2 Рік тому

      As far as I know the sql endpoint in the lakehouse is read only, while the one in the wharehouse is read and write