Advancing Fabric - Lakehouse vs Warehouse
Вставка
- Опубліковано 29 вер 2024
- Microsoft Fabric has been released, there's a flood of information out in the world for you to start digesting! Fabric has different experiences, different workload types, but the underlying data is held as Delta tables.
With the T-SQL Synapse Warehouse workload, and the Spark-based Data Engineering workload, it's a confusing choice. Does it matter? Are there pros/cons to each side? Why are there two experiences??
In this video Simon walks through the differences and explains where he would use the two workloads!
If you're looking at starting a Microsoft Fabric journey, and need some real Lakehouse experts, talk to Advancing Analytics!
Watching your videos is such a time saver it actually feels like I'm cheating within the industry
Now do you choose Databricks or Microsoft Fabric to build a Lakehouse!?
Fabricators!!!
Count me in as a Fabricators!!
Mount up ....
Did someone try to run same analysis-queries or so in Dbx and and Fabric? Is there any real comparison? I mean... If I am foinf to chose f16 is it equivavalent to what 16core 32G ram or???
Good take!
Just for reference, I've built dynamic T-SQL based metadata frameworks for longer than I would like to admit, going back to 2011. Just to highlight that data Engineering is not uniquely an Apache Spark paradigm.
Yep, absolutely - I've built way too many SQL frameworks myself. But I differentiate between the code generators (my engine builds SQL at runtime/deployment), dynamic SQL (my code cannot be followed without lighting a candle and performing a ritual sacrifice), and actual dynamic, parameterised workflows in a programming language (scala, python, C#, whatever). I'm likely using the wrong terms, but I think it's important to differentiate the step change of how we think about our ETL between those approaches.
I having meta data driven framework as well. It make maintaining the pipeline way easier for me.
@@AdvancingAnalytics This "my code cannot be followed without lighting a candle and performing a ritual sacrifice" made me Literally roll on the floor laughing, Black magic and all taken into consideration ill jump back to my DOS scripting days %1 🤣
Is the fabric spark cheaper enough to offset the loss of performance you’d get from databricks optimizations and photon?
"Agonizing indecision" - that's my life.
Good analysis, but you didn't include data factory in the equation... You can use data flows (just power query online) to do some fairly complex data trasformations and automate when those data flows run to load data in a data wharehouse...
For someone with a background in data analysis and powerbi, this is a more natural choice, because we are already used to power query
Great work! But also should it be highlighted the scalability of the spark engine, and arguably that the Polaris is truly serverless. Is there much in the way of pros and cons around this? Cheers
One of the things that excited me in the preview was Dataflow Gen2! Are you going to touch that as well?
This was exactly the question I needed answering after the announcement.
I love the simplicity approach!
Rear to find it these days, thank you!
It all comes down to your team/consumers skillsets, but if the have all the skillsets, it doesn’t matter at all
Transactional support is probably the one thing you’d gain over databricks.
It’s something I’ve leveraged in snowflake in the past to be able to modify two tables in a transaction and have a rollback if it fails. In reality though it’s not that hard to live without it.
Looking forward to it.. I would be more interested to see the cost impact (bang for buck) for organisations:
- Wanting to move from traditional ADF, Synapse, ADLS to using Fabric
- Wanting to make a choice between Fabric and Databricks, As much as they complement each other they can also be used exclusively.
And does storage taxonomy go out the window with whoever has access able to write to one lake in whatever way they feel (I could be incorrect about this assumption)
Could you explain how databrick could be used instead of fabric? Sounds really interesting, but I don't know that framework...
Thoughts on how SSAS "direct Lake" will work\impact or not when used via VS "tabular models" and PBI? I am assuming SSAS dimensional is mostly dead at this point, correct?
Oooh, that's a Microsoft question. I can take a guess - given SSAS is largely used to get around model size limits in Power BI, and DirectLake pretty much removes any model size limits... The argument for SSAS doesn't look great. But I honestly don't know, there might be other factors!
I agree - we havent carried out an SSAS implementation for a couple of years. PPU removed any need for it. Fabric seems to put the nail in the coffin. The only potential for consideration is where microsoft offered x number of pro licenses as a consequence of another microsoft implementation (dynamics).
Hi, are you sure about the "full T-SQL support" in the Warehouse? (Some features ie. NVARCHAR data type, temporary tables etc. are NOT supported.)
Marketing exercises. Muscle up, baby! RIP dedicated sql pool.....😂
I was expecting something new for dedicated sql pool but didn't expect this big that I have to move out of Azure.
@@chasedoe2594 to be fair, the dedicated pool technology base is from the MPP Datallegro appliance......and going back the early to mid 2000s. Simon could keep me honest on that date range though.....with Teradata I believe being the originator of the mpp tech
@@NeumsFor9 I'm fairly sure Tandem NonStop got in there first with MPP. During the 90's, it already had disk processes that performed SQL predicate filtering (selection & projection) using dedicated memory cache and it an X & Y communications bus fabric for dual path reads and writes. That was baked in for fault tolerance, but came to the fore when the SQL engine was deployed into the OS post flat file (Enscribe, the equivalent of VSAM). All of this gave it the unique ability to hot swaps CPUs and add them in, without the need to reboot. You could have 16 locally configured paired CPUs andif I recall it was technically feasible to run that out to many hundreds of distributed nodes.
This is the timeline, not sure where Teradata fitted in, would be good to know.
1970 - Information Retrieval whitepaper written by EF Codd published
1970 to 1980 - Ingres research project at University of California, Berkeley. The codebase, via BSD license, spawned Sybase, Microsoft SQL Server & Tandem NonStop SQL
1979 - Oracle was first commercially available product
1981 - IBM SQL / DS, now known as DB2, spawned from System / R
1987 - NonStop SQL released by Tandem Computers, ft. parallel processing & fault tolerance
1989 - SQL Server 1.0 released by Microsoft, Sybase & Ashton Tate
1995 - SQL Server 6.0 first version without Sybase involvement
Fabric will solve a lot at ods side og darawarehouse but there are gaps that are not filled and it is hard to fill also. Like some cleaning things and moddeling issues. When you relay only current state of data then lakehous may fit but when versioning goes into play then datawarehouse winns. there are few occations when to use lakehouse reporting. But when going to synapse side then it has problems with supporting modern darawarehouse like indexing random md5 hashes that are barebone of hasdifs used in data vault datawarehouse system. Also when talking taking in data to sql server there are limitations like 50 column limit. That is another thing that MS is doing differently and it will be MS against rest of the world thing. Fabric acrually has worse support to get data out of its own products like d365 than other cloud based solutions. When i want to get d365 data to warehouse then fabric may fit when they get rid 50column limitation and then i apply dbt-fabric and then i get data vault model out of dbt and then i build unified star schema based marts on top of it. The current tooling of fabric is not optimized for getting out diffs. Like what chabged in data when i made report of 23q2 today and 3 months ago why i get different results when calculations are not changed. it is a lot more complicated.
I enjoyed using data factory in the past it’s great for ingesting or moving data around and for orchestration. I wouldn’t use fabric for transformations at scale though.
I have two tables in a Lakehouse and I want them also in the Data Warehouse, should I use "Copy Data" in pipleline to clone them or any other way to create a "Shortcut" from Warehouse to Lakehouse??? Thanks.
great comparison, thanks. In my org, the power BI tenant is managed by europe and they dont know if they're going to be activating fabric for U.K to test. Will MS make us move to fabric at some point anyway?
Great and simple explanation! Thanks for sharing this!
Could you cover in your next video following topics related to Fabric - 1) The storage used by One Lake is deployed to customer's Azure Account or Microsoft's own Azure account? 2) Do we have the capability to define Data Life cycle Management policy for data on One Lake like we have ADLS G2? 3) Is the cost of One Lake storage same as cost of a blob or ADLS G2 storage?
It should be so, it is just an abstract layer on top of Gen2. Also the storage is probably created on the fly, how you used data, same as databricks, but when you do pricing, you can estimate Gigs necessary and see. It is very cheap, like 0,024 for 1GB.
Good and clear explanation, thank you. Learn lots. 😎
Of course! Fabricator makes a LOT of sense!!
Still gonna use dbx. Just gonna store my unity catalog in onelake. Get all the benefits of databricks and all the benefits of direct lake without all the choppiness of Microsoft data engineering products.
And that's a perfectly valid approach! Hoping we'll see closer integration with Unity & OneLake as both sides mature!
Can we apply MERGE INTO between two tables (one is in Lakehouse another is in Warehouse)?
Yep. Each of the experiences can read data from the other. So you can read from a warehouse and merge it into a Lakehouse, as it's only one of the types where the write is being performed.
What you can't do, is write into both a Warehouse and a Lakehouse at the same time, using the same workload.
In Dbx, I can insert data into tables using both SQL and Spark. Yeah, the SQL is slightly different to T-SQL, but not a big deal for Data warehousing, isn't it?
Exactly, so if you're happy with Spark SQL and don't need full T-SQL, the Lakehouse side of the fence does what you need
Great video .
Could you tell me Fabric Synapse Warehouse T-SQL compatibility compared to Dedicated SQL Pool?
I just scared how much refactoring we need to do for migration.
And since it is store in a delta not a database storage anymore. Is there a trade-off using Synapse warehouse? Can we utilize all the delta meta data from SQL Warehouse? Or it is just a plain basic delta like in Stream Analytics?
That's a full migration path to help move your code from Dedicated into Fabric Warehouse, from what I've heard it's a big ambition to make the migration as smooth as possible.
And the Delta from Warehouse is full Delta, with all the trans log
@@AdvancingAnalytics So sun is setting on dedicated then. This is not unexpected tho, giving that MS rarely put any update on dedicated SQL Pool lately, but having to migrate out of Azure entirely is quite hard sell.
Will there be equivalent product in Azure release as well? Coz the way PBI governance and permission control is not the best way to apply with IT infrastructure IMHO.
good stuff
Simon, what I still don't understand and don't see answers to in the documentation or even here, is probably the most important issues: pricing and performance. I'm wondering about the fact of attaching the entire platform to the Power BI Premium license with monthly commitment or Azure Subscription paid pay as you go) . Within them, we have various SKUs, and each one has specific computing power. So how does it work in the case of Fabric Synapse Lakehouse and Warehouse? Can we still talk about choosing the size of Spark clusters, paying for the amount of processed data (like in Synapse serverless), or for compute time (like in dedicated pool)? And will the differences lie precisely in the money we have to spend or the time we have to spend on data processing? They say "We SaaS'ified all the platforms" , but again they do not share details on pricing and performance. Do you know anyhting about it? Maybe I just missed something in the docs? :/
Thanks.
And of course thank you for the content. You and John Savill are my single source of Azure & Data truth :)
Hola! We'll do a video digging into how capacities are managing but at the simplest level - you pay for a certain size of SKU, similar to Power BI Premium. P1 now being F64. This gives you a certain number of v-cores you can use, averaged out across the day. For an F64 that's 128 cores (edited: originally said 32 cores, forgot it was double not half!)
So you might use a mix of spark, SQL and power BI throughout the day. Or you might use a huge amount of spark for an hour or two, then nothing for the rest of the day. The capacity is measured out across 24 hours to determine that you're within your capacity.
You can put guiderails in to determine how many executors you're using in spark etc, but it all goes against your central capacity allocation.
You can also separate capacities out - give the business one capacity for their reporting etc, whilst the engineers use a different capacity to produce the datasets.
There's more to it, with other consumption methods and configuration, but that'll wait for the video!
Quick update - edited the previous reply, I'd put F64 as having 32 cores, it's actually 128, ie: double cores not half! Serves me right for answering questions from on holiday :D
@@AdvancingAnalytics Thanks, Simon. I really appreciate your (your team's) dedication, professionalism in dealing with details, and straightforwardness in commenting, along with the humor you bring to it :)
Also, the fact that you go back to correct statements in your comments shows that you care about quality and brand. Hats off to you.
I will definitely try my best to understand the "power" and division of this "power" in terms of the entire capacity. Benchmarks will be crucial in assessing what we can achieve with each SKU and how long it will take to process this and that. And, of course, how to design mixed workloads to avoid interference, queues, resource QoS battles, and similar issues. There's still a lot ahead of us, and there's still a lot ahead for Microsoft. We'll see how it ends :)
Did you deliberately avoid mentioning the SQL endpoint in the context of a Data Engineering workload (lakehouse side)?
Hah, nope, nothing deliberate, just didn't have time to go into connectivity for each object :)
@@AdvancingAnalytics Okay :). As you probably already figured out, I just find it to be another point of confusion for people trying to make out a difference between the warehouse and lakehouse workloads.
As far as I know the sql endpoint in the lakehouse is read only, while the one in the wharehouse is read and write