Great presentation. For me, working in data modeling/practice, my straw-man view of Data Mesh is > Mr. Kimball's Galaxy schema design< (i.e. cluster of star-schema). You nailed it about the hurdles, technologies were there for a long time, but discipline required to build or even start it are very astronomical. Often it dies from departmental politics and budgetings. Even with the raise in power of CTO, it still not enough to have all domains (departments) to comply. Fun thing about buzz words, when hybrid cloud start selling, I comically coined Data Vaporization and Data Condensation for inter-ops tasks.
To me Data Mesh is purely good data governance over existing centralized architectures. I also like to think of your Type 1 architecture as viable with most clients. A data Lake House with domains owning their own containers. Let them also build PBI reports and ML models over their containers. This way we fix clear responsibility for data and tackle performance issues. Also we dont have to deal with disparate technologies for each domain. That would make people go nuts in the long run.
I agree with you. The good thing about Data Mesh is the decentralized governance and the shift to each specialized domain for Data Product management (with of course the possibility of cross-domain Data Engineering functions). On the other hand, I'm not convinced about decentralized data storage. A lakehouse and sharing of centralized functionalities (Data Catalog, CI/CD, Computing power, Storage, etc.) is I think more efficient. Of course, in very large companies, we can envisage several lakehouses, but that's not the majority of companies.
Hi James, Thanks for this video, it complies well with my thoughts. FYI I had or have all 3 Mesh Types in different productions. Mesh Type 1 uniformity and centralization is good to start fast from scratch as a Data Mesh core. Mesh Type 2 distribution and its domain and storage total uniformity are too optimistic for real life and have limited application. Mesh Type 3 is too agile and diverse, so it may introduce a high total cost of ownership if it is selected as a core for Data Mesh, however, if there is a hybrid cloud, 3rd party SaaS API, a custom HPC cluster, a "good enough" legacy system, or security/privacy constraints IMHO it is reasonable to use Mesh Type 1 as a core and combine it with Mesh Type 3 plugins, where each Mesh Type 3 plugin mimics Mesh Type 1 approach for its input and output data products. I prefer using Mesh Type 1 + 3 as it offers centralized control, cost-efficiency, and fast development. This is achieved by utilizing the Mesh Type 1 core as a default and having the flexibility to extend the core with diverse Mesh Type 3 domain plugins, giving us agile capability.
kudos to #palantir #foundry to have solved all the problems with lakehouse that was highlighted here. like they even have column level and row level security. schema and referential integrity check inforced into "check" phase of their job, before a spark job fires up. also about power users not liking spark SQL vs T-SQL. palantir actually build a drag-n-drop layer for business users that directly writes spark underneath they get powerBI like experience on freaking S3 bucket files and spark underneath!
@31:00 - The four principles aren't "negotiable." If you're not doing all four, you're NOT doing Data Mesh. You're doing something of your own creation. Zhamak has been very purposeful about each of these concepts, and their interlocking natures. It's based on domain driven design in application architecture. You don't get to "pick and choose" and call it Data Mesh. She came up with the overarching idea, she gets to decide what's in and what's out. With regard to your beef about "technical scaling," my 30+ years in data in Fortune 100 and 500 companies would suggest you're kidding yourself if you think we've "solved technical scaling" with previous approaches. When you're centralized, and you go ask for a $50M investment in infrastructure (or if you're in the cloud, a 50% increase in monthly billing) to help scale some centralized warehouse or lake, it typically doesn't get approved. Period. Full stop. So, then you're trying to hack stuff to stay performant, which ultimately collapses in on itself. You can argue that that's "just a process problem," but that's sort of irrelevant, because it's *viewed* as a technical failure from the people who write the checks, and if we're customer oriented, that's whose opinion matters. I think you're overselling the notion of business driven data mesh transformation as well. Nowhere does Zhamak say that's a requirement (it's not one of the four principles). Although it does involve org change, there's nothing say that IT isn't part of the picture. What I see happening is a mirroring of ownership being defined in the business (which is really the fulfillment of what we've been saying about data governance for years) with IT professionals who support that domain WITH the business owners (basically the same thing we've seen them do with applications since the dawn of time, but instead of "screens" being the focal point, "data domains" are the focal point). With regard to your aggregated domain notion, it starts at the root question: should there be, or should there NOT be an OWNER for that domain (someone who gets to speak definitively about what that domain offers to the outside world?) If the answer is no, then aggregating it is *at best* a tactical choice for some performance concern, and at worst, *something you should not do*. DDD is about OWNERSHIP and getting straight on who gets to decide what about data. It's a HUGE obstacle in nearly every company, because it's very fuzzy and unclear in most (or it's haplessly tied to application screen ownership). For decades we've lamented that data is more important than applications, and then when someone comes along and says "I agree, let's architect that way," we see countless old-school data people peeing all over the solution.... kind of mind boggling.
Base on this point of view, I just want to add one my opinion: Business domains still are application level, not enterprise level. Several domains may use the services providing by one business object(or business component) , either say, business domain and business object(component) are two different dimensions. The data cohesion is the key consideration determining how to form a business object. No matter which data architecture(not enterprise data architecture) stage a company is, they still need enterprise level data modelling.
Hi JR, Thanks for the comments! I'll respond to each paragraph: #1: If all four principles are needed to call a solution a data mesh, then I don't think anyone has yet to build a "data mesh". There is no technology yet for principle #3. And then you have the problem of how much you have to comply with each principle. If I only require each domain to follow a standard regulation, can I say I'm following principle #4? I think we need a "minimum viable product" to call something a data mesh that we all agree on. If a solution follows three of the principles, what should we call it? One of the problems with the data mesh hype is everyone is trying call their solution/product a data mesh, when sometimes it does not adhere to any of the principles. #2: Current approaches to handle scaling, but I would not say we have "solved" it, as there are a small percentage of use cases that struggle with current technology. But I have seen a ton of centralized solutions that handle petabytes of data very well. I'm not following your example of a technical failure if you are denied funds to expand a centralized solution. Building a data mesh costs much more than any centralized solution, and you could just as well be denied money to expand the domains for a data mesh. Any data architecture would be a "failure" if denied money to expand and performance suffered. #3: I have always believed IT would be involved in a data mesh, just less so. You certainly need IT for principles #3 and #4. There will be a ton of org changes in a data mesh, which is the biggest problem with trying to implement a data mesh, as most companies are not ready for such a massive change. Most people don't like change. #4: I don't see how an aggregate domain could be built and maintained without an owner. Someone has to define how the data will be pulled into the domain, how it will be joined and aggregated, then build it and maintain it. That to me is the "owner". The problem I see with an aggregate domain is you are creating more ETL and centralizing some of the data, which goes against what the data mesh is trying to prevent.
>There is no technology yet for principle #3. Disagree. Snowflake, various data virtualization tools, and even custom engineering can be combined/engineered to produce something that satisfies principle 3. Platforms don't have to be provided by a third party. >Building a data mesh costs much more than any centralized solution But since the request never comes as a single ask, and is always justified by something needed *close* to the demand area, it's more likely to be approved, which is the point. There's a political / org dynamics aspect that's pretty fundamental underlying data mesh (Zhamak is synthesizing two ideas borrowed from application engineering -- domain driven design as originated by Eric Evans, and an Inverse Conway Maneuver as originated within ThoughtWorks.) >I have always believed IT would be involved in a data mesh, just less so. Okay, but the video implies that it's a requirement to get IT out of the picture. Not required or expected. >That to me is the "owner". That's not an owner as Zhamak defines it. Aggregation doesn't automatically produce a domain owner. It would be nice to be at that level of organizational data maturity, but that's *precisely* where we're at with data governance -- it's NOT mature.
@@Calphool222 If your having to combine many technologies to satisfy principle #3, that means a ton of work for each company. So, sure, it does not have to be provided by a third party, but a building a data mesh will take a lot more time and money. My point is a data mesh takes a lot more time and cost to build compared to other architectures, so you have to be very aware of that if you go down that road The companies I have seen that are building a data mesh ask for a lot of money to get started and are aware that there will be a lot of requests for more money down the road. I have seen the same thing when building a solution using other data architectures. If there is some better political approach that makes a data mesh get more approval for money compared to other data architectures, that is great but I have not seen that in my experience. But I'm just one person :-) Sorry if I caused confusion and I'll clarify in future presentations that IT will still be involved, and have done so recently when talking about certain things are still centralized in a data mesh (principles #3 and #4, and in many cases storage). My point was that data mesh is trying to reduce IT's involvement. As far as aggregation domains, I'm confused about the difference between a domain owner and "the set of people who build an maintain a domain". This is the big problem with a data mesh - it is very confusing! Great conversation!
@@jamserra >This is the big problem with a data mesh - it is very confusing! It's not that confusing if you're familiar with the underlying ideas Zhamak is building from. They're well established application engineering techniques. She's taking those as a base, and altering them as necessary to apply to the data domain.
Thanks for this! For someone who is not a data architect or data scientist; was is then the difference between a data lakehouse and a data platform/ pipeline?
1. Business Domains do not produce data, applications do. Why is this high focus on the domains? 2. Data mesh is model. is there a single purpose/context/problem? I can imagine for corporate reporting we need one architecture , for archiving another, for exchange - another. I do not believe in a general swiss knife approach here.
ua-cam.com/video/VYmjJe2gR1A/v-deo.html Unity Catalog in Databricks has solved the RLS and Column Masking. A very impressive and interesting video. For me it served a good recap :) Cheers Thank You
BY FAR THE SINGLE MOST EFFECTIVE EXPLANATION EVER ❤
Amazing way of explaining with granular details. Thanks
Great presentation.
For me, working in data modeling/practice, my straw-man view of Data Mesh is > Mr. Kimball's Galaxy schema design< (i.e. cluster of star-schema).
You nailed it about the hurdles, technologies were there for a long time, but discipline required to build or even start it are very astronomical. Often it dies from departmental politics and budgetings. Even with the raise in power of CTO, it still not enough to have all domains (departments) to comply.
Fun thing about buzz words, when hybrid cloud start selling, I comically coined Data Vaporization and Data Condensation for inter-ops tasks.
Superb representation as always James, I would echo with 1 comment in your blog - "The more I read about Data Mesh, the more concerns I have"
To me Data Mesh is purely good data governance over existing centralized architectures. I also like to think of your Type 1 architecture as viable with most clients. A data Lake House with domains owning their own containers. Let them also build PBI reports and ML models over their containers. This way we fix clear responsibility for data and tackle performance issues. Also we dont have to deal with disparate technologies for each domain. That would make people go nuts in the long run.
I agree with you. The good thing about Data Mesh is the decentralized governance and the shift to each specialized domain for Data Product management (with of course the possibility of cross-domain Data Engineering functions). On the other hand, I'm not convinced about decentralized data storage. A lakehouse and sharing of centralized functionalities (Data Catalog, CI/CD, Computing power, Storage, etc.) is I think more efficient. Of course, in very large companies, we can envisage several lakehouses, but that's not the majority of companies.
Hi James, Thanks for this video, it complies well with my thoughts. FYI I had or have all 3 Mesh Types in different productions. Mesh Type 1 uniformity and centralization is good to start fast from scratch as a Data Mesh core. Mesh Type 2 distribution and its domain and storage total uniformity are too optimistic for real life and have limited application. Mesh Type 3 is too agile and diverse, so it may introduce a high total cost of ownership if it is selected as a core for Data Mesh, however, if there is a hybrid cloud, 3rd party SaaS API, a custom HPC cluster, a "good enough" legacy system, or security/privacy constraints IMHO it is reasonable to use Mesh Type 1 as a core and combine it with Mesh Type 3 plugins, where each Mesh Type 3 plugin mimics Mesh Type 1 approach for its input and output data products. I prefer using Mesh Type 1 + 3 as it offers centralized control, cost-efficiency, and fast development. This is achieved by utilizing the Mesh Type 1 core as a default and having the flexibility to extend the core with diverse Mesh Type 3 domain plugins, giving us agile capability.
Hi Denis...thanks for the feedback! I'll definitely expand on the 3 mesh types in my book for the next edition 🙂
Thank you sir, it's now much way more clear.
Crystal clear explanation.
Congratulations, this video was very well explained, thanks for preparing this!
It would be great to have a follow up that includes Databricks SQL Warehouse that addresses shortcomings mentioned about Data Lakehouse.
a well-structured and informative presentation. thanks
Great presentation! Thank you man
kudos to #palantir #foundry to have solved all the problems with lakehouse that was highlighted here. like they even have column level and row level security. schema and referential integrity check inforced into "check" phase of their job, before a spark job fires up.
also about power users not liking spark SQL vs T-SQL. palantir actually build a drag-n-drop layer for business users that directly writes spark underneath
they get powerBI like experience on freaking S3 bucket files and spark underneath!
Great video! Can you please post the links in your Data Mesh section (46:37) or the presentation as a whole? Thanks.
Glad you like it! I posted the deck link in the description.
@31:00 - The four principles aren't "negotiable." If you're not doing all four, you're NOT doing Data Mesh. You're doing something of your own creation. Zhamak has been very purposeful about each of these concepts, and their interlocking natures. It's based on domain driven design in application architecture. You don't get to "pick and choose" and call it Data Mesh. She came up with the overarching idea, she gets to decide what's in and what's out.
With regard to your beef about "technical scaling," my 30+ years in data in Fortune 100 and 500 companies would suggest you're kidding yourself if you think we've "solved technical scaling" with previous approaches. When you're centralized, and you go ask for a $50M investment in infrastructure (or if you're in the cloud, a 50% increase in monthly billing) to help scale some centralized warehouse or lake, it typically doesn't get approved. Period. Full stop. So, then you're trying to hack stuff to stay performant, which ultimately collapses in on itself. You can argue that that's "just a process problem," but that's sort of irrelevant, because it's *viewed* as a technical failure from the people who write the checks, and if we're customer oriented, that's whose opinion matters.
I think you're overselling the notion of business driven data mesh transformation as well. Nowhere does Zhamak say that's a requirement (it's not one of the four principles). Although it does involve org change, there's nothing say that IT isn't part of the picture. What I see happening is a mirroring of ownership being defined in the business (which is really the fulfillment of what we've been saying about data governance for years) with IT professionals who support that domain WITH the business owners (basically the same thing we've seen them do with applications since the dawn of time, but instead of "screens" being the focal point, "data domains" are the focal point).
With regard to your aggregated domain notion, it starts at the root question: should there be, or should there NOT be an OWNER for that domain (someone who gets to speak definitively about what that domain offers to the outside world?) If the answer is no, then aggregating it is *at best* a tactical choice for some performance concern, and at worst, *something you should not do*. DDD is about OWNERSHIP and getting straight on who gets to decide what about data. It's a HUGE obstacle in nearly every company, because it's very fuzzy and unclear in most (or it's haplessly tied to application screen ownership). For decades we've lamented that data is more important than applications, and then when someone comes along and says "I agree, let's architect that way," we see countless old-school data people peeing all over the solution.... kind of mind boggling.
Base on this point of view, I just want to add one my opinion:
Business domains still are application level, not enterprise level. Several domains may use the services providing by one business object(or business component) , either say, business domain and business object(component) are two different dimensions. The data cohesion is the key consideration determining how to form a business object. No matter which data architecture(not enterprise data architecture) stage a company is, they still need enterprise level data modelling.
Hi JR, Thanks for the comments! I'll respond to each paragraph:
#1: If all four principles are needed to call a solution a data mesh, then I don't think anyone has yet to build a "data mesh". There is no technology yet for principle #3. And then you have the problem of how much you have to comply with each principle. If I only require each domain to follow a standard regulation, can I say I'm following principle #4? I think we need a "minimum viable product" to call something a data mesh that we all agree on. If a solution follows three of the principles, what should we call it? One of the problems with the data mesh hype is everyone is trying call their solution/product a data mesh, when sometimes it does not adhere to any of the principles.
#2: Current approaches to handle scaling, but I would not say we have "solved" it, as there are a small percentage of use cases that struggle with current technology. But I have seen a ton of centralized solutions that handle petabytes of data very well. I'm not following your example of a technical failure if you are denied funds to expand a centralized solution. Building a data mesh costs much more than any centralized solution, and you could just as well be denied money to expand the domains for a data mesh. Any data architecture would be a "failure" if denied money to expand and performance suffered.
#3: I have always believed IT would be involved in a data mesh, just less so. You certainly need IT for principles #3 and #4. There will be a ton of org changes in a data mesh, which is the biggest problem with trying to implement a data mesh, as most companies are not ready for such a massive change. Most people don't like change.
#4: I don't see how an aggregate domain could be built and maintained without an owner. Someone has to define how the data will be pulled into the domain, how it will be joined and aggregated, then build it and maintain it. That to me is the "owner". The problem I see with an aggregate domain is you are creating more ETL and centralizing some of the data, which goes against what the data mesh is trying to prevent.
>There is no technology yet for principle #3.
Disagree. Snowflake, various data virtualization tools, and even custom engineering can be combined/engineered to produce something that satisfies principle 3. Platforms don't have to be provided by a third party.
>Building a data mesh costs much more than any centralized solution
But since the request never comes as a single ask, and is always justified by something needed *close* to the demand area, it's more likely to be approved, which is the point. There's a political / org dynamics aspect that's pretty fundamental underlying data mesh (Zhamak is synthesizing two ideas borrowed from application engineering -- domain driven design as originated by Eric Evans, and an Inverse Conway Maneuver as originated within ThoughtWorks.)
>I have always believed IT would be involved in a data mesh, just less so.
Okay, but the video implies that it's a requirement to get IT out of the picture. Not required or expected.
>That to me is the "owner".
That's not an owner as Zhamak defines it. Aggregation doesn't automatically produce a domain owner. It would be nice to be at that level of organizational data maturity, but that's *precisely* where we're at with data governance -- it's NOT mature.
@@Calphool222 If your having to combine many technologies to satisfy principle #3, that means a ton of work for each company. So, sure, it does not have to be provided by a third party, but a building a data mesh will take a lot more time and money. My point is a data mesh takes a lot more time and cost to build compared to other architectures, so you have to be very aware of that if you go down that road
The companies I have seen that are building a data mesh ask for a lot of money to get started and are aware that there will be a lot of requests for more money down the road. I have seen the same thing when building a solution using other data architectures. If there is some better political approach that makes a data mesh get more approval for money compared to other data architectures, that is great but I have not seen that in my experience. But I'm just one person :-)
Sorry if I caused confusion and I'll clarify in future presentations that IT will still be involved, and have done so recently when talking about certain things are still centralized in a data mesh (principles #3 and #4, and in many cases storage). My point was that data mesh is trying to reduce IT's involvement.
As far as aggregation domains, I'm confused about the difference between a domain owner and "the set of people who build an maintain a domain". This is the big problem with a data mesh - it is very confusing!
Great conversation!
@@jamserra
>This is the big problem with a data mesh - it is very confusing!
It's not that confusing if you're familiar with the underlying ideas Zhamak is building from. They're well established application engineering techniques. She's taking those as a base, and altering them as necessary to apply to the data domain.
Lovely video!
Great video. Very informative.
Thanks for this! For someone who is not a data architect or data scientist; was is then the difference between a data lakehouse and a data platform/ pipeline?
Thanks for the excellent explanation. Can you please elaborate on Data Access vs API on slide at 16:10 (data fabric).
Hi Shashank, this just means you can create API's to access the data instead of other methods, such as connection strings
Great! Thank u.
1. Business Domains do not produce data, applications do. Why is this high focus on the domains?
2. Data mesh is model. is there a single purpose/context/problem? I can imagine for corporate reporting we need one architecture , for archiving another, for exchange - another. I do not believe in a general swiss knife approach here.
I have seen type 3. I think in large tech companies its done.
ua-cam.com/video/VYmjJe2gR1A/v-deo.html Unity Catalog in Databricks has solved the RLS and Column Masking. A very impressive and interesting video. For me it served a good recap :) Cheers Thank You
BY FAR THE SINGLE MOST EFFECTIVE EXPLANATION EVER ❤