Thank you!!! Finally, someone talks about the Medallion Architecture as a framework and not the rule of you need a bronze, silver, and gold layer to your data lakehouse. For years I always received that basic explanation as the rule and "Best Practice" even from all of the so called experts. After learning from experience with implementations it's nice to hear affirmation - it's what you decide is appropriate to the context of your team and data product to determine the various landing zones of our data. You provide a good way to communicate that without just saying "It depends" which it does and nobody likes that answer 😂
Finally somebody talking common sense. Its just the layer structure that always have exists. Whether you call it bronze silver gold, or stage help dim fact etc., its a potato, potato discussion. In other words semantics.
@@Milhouse77BS Why? ETL (and ELT) are processes. ETL can take raw data from others systems and upload it into a Bronze layer. ETL can take raw Bronze data, make cleaned Silver data out of it, then upload it to your Silver layer. ETL can take cleaned Silver data, join it with other Silver datasets, to make a Gold dataset, then upload it to your Gold layer. ETL can exist entirely outside of medallion architecture and take whatever data you want from a bunch of source systems and produce custom reports for a business unit, director, c-suite, etc. Think of it like these layers are the nouns, and ETL (or ELT) is a verb that can turn parts of one layer into another. Or it can be used entirely outside of a medallion architecture.
A key point omitted here is that the medallion architecture is also referred to as a 'multi-hop' architecture. I always talk about a multi-hop design approach with medallion as a starting point. In data warehousing we used to talk about 'staging'. This was never a single hop transformation to the dimensional model, but many smaller hops - in a similar way to medallion zones.
Some 12 years ago, at the far forgotten times of Powercenter and Teradata, we had a Data Warehouse divided into three parts: Staging area, Work area and DWH. One for raw, flat files, other for curated data and the last one for aggregated data. I have realized that, in this market, we recycle old concepts with new names, so we keep the environment fresh.
Simon! I am so glad you made this video. It’s been a constant challenge at different clients explaining that medallion isn’t strictly 3 layers. The databricks blogs and documentation gets vague and prescriptive at the same time.
Thanks for this. Having worked on DW ETLs for decades, I wondered what was new about 'Medallion architecture'... answer: Not a lot. Same series of steps, with a different name. The new bit - albeit not that new these days - is the reporting model sitting downstream of the DW / Analytical Model.
such a nice way of explaining this. always wondered why everyone nowadays is pushing for this "architecture" which can make sense in a lot of scenarios but I think not as many as people tend to think
100% Agreed - I think the medallion "architecture" is really just an extremely poor analogy because it has caused confusion rather than simplifying a concept. Bronze, Silver, Gold makes non-technical users think of Good, Great, Best - but data is not an Olympic event. Bronze, Silver, Gold is telling me what exactly? Is the Bronze data 3rd best? Is the Silver data only slightly less good than Gold? Raw, Prep, Land, Stage, Star, Semantic, BI - words like this have semantic meanings that relate to the intended purpose and integrity of those layers. Thanks for illuminating the source of this abomination. 😁
Hi Simon. I found your video very insightful. I have a question, though. How do you model the Silver layer? Would you use a set of flat tables? If not Data Vault, what would you suggest for the conformed data?
Thank you! In which stage do we transform data into 1) Parquet Files (Schema Val after transforming from Raw (Json to Parq) and Then 2) to Delta tables in your schema? (In Silver Layer on Cleansed and Validated Stage when Data transformed to Delta Tables).
Great overview. For something like Power BI which would like a star schema, would there be some overlap between the curated and semantic layer?? Or for some items in the curated, it would be a duplication of the object into the semantic layer? with other objects extended by the metrics added where needed??
Yes, for example your Date dimension in curated layer may be copied as is into the semantic layer. In Microsoft Fabric it would be a link or shortcut, so data not actually copied.
I definitely have build "Data solutions" with this sort of thinking. The one other one I'd include here that wasn't included was around if you have heavy PII and especially PHI/HIPAA data as often you have to have a governance structure for limiting access to certain users and only certain environments/tools. This is where you may have STG/DEV/TEST/PROD environments. It gets complex, the key is making sure it's clear however and wherever the data is at and can be found by users.
Thanks for the video. I have pretty much same vision (and apply it with other clients), and this is very well-structured. Yes, the oversimplified concept of this "medallion" architecture does indeed greatly confuse clients for the sake of catchy marketing. I have two remarks what I see slightly differently: 1) The "base tables" have schema enforced, changes applied, it is a 1:1 replica of the source system's table, essentially the lowest "queryable" dataset for any consumption but before any logical conformations and transformations; In my view, this should be part of the Bronze "grade" (let's not call it a layer) while you are depicting it as a Silver. With any kind of incremental data extraction pattern, the first two layers ("landing" and "raw" in your slides) are append-only and contain change-sets. Which makes them practically unqueryable. We use last data portion as part of an ETL job to properly apply changes towards the "base" table - but any other user or downstream job can't consume data from these layers, as would require enormous complexity and compute to replay the changeset from the beginning and reconstruct the current state of the data. In many cases it's not even possible. If we saying that the "base" table (the first actually queryable) is "Silver" then our "Bronze" is nothing more than an unqueryable data dumpster. This is very misleading to the client as while they are stuck with the Medallion misnomer it leaves them with only two "broadly recognized" metals for their data layers which is not enough. My take is that "base" == Bronze, what's below it (any kind of stacked append-only changeset datasets, whether unstructured or structured), if retained at all, is pre-bronze. Don't you see it this way? 2) The layer naming you suggested is consistent and makes sense, but I see a lot of confusion surrounding the term "raw". The data engineering world understands it closer to the landing (raw landed data) like you do. But the business data consumers use and understand "bronze" / "base" and "raw" terms refer interchangeably, to them this is all "raw" (as untransformed) source system data, same as in the source system itself. At least I see this perception A LOT in the field. If I was starting a new data platform from scratch today, I would reserve the word "raw" to refer to the same as "base" tables (and also map it a Bronze to map in the medallion paradigm), while calling both of the leftmost layers (landed unstructured files and schema validated) as landing & staging accordingly; Of which depending on the ETL pattern only one or both (or neither) may exist.
Awesome breakdown! Im glad I understood the concept very early on and not end up stuck with that misconception. The way the medallion architecture has been presented is oversimplified. It's okay to present it as is as a start. BUT, it being more of a logical layer should be emphasized. That and business logic and needs is also a major consideration.
Care to outline what the major differences between the two gold layers in this examples are? Are you building all the different dimensional models your business needs in curated, using spark and lake, and then another layer with basically the same model but with the addition of a semantic model on top with measures and such?
Yep, exactly. Just wanted to capture that there is the logic stored in the lake via materialised delta tables, then additional context-aware measures in a BI tool such as Power BI semantic models!
While I agree in principle I think the value from using Bronze, Silver and Gold is that users understand generally what to expect of the data in each layer. I talk to users about Bronze data so often as a kind of warning - it's a mirror of source so don't expect it to be clean or conformed. So often we forget to define what is meant by various terms and a new person joining an organization from outside where they may have worked with these terms can immediately understand if we propagate their usage. So yeah, it's nothing new, but using the terms consistently has value in itself. Thanks for the video, as usual, your channel always has quality content - it's the Gold layer of analytics videos :-)
In many cases, a file comes with perfect data. Would you still move it and store it in all layers, or skip couple of those and have that file in for example Landing, Raw and Semantic. If you have terabytes of data it would not be efficient to move and store data through all layers. And if you decide not to move through all layers and skip some for some layers, then you end up in situation where you need to "join" two tables from different layers, because one of the table does not exist in one of the layers. How do you deal then with such anomality?
Great video, thanks. I had a question about the base layer, please. I was wondering what use cases you typically saw for storing the base layer? I’ve typically seen the base layer in the form of views that allow the flow from raw into enriched. I assume you store the base layer as you see use cases around downstream users/consumers wanting a place to source clean untransformed data? I was just thinking from a wider enterprise perspective would it not be better for downstream consumers in this scenario to take data from enriched which is often closer to the companies logical data model? This promotes a more uniformed understanding enterprise wide and can make integration easier? Thanks.
Regarding the names of the layers, I'm still on the fence between airlock code of the planet Druidia, or the album titles of Led Zeppelin Joking aside, love this video, I've recommended it a bunch already!
The misconception here is too many peoples think of Bronze, Silver, Gold as physical layer, and when it should be seems as logical layers! And too many think about it a bit too literally. I’m not sure even Databricks ever mentioned you’ll have only one step of refinement in your Silver layer or Gold layer. The number of hops should be tight to a business cases and not defined and predefined in advance. All that can now be managed with tags on your tables, or databases in most of all modern data warehouses.
So, when using UC in Databricks, do you create a catalog per layer or do you divide it using schemas? Always in doubt on what the best approach is, especially with DLT, where only one destination catalog is possible.
I generally see catalogs named dev, stg, prod with similar schemas and cloud storage folder structures in each. This makes it easier to embrace CICD processes since the catalog name can be set as a default in the workspace admin settings and the code can run unchanged as it moves across the environments. Sometimes people want to organize bronze tables in a special schema (iot_bronze) and the silver and gold objects in another (iot). This makes it easier to hide the bronze tables from analysts and dsml engineers if desired. There are also line of business strategies for catalogs. I think more can be written about the patterns and pros/cons to help architects setting these things up.
This usually comes down to a size thing - if you get to a certain size of business, having a single "prod" catalog can be too limiting trying to do all security at the schema level. We've seen dev/test/prod or dev_raw/test_raw/prod_raw where you then want to break raw down into further data product oriented schemas, rather than have a single catalog that encapsulates all layers for all data products. You could well extend that to the product level if you had large, org-wide security that needed to be secured at the workspace level using catalog-workspace bindings. Essentially - design it to the granularity of security you require, balanced with usability for your data consumers
This aligns pretty well with how I have been designing lakes. I have had internal debates about whether structured coming into the lake should be "unstrcutured" but I see your logic. My question relates to when importing large quantitties of data from external OLTP source. Do you ETL only daily changes or is Raw a full copy? (which doesn't work when receiving few million records a day) Main question: where do you first filter to stop inserting a few terrabyte each day?
We try and ensure all loads from source systems are incremental, but this relies on us having the option - not all sources have change data capture / timestamps / change feeds etc. So we parameterise & config up the data extracts, so we can do incremental where possible, then date partition/full snapshot where we have no other choice!
That's from the olden days when Spark wasn't great at joins & star-schema queries. These days I don't see a driver for adding another tech into the architecture so we keep everything lakehouse-based!
@@AdvancingAnalytics Saw a different video about this subject before seeing yours, good video by the way, you confirmed some of my thoughts about medallion. What he said is that if you need granular security on your tables and columns (RLS supported in both lake and DW) then DW is where you need to put your data facing the consumer. I didn't catch any other advantages other than perhaps personas working with dimensional modelling being more accustomed to T-SQL, which may or may not be a real case.
Simon is there a set of file formats that you conform to in the raw layer? For example if the data source is CSV or Excel do you transform that into delta in the pipeline from data source to the raw layer? Also you store rest related data as json in raw, do you transform it to delta at a later stage?
Once you've picked it up from Landing, it's in a spark dataframe. When it's in a spark dataframe, land it as Delta, always (unless exporting data for a very specific integration)
@@AdvancingAnalytics so if I understand correctly you land the data in the data lake in its original file format and thereafter all data will be conformed to delta in the next layer?
@@patronspatron7681Land the data in the landing zone as what ever format json, parquet, csv. When writing hour notebook create a spark df on the landing data and then save it as delta in the raw or bronze zone.
A very interesting perspective on consolidated analytics and ETL, although I may not completely agree with it. I must emphasize that the medallion architecture's contribution to data processing is invaluable, offering numerous benefits across the enterprise. It is simple, comprehensible, and, like any best practice, should be encouraged to be followed.
Compression in a lake can force reading to be single-threaded, so we tend to avoid traditional compression (ie: .zip) as it'll end up going slower! However, other types of compression (such as parquet with snappy!) are what's known as splittable compression, which CAN be read in parallel, but also has the benefits of smaller reads
Simon, when you say that your raw layer (schema validated) is APPEND ONLY, do you mean that you're not checking for garbage and duplicate records? I'm assuming that you're still delta merging to avoid unnecessary duplication from inserting the same record over and over again, correct? Thanks for the great content!
Hey - nope, with RAW we literally do a straight append, so raw will absolutely have duplicates if we are sent the same records repeatedly (ideally we avoid this during source data extraction). We merge into the BASE layer and deal with deduplication etc there
Initially this sounded lot like early days Data Warehouse with landing/staging table and main tables. With Lake house We can create multiple tables/views from same sources without much risk need not to be bronze, silver, gold.
Anyone doing data mgmt for a while knows there are stages to any data management solution. It's a marketing tool. The key is defining all the sublayers and data management tasks that have to happen such as reference data management, ingestion, intermediate reshaping, etc. The biggest issue I have with the medallion architecture? Most people that know metallurgy would more than slightly shudder if they were ever explained data engineering 😂😂 At the same time, one should not just dismiss someone else suggesting the architecture. It may just be the way they learned. My suggestion is to inquire about where folks think all the subtasks should be.
To me even the gold layer could be the Kimball base star schema for ad hoc reporting. The Platinum layer can be fed by a semantic model. Then, the advanced Kimball concepts such as second level consolidated fact tables or other materialization of drill across result can then feed dashboards......if indeed that is the part of requirements. The reality is speed to market and skill sets of a team will short circuit this approach. That's when you have big trouble and an evolving mess ......
Nope - these are all of the hops a customer might go to, they are absolutely not ALL necessary to ALL organisations. Design the layers to your required processes. PII - depends on your use case. If you don't need it in your lake, strip it out when first performing structural validation. If you need it but need to obfuscate it, then do that during data cleansing and lock down your RAW layer accordingly
@@AdvancingAnalytics Somebody already asked when you typically transition to Delta. However, there is also a third angle regarding PII specifically. One might need - and have a justification - for using the actual data, unobfuscated. However, there should still be a mechanism for "surgically" removing it when a legal basis for processing no longer exists (e.g. a person is exercising a right to erasure / right to be forgotten). This becomes potentially problematic when persisting data across ALL the stages. How do you typically handle this? Transient landing, but persistent RAW - the latter in Delta to enable more targeted deletions / anonymization?
If you use Databricks, Unity Catalog offers capabilities to help manage PII data when you need to retain it for authorized people/purposes: Volumes let's you control who can see raw data files containing PII before its ingested into Delta, you can also create reusable functions in Unity Catalog to obfuscate and filter PII data from unauthorized people, and Delta and Deletion Vectors help you implement the GDPR "right to be forgotten" without introducing performance issues in your tables.
@@jimthorstad6226 Those are all good points and advertisement of features within Databricks, but Simon used the age-old argument of "retaining all data (in all stages), because storage is cheap in the cloud". However, cost is not the only factor when deciding what data to retain - particularly in its raw form. Data engineers are not some special people who are privileged to handle PII, when the company as a whole does not have the right to process it (anymore).
Rewatching this Simon missed a pun with the "layers of the lake." He could have said different depths or depths of the data to lean into the lake analogies. :)
Criticizing the medallion architecture and present a more complex architecture 😅 For us we only have the schema validated and then analytical layer which is star schema You conviced me about the raw files as we do a lot of back filing and this should solve the issue Overall great video i l love this type of content
YES! It's not an architecture! I see so many old ideas rehashed with new names for marketing purposes. Layer approach, columnstores. These things have been around for decades
I was asked this medallion architecture in job interview and i didnt know except ive heard the name. The interviewer explained it enthusiastically and i just commented that its the same as any other process, he didnt seem to like my response. I didnt get picked for the role and also didnt want to as the interviewer seemed arrogant. Its not rocket science.
There is so much confusion happening with the definition of what goes in a medallion layers, least of all because people don't necessarily the different type of modelling you do whether its 3rd normal form or dimensional models, the aggregation aspect is confusing to folks, these days many data engineers don't really understand what a dimensional model is, in the old days a bi person was end to end but now the raw to clean conformed is done by data engineers, that's my recent experience, and think you need to aggregate your facts in gold, when actually that's not true because you would mostly always have detailed facts to the lowest grain possible and that should sit in gold, or is this just splitting hairs, at the end of the day like you say as long as we all on the same page it doesn't matter, but as the teams and companies get bigger, i can see a lot of mismatches happening, just like ive seen so many data warehouses built incorrectly the definition of what you doing needs to be clear understood and communicated. Your layers make more sense to me as a long time end to end BI person because it's quite clear what it means.
Agree. All this mention of "aggregates" to me is a waste of breath. Capture facts at lowest grain and optionally add aggregates if you want. But you shouldn't add aggregates for performance. Performance should be solved in the platform. IMHO the best case for aggregates is combining two facts of different grains to make analysis simpler.
Hah, not at all - advocating for "design the layers as appropriate for your company, not accept the 3-layer medallion default". Sure, I've built many 6-layer, rigorous architectures for global FinServ, but if you don't need that, don't build that? Wanna go for 2 layers with a virtual logic layer? Go nuts, as long as you're making that choice that's right for the org
Thank you!!! Finally, someone talks about the Medallion Architecture as a framework and not the rule of you need a bronze, silver, and gold layer to your data lakehouse. For years I always received that basic explanation as the rule and "Best Practice" even from all of the so called experts. After learning from experience with implementations it's nice to hear affirmation - it's what you decide is appropriate to the context of your team and data product to determine the various landing zones of our data. You provide a good way to communicate that without just saying "It depends" which it does and nobody likes that answer 😂
Finally somebody talking common sense. Its just the layer structure that always have exists. Whether you call it bronze silver gold, or stage help dim fact etc., its a potato, potato discussion. In other words semantics.
I call it ETL: Extract (Bronze), Transform (Silver), Load (Gold)
@@Milhouse77BS what if you ELT? the point here is that BSG covers several transformation types and in some cases several copies of data.
BSG I think of "Battle Star Galactica" first :)
@@FR-099
@@Milhouse77BS Why? ETL (and ELT) are processes.
ETL can take raw data from others systems and upload it into a Bronze layer.
ETL can take raw Bronze data, make cleaned Silver data out of it, then upload it to your Silver layer.
ETL can take cleaned Silver data, join it with other Silver datasets, to make a Gold dataset, then upload it to your Gold layer.
ETL can exist entirely outside of medallion architecture and take whatever data you want from a bunch of source systems and produce custom reports for a business unit, director, c-suite, etc.
Think of it like these layers are the nouns, and ETL (or ELT) is a verb that can turn parts of one layer into another. Or it can be used entirely outside of a medallion architecture.
@@Milhouse77BS 😂😂😂, nice sense of humor
This is the kind of content that deserves a love, insightful and celebrate button.
A key point omitted here is that the medallion architecture is also referred to as a 'multi-hop' architecture. I always talk about a multi-hop design approach with medallion as a starting point. In data warehousing we used to talk about 'staging'. This was never a single hop transformation to the dimensional model, but many smaller hops - in a similar way to medallion zones.
Some 12 years ago, at the far forgotten times of Powercenter and Teradata, we had a Data Warehouse divided into three parts:
Staging area, Work area and DWH.
One for raw, flat files, other for curated data and the last one for aggregated data.
I have realized that, in this market, we recycle old concepts with new names, so we keep the environment fresh.
Simon! I am so glad you made this video. It’s been a constant challenge at different clients explaining that medallion isn’t strictly 3 layers. The databricks blogs and documentation gets vague and prescriptive at the same time.
Thanks for this. Having worked on DW ETLs for decades, I wondered what was new about 'Medallion architecture'... answer: Not a lot. Same series of steps, with a different name.
The new bit - albeit not that new these days - is the reporting model sitting downstream of the DW / Analytical Model.
such a nice way of explaining this.
always wondered why everyone nowadays is pushing for this "architecture" which can make sense in a lot of scenarios but I think not as many as people tend to think
100% Agreed - I think the medallion "architecture" is really just an extremely poor analogy because it has caused confusion rather than simplifying a concept.
Bronze, Silver, Gold makes non-technical users think of Good, Great, Best - but data is not an Olympic event. Bronze, Silver, Gold is telling me what exactly? Is the Bronze data 3rd best? Is the Silver data only slightly less good than Gold?
Raw, Prep, Land, Stage, Star, Semantic, BI - words like this have semantic meanings that relate to the intended purpose and integrity of those layers. Thanks for illuminating the source of this abomination. 😁
yeah when I first heard a management consultant use those colors I wept a tear for the death of meaning and nuance I knew would follow
Thank you. I've been struggling internally with this and have more clarity and assurance now.
Thoughtful, articulate, and impactful - as usual! Thank you for this.
Hi Simon. I found your video very insightful. I have a question, though. How do you model the Silver layer? Would you use a set of flat tables? If not Data Vault, what would you suggest for the conformed data?
Thank you! In which stage do we transform data into 1) Parquet Files (Schema Val after transforming from Raw (Json to Parq) and Then 2) to Delta tables in your schema? (In Silver Layer on Cleansed and Validated Stage when Data transformed to Delta Tables).
Wise Words! I think "Make it your own" is the message. Great video.
Great overview. For something like Power BI which would like a star schema, would there be some overlap between the curated and semantic layer?? Or for some items in the curated, it would be a duplication of the object into the semantic layer? with other objects extended by the metrics added where needed??
Yes, for example your Date dimension in curated layer may be copied as is into the semantic layer. In Microsoft Fabric it would be a link or shortcut, so data not actually copied.
Ah yes shortcuts in Fabric! That’s a nice idea!
I definitely have build "Data solutions" with this sort of thinking. The one other one I'd include here that wasn't included was around if you have heavy PII and especially PHI/HIPAA data as often you have to have a governance structure for limiting access to certain users and only certain environments/tools.
This is where you may have STG/DEV/TEST/PROD environments. It gets complex, the key is making sure it's clear however and wherever the data is at and can be found by users.
What a great video - again! You're a legend. Well spoken and explained.
Thanks for the video. I have pretty much same vision (and apply it with other clients), and this is very well-structured. Yes, the oversimplified concept of this "medallion" architecture does indeed greatly confuse clients for the sake of catchy marketing.
I have two remarks what I see slightly differently:
1) The "base tables" have schema enforced, changes applied, it is a 1:1 replica of the source system's table, essentially the lowest "queryable" dataset for any consumption but before any logical conformations and transformations; In my view, this should be part of the Bronze "grade" (let's not call it a layer) while you are depicting it as a Silver.
With any kind of incremental data extraction pattern, the first two layers ("landing" and "raw" in your slides) are append-only and contain change-sets. Which makes them practically unqueryable. We use last data portion as part of an ETL job to properly apply changes towards the "base" table - but any other user or downstream job can't consume data from these layers, as would require enormous complexity and compute to replay the changeset from the beginning and reconstruct the current state of the data. In many cases it's not even possible.
If we saying that the "base" table (the first actually queryable) is "Silver" then our "Bronze" is nothing more than an unqueryable data dumpster. This is very misleading to the client as while they are stuck with the Medallion misnomer it leaves them with only two "broadly recognized" metals for their data layers which is not enough.
My take is that "base" == Bronze, what's below it (any kind of stacked append-only changeset datasets, whether unstructured or structured), if retained at all, is pre-bronze.
Don't you see it this way?
2) The layer naming you suggested is consistent and makes sense, but I see a lot of confusion surrounding the term "raw". The data engineering world understands it closer to the landing (raw landed data) like you do. But the business data consumers use and understand "bronze" / "base" and "raw" terms refer interchangeably, to them this is all "raw" (as untransformed) source system data, same as in the source system itself. At least I see this perception A LOT in the field.
If I was starting a new data platform from scratch today, I would reserve the word "raw" to refer to the same as "base" tables (and also map it a Bronze to map in the medallion paradigm), while calling both of the leftmost layers (landed unstructured files and schema validated) as landing & staging accordingly; Of which depending on the ETL pattern only one or both (or neither) may exist.
Awesome breakdown! Im glad I understood the concept very early on and not end up stuck with that misconception.
The way the medallion architecture has been presented is oversimplified. It's okay to present it as is as a start. BUT, it being more of a logical layer should be emphasized. That and business logic and needs is also a major consideration.
Care to outline what the major differences between the two gold layers in this examples are? Are you building all the different dimensional models your business needs in curated, using spark and lake, and then another layer with basically the same model but with the addition of a semantic model on top with measures and such?
Yep, exactly. Just wanted to capture that there is the logic stored in the lake via materialised delta tables, then additional context-aware measures in a BI tool such as Power BI semantic models!
While I agree in principle I think the value from using Bronze, Silver and Gold is that users understand generally what to expect of the data in each layer. I talk to users about Bronze data so often as a kind of warning - it's a mirror of source so don't expect it to be clean or conformed. So often we forget to define what is meant by various terms and a new person joining an organization from outside where they may have worked with these terms can immediately understand if we propagate their usage.
So yeah, it's nothing new, but using the terms consistently has value in itself.
Thanks for the video, as usual, your channel always has quality content - it's the Gold layer of analytics videos :-)
Hi Advancing Analytics Team, At what layer do you start to use the delta format?
As soon as we can get the data into a coherent structure - so for us it's Delta from RAW onwards!
Briliant! Thanks for making a simple and clear video about the Medallion Architecture. 🙂
Great great video. Every minute was to the point
Brilliantly said Simon! Thanks for this, keen for more videos like this in the future.
At which point are you doing any required de-duplication? Base?
Thank you for sharing. The message I hear and champion is; governance 😊. Make sure it is clear what can be found where 😊
In many cases, a file comes with perfect data. Would you still move it and store it in all layers, or skip couple of those and have that file in for example Landing, Raw and Semantic.
If you have terabytes of data it would not be efficient to move and store data through all layers. And if you decide not to move through all layers and skip some for some layers, then you end up in situation where you need to "join" two tables from different layers, because one of the table does not exist in one of the layers. How do you deal then with such anomality?
Great video, thanks. I had a question about the base layer, please. I was wondering what use cases you typically saw for storing the base layer? I’ve typically seen the base layer in the form of views that allow the flow from raw into enriched. I assume you store the base layer as you see use cases around downstream users/consumers wanting a place to source clean untransformed data? I was just thinking from a wider enterprise perspective would it not be better for downstream consumers in this scenario to take data from enriched which is often closer to the companies logical data model? This promotes a more uniformed understanding enterprise wide and can make integration easier? Thanks.
Thank you! Appreciate the effort. It actually makes way more sense
@simon does this also means that we have 6 layer of same data storage in data lake (some raw, some cleaned, some agreegated) ?
Yep, potentially. If you require it and there are business use cases for each materialised layer.
Regarding the names of the layers, I'm still on the fence between airlock code of the planet Druidia, or the album titles of Led Zeppelin
Joking aside, love this video, I've recommended it a bunch already!
The misconception here is too many peoples think of Bronze, Silver, Gold as physical layer, and when it should be seems as logical layers! And too many think about it a bit too literally.
I’m not sure even Databricks ever mentioned you’ll have only one step of refinement in your Silver layer or Gold layer.
The number of hops should be tight to a business cases and not defined and predefined in advance. All that can now be managed with tags on your tables, or databases in most of all modern data warehouses.
So, when using UC in Databricks, do you create a catalog per layer or do you divide it using schemas? Always in doubt on what the best approach is, especially with DLT, where only one destination catalog is possible.
I generally see catalogs named dev, stg, prod with similar schemas and cloud storage folder structures in each. This makes it easier to embrace CICD processes since the catalog name can be set as a default in the workspace admin settings and the code can run unchanged as it moves across the environments. Sometimes people want to organize bronze tables in a special schema (iot_bronze) and the silver and gold objects in another (iot). This makes it easier to hide the bronze tables from analysts and dsml engineers if desired. There are also line of business strategies for catalogs. I think more can be written about the patterns and pros/cons to help architects setting these things up.
This usually comes down to a size thing - if you get to a certain size of business, having a single "prod" catalog can be too limiting trying to do all security at the schema level. We've seen dev/test/prod or dev_raw/test_raw/prod_raw where you then want to break raw down into further data product oriented schemas, rather than have a single catalog that encapsulates all layers for all data products. You could well extend that to the product level if you had large, org-wide security that needed to be secured at the workspace level using catalog-workspace bindings. Essentially - design it to the granularity of security you require, balanced with usability for your data consumers
This aligns pretty well with how I have been designing lakes. I have had internal debates about whether structured coming into the lake should be "unstrcutured" but I see your logic. My question relates to when importing large quantitties of data from external OLTP source. Do you ETL only daily changes or is Raw a full copy? (which doesn't work when receiving few million records a day)
Main question: where do you first filter to stop inserting a few terrabyte each day?
We try and ensure all loads from source systems are incremental, but this relies on us having the option - not all sources have change data capture / timestamps / change feeds etc. So we parameterise & config up the data extracts, so we can do incremental where possible, then date partition/full snapshot where we have no other choice!
I have seen some design where gold layer is kept separate into a database rather than in databricks what can be the thinking behind this?
That's from the olden days when Spark wasn't great at joins & star-schema queries. These days I don't see a driver for adding another tech into the architecture so we keep everything lakehouse-based!
@@AdvancingAnalytics Saw a different video about this subject before seeing yours, good video by the way, you confirmed some of my thoughts about medallion. What he said is that if you need granular security on your tables and columns (RLS supported in both lake and DW) then DW is where you need to put your data facing the consumer. I didn't catch any other advantages other than perhaps personas working with dimensional modelling being more accustomed to T-SQL, which may or may not be a real case.
Thoroughly enjoyed that Simon, thanks!
Simon is there a set of file formats that you conform to in the raw layer? For example if the data source is CSV or Excel do you transform that into delta in the pipeline from data source to the raw layer? Also you store rest related data as json in raw, do you transform it to delta at a later stage?
Once you've picked it up from Landing, it's in a spark dataframe. When it's in a spark dataframe, land it as Delta, always (unless exporting data for a very specific integration)
@@AdvancingAnalytics so if I understand correctly you land the data in the data lake in its original file format and thereafter all data will be conformed to delta in the next layer?
@@patronspatron7681Land the data in the landing zone as what ever format json, parquet, csv. When writing hour notebook create a spark df on the landing data and then save it as delta in the raw or bronze zone.
Thank you. You solved a million dollar question I had.
A very interesting perspective on consolidated analytics and ETL, although I may not completely agree with it. I must emphasize that the medallion architecture's contribution to data processing is invaluable, offering numerous benefits across the enterprise. It is simple, comprehensible, and, like any best practice, should be encouraged to be followed.
It offers nothing new. It's just layers. Layers is a concept that has been around for decades
What about zipped/ compressed files? Land as .zip or their uncompressed version?
Compression in a lake can force reading to be single-threaded, so we tend to avoid traditional compression (ie: .zip) as it'll end up going slower!
However, other types of compression (such as parquet with snappy!) are what's known as splittable compression, which CAN be read in parallel, but also has the benefits of smaller reads
This is really helpful - thanks!
This video makes me so so happy.
Simon, when you say that your raw layer (schema validated) is APPEND ONLY, do you mean that you're not checking for garbage and duplicate records? I'm assuming that you're still delta merging to avoid unnecessary duplication from inserting the same record over and over again, correct?
Thanks for the great content!
Hey - nope, with RAW we literally do a straight append, so raw will absolutely have duplicates if we are sent the same records repeatedly (ideally we avoid this during source data extraction). We merge into the BASE layer and deal with deduplication etc there
Excellent synthesis, thank you
I have a question: where data is really large, like truely billion record, how much of these layers can be materialised vs how much should be logical?
Regarding the semantic layer: usually how do you implement it with lake house? It is purely PowerBi managed or you use Azure Analysis Services?
Power BI data model has pretty much replaced Azure Analysis Services. You shouldn't use AAS for anything new
@@nickmcdermaid3015 thanks for answer!
It's just about logical concepts that require abstractions based on your business cases. Period!
Excellent vid! Thank you.
Initially this sounded lot like early days Data Warehouse with landing/staging table and main tables. With Lake house We can create multiple tables/views from same sources without much risk need not to be bronze, silver, gold.
Anyone doing data mgmt for a while knows there are stages to any data management solution. It's a marketing tool. The key is defining all the sublayers and data management tasks that have to happen such as reference data management, ingestion, intermediate reshaping, etc.
The biggest issue I have with the medallion architecture? Most people that know metallurgy would more than slightly shudder if they were ever explained data engineering 😂😂
At the same time, one should not just dismiss someone else suggesting the architecture. It may just be the way they learned. My suggestion is to inquire about where folks think all the subtasks should be.
Brilliant rant - loved it
To me even the gold layer could be the Kimball base star schema for ad hoc reporting. The Platinum layer can be fed by a semantic model. Then, the advanced Kimball concepts such as second level consolidated fact tables or other materialization of drill across result can then feed dashboards......if indeed that is the part of requirements.
The reality is speed to market and skill sets of a team will short circuit this approach. That's when you have big trouble and an evolving mess ......
Simon, are all those hops really necessary? How do you managed your PII data with all those hops?
Nope - these are all of the hops a customer might go to, they are absolutely not ALL necessary to ALL organisations. Design the layers to your required processes.
PII - depends on your use case. If you don't need it in your lake, strip it out when first performing structural validation. If you need it but need to obfuscate it, then do that during data cleansing and lock down your RAW layer accordingly
@@AdvancingAnalytics Somebody already asked when you typically transition to Delta. However, there is also a third angle regarding PII specifically. One might need - and have a justification - for using the actual data, unobfuscated. However, there should still be a mechanism for "surgically" removing it when a legal basis for processing no longer exists (e.g. a person is exercising a right to erasure / right to be forgotten).
This becomes potentially problematic when persisting data across ALL the stages. How do you typically handle this? Transient landing, but persistent RAW - the latter in Delta to enable more targeted deletions / anonymization?
If you use Databricks, Unity Catalog offers capabilities to help manage PII data when you need to retain it for authorized people/purposes: Volumes let's you control who can see raw data files containing PII before its ingested into Delta, you can also create reusable functions in Unity Catalog to obfuscate and filter PII data from unauthorized people, and Delta and Deletion Vectors help you implement the GDPR "right to be forgotten" without introducing performance issues in your tables.
@@jimthorstad6226 Those are all good points and advertisement of features within Databricks, but Simon used the age-old argument of "retaining all data (in all stages), because storage is cheap in the cloud". However, cost is not the only factor when deciding what data to retain - particularly in its raw form.
Data engineers are not some special people who are privileged to handle PII, when the company as a whole does not have the right to process it (anymore).
@@vjraitila on top of that you need to remember information kept in previous versions of your delta table used for time travel.
Mazing video - subbed
Oomph... love me a metric layer
Rewatching this Simon missed a pun with the "layers of the lake." He could have said different depths or depths of the data to lean into the lake analogies. :)
So how do I call the delta between my Delta tables now 🙂
Alpha Delta Zeta. Pretend it's a fraternity.
Criticizing the medallion architecture and present a more complex architecture 😅
For us we only have the schema validated and then analytical layer which is star schema
You conviced me about the raw files as we do a lot of back filing and this should solve the issue
Overall great video i l love this type of content
Neat and concrete! 🎉
a BIG thank you !
YES! It's not an architecture! I see so many old ideas rehashed with new names for marketing purposes. Layer approach, columnstores. These things have been around for decades
Spot on buddy!
Preach it! Sober up everyone, this is real world.
Show some examples. 16 min in and still talking hypothetical.
I was asked this medallion architecture in job interview and i didnt know except ive heard the name. The interviewer explained it enthusiastically and i just commented that its the same as any other process, he didnt seem to like my response. I didnt get picked for the role and also didnt want to as the interviewer seemed arrogant. Its not rocket science.
One request, can you please use a better mic? your content is so good but needs a better sound quality
Spot on
There is so much confusion happening with the definition of what goes in a medallion layers, least of all because people don't necessarily the different type of modelling you do whether its 3rd normal form or dimensional models, the aggregation aspect is confusing to folks, these days many data engineers don't really understand what a dimensional model is, in the old days a bi person was end to end but now the raw to clean conformed is done by data engineers, that's my recent experience, and think you need to aggregate your facts in gold, when actually that's not true because you would mostly always have detailed facts to the lowest grain possible and that should sit in gold, or is this just splitting hairs, at the end of the day like you say as long as we all on the same page it doesn't matter, but as the teams and companies get bigger, i can see a lot of mismatches happening, just like ive seen so many data warehouses built incorrectly the definition of what you doing needs to be clear understood and communicated. Your layers make more sense to me as a long time end to end BI person because it's quite clear what it means.
Agree. All this mention of "aggregates" to me is a waste of breath. Capture facts at lowest grain and optionally add aggregates if you want. But you shouldn't add aggregates for performance. Performance should be solved in the platform. IMHO the best case for aggregates is combining two facts of different grains to make analysis simpler.
Agree 100%
this is gold
*Curated ;)
You need it because without the terminology pipelines behave differently
Is this guy actually advocating for copying data 6 separate times? Should rename this video “How to bankrupt you company through ETL”
Hah, not at all - advocating for "design the layers as appropriate for your company, not accept the 3-layer medallion default". Sure, I've built many 6-layer, rigorous architectures for global FinServ, but if you don't need that, don't build that? Wanna go for 2 layers with a virtual logic layer? Go nuts, as long as you're making that choice that's right for the org