Crisp, clear and short introductory explanation of the how Data Vault has evolved from Dimensional Modeling. Personal thanks to the engaging and excellent presenter.
Thanks for sharing the concept. My queries are: How is this different from Snowflake? The detail lies is implementation because of more joins and hence performance for faster read.
Data Vault is typically used for the main DW layer. This layer is integrated, non-volatile, time-variant, and subject oriented (@ Inmon) and auditable, agile, atomic etc. The business meaning implied by the "conformed" dimension in this example represents the common business rule transformations. These are intended to be completed within the DW layer as part of the target or BDW area. In the same layer there is also a RAW area (traceability). The Data Marts remain Dimensional Models.
Mr. Hans, your explanations and videos are always better than those of other experts in DV. However, I have a question - how does it impact the ETL in case of dimension model appends vs. DV satellite additions?
Can you please help me with these . Would be great if you can explain with some examples. 1. Why link to link relationship is not recommended in RDV? 2. In BDV bridge table, if we are storing only hash keys( not natural keys), then how in fact/dimension we are going to get natural keys?
Thanks for the Great Introduction into Data Vault. The videos in the academy are great. Appreciate if you could put a logical structure of Videos that can be read on a sequential basis
How can we design the already connected dimension tables to the fact table in Data vault ? You were just explaining about the extended dimensions with vault
Pretty good explanation on DV. Though it talks about Hub and Satellites it doesn't seem to talk about the Links. Is there a separate video on that? Covering Links would make this topic complete in my opinion.
This particular topic didn't concern links however there are other videos in this channel that do. Also you can find more on the Genesee Academy youtube channel.
Very clear explanation. Thanks! Simple Question: If I aggregate Sales upwards, via the link tables, to a unique customer contained within the hub, how do I know what version of a Satellite record to use. A unique Hub record can now have multiple Satellite records by having different Load Dates.
An SCD2 is painful to maintain and use, but very effective at tracking history, as new versions have unique surrogates that can be associated with transactions. The hub-Sat model does not associate different satellite versions for the same hub record to a transaction. Only to the unique hub record. So If I use a Sat table to track changes to a specific customer contained in the hub, I cannot see what transactions occurred when during the life of the customer. Is my understanding correct??
This approach is no way fundamentally different from dimensional modeling. Just a variation. If a new hub comes, you still have to change all the link tables . Small changes may be are easier but Ralph never said you must prop everything in a giant dimension. This is more or less what DV improves, isolating attributes based on the same frequency of changes (satellites) and linking this to a dimension (hub) instead of direct link to the facts tables like classical MDM does. Makes sense, but packaging this like new modeling paradigm and calling it all different 😂😂😂😂. Presentation would benefit from talking about frequency of changes grouping. This is the essential part
Is this really the evolution of star or Snowflake schema’d DWH design? A data vault modeled DWH would require far more joins and likely perform poorly. What about that? Whereas a single dimensional model has fewer tables is easier to partition faster to query.
No, no, no, no... No on so many levels. And people cannot see the lecturer is making mistakes because he is giving no examples. The final structure you are making is turning dimension table into a fact table and you would have a record there for every customer. So basically if you have a million customers, you would have a million records there. If you have half a million customers from Melbourne you would end up writing "Melbourne" half a million times. Good luck with aggregating data by value. But that is only half a problem. Say you want to add a "gender" value to your customer. On your initial structure you would need to go through a million records of customers and add a new value to a new field. And you say it is difficult. And you offer an easier way, you create a new table and you add the gender there. So you create a new table with a key for every customer and then and a value of their gender, that would take exactly the same amount of calculation - it would be 1 million records processed. Where is a simplification?
Melbourne would not be written a million times. What this video did not mention is that there would be an intermediate link table. That is, there would be a customer-location link table with a million records linking those customers to the same (integer) id. Then there is a final location SAT table that contains a single record associating that id with “Melbourne”. As for your second point, the benefits come from modularity. Instead of having to modify a “dim_customers” file every time you want to add new attributes, you can write brand new “sat_customer_location” and “link_customer_location” files without modifying the original files. In a fast paced agile environment, writing new files is much easier than modifying existing files.
I am not convinced that data vault modeling is the evolution of dimensional modeling. The main purpose for dimensional modeling ( denormalizing) is to reduce the number of joins, make it less expensive and hence get performance gain while querying reports in realtime. With the above data vault approach, you are again increasing the number of joins ( seems like you are again somewhat normalizing). Please let me know if my logic is correct.
Really late answer but the dimensional vs vault discussion is not relevant. Many times if not always we need a semantic dimensional layer on a vault model. Vault is not just a modelling technique but also an architectural direction. It only works best with cdc and business keys on source. The beautiful thing about Vault is that it somewhat integrates mdm into the model part, stores history almost by default.
Don't you understand how ridiculous dimensional modelling is compared to 3rdNF. Dimensional modelling is simple nonsense. Data Vault is a fair good concept, but very related to 3rdNF. I like Data Vault, because you can do things in parallel, but dimensional modelling is simple pure nonsense. One of those unprecedented failures in IT.
best explanation of moving from dimensional modelling towards data vault in a step by step manner..
The moment you reverse the business key to reverse the relation.. it is a pragmatic way ti show evolution. Loved this.
Crisp, clear and short introductory explanation of the how Data Vault has evolved from Dimensional Modeling. Personal thanks to the engaging and excellent presenter.
Dimensional modelling was SAP. A complete company without any idea how data models woks at all.,
Wow this was so easy to understand. Great story telling and explanation. Thank you!!
Thank you. This is the best explanation of the problem that Data Vault actually solves.
this is one of the most important video for one who comes from a the kimbal world ...
Best video on adding new source attributes
Brilliant! So helpful in helping me to understand *both* modeling techniques.
Thanks for sharing the concept. My queries are:
How is this different from Snowflake?
The detail lies is implementation because of more joins and hence performance for faster read.
Data Vault is typically used for the main DW layer. This layer is integrated, non-volatile, time-variant, and subject oriented (@ Inmon) and auditable, agile, atomic etc. The business meaning implied by the "conformed" dimension in this example represents the common business rule transformations. These are intended to be completed within the DW layer as part of the target or BDW area. In the same layer there is also a RAW area (traceability).
The Data Marts remain Dimensional Models.
Mr. Hans, your explanations and videos are always better than those of other experts in DV. However, I have a question - how does it impact the ETL in case of dimension model appends vs. DV satellite additions?
Short, straightforward and objective as every great work should be done. Congrats
Thank you. This made it real and practical for me.
Can you please help me with these . Would be great if you can explain with some examples.
1. Why link to link relationship is not recommended in RDV?
2. In BDV bridge table, if we are storing only hash keys( not natural keys), then how in fact/dimension we are going to get natural keys?
Fantastic! It just solved the DV to DM on the Dimensions!!! The dimension is a plain cross product between the satellites on the hub!
I love this great presentation.
Awesome explanation. Great. Thank you
Brilliant explanation !!!! Thank you very much
absolutely stunning explanation. Thank you a lot
Very good! Very clear story on DV modeling - thanks a lot for sharing.
Thanks for the Great Introduction into Data Vault. The videos in the academy are great. Appreciate if you could put a logical structure of Videos that can be read on a sequential basis
great explanation, thanks a lot
How can we design the already connected dimension tables to the fact table in Data vault ? You were just explaining about the extended dimensions with vault
Pretty good explanation on DV. Though it talks about Hub and Satellites it doesn't seem to talk about the Links. Is there a separate video on that? Covering Links would make this topic complete in my opinion.
This particular topic didn't concern links however there are other videos in this channel that do. Also you can find more on the Genesee Academy youtube channel.
Thanks very much for this clear explanation. Could you please also create a video explaining how FACTS are derived from DV or vice versa? Thanks
I could see this becoming a performance nightmare ...
Great and easy to watch video, big thanks
Very clear explanation. Thanks! Simple Question: If I aggregate Sales upwards, via the link tables, to a unique customer contained within the hub, how do I know what version of a Satellite record to use. A unique Hub record can now have multiple Satellite records by having different Load Dates.
An SCD2 is painful to maintain and use, but very effective at tracking history, as new versions have unique surrogates that can be associated with transactions. The hub-Sat model does not associate different satellite versions for the same hub record to a transaction. Only to the unique hub record.
So If I use a Sat table to track changes to a specific customer contained in the hub, I cannot see what transactions occurred when during the life of the customer.
Is my understanding correct??
very well said.. I really enjoyed this sessions, and learned a lot.
best 11.23 mins i spent today
Very good. Thank you.
Really well explained, great example!
Awesome!
Great explanation!
This approach is no way fundamentally different from dimensional modeling. Just a variation. If a new hub comes, you still have to change all the link tables . Small changes may be are easier but Ralph never said you must prop everything in a giant dimension. This is more or less what DV improves, isolating attributes based on the same frequency of changes (satellites) and linking this to a dimension (hub) instead of direct link to the facts tables like classical MDM does. Makes sense, but packaging this like new modeling paradigm and calling it all different 😂😂😂😂. Presentation would benefit from talking about frequency of changes grouping. This is the essential part
It is really a great lecture
v good. i have just started reading on data vaults
akthar
I didn't get if all or any of the smaller customer dimensions are type 2 dimensions?
They are all type 2. :-)
wowow!!! simple and perfect explanation
This is Great quickly digestible
It's very similar to EAV ( Entity Attribute Value) Model
THANKS!!!
Is this really the evolution of star or Snowflake schema’d DWH design? A data vault modeled DWH would require far more joins and likely perform poorly. What about that? Whereas a single dimensional model has fewer tables is easier to partition faster to query.
No, no, no, no... No on so many levels. And people cannot see the lecturer is making mistakes because he is giving no examples. The final structure you are making is turning dimension table into a fact table and you would have a record there for every customer. So basically if you have a million customers, you would have a million records there. If you have half a million customers from Melbourne you would end up writing "Melbourne" half a million times. Good luck with aggregating data by value.
But that is only half a problem.
Say you want to add a "gender" value to your customer. On your initial structure you would need to go through a million records of customers and add a new value to a new field. And you say it is difficult. And you offer an easier way, you create a new table and you add the gender there. So you create a new table with a key for every customer and then and a value of their gender, that would take exactly the same amount of calculation - it would be 1 million records processed. Where is a simplification?
Melbourne would not be written a million times. What this video did not mention is that there would be an intermediate link table. That is, there would be a customer-location link table with a million records linking those customers to the same (integer) id. Then there is a final location SAT table that contains a single record associating that id with “Melbourne”.
As for your second point, the benefits come from modularity. Instead of having to modify a “dim_customers” file every time you want to add new attributes, you can write brand new “sat_customer_location” and “link_customer_location” files without modifying the original files. In a fast paced agile environment, writing new files is much easier than modifying existing files.
Perfect but It seems 3NF like hybrid approach.
are there any modeling tools that support Data Vault?
I am not convinced that data vault modeling is the evolution of dimensional modeling. The main purpose for dimensional modeling ( denormalizing) is to reduce the number of joins, make it less expensive and hence get performance gain while querying reports in realtime. With the above data vault approach, you are again increasing the number of joins ( seems like you are again somewhat normalizing). Please let me know if my logic is correct.
Really late answer but the dimensional vs vault discussion is not relevant. Many times if not always we need a semantic dimensional layer on a vault model. Vault is not just a modelling technique but also an architectural direction. It only works best with cdc and business keys on source. The beautiful thing about Vault is that it somewhat integrates mdm into the model part, stores history almost by default.
Good one
Thank You
Well done.
Don't you understand how ridiculous dimensional modelling is compared to 3rdNF. Dimensional modelling is simple nonsense. Data Vault is a fair good concept, but very related to 3rdNF. I like Data Vault, because you can do things in parallel, but dimensional modelling is simple pure nonsense. One of those unprecedented failures in IT.
wonderfully explained!!