I’m writing this blog post as an additional comment and reflection on the whole discussion that broke loose as a result of the article that Frank Habers wrote in XR magazine.
Before I continue, I want to make the following very clear:
- I am an independant BI consultant with almost 15 years of experience
- Most implementations I did or worked on are using Kimball’s approach
- I am NOT a Certified Data Vault Modeler, but that does not mean that I haven’t read a lot of material that Dan Linstedt and others wrote about Data Vault (such as “Supercharge Your Data Warehouse”)
- I have little practical experience in using Data Vault
- The largest BI implementation (in terms of volume of data) I encountered was for a mobile telephone company
- I have never (unfortunately) worked with an MPP database
- I am not trying to sell anyone anything based on this post
My original comment on Frank’s article was only on one specific point he made about the difference in performance between a Dimensional and Data Vault model, with respect to the joins. I mentioned that it was missing enough clarification to make his point. Something he admitted in his own comment and to which he gave some more clarification already, which I appreciate.
However the whole discussion in the comments on his article could easily turn into a “war”, which is not very helpful as stated by Ronald Damhof and Rick van der Lans in their comments on twitter.
I find the article that Dan Linstedt wrote on his own blog to counter Frank’s article also a bit of an overheated response even though Dan’s makes it clear that he has nothing against Frank personally. For some part I can understand that. There is nothing wrong in correcting statements made that are false or not entirely true. And of course Data Vault is still Dan’s “baby” and we all know how we react if someone does something wrong to our children. But I do think that NOT being a Certified Data Vault Modeler doesn’t mean you can’t discuss it or don’t know anything about it. There isn’t such a thing as being a Certified Dimensional Modeler either…
But we must make sure we don’t actually start a war. We have done that before with Inmon’s and Kimball’s approach. It doesn’t lead anywhere in the end. Having a sound and constructive discussion in which we elaborate on pro’s and con’s of certain approaches is however a good thing. As Ronald Damhof mentioned in his comments, it all depends on the context (of the client).
And whether Frank’s article may have some commercial background in it or not, the approach he discusses is a good approach, but again, it depends on the context.
Benefits of Data Vault
Based on my limited experience with Data Vault, there are some benefits that I can see in its modeling aspect that are less obvious in Dimensional Modeling. The whole idea of having hubs and links and the fact that you have many-to-many relationships does help in at least two ways:
- Understanding the business and creating a sound model of the business processes
- Getting possibly crappy data from the source into your data warehouse and show the business that they may have an issue
Note that the above does not mean you can’t accomplish this with Dimensional Modeling. Let me elaborate.
Understanding the business
When discussing business processes and the data that is used or produced by these processes, I have come to the conclusion that a Dimensional Model is fairly easy to understand by the business. However, creating the bus architecture with many fact tables (at the most detailed grain possible) and conformed dimensions can also easily result in losing the complete overview, even when you only present the entities without the attributes. Secondly, I find it more difficult to understand possible relationships that exists between fact tables.
Does a Data Vault model solve this? Yes and no. If you present a complete Data Vault model with all satellites and possible reference tables, you’re lost as well (both IT and business). But if you limit it to the hubs and links only, it becomes much clearer.
I can hear you say already: “this doesn’t help”. Partially you are right. In many cases there is not much difference between a Data Vault and Dimensional Model. Let’s look at the following simple example:
Where as in a Dimensional Model you would have two dimensions and one fact, in a Data Vault model you would have two hubs, two satellites linked to those hubs, one link and one satellite linked to that link table. Leave out the satellites and you get (basically) the same as with the Dimensional Model: two hubs and one link, representing two dimensions and one fact.
But if you need to introduce a many-to-many relationship between dimensions, there are basically two ways of solving it:
- You use a factless fact table to capture that
- You alter the grain of an existing fact table by adding the additional dimension
With the second approach you will give yourself a headache when there is already data present, but it can be done.
The first approach, using the factless fact, is much easier. But wait, isn’t that the same as creating another link table between two hubs in a Data Vault model? Sure it is! But to me it feels more natural in a Data Vault model to use a link between hubs than to use a factless fact in a Dimensional Model. The reason for this is only psycological because of terminology: a factless fact. You’re registering a fact without it being a fact. Weird terminology if you ask me. Maybe it should have been called an attribute-less fact.
So in many cases there may not be much of a difference after all between a Dimensional Model and a Data Vault model, but I find a Data Vault model easier in terms of evolution. The “divide and conquer” is much easier to apply to it than to a Dimensional Model.
Another issue that I sometimes encounter with a Dimensional Model is the possibly changing cardinality of a relationship between dimensions. In a true Dimensional Model, snowflaking should be prevented (there are always exceptions), meaning you flatten or denormalize your table. Great if there is a hierarchy present that is one-to-many. But a nightmare when this changes to a many-to-many relationship (in which case having snowflaked it would give you easier means to recover).
Getting crappy data from your source in your data warehouse
Let’s be honest, we all have encountered it. If not, let me know. There is a lot of crappy data in source systems. Data that does not represent the cardinality rules given by the business. And all kinds of other data (quality) issues.
Having a Data Vault model with its many-to-many relationships provides a guarantee that you can at least load that crappy data into your data warehouse (maybe with a few exceptions). Having it there will of course still give you a headache when you need to process and present it to the business in a layer more suitable for presentation, either virtualized or with a Dimensional Model on top of your Data Vault.
But it does become much easier to confront the business with the fact that they have crappy data in their source!
I find it easier with a Data Vault model than with a HSA that is modeled as the source model. In fact, how often haven’t you been in the situation that the source model is much of a blackbox and you only receive extracts from it. In such a case, the HSA is probably modeled after the extract, which may not be the actual source model.
Often when using a Dimensional Model, this crappy data is hidden because it is being cleaned by the (complex) ETL along the way from source to presentation to the business. You lose some track of it and the business is possibly not even aware of it.
But Data Vault does not solve this, it only helps you to make it more visible. In the end, there is still work to be done to clean it, either in the source itself or along the way to the presentation layer (whether that be a Dimensional Model, cube or something else).
Con’s of Data Vault
This is probably the part that may get readers and experts “excited”, to say the least 😉 Due to my limited experience, these con’s could be false in some cases. Please correct me if I am wrong, I want to learn from the experts in the field.
One of the con’s is that Data Vault indeed does result in more tables and possibly more joins, which can make it more complex to maintain from the DBA’s point of view.
Secondly I do have some doubts on performance as well, but especially (and only) in the following situation: if you create a virtualized Dimensional Model suitable for presentation on top of the Data Vault model using views and when you do this on a plain non-MPP database that doesn’t use column stores. If even a physically implemented Dimensional Model already gives performance issues, than using views with more joins on top of a Data Vault model on the same configuration won’t be any quicker.
Thirdly… well, this is not related to the Data Vault Model and Methodology as such, but more to the advocacy of it. With any new or just evolutionary approach, there is a hurdle to tackle. We are afraid of change. Sometimes Data Vault is presented as the holy grail. That’s not true, period. It doesn’t even depend on the context. The holy grail has never been found. Data Vault can help you solve particular issues that we encounter now and maybe in the next ten years. But by then, we may have evolved in such a way in handling data, that even Data Vault doesn’t provide a solution for the issues we encounter.
I also have issues with the continous hammering on getting certified in Data Vault. What is really the benefit of it? Of course, I can show off with it on my CV. Increase my hourly rate a bit so that I can earn it back. I can see a benefit for Dan and Hans. They make money out of it. Those are valid reasons of course, but do I really get much more knowlegde by following the training and certification class, or by getting experience in the field with the theory based on the articles, blogpost, books, (free) advice from experts (yes, I did get free advice) that are certified and paying close attention to reviews done by a certified expert.
So what conclusion is there? Did I make a strong point somewhere? No, I just wanted to reflect on the discussion that Frank’s article started.
Data Vault can be useful, for sure, I can see that. But I have doubts as well. The most important thing is that we help our clients and choose the best approach given the context of those clients. Make them evolve.
I hope this post invites you to give your comments on my reflection. Please help me learn and evolve. Correct me if I am wrong etc.
And thanks for coming all the way down here to this last line, it means the post wasn’t boring 😉