ERwin data modeller plug-in MODGEN for DataVault generation

Thanks to George McGeachie my attention was drawn to the following article on the blog of Erwin, a well known data modelling tool.

The article is about DataVault in general and how a data modelling tools like Erwin can help.

More interesting is the fact that the German company heureka e-Business GmbH has written a plug-in for Erwin called MODGEN that is able to generate a DataVault model from another data model.

I will certainly contact them and see if they do their webcast again on whether a recorded version is available or offline viewing.

Who knows this is one step further in automating DataVault.

My 2 cents on DataVault standards (evolution)

My 2 cents on DataVault standards (evolution)

Generating #datavault models & Issues to address | Accelerated Business Intelligence

Generating #datavault models & Issues to address | Accelerated Business Intelligence

How to load a #datavault hub from multiple sources

In this blogpost I will describe a few scenario’s on how to load a datavault hub from multiple sources, each with their pros and cons. Many thanks to Martijn Evers (@DM_Unseen) and @RonaldKunenborg for giving their input.

The main reason for this blogpost is to share knowledge about Data Vault, so that it becomes accessible to a broader audience.

Disclaimer: I am not a Certified Data Vault Modeler, but the fore mentioned persons are.

Some considerations

One of the “principles” in the Data Vault Methodology is that most of the loading can be done in parallel. First the hubs are loaded in parallel, then the links and finally the satellites (although satellites belonging to hubs can of course be loaded after the hubs are loaded, one doesn’t need to wait for the links to be loaded).

It’s exactly at this point where I initially was somewhat confused regarding the loading of a hub that has multiple sources. If you would load that hub sequentially, as explained in scenario 2 below, wouldn’t you be defying this principle of parallel loading?

On the other hand, the idea is to load your data as soon as it becomes available. This poses a problem when using a union construct as explained in scenario 1 below. If one of the sources is ready a lot later, you have to sit and wait before you can load. Precious time is lost.

Scenario 1: using a union between sources

In this scenario the business keys from the different sources will be unioned together, while keeping information about the record source of each business key.

The following pseudo-SQL provides the basis for this scenario.

select distinct a.col1 as business_key  , 'table1.col1' as record_source , load_cycle_ts() as load_dts from table1 a  union  select distinct b.col2 as business_key , 'table2.col2' as record_source , load_cycle_ts() as load_dts from table2 b

Note that the above is not entirely correct, as it can result in duplicate business keys due to the inclusion of the record_source. This can be handled however with most ETL tools.

A typical setup for the above using Pentaho Data Integration would be like:

Pros

  • All the sources of a particular hub in one transformation which can give a better overview
  • Principle of parallel loading is maintained

Cons

  • Not easy to generate via automation
  • Difficult to maintain if additional sources are needed
  • Additional steps are needed to prevent duplicate keys
  • Additional constructs are needed to appoint the master source
  • Synchronization between sources, because you need to wait to start loading when all sources are ready

Scenario 2: sequentially for each source

In this scenario the business keys from the different sources will be loaded sequentially, starting with the master source.

Step 1: master source first

select distinct a.col1 as business_key  , 'table1.col1' as record_source , load_cycle_ts() as load_dts from table1 a

Step 2: next source

select distinct b.col2 as business_key , 'table2.col2' as record_source , load_cycle_ts() as load_dts from table2 b

A typical setup for each of these steps in Pentaho Data Integration would be like:

With this kind of setup, you can also use “micro-batches” to load the data. This idea is explained by Martijn Evers (@DM_Unseen) in his reply to my original question about this subject on LinkedIn (you need to be a member of the Data Vault Discussions group to view it).

Pros

  • Can easily be generated via automation
  • Adding new sources is easy
  • None of the Cons of scenario 1

Cons

  • Slightly defies the principle of parallel loading of hubs, but this is really of minor importance

Conclusion

It should be clear that the second scenario of loading the hub sequentially from it sources is normally the best one to choose. However, to quote @RonaldDamhof, “it depends on the context”. You can always come across a situation where another way is better…

Waiting for DataVault implementation classes

 I’m anxious to learn more about the implementation part of a DataVault, something that is not covered in detail in his book “Supercharge Your Data Warehouse”. 

For a while now, Dan Linstedt has annouced that he is working classes that cover the implementation. You can register to be notified about it here.

 

Unfortunately, I keep receiving emails from that list with all kinds of info about DataVault – but not about the implementation classes – and that I can be notified when the implementation classes are ready. But that’s why I registered in the first place. If I already did, why it that still in the emails that I keep receiving?

 

Just now, I saw another tweet from him:

 

 

Now I find it interesting that he’s working on multiple things, but it would be nice if something actually finishes. The implementation classes are almost ready for production and release. We all know what this means in IT. Either something is ready, or it’s not. “Almost” doesn’t exist…

 

Dan, get those implementation classes out now. People are waiting for it. You can always finetune them later…

 

Data Vault and other approaches, my reflection on Frank Haber’s article

Intro

I’m writing this blog post as an additional comment and reflection on the whole discussion that broke loose as a result of the article that Frank Habers wrote in XR magazine.

Before I continue, I want to make the following very clear:

  • I am an independant BI consultant with almost 15 years of experience
  • Most implementations I did or worked on are using Kimball’s approach
  • I am NOT a Certified Data Vault Modeler, but that does not mean that I haven’t read a lot of material that Dan Linstedt and others wrote about Data Vault (such as “Supercharge Your Data Warehouse”)
  • I have little practical experience in using Data Vault
  • The largest BI implementation (in terms of volume of data) I encountered was for a mobile telephone company
  • I have never (unfortunately) worked with an MPP database
  • I am not trying to sell anyone anything based on this post

My original comment on Frank’s article was only on one specific point he made about the difference in performance between a Dimensional and Data Vault model, with respect to the joins. I mentioned that it was missing enough clarification to make his point. Something he admitted in his own comment and to which he gave some more clarification already, which I appreciate.

However the whole discussion in the comments on his article could easily turn into a “war”, which is not very helpful as stated by Ronald Damhof and Rick van der Lans in their comments on twitter.

I find the article that Dan Linstedt wrote on his own blog to counter Frank’s article also a bit of an overheated response even though Dan’s makes it clear that he has nothing against Frank personally. For some part I can understand that. There is nothing wrong in correcting statements made that are false or not entirely true. And of course Data Vault is still Dan’s “baby” and we all know how we react if someone does something wrong to our children. But I do think that NOT being a Certified Data Vault Modeler doesn’t mean you can’t discuss it or don’t know anything about it. There isn’t such a thing as being a Certified Dimensional Modeler either…

But we must make sure we don’t actually start a war. We have done that before with Inmon’s and Kimball’s approach. It doesn’t lead anywhere in the end. Having a sound and constructive discussion in which we elaborate on pro’s and con’s of certain approaches is however a good thing. As Ronald Damhof mentioned in his comments, it all depends on the context (of the client).

And whether Frank’s article may have some commercial background in it or not, the approach he discusses is a good approach, but again, it depends on the context.

Benefits of Data Vault

Based on my limited experience with Data Vault, there are some benefits that I can see in its modeling aspect that are less obvious in Dimensional Modeling. The whole idea of having hubs and links and the fact that you have many-to-many relationships does help in at least two ways:

  1. Understanding the business and creating a sound model of the business processes
  2. Getting possibly crappy data from the source into your data warehouse and show the business that they may have an issue

Note that the above does not mean you can’t accomplish this with Dimensional Modeling. Let me elaborate.

Understanding the business

When discussing business processes and the data that is used or produced by these processes, I have come to the conclusion that a Dimensional Model is fairly easy to understand by the business. However, creating the bus architecture with many fact tables (at the most detailed grain possible) and conformed dimensions can also easily result in losing the complete overview, even when you only present the entities without the attributes. Secondly, I find it more difficult to understand possible relationships that exists between fact tables.

Does a Data Vault model solve this? Yes and no. If you present a complete Data Vault model with all satellites and possible reference tables, you’re lost as well (both IT and business). But if you limit it to the hubs and links only, it becomes much clearer.

I can hear you say already: “this doesn’t help”. Partially you are right. In many cases there is not much difference between a Data Vault and Dimensional Model. Let’s look at the following simple example:

  • Customer
  • Shop
  • Sales

Where as in a Dimensional Model you would have two dimensions and one fact, in a Data Vault model you would have two hubs, two satellites linked to those hubs, one link and one satellite linked to that link table. Leave out the satellites and you get (basically) the same as with the Dimensional Model: two hubs and one link, representing two dimensions and one fact.

But if you need to introduce a many-to-many relationship between dimensions, there are basically two ways of solving it:

  1. You use a factless fact table to capture that
  2. You alter the grain of an existing fact table by adding the additional dimension

With the second approach you will give yourself a headache when there is already data present, but it can be done.

The first approach, using the factless fact, is much easier. But wait, isn’t that the same as creating another link table between two hubs in a Data Vault model? Sure it is! But to me it feels more natural in a Data Vault model to use a link between hubs than to use a factless fact in a Dimensional Model. The reason for this is only psycological because of terminology: a factless fact. You’re registering a fact without it being a fact. Weird terminology if you ask me. Maybe it should have been called an attribute-less fact.

So in many cases there may not be much of a difference after all between a Dimensional Model and a Data Vault model, but I find a Data Vault model easier in terms of evolution. The “divide and conquer” is much easier to apply to it than to a Dimensional Model.

Another issue that I sometimes encounter with a Dimensional Model is the possibly changing cardinality of a relationship between dimensions. In a true Dimensional Model, snowflaking should be prevented (there are always exceptions), meaning you flatten or denormalize your table. Great if there is a hierarchy present that is one-to-many. But a nightmare when this changes to a many-to-many relationship (in which case having snowflaked it would give you easier means to recover).

Getting crappy data from your source in your data warehouse

Let’s be honest, we all have encountered it. If not, let me know. There is a lot of crappy data in source systems. Data that does not represent the cardinality rules given by the business. And all kinds of other data (quality) issues.

Having a Data Vault model with its many-to-many relationships provides a guarantee that you can at least load that crappy data into your data warehouse (maybe with a few exceptions). Having it there will of course still give you a headache when you need to process and present it to the business in a layer more suitable for presentation, either virtualized or with a Dimensional Model on top of your Data Vault.

But it does become much easier to confront the business with the fact that they have crappy data in their source!

I find it easier with a Data Vault model than with a HSA that is modeled as the source model. In fact, how often haven’t you been in the situation that the source model is much of a blackbox and you only receive extracts from it. In such a case, the HSA is probably modeled after the extract, which may not be the actual source model.

Often when using a Dimensional Model, this crappy data is hidden because it is being cleaned by the (complex) ETL along the way from source to presentation to the business. You lose some track of it and the business is possibly not even aware of it.

But Data Vault does not solve this, it only helps you to make it more visible. In the end, there is still work to be done to clean it, either in the source itself or along the way to the presentation layer (whether that be a Dimensional Model, cube or something else).

Con’s of Data Vault

This is probably the part that may get readers and experts “excited”, to say the least 😉 Due to my limited experience, these con’s could be false in some cases. Please correct me if I am wrong, I want to learn from the experts in the field.

One of the con’s is that Data Vault indeed does result in more tables and possibly more joins, which can make it more complex to maintain from the DBA’s point of view.

Secondly I do have some doubts on performance as well, but especially (and only) in the following situation: if you create a virtualized Dimensional Model suitable for presentation on top of the Data Vault model using views and when you do this on a plain non-MPP database that doesn’t use column stores. If even a physically implemented Dimensional Model already gives performance issues, than using views with more joins on top of a Data Vault model on the same configuration won’t be any quicker.

Thirdly… well, this is not related to the Data Vault Model and Methodology as such, but more to the advocacy of it. With any new or just evolutionary approach, there is a hurdle to tackle. We are afraid of change. Sometimes Data Vault is presented as the holy grail. That’s not true, period. It doesn’t even depend on the context. The holy grail has never been found. Data Vault can help you solve particular issues that we encounter now and maybe in the next ten years. But by then, we may have evolved in such a way in handling data, that even Data Vault doesn’t provide a solution for the issues we encounter.

I also have issues with the continous hammering on getting certified in Data Vault. What is really the benefit of it? Of course, I can show off with it on my CV. Increase my hourly rate a bit so that I can earn it back. I can see a benefit for Dan and Hans. They make money out of it. Those are valid reasons of course, but do I really get much more knowlegde by following the training and certification class, or by getting experience in the field with the theory based on the articles, blogpost, books, (free) advice from experts (yes, I did get free advice) that are certified and paying close attention to reviews done by a certified expert.

Conclusion

So what conclusion is there? Did I make a strong point somewhere? No, I just wanted to reflect on the discussion that Frank’s article started.

Data Vault can be useful, for sure, I can see that. But I have doubts as well. The most important thing is that we help our clients and choose the best approach given the context of those clients. Make them evolve.

I hope this post invites you to give your comments on my reflection. Please help me learn and evolve. Correct me if I am wrong etc.

And thanks for coming all the way down here to this last line, it means the post wasn’t boring 😉