How to explain data architecture to a teenager

Yesterday I attended the initial awareness session of the “Full Scale” Data Architects. We had an open discussion on what it is, could be or should be.

One of the questions raised during that session and afterwards on LinkedIn was how to explain what we – data architects – do.

Although data architecture and architecture (in construction) have many differences, I still see an analogy.

When asked what I do, I also make that analogy. It doesn’t cover it completely of course but it is often enough for the first introduction.

I design “something”, make the blueprint and lay the foundation.

And that while taking into account all wishes, (legal) requirements, environmental factors, durability, change and – although in data architecture we try as much as possible to be technology-independent – available “building material”. It’s basically finding the right balance as Ronald Damhof put it.

In practice the architect may also be the contractor that takes the lead in the construction. This can be an incentive for some but not for others1.

But I always keep an eye – or delegate it – that the construction is according to plan. When necessary I even change the plan (due to external changes, available building materials).

I should therefore have an overview and be part of a whole team that I can trust.

And I shouldn’t make it more complex than strictly necessary, certainly not when I try to explain it to someone else.

Of course it can definitely help if you have ever constructed things yourself – and I did -, but mainly from the point of view of the problems you can run into. Otherwise you risk that you start with a technical bias (yes, it does happen to me occasionally).

  1. Another question was how to make data architecture attractive to teenagers so that they will study it, if there were any real studies about it.

Impression BIReady demo – DWH automation

This afternoon Gertjan Vlug gave me a demo of BIReady, a product for automating the generation and loading of a data warehouse. In the remainder of this post, I’ll give my impression about its possibilities.

What does it do?

BIReady generates data models and ETL code for:

  • Staging Area
  • Enterprise Data Warehouse (EDW)
  • Data Marts

For the EDW it uses Data Vault modeling, for the Data Marts it uses Dr. R. Kimball’s star schemas.

The ETL code is dependent on the target database, but is essentially ANSI SQL that is run on the database itself. This means it executes as fast as your database engine can run it, the tool itself only handles the parallelism of the instructions that need to be executed.

What doesn’t it do?

BIReady doesn’t do custom integration, cleansing and other things that cannot be automated easily. However, you can still use existing ETL tools, data quality and cleansing tools to handle this part.

The starting point: a business data model

Unlike several other competitive products1, BIReady uses a business data model as a starting point for the generation of the data warehouse. This is basically an Entity Relationship Diagram (ERD) in third normal form that reflects the business (data) model. You should not confuse this with a Business Process Model using BPMN to model it.

The business data model can be imported from CA ERWin or PowerDesigner. [BIReady] also has some built-in modeling facilities, but those are of course limited compared to the fore-mentioned data modeling tools.

When a business data model is not present, you can start with reverse engineering one or more source data models, just like many of the competitive products do.

I like the fact that the business data model is taken as a starting point, because it is much more likely to integrate the data compared to using source data models. By using a business data model, you are bridging the semantic gap that Ronald Damhof is referring to in his presentation he gave at the Data Vault Automation conference last year.

Demo on Northwind database

We are all familiar with Microsoft’s Northwind database that is used in many, many examples. Gertjan used it for his demo. The good thing about it, is that it is a well documented and properly designed (business) data model. Gertjan was explaining the steps and showing the most important options of the tool and one and a half our later, the staging area, EDW and data mart were generated and loaded. The reason it took that long was because I interrupted him with some questions…

Conclusion

I was very impressed by the ease of use and speed. I will get a demo license of the product to play a bit with it. Based on that I will probably write another post containing some more details. Contact BIReady for a demo if you want to know more.


  1. most of the competitive products use the source data models as a starting point 

Data Vault and other approaches, my reflection on Frank Haber’s article

Intro

I’m writing this blog post as an additional comment and reflection on the whole discussion that broke loose as a result of the article that Frank Habers wrote in XR magazine.

Before I continue, I want to make the following very clear:

  • I am an independant BI consultant with almost 15 years of experience
  • Most implementations I did or worked on are using Kimball’s approach
  • I am NOT a Certified Data Vault Modeler, but that does not mean that I haven’t read a lot of material that Dan Linstedt and others wrote about Data Vault (such as “Supercharge Your Data Warehouse”)
  • I have little practical experience in using Data Vault
  • The largest BI implementation (in terms of volume of data) I encountered was for a mobile telephone company
  • I have never (unfortunately) worked with an MPP database
  • I am not trying to sell anyone anything based on this post

My original comment on Frank’s article was only on one specific point he made about the difference in performance between a Dimensional and Data Vault model, with respect to the joins. I mentioned that it was missing enough clarification to make his point. Something he admitted in his own comment and to which he gave some more clarification already, which I appreciate.

However the whole discussion in the comments on his article could easily turn into a “war”, which is not very helpful as stated by Ronald Damhof and Rick van der Lans in their comments on twitter.

I find the article that Dan Linstedt wrote on his own blog to counter Frank’s article also a bit of an overheated response even though Dan’s makes it clear that he has nothing against Frank personally. For some part I can understand that. There is nothing wrong in correcting statements made that are false or not entirely true. And of course Data Vault is still Dan’s “baby” and we all know how we react if someone does something wrong to our children. But I do think that NOT being a Certified Data Vault Modeler doesn’t mean you can’t discuss it or don’t know anything about it. There isn’t such a thing as being a Certified Dimensional Modeler either…

But we must make sure we don’t actually start a war. We have done that before with Inmon’s and Kimball’s approach. It doesn’t lead anywhere in the end. Having a sound and constructive discussion in which we elaborate on pro’s and con’s of certain approaches is however a good thing. As Ronald Damhof mentioned in his comments, it all depends on the context (of the client).

And whether Frank’s article may have some commercial background in it or not, the approach he discusses is a good approach, but again, it depends on the context.

Benefits of Data Vault

Based on my limited experience with Data Vault, there are some benefits that I can see in its modeling aspect that are less obvious in Dimensional Modeling. The whole idea of having hubs and links and the fact that you have many-to-many relationships does help in at least two ways:

  1. Understanding the business and creating a sound model of the business processes
  2. Getting possibly crappy data from the source into your data warehouse and show the business that they may have an issue

Note that the above does not mean you can’t accomplish this with Dimensional Modeling. Let me elaborate.

Understanding the business

When discussing business processes and the data that is used or produced by these processes, I have come to the conclusion that a Dimensional Model is fairly easy to understand by the business. However, creating the bus architecture with many fact tables (at the most detailed grain possible) and conformed dimensions can also easily result in losing the complete overview, even when you only present the entities without the attributes. Secondly, I find it more difficult to understand possible relationships that exists between fact tables.

Does a Data Vault model solve this? Yes and no. If you present a complete Data Vault model with all satellites and possible reference tables, you’re lost as well (both IT and business). But if you limit it to the hubs and links only, it becomes much clearer.

I can hear you say already: “this doesn’t help”. Partially you are right. In many cases there is not much difference between a Data Vault and Dimensional Model. Let’s look at the following simple example:

  • Customer
  • Shop
  • Sales

Where as in a Dimensional Model you would have two dimensions and one fact, in a Data Vault model you would have two hubs, two satellites linked to those hubs, one link and one satellite linked to that link table. Leave out the satellites and you get (basically) the same as with the Dimensional Model: two hubs and one link, representing two dimensions and one fact.

But if you need to introduce a many-to-many relationship between dimensions, there are basically two ways of solving it:

  1. You use a factless fact table to capture that
  2. You alter the grain of an existing fact table by adding the additional dimension

With the second approach you will give yourself a headache when there is already data present, but it can be done.

The first approach, using the factless fact, is much easier. But wait, isn’t that the same as creating another link table between two hubs in a Data Vault model? Sure it is! But to me it feels more natural in a Data Vault model to use a link between hubs than to use a factless fact in a Dimensional Model. The reason for this is only psycological because of terminology: a factless fact. You’re registering a fact without it being a fact. Weird terminology if you ask me. Maybe it should have been called an attribute-less fact.

So in many cases there may not be much of a difference after all between a Dimensional Model and a Data Vault model, but I find a Data Vault model easier in terms of evolution. The “divide and conquer” is much easier to apply to it than to a Dimensional Model.

Another issue that I sometimes encounter with a Dimensional Model is the possibly changing cardinality of a relationship between dimensions. In a true Dimensional Model, snowflaking should be prevented (there are always exceptions), meaning you flatten or denormalize your table. Great if there is a hierarchy present that is one-to-many. But a nightmare when this changes to a many-to-many relationship (in which case having snowflaked it would give you easier means to recover).

Getting crappy data from your source in your data warehouse

Let’s be honest, we all have encountered it. If not, let me know. There is a lot of crappy data in source systems. Data that does not represent the cardinality rules given by the business. And all kinds of other data (quality) issues.

Having a Data Vault model with its many-to-many relationships provides a guarantee that you can at least load that crappy data into your data warehouse (maybe with a few exceptions). Having it there will of course still give you a headache when you need to process and present it to the business in a layer more suitable for presentation, either virtualized or with a Dimensional Model on top of your Data Vault.

But it does become much easier to confront the business with the fact that they have crappy data in their source!

I find it easier with a Data Vault model than with a HSA that is modeled as the source model. In fact, how often haven’t you been in the situation that the source model is much of a blackbox and you only receive extracts from it. In such a case, the HSA is probably modeled after the extract, which may not be the actual source model.

Often when using a Dimensional Model, this crappy data is hidden because it is being cleaned by the (complex) ETL along the way from source to presentation to the business. You lose some track of it and the business is possibly not even aware of it.

But Data Vault does not solve this, it only helps you to make it more visible. In the end, there is still work to be done to clean it, either in the source itself or along the way to the presentation layer (whether that be a Dimensional Model, cube or something else).

Con’s of Data Vault

This is probably the part that may get readers and experts “excited”, to say the least 😉 Due to my limited experience, these con’s could be false in some cases. Please correct me if I am wrong, I want to learn from the experts in the field.

One of the con’s is that Data Vault indeed does result in more tables and possibly more joins, which can make it more complex to maintain from the DBA’s point of view.

Secondly I do have some doubts on performance as well, but especially (and only) in the following situation: if you create a virtualized Dimensional Model suitable for presentation on top of the Data Vault model using views and when you do this on a plain non-MPP database that doesn’t use column stores. If even a physically implemented Dimensional Model already gives performance issues, than using views with more joins on top of a Data Vault model on the same configuration won’t be any quicker.

Thirdly… well, this is not related to the Data Vault Model and Methodology as such, but more to the advocacy of it. With any new or just evolutionary approach, there is a hurdle to tackle. We are afraid of change. Sometimes Data Vault is presented as the holy grail. That’s not true, period. It doesn’t even depend on the context. The holy grail has never been found. Data Vault can help you solve particular issues that we encounter now and maybe in the next ten years. But by then, we may have evolved in such a way in handling data, that even Data Vault doesn’t provide a solution for the issues we encounter.

I also have issues with the continous hammering on getting certified in Data Vault. What is really the benefit of it? Of course, I can show off with it on my CV. Increase my hourly rate a bit so that I can earn it back. I can see a benefit for Dan and Hans. They make money out of it. Those are valid reasons of course, but do I really get much more knowlegde by following the training and certification class, or by getting experience in the field with the theory based on the articles, blogpost, books, (free) advice from experts (yes, I did get free advice) that are certified and paying close attention to reviews done by a certified expert.

Conclusion

So what conclusion is there? Did I make a strong point somewhere? No, I just wanted to reflect on the discussion that Frank’s article started.

Data Vault can be useful, for sure, I can see that. But I have doubts as well. The most important thing is that we help our clients and choose the best approach given the context of those clients. Make them evolve.

I hope this post invites you to give your comments on my reflection. Please help me learn and evolve. Correct me if I am wrong etc.

And thanks for coming all the way down here to this last line, it means the post wasn’t boring 😉

 

 

What’s your opinion on feeding a data warehouse from a message queue?

There are organizations that have decided that their data warehouse(s) should be fed with data in the same way as their operational systems that are part of their Enterprise Application Integration (EAI) strategy.

Depending on your information needs, this might be a good idea. Especially when you have a (near) realtime data warehouse. Short burst of transactional data can be processed quickly and your data warehouse will be up to date.

But what if your information needs are different? What if you only need a snapshot of the data at the end of the day and just don’t need all changes that happened during the day in your transactional systems?

In a batch oriented data warehouse that only needs data once a day, or maybe only a few times per day but not anywhere close to (near) realtime, is having a message queue that feeds it really the way to go?

In one of my previous projects we had a batch oriented data warehouse that was fed by a message queue. The amount of “data” pushed to it consisted of 75% overhead just because of all the XML tags that were needed. 25% was the data we really needed. In the end, it was decided to make a special XML schema for us with shorter tags to get rid of the overhead.

Secondly, a buffer area needed to be developed with just one service: create a consolidated daily snapshot of all changes received via the message queue. Without going into all the details of the problems we faced, it was a messy solution and just didn’t feel right.

Do you recognize this or have any experience with it? Then please leave your opinion and thoughts in the comment section.

Much appreciated and have a nice day.