The “catch” with data warehouse automation tools

During my evaluation of several data warehouse automation tools such as BIReady, Quipu and RapidAce, I have come to some sort of conclusion that is crucial to the success of using these tools.

As most of these tools take the source data models as a starting point, you better make sure it is correctly modeled. Even with BIReady that takes a “business” model, you need to have a good model. By correctly modeled, I mean that preferably your source is modeled according to 3NF. When reverse engineering an existing database model, make sure primary keys and foreign keys are defined.

If not, you can be sure that the resulting generated data warehouse (datavault) models are pretty worthless.

I noticed this when using some of the tools on a source model that I have at hand from a client. This model is basically based on flat wide files loaded into (flat wide) tables. Primary keys are sometimes not defined. Foreign keys almost do not exist at all. Normalization is not done.

You can argue whether this is a true source model. It is not, that is true. But it is all we have. A situation that you will probably encounter very often.

Mac OS automator workflow for getting direct link to @CloudApp uploads

Mac OS automator workflow for getting direct link to @CloudApp uploads

BIReady evaluation continues…

The issue that I had with the ODBC connection has been solved. I was using a 64-bit driver and should have been using the 32-bit driver for MySQL.

Thanks to Jan Vos of BIReady for helping me out! I will now continue my evaluation and post an update soon.

However, I’m under NDA, so I need to check what I can and cannot post here.

BIReady evaluation – first update

Last weekend I received the demo license for BIReady and tried to play with it.

However, I wanted to make it a real case study and not use the included demo databases which were already shown to me.

And there I stumbled on the first issue. BIReady’s repository must be a MS Access database, something that will be solved in a next version according to a mail I received from their support.

So I decided to go for the MS Access repository but then I got stuck in adding a MySQL database as DWH. Something seems to be wrong with the ODBC connection. Strange as the connection tests just fine using other tools.

Anyway, I’m waiting for BIReady support to be able to continue.

Until further notice…

Impression BIReady demo – DWH automation

This afternoon Gertjan Vlug gave me a demo of BIReady, a product for automating the generation and loading of a data warehouse. In the remainder of this post, I’ll give my impression about its possibilities.

What does it do?

BIReady generates data models and ETL code for:

  • Staging Area
  • Enterprise Data Warehouse (EDW)
  • Data Marts

For the EDW it uses Data Vault modeling, for the Data Marts it uses Dr. R. Kimball’s star schemas.

The ETL code is dependent on the target database, but is essentially ANSI SQL that is run on the database itself. This means it executes as fast as your database engine can run it, the tool itself only handles the parallelism of the instructions that need to be executed.

What doesn’t it do?

BIReady doesn’t do custom integration, cleansing and other things that cannot be automated easily. However, you can still use existing ETL tools, data quality and cleansing tools to handle this part.

The starting point: a business data model

Unlike several other competitive products1, BIReady uses a business data model as a starting point for the generation of the data warehouse. This is basically an Entity Relationship Diagram (ERD) in third normal form that reflects the business (data) model. You should not confuse this with a Business Process Model using BPMN to model it.

The business data model can be imported from CA ERWin or PowerDesigner. [BIReady] also has some built-in modeling facilities, but those are of course limited compared to the fore-mentioned data modeling tools.

When a business data model is not present, you can start with reverse engineering one or more source data models, just like many of the competitive products do.

I like the fact that the business data model is taken as a starting point, because it is much more likely to integrate the data compared to using source data models. By using a business data model, you are bridging the semantic gap that Ronald Damhof is referring to in his presentation he gave at the Data Vault Automation conference last year.

Demo on Northwind database

We are all familiar with Microsoft’s Northwind database that is used in many, many examples. Gertjan used it for his demo. The good thing about it, is that it is a well documented and properly designed (business) data model. Gertjan was explaining the steps and showing the most important options of the tool and one and a half our later, the staging area, EDW and data mart were generated and loaded. The reason it took that long was because I interrupted him with some questions…

Conclusion

I was very impressed by the ease of use and speed. I will get a demo license of the product to play a bit with it. Based on that I will probably write another post containing some more details. Contact BIReady for a demo if you want to know more.


  1. most of the competitive products use the source data models as a starting point 

Datawarehouse Automation

Yesterday I attended the DWH Automation conference in Leuven (Belgium), hosted by BI-Community.

Organizers

It was organized by Ronald Damhof and Tom Breur.

Sponsors

Among the sponsors were Centennium, WhereScape, Qosqo, BIReady, Genesee Academy, timeXtender and TripWire Solutions.1

General presentations

The presentations given by [RonaldDamhof] and [TomBreur] were largely the same as last year at the Data Vault Automation conference in Utrecht (The Netherlands). They both focus on Agile BI and the importance of DWH automation in Agile BI. For that matter, the use of Data Vault modeling for the Enterprise Data Warehouse (EDW) seems the only methodology that truly supports Agile BI and is the one that is the most easy to automate due to its patterns of hubs, links and satellites.

Hans Hultgren from Genesee Academy gave a very interesting presentation about the meaning of Data Warehousing. Nowadays there are a lot of different terms in Data Warehousing and some of these have different meanings depending on who you’re asking to define the term. He focused on the importance to talk about the meaning of the term, instead of the term itself. Depending on the meaning, several layers can be defined in a Data Warehouse solution, each of which has a specific purpose that can be (partially) automated or not.

Frederik Naessens from K25 gave a small presentation on how to use the ERWin data modeling tool to generate the various models, such as a Data Vault model for the EDW. It’s a poor man’s solution focused on being able to create awareness of the need of DWH automation tools.

Product presentations

The following companies presented their DWH Automation solution with “SlideWare”2:

  • TripWire Solutions
  • DWhite
  • BIReady

While the other companies gave a live demo of their products:

  • WhereScape
  • Qosqo

TripWire Solutions

Dirk Vermeiren from TripWire Solutions focuses on Oracle and presented their accelerators. They gave a complete overview of the layers they implement in a Data Warehouse solution and which of those layers can be automated. Data Vault modeling is used for the layer that contains the EDW.

My 2 cents: looks promising for a specific market (Oracle).

DWhite

DWhite presented a solution that is not yet commercialized, because it is still in the works, but already used at a particular client. It is focused on Microsoft BI at the moment, but should support “everything” independent of the modeling used.

My 2 cents: I got the impression it is a one-man show and the goal is set pretty high, so I don’t think this will make it.

BIReady

Gertjan Vlug from BIReady, as the last presenter of the day, decided to give some kind of wrap-up of the day and picked in where their product fits in. They are one of the pioneers and their product focuses on using a business model from which the rest can be automated, instead of using source models. BIReady can handle any type of modeling and also uses (but not necessarily) Data Vault.

My 2 cents: I want to see a demo, looks very promising and I really like the fact that it starts with a business model instead of a source model.

WhereScape

Robert gave a stunning Star Wars introduction that made sure that he got everyone’s attention. It was funny but still hit the spot. After that Terry took over and gave a small demo of WhereScape 3D and WhereScape RED. They had been demoing already at their booth, so it was kept short.

WhereScape RED is a stunning product. It really seems to do it all. It also takes care of the ETL itself, scheduling etc.

My 2 cents: WhereScape really knows its business and has a great product.

Qosqo

Jeroen Klep from Qosqo gave a demo of their Quipu product. It is open source and can be adapted to your needs by changing the templates. It is not meant to replace it all, but to be complementary to investments already made. Quipu is still young, but also looks very promising. It can automate design of staging, EDW and data marts. The EDW is based on Data Vault.

My 2 cents: Quipu has to be taken seriously and could become a true competitor for the other players such as WhereScape and BIReady.

Conclusion

While Agile BI is not only “hot”, but also necessary in a changing world, the need for being able to automate large parts of it is inevitable. There are some great players in the market that can help you with that.


  1. Don’t shoot me if I forgot one of the sponsors 

  2. They did not give a live demo, but only presented the solution using slides. 

Interpretation in dimensional modeling vs data vault

In my previous post I mentioned that there is less interpretation for the designer in data vault modeling than in dimensional modeling.

Let me elaborate on that with an example.

Dimensional modeling

The question that I received in the workshop was what to do with the age of the customer at time of the transaction. Is it a fact or a dimensional attribute?

The age is something that could be calculated by using the customer’s birthdate and the date of the transaction. In this case, the dimensional attribute would be birthdate. The fact table should hold a reference to the time dimension representing the transaction date.

Depending on the number of fact records however, there could be a negative impact on performance when calculating the age on the fly, but this can probably be neglected.

But what if the age is supposed to be used to used to determine if it falls in a certain range? What if these ranges are variable and only known at query time? And what about the average age of the customers buying a particular product within a certain time frame?

Could the age be a fact measure in those cases?

Some designers would argue that it is a derived measure that doesn’t need to be stored in your fact table. I agree, but it does require your query reporting tool to be able to handle it all.

Others would argue that the age is a measure that you better store in the fact table. It wouldn’t be exactly an additive measure, but at least you could answer the “average age” somewhat easier.

You could even argue that it is both something dimensional and factual. And that’s true as well. The birthdate of the customer is dimensional and the age could be used for filtering facts.

It should be clear that it depends on the context. If there is no business question yet about the age, I would at least store the birthdate as a dimensional attribute of the customer. Adaption to the fact table when the question arises can be done later, but does require reengineering.

Data vault modeling

In data vault modeling you don’t suffer from this interpretation problem. The birthdate of the customer is a satellite attribute linked to your customer hub. There is no question about it.

You divide and conquer.

The interpretation will only be needed once you get the business question. Based on that, you can determine how to model it in the presentation layer, whether it be dimensional or any other form needed.

Wait, did you hear me mention something about Agile BI here? Well, sort of 😉