The Art of Data Modeling – follow up

In my previous post on data modeling, I wrote that it is an art and not pure science. In this post, I’ll elaborate on that.

Today I gave a small workshop on dimensional modeling for the business intelligence team that I am currently part of. The main objectives of this workshop were to get the team in a standard way of working and to refresh their memory on dimensional modeling.

As preparation I had written a small summary about dimensional modeling, mostly based on Dr. Ralph Kimball’s book The Data Warehouse Toolkit.

Before I got into presenting some example models for the project I’m working in, I had a small Q&A session about the preparation material that I had provided.

Well, I thought it would be small…

I received many design questions that I could not answer… in the way they thought I would answer it, i.e. as a straight answer that would set it for once and for all.

That is exactly what designing a data model is all about, especially when doing dimensional modeling. To quote Ronald Damhof:

it all depends on the context

That is also exactly why there are some drawbacks in dimensional modeling compared to other modeling techniques such as data vault.

Is a particular attribute dimensional or a measure that belongs in the fact table? It just depends in how it will be used. There is no scientific answer to it.

That’s why I say, data modeling is an art. Art represents the artist’s interpretation or view.

It is also the reason why I prefer data vault over dimensional modeling for the EDW. There is less left up to the interpretation. But that’s another story…

The Art of Data Modeling

For reasons still not quite clear to me, I come across a lot of organizations that perform pretty poor regarding the data modeling of their business intelligence solutions.

In some cases the organizations don’t even have tools for data modeling. Strangely enough a lot of these have an Oracle database, so why not use the Oracle SQL data modeler, which comes for free.

Of course the tools are just a means to an end and don’t guarantee that your data model design is correct. But bringing some kind of structure to your data that is visualized in a data model will help you and the stakeholders to determine its correctness.

Data modeling is an art, not pure science. In a follow up post, I’ll elaborate on that.

OmniFocus and Basecamp

OmniFocus and Basecamp

My thoughts on the Datavault Certification Training

Last Thursday and Friday, I joined the datavault certification training, organised by Genesee Academy and Centennium BI expertisehuis.

In this post, I want to give my reflections on this training and the certification exam that is part of it. I give it both good points and some less good points. I will start with giving a little description on how the training is setup.

Setup of the training

The training starts with a set of online videos that are made available two weeks upfront of the actual two days classroom training. They give a good introduction and even some in depth information about the Data Vault modeling and methodology.

The two days classroom training contains a slightly adapted set of the slides that are presented in the online videos. You receive those as handouts in a nice binder.

During the first day, a lot of the slides are presented and additional context is given. Two small modeling cases need to be worked out and are discussed. When time permits, a third – slightly bigger – case is started.

The second day the third case is worked out and discussed in detail. Also any questions asked during the first day are discussed. The last slides are presented and a recap is given before the 2,5 hour exam starts.

The exam consists of both true/false questions, open questions where you have to write down the answer and some modeling. You need to have 90% of the questions correctly answered to be granted the title of “Certified Data Vault Data Modeler”.

How I rate it

In the next section I’ll give my thoughts on it. Of course this is highly subjective. I also need to add that I had already read a lot about Data Vault upfront and that I even started an implementation.

The Bad

Actually, it isn’t that bad, but just a matter of speech. Here it comes:

  • the slides in the classroom training don’t add any value to the ones presented in the online videos
  • there is more theoretical information in the online videos than in the classroom training

The Ugly

  • the fact that you need to have a score of 90% to get certified is not mentioned upfront
  • some of the true/false questions in the exam are ambiguous and need you to write down your assumptions to clarify your answer
  • some slides are already a little outdated based on the ongoing insight

The Good

As in any good presentation, it is the material that isn’t in the slides that really add value. In this training this is no different. The context around the material in the slides is provided by the trainer and explained in detail.

  • in-depth discussion on specific topics based on feedback and questions from the participants
  • the cases provide plenty of room for discussion on how to model in certain circumstances and why to do it like that

Conclusion

The classroom part provides an interaction that can’t be matched by slides or online videos, even though some information is repeated. But that’s the idea of learning as well, by using repetition. You do need to watch the online videos upfront if you are really new to the subject.

I still think it has been worth the money.

Summary mindmap of book Rework by 37signals

I’m currently reading the book Rework, written by the guys from 37signals.com, makers of Basecamp and other very nice collaboration tools.

This book is really a must read for everyone that runs or is planning to run its own business.

Included is a interactive mindmap in PDF format (with Flash embedded, needs to be viewed with Adobe Reader) that summarizes the book. I don’t know the original author of this mindmap, but all credit should go to him/her.

Use the download link to get it, because posterous can not convert this type of PDF to something viewable.

Rework_by_37_Signals.pdf Download this file

 

Limiting your number of inboxes using ifttt.com and OmniFocus

When applying Gettings Things Done, you at least have one inbox to capture your stuff. Most of the time, you probably have more than one inbox, especially digital ones. The more inboxes you have to manage, the less productive you probably get.

This post will show some small examples of how ifttt.com can come to the rescue in limiting the number of digital inboxes you have.

About ifttt.com

ifttt.com stands for “if-this-then-that”. It helps you to create tasks that are triggered by a particular event. This event can be a mail that you receive, a new blog post you wrote, a tweet you favorited, a google reader item you starred and many more. See this great article on Lifehacker.com for more information.

Your OmniFocus inbox setup

There are two ways you can setup your automated OmniFocus inbox:

  1. Using a Mac running as a kind of server with a mail rule (using Mail.app) and running OmniFocus
  2. Using the send-to-omnifocus@omnigroup.com mail address (which requires an additional manual step)

The first method is explained in the OmniFocus help on your Mac.

With the second method, a mail is returned containing a s pecial URL in it using the omnifocus:// scheme. By clicking on it, OmniFocus is opened (even on your iDevices) and the task is added to your inbox once you confirm. I am using this setup for the moment, but will switch to the first one soon.

Typical use-case scenarios

  1. Add tweets you favorite to your OmniFocus inbox to process them later
  2. Add Google reader items you star to your OmniFocus inbox to process them later

There are many more cases you can thinks of, just take a look at the possibilities on ifttt.com

Recipe for scenario 1: in the “To address” field, just fill in send-to-omnifocus@omnigroup.com

The rest is up to you…

How to load a #datavault hub from multiple sources

In this blogpost I will describe a few scenario’s on how to load a datavault hub from multiple sources, each with their pros and cons. Many thanks to Martijn Evers (@DM_Unseen) and @RonaldKunenborg for giving their input.

The main reason for this blogpost is to share knowledge about Data Vault, so that it becomes accessible to a broader audience.

Disclaimer: I am not a Certified Data Vault Modeler, but the fore mentioned persons are.

Some considerations

One of the “principles” in the Data Vault Methodology is that most of the loading can be done in parallel. First the hubs are loaded in parallel, then the links and finally the satellites (although satellites belonging to hubs can of course be loaded after the hubs are loaded, one doesn’t need to wait for the links to be loaded).

It’s exactly at this point where I initially was somewhat confused regarding the loading of a hub that has multiple sources. If you would load that hub sequentially, as explained in scenario 2 below, wouldn’t you be defying this principle of parallel loading?

On the other hand, the idea is to load your data as soon as it becomes available. This poses a problem when using a union construct as explained in scenario 1 below. If one of the sources is ready a lot later, you have to sit and wait before you can load. Precious time is lost.

Scenario 1: using a union between sources

In this scenario the business keys from the different sources will be unioned together, while keeping information about the record source of each business key.

The following pseudo-SQL provides the basis for this scenario.

select distinct a.col1 as business_key  , 'table1.col1' as record_source , load_cycle_ts() as load_dts from table1 a  union  select distinct b.col2 as business_key , 'table2.col2' as record_source , load_cycle_ts() as load_dts from table2 b

Note that the above is not entirely correct, as it can result in duplicate business keys due to the inclusion of the record_source. This can be handled however with most ETL tools.

A typical setup for the above using Pentaho Data Integration would be like:

Pros

  • All the sources of a particular hub in one transformation which can give a better overview
  • Principle of parallel loading is maintained

Cons

  • Not easy to generate via automation
  • Difficult to maintain if additional sources are needed
  • Additional steps are needed to prevent duplicate keys
  • Additional constructs are needed to appoint the master source
  • Synchronization between sources, because you need to wait to start loading when all sources are ready

Scenario 2: sequentially for each source

In this scenario the business keys from the different sources will be loaded sequentially, starting with the master source.

Step 1: master source first

select distinct a.col1 as business_key  , 'table1.col1' as record_source , load_cycle_ts() as load_dts from table1 a

Step 2: next source

select distinct b.col2 as business_key , 'table2.col2' as record_source , load_cycle_ts() as load_dts from table2 b

A typical setup for each of these steps in Pentaho Data Integration would be like:

With this kind of setup, you can also use “micro-batches” to load the data. This idea is explained by Martijn Evers (@DM_Unseen) in his reply to my original question about this subject on LinkedIn (you need to be a member of the Data Vault Discussions group to view it).

Pros

  • Can easily be generated via automation
  • Adding new sources is easy
  • None of the Cons of scenario 1

Cons

  • Slightly defies the principle of parallel loading of hubs, but this is really of minor importance

Conclusion

It should be clear that the second scenario of loading the hub sequentially from it sources is normally the best one to choose. However, to quote @RonaldDamhof, “it depends on the context”. You can always come across a situation where another way is better…

How to link Evernote notes to OmniFocus

Sven Fechner (@simplicitybliss) is one of my major resources when it comes to using OmniFocus. He wrote about linking OmniFocus and Evernote together in at least two blog posts:

  1. Get Evernote and OmniFocus talking
  2. Linking Evernote notes to OmniFocus tasks

In the second post, there is an important comment made by Bryan and Diego. You can link even easier now, by using the Evernote’s ability for copying a note link (or more than one note link).

Getting this note link can be done by selecting one or more Evernote notes and then right-click to get the context menu, and select “Copy Note Link” (or “Copy Note Links” if having selected more than one note):

Copy Note Links in context menu

I prefer to have a hot key assigned to this in the System Preferences. I use the same key for both the menu item “Copy Note Link” (when having one note selected) and the menu item “Copy Note Links” (when having multiple notes selected).

Short cut keys

After selecting ⌘K, you can paste the links in you OmniFocus tasks note. Simply clicking on it, will reveal your Evernote note.

Waiting for DataVault implementation classes

 I’m anxious to learn more about the implementation part of a DataVault, something that is not covered in detail in his book “Supercharge Your Data Warehouse”. 

For a while now, Dan Linstedt has annouced that he is working classes that cover the implementation. You can register to be notified about it here.

 

Unfortunately, I keep receiving emails from that list with all kinds of info about DataVault – but not about the implementation classes – and that I can be notified when the implementation classes are ready. But that’s why I registered in the first place. If I already did, why it that still in the emails that I keep receiving?

 

Just now, I saw another tweet from him:

 

 

Now I find it interesting that he’s working on multiple things, but it would be nice if something actually finishes. The implementation classes are almost ready for production and release. We all know what this means in IT. Either something is ready, or it’s not. “Almost” doesn’t exist…

 

Dan, get those implementation classes out now. People are waiting for it. You can always finetune them later…