Data Vault and other approaches, my reflection on Frank Haber’s article

Intro

I’m writing this blog post as an additional comment and reflection on the whole discussion that broke loose as a result of the article that Frank Habers wrote in XR magazine.

Before I continue, I want to make the following very clear:

  • I am an independant BI consultant with almost 15 years of experience
  • Most implementations I did or worked on are using Kimball’s approach
  • I am NOT a Certified Data Vault Modeler, but that does not mean that I haven’t read a lot of material that Dan Linstedt and others wrote about Data Vault (such as “Supercharge Your Data Warehouse”)
  • I have little practical experience in using Data Vault
  • The largest BI implementation (in terms of volume of data) I encountered was for a mobile telephone company
  • I have never (unfortunately) worked with an MPP database
  • I am not trying to sell anyone anything based on this post

My original comment on Frank’s article was only on one specific point he made about the difference in performance between a Dimensional and Data Vault model, with respect to the joins. I mentioned that it was missing enough clarification to make his point. Something he admitted in his own comment and to which he gave some more clarification already, which I appreciate.

However the whole discussion in the comments on his article could easily turn into a “war”, which is not very helpful as stated by Ronald Damhof and Rick van der Lans in their comments on twitter.

I find the article that Dan Linstedt wrote on his own blog to counter Frank’s article also a bit of an overheated response even though Dan’s makes it clear that he has nothing against Frank personally. For some part I can understand that. There is nothing wrong in correcting statements made that are false or not entirely true. And of course Data Vault is still Dan’s “baby” and we all know how we react if someone does something wrong to our children. But I do think that NOT being a Certified Data Vault Modeler doesn’t mean you can’t discuss it or don’t know anything about it. There isn’t such a thing as being a Certified Dimensional Modeler either…

But we must make sure we don’t actually start a war. We have done that before with Inmon’s and Kimball’s approach. It doesn’t lead anywhere in the end. Having a sound and constructive discussion in which we elaborate on pro’s and con’s of certain approaches is however a good thing. As Ronald Damhof mentioned in his comments, it all depends on the context (of the client).

And whether Frank’s article may have some commercial background in it or not, the approach he discusses is a good approach, but again, it depends on the context.

Benefits of Data Vault

Based on my limited experience with Data Vault, there are some benefits that I can see in its modeling aspect that are less obvious in Dimensional Modeling. The whole idea of having hubs and links and the fact that you have many-to-many relationships does help in at least two ways:

  1. Understanding the business and creating a sound model of the business processes
  2. Getting possibly crappy data from the source into your data warehouse and show the business that they may have an issue

Note that the above does not mean you can’t accomplish this with Dimensional Modeling. Let me elaborate.

Understanding the business

When discussing business processes and the data that is used or produced by these processes, I have come to the conclusion that a Dimensional Model is fairly easy to understand by the business. However, creating the bus architecture with many fact tables (at the most detailed grain possible) and conformed dimensions can also easily result in losing the complete overview, even when you only present the entities without the attributes. Secondly, I find it more difficult to understand possible relationships that exists between fact tables.

Does a Data Vault model solve this? Yes and no. If you present a complete Data Vault model with all satellites and possible reference tables, you’re lost as well (both IT and business). But if you limit it to the hubs and links only, it becomes much clearer.

I can hear you say already: “this doesn’t help”. Partially you are right. In many cases there is not much difference between a Data Vault and Dimensional Model. Let’s look at the following simple example:

  • Customer
  • Shop
  • Sales

Where as in a Dimensional Model you would have two dimensions and one fact, in a Data Vault model you would have two hubs, two satellites linked to those hubs, one link and one satellite linked to that link table. Leave out the satellites and you get (basically) the same as with the Dimensional Model: two hubs and one link, representing two dimensions and one fact.

But if you need to introduce a many-to-many relationship between dimensions, there are basically two ways of solving it:

  1. You use a factless fact table to capture that
  2. You alter the grain of an existing fact table by adding the additional dimension

With the second approach you will give yourself a headache when there is already data present, but it can be done.

The first approach, using the factless fact, is much easier. But wait, isn’t that the same as creating another link table between two hubs in a Data Vault model? Sure it is! But to me it feels more natural in a Data Vault model to use a link between hubs than to use a factless fact in a Dimensional Model. The reason for this is only psycological because of terminology: a factless fact. You’re registering a fact without it being a fact. Weird terminology if you ask me. Maybe it should have been called an attribute-less fact.

So in many cases there may not be much of a difference after all between a Dimensional Model and a Data Vault model, but I find a Data Vault model easier in terms of evolution. The “divide and conquer” is much easier to apply to it than to a Dimensional Model.

Another issue that I sometimes encounter with a Dimensional Model is the possibly changing cardinality of a relationship between dimensions. In a true Dimensional Model, snowflaking should be prevented (there are always exceptions), meaning you flatten or denormalize your table. Great if there is a hierarchy present that is one-to-many. But a nightmare when this changes to a many-to-many relationship (in which case having snowflaked it would give you easier means to recover).

Getting crappy data from your source in your data warehouse

Let’s be honest, we all have encountered it. If not, let me know. There is a lot of crappy data in source systems. Data that does not represent the cardinality rules given by the business. And all kinds of other data (quality) issues.

Having a Data Vault model with its many-to-many relationships provides a guarantee that you can at least load that crappy data into your data warehouse (maybe with a few exceptions). Having it there will of course still give you a headache when you need to process and present it to the business in a layer more suitable for presentation, either virtualized or with a Dimensional Model on top of your Data Vault.

But it does become much easier to confront the business with the fact that they have crappy data in their source!

I find it easier with a Data Vault model than with a HSA that is modeled as the source model. In fact, how often haven’t you been in the situation that the source model is much of a blackbox and you only receive extracts from it. In such a case, the HSA is probably modeled after the extract, which may not be the actual source model.

Often when using a Dimensional Model, this crappy data is hidden because it is being cleaned by the (complex) ETL along the way from source to presentation to the business. You lose some track of it and the business is possibly not even aware of it.

But Data Vault does not solve this, it only helps you to make it more visible. In the end, there is still work to be done to clean it, either in the source itself or along the way to the presentation layer (whether that be a Dimensional Model, cube or something else).

Con’s of Data Vault

This is probably the part that may get readers and experts “excited”, to say the least 😉 Due to my limited experience, these con’s could be false in some cases. Please correct me if I am wrong, I want to learn from the experts in the field.

One of the con’s is that Data Vault indeed does result in more tables and possibly more joins, which can make it more complex to maintain from the DBA’s point of view.

Secondly I do have some doubts on performance as well, but especially (and only) in the following situation: if you create a virtualized Dimensional Model suitable for presentation on top of the Data Vault model using views and when you do this on a plain non-MPP database that doesn’t use column stores. If even a physically implemented Dimensional Model already gives performance issues, than using views with more joins on top of a Data Vault model on the same configuration won’t be any quicker.

Thirdly… well, this is not related to the Data Vault Model and Methodology as such, but more to the advocacy of it. With any new or just evolutionary approach, there is a hurdle to tackle. We are afraid of change. Sometimes Data Vault is presented as the holy grail. That’s not true, period. It doesn’t even depend on the context. The holy grail has never been found. Data Vault can help you solve particular issues that we encounter now and maybe in the next ten years. But by then, we may have evolved in such a way in handling data, that even Data Vault doesn’t provide a solution for the issues we encounter.

I also have issues with the continous hammering on getting certified in Data Vault. What is really the benefit of it? Of course, I can show off with it on my CV. Increase my hourly rate a bit so that I can earn it back. I can see a benefit for Dan and Hans. They make money out of it. Those are valid reasons of course, but do I really get much more knowlegde by following the training and certification class, or by getting experience in the field with the theory based on the articles, blogpost, books, (free) advice from experts (yes, I did get free advice) that are certified and paying close attention to reviews done by a certified expert.

Conclusion

So what conclusion is there? Did I make a strong point somewhere? No, I just wanted to reflect on the discussion that Frank’s article started.

Data Vault can be useful, for sure, I can see that. But I have doubts as well. The most important thing is that we help our clients and choose the best approach given the context of those clients. Make them evolve.

I hope this post invites you to give your comments on my reflection. Please help me learn and evolve. Correct me if I am wrong etc.

And thanks for coming all the way down here to this last line, it means the post wasn’t boring 😉

 

 

Kanban With Evernote: A Household Example

In my previous article Setting Up Kanban With Evernote I wrote about a simple setup for Kanban using Evernote.

In that article I didn’t give all the details on how you can eventually use such a setup and how it really looks like. In this article I will go a little further and give an example with screenshots and I will share that notebook for public viewing.

For the examples I will use the Evernote web interface, but you can also do this with the desktop or mobile clients.

Assumptions

The workflow consists of the following states (represented by tags):

  • todo
  • doing
  • done

This household consists of three people (represented by tags):

  • John
  • Mary
  • Junior

To make it even a little bit more interesting, I will introduce some “areas of responsibility”, also represented by tags:

  • Cleaning
  • Payments
  • Shopping

Why not make it even a bit more interesting and add some contexts borrowed from David Allen’s GTD. A context can be a location or a tool you need to accomplish the task (in fact a context can be much more than just that, but I keep it simple for this example):

  • @Hardware Store
  • @Supermarket
  • @Home
  • @Computer

Your setup will look like this:

Setup tags

Note that I put the tags in groups of tags. This is not necessary, I did it just for illustration purposes to make things clearer.

Workflow

As mentioned before, the workflow is simple in this case and a task will go through the following states, in the order specified:

  1. todo
  2. doing
  3. done

Creating tasks

You can enter a new task by simply creating a new note and give it a title of the thing that needs to be done. You assign it the todo tag and possibly the tag of the person that needs to do it if known upfront and a context tag if you know upfront where you need to do it or what tool you need.

The following example shows “Buy bread”, which is assigned the following tags:

  • todo
  • @Supermarket
  • Shopping

Entering a task

As you can see, you still need to buy the bread, you need to do it at the supermarket and the area of responsibility is shopping. Anyone can do it, you haven’t assigned someone special to do it.

Now enter some other tasks. I will not give all the details here in the text, but you will be able to see them in the next screenshot:

Tasks in snippet view

However, to have a better overview in the web client, choose the View Options in the notes and show them as a list. This will immediately show the tags assigned to the notes as well, as can be seen below:

Tasks in list view

But you can see that if you have long tag names, not all tags may show, for example with Buy hammer, you don’t see the todo tag. I haven’t been able to change the width of the columns in the web interface, but there are other alternatives that I will address later.

Doing tasks: changing the tags

When someone in the household is ready to start a task, it involves merely changing the notes tags.

When John picks up the Pay bills task, the todo flag is removed from the note and it will be assigned the doing tag. When the task is done, the doing tag is removed and replaced with the done tag.

The Buy bread task hadn’t been assigned a specific person upfront, so anyone in the household can do it. If Mary would decide to do so, she would assign her own tag Mary to it and change the todo tag into doing.

More advanced views

You can use Evernote’s standard features to have more control over your workflow, by filtering the notes on one or more specific tags.

Let’s assume that John finished paying the bills and that Mary is buying the bread. Filtering on the todo tag will now show only the following tasks:

  1. Buy hammer
  2. Do homework for school
  3. Clean shower in bathroom

Remaining todo

Likewise, when you filter on the doing tag, it will show only the Buy bread task:

Tasks in progress

And when you filter on the done tag, you would only see the Pay bills tasks (not shown here).

Suppose one of the members in the household wants to see which tasks remain todo and are assigned to him/her or are not assigned to someone specific (i.e. just something that is available). This makes a very valid use case. Let’s say Mary want to see this.

The filter for this is easy to setup and will show todo tasks not assigned to John or Junior (i.e. assigned to Mary or to no one):

  • Notebook:“Household Kanban”
  • Tag:todo
  • -Tag:John
  • -Tag:Junior

Remaining tasks that Mary can decide to do

As these types of filters will be often used, it is recommended to store them as a Saved Search in Evernote, so that you can easily apply them again without have to write it from scratch.

Saved search

Conclusion

This is just a simple setup, but gives enough hints for further extension and other applications.

I have shared the notebook for this setup publicly for viewing only. This means that you won’t be able to create, modify or delete notes/tasks. This notebook will remain shared until the end of March 2012.

It is shared via the following public URL: http://www.evernote.com/pub/estrenuo/householdkanban

One more thing

Want to have a “real” Kanban board like view? Try something like the following… ;–)

Using multiple browsers for a Kanban board view

Setting Up Kanban With Evernote

This article describes how you can use Evernote to setup a simple, yet easy to use Kanban “system” to manage your projects, workflows and tasks using (shared) notebooks, tags and notes. For optimal use at least one premium subscriber evernote account is needed.

Evernote was not intended to be used for this, so there are some drawbacks of course. The most important drawback is that you will be missing the typical visual representation of a Kanban board, with its vertical lanes that represent a state in the workflow.

What is Kanban?

Kanban, very simply put, is a way to manage and optimize workflow. It was originally invented by Toyota for their manufacturing, but nowadays it is also applied to software development and other kinds processes such as household tasks.

For more details, just google on Kanban or use Wikipedia as a starting point to learn more about it. I strongly you suggest you do some initial reading on this, so that you can easily understand the rest of the article and see the benefit of a setup using Evernote.

Minimalistic setup in Evernote

The most minimalistic setup is just for one person. This can be a free account, but in that case the normal limitations apply. With a free account you can only attach PDF’s and images to a note. With a premium subscriber account you can also attach Word document and basically any other type of attachment. PDF’s will be searchable and even text in images is searchable.

What do you need?

You need the following:

  • an evernote account (free or premium)
  • a single (synchronized) notebook
  • tags to represent stages in the workflow, such as for example:
    • todo
    • doing
    • done
  • notes representing tasks (these are the Kanban cards)

The setup described will work with any modern browser. You can also use any of the evernote desktop clients (Mac/Windows) or one of the mobile apps (iPhone, Android, BlackBerry, Windows Phone).

Note that a synchronized notebook is not the same as a shared notebook. A synchronized notebook created in one of the desktops clients syncs with your online evernote account. With the desktop client you can also create local notebooks however. These notebooks are not synced with your online evernote account and will not be accessible with a browser or one of the mobile apps.

How does it work?

The notebook you create basically represents the Kanban board, but without the same visual representation of it. It is the placeholder for your notes that represent the Kanban cards, where each note/card represents a task or any kind of item that you want it to represent, as long as it fits within the Kanban way of working.

Once you created the notebook, you can start adding notes that represent your tasks, such as:

  • buy bread
  • bring out the trash
  • clean garage

Each of these notes will be assigned one or more tags. In the example tags given above, a task can only have one tag, because the states of the workflow are mutually exclusive.

Initially, assuming you aren’t doing any of those tasks yet, all these notes will be tagged with todo. When you then decide to take up a task, you change the note, remove the todo tag and assign it the doing tag. And when you’re done, well, you remove the doing tag and assign it the done flag. After a while, you can decide to remove the notes that have the done tag as you may not want to keep those forever.

Based on your tags, you can easily see in which state a particular note is and when it may be ready to be pulled into the next state of your workflow.

That’s it!

A more advanced setup: other people in the game

Setting up a Kanban approach just for you is nice, but could be a bit of overkill. It is very useful however when more people come into play. In case of the household related tasks given earlier, it might be that other people in your household/family add new tasks or do them. So how would you do that?

One “shared” account

The most simple setup here is by using just one evernote account that is shared by the other people in your household. They all know the account user name and password. You just create an extra set of tags representing the names of your household/family members, for example:

  • John
  • Mary
  • Junior

When you create a new task note, you assign it both the todo and one of the name tags if you already know upfront who is supposed to do it. But you can also leave it “blank”, i.e. you don’t assign a name tag to it, meaning that anyone can do it. In that case, if someone picks it up to do it, he/she would remove the todo tag and assign it the doing tag and his/her name tag, for example John.

Sounds easy, doesn’t it?

However, there are situations where you don’t want the other people to use your account. You may have other notebooks in your account that you don’t want other people peeking into, not even when they are your family members. Even if there is nothing confidential, there is always the risk that another member deletes notes or changes them just for fun (you don’t here me laughing however).

But there is an alternative to that, just read on…

One premium subscriber account and multiple other accounts

In this case each person involved needs his/her own evernote account, but one of them needs to be a premium subscriber. The reason for this is that only a premium subscriber account can share a notebook with individuals that are able to create, modify or delete notes in that shared notebook. A free account can only share a notebook for viewing, which is not what you want in this case.

So how does this work?

The premium subscriber account needs to create a notebook as normal and then share it with individuals. Basic information on how to share a notebook from the desktop client of Evernote can be found here.

When you want to share a notebook from the desktop client, right click on it and choose to share it. You will be presented with the following screen (or something similar): Sharing Notebooks

Now choose Share with individuals and enter the email addresses of the persons you want to share the notebook with.

Don’t forget to check the Modify this notebook setting and Require log in to Evernote setting: Settings shared notebook

The invited people will receive an email with a link to the shared notebook, which they can either access online with a browser or integrate within the desktop client. Note that if you want to access this shared notebook with one of the mobile apps that you need to integrate it first in the desktop and sync, otherwise it won’t show up. Further details are left up to the reader to find out.

That’s all!

Additional thoughts

The above more advanced setup can of course be further extended. If you are working within a software development team, you could think of the following:

  • Using multiple shared notebooks to represent different teams
  • Using multiple shared notebooks to represent different projects (not recommended, see next item)
  • Using tags to identify a project
  • Using especially tagged notes to describe the projects, tagged with charter
  • Using tags for bug, incident, release, feature, story etc. (yes, the hint to Scrum is intentional ;–))
  • Adding comments in the body of a note to describe whatever you like
  • Attach files to notes with additional information
  • Create saved searches to quickly filter on specific tags
  • Create “template” notes for specific entries that are often needed, pre-tagged

Shortcomings

The above setup still has a lot of shortcomings:

  • You don’t get a nice visual representation of the Kanban board
  • It’s a manual process to set the tags (and ownership of a task)
  • In fact it is all manual…
  • No other advanced features that some of the online tools have to offer

Another shortcoming is that there are a lot of companies that have their firewall block access to Evernote (and other cloud-based storage services such as Dropbox).

Advanced alternatives

If you need something more advanced, take a look at the following online services:

Or look at this article which lists 15 tools for Kanban.

What’s your opinion on feeding a data warehouse from a message queue?

There are organizations that have decided that their data warehouse(s) should be fed with data in the same way as their operational systems that are part of their Enterprise Application Integration (EAI) strategy.

Depending on your information needs, this might be a good idea. Especially when you have a (near) realtime data warehouse. Short burst of transactional data can be processed quickly and your data warehouse will be up to date.

But what if your information needs are different? What if you only need a snapshot of the data at the end of the day and just don’t need all changes that happened during the day in your transactional systems?

In a batch oriented data warehouse that only needs data once a day, or maybe only a few times per day but not anywhere close to (near) realtime, is having a message queue that feeds it really the way to go?

In one of my previous projects we had a batch oriented data warehouse that was fed by a message queue. The amount of “data” pushed to it consisted of 75% overhead just because of all the XML tags that were needed. 25% was the data we really needed. In the end, it was decided to make a special XML schema for us with shorter tags to get rid of the overhead.

Secondly, a buffer area needed to be developed with just one service: create a consolidated daily snapshot of all changes received via the message queue. Without going into all the details of the problems we faced, it was a messy solution and just didn’t feel right.

Do you recognize this or have any experience with it? Then please leave your opinion and thoughts in the comment section.

Much appreciated and have a nice day.

The Art Of Delegation

One of the possibilities in the GTD workflow is to delegate an action if there is someone more up to the task than you are, for whatever reason.

However, delegation isn’t always as simple as it looks. Why? I have assembled a few things that I’ve encountered. Please leave your comments if you recognize these or better, if you’ve come across things I haven’t listed. Of course suggestions for improvement are more than welcome as well.

Here is the list of reasons why delegation may be difficult:
– you don’t want to delegate at all, because you can do it better, at least you think you can
– the person you delegate to may think different than you about being the “better” one to handle the action
– the delegated action may not be part of the responsibilities or job description of the person you delegate to
– your “waiting for” list may become very very large and difficult to follow up
– delegating an action to your manager may feel inappropriate
– delegating an action to your peers may feel inappropriate

As you can see, these are just a few of the issues you may encounter when delegating actions. But I’m pretty sure that some of these are recognizable.

A few tips to handle these issues:
– you may be wrong in thinking you are the best to handle it
– keep a clear and current “waiting for” list with all delegated actions
– talk to the persons you delegate to in person of by phone; are they the right ones?
– follow up as much as needed to get your projects going

Any comments or suggestions are much appreciated.

My productivity apps

My_productivity_apps

This is an overview of most applications I am currently using to help me to be productive. It has been created using one of the apps I am using, MindManager. Some are free, some are paid for (most of them).

My top 5:
  1. OmniFocus
  2. MindManager (when @MacBookPro) and iThoughts (when @iPhone)
  3. Read It Later
  4. iCal (when @MacBookPro) and Agenda (when @iPhone)
  5. Evernote

GTD, NLP and the 7 Habits of Highly Effective People

Lately I have been reading (or listening to audio books) about GTD, NLP and the Seven Habits of Highly Effective People. GTD is the Getting Things Done methodology (or systematic approach) as promoted by David Allen. The Seven Habits… is written by Stephen R. Covey and NLP stands for Neuro Linguistic Programming.

NLP is promoted by a lot of different people, such as Anthony Robbins, but also Frank Bruining and many more. Now let me be clear. I’m into GTD, have some notions of NLP and am not even half way in reading/listening the Seven Habits. But I am no expert at all in all these different areas.

Different? Really? That is exactly what strikes me at the moment. There are so many similarities, so much overlap between each of these.

So is it just the same with another name? Each with a little twist around it to justify the different name? Or is it just a confirmation that the basics of each of these topics is really the same and that we can’t ignore it? Please let me know your thoughts on this. I don’t have an answer yet and maybe never will, but at least I will continue to read and listen and post more on this in the future.

Stuff in your head

Do you know that feeling that you can’t sleep because of all the stuff
that keeps popping up in your head? It’s happening to me right now.
Most of the time you’re probably thinking of the same thing over and
over again in that case.

So how do you get rid of these thoughts and go back to your well
deserved night’s rest? One solution could be to actually take action
on what you’re thinking about.

But in true GTD (Gettings Things Done) style, just write it down
somewhere as a reminder for later and toss it in your inbox.

Foto