Modeling business concepts using DataVault – part 2 of 2

In part 1 of this series, I showed three approaches to model the business concepts. The third attempt, shown below, is the better one and I’ll explain why.

The business key issues

After discussions with the business and by looking at the sources, the business keys of the concepts turned out to be as follows for a “study site”:

  • Study number
  • Region abbreviation
  • Site sequence number

This is a legacy business key however, because the concept of “master site” was introduced later. But it does pose some issues with the first and second attempt:

  1. Region isn’t foreseen anywhere but could potentially be modeled as a hub on its own;
  2. The site sequence number is meaningless on its own and can’t be modeled as a hub as such, resulting in a degenerate column in the “study site” link as it’s still part of the key.

The second issue is still valid according to Dan’s book1, but will require a slightly different pattern for loading (and thus automation) than a standard link that is the intersection of actual hubs.

If the region isn’t actually that important to the business to be modeled as its own hub, it would also need to be modeled as a degenerate column, bringing us back by the second issue.

The extensibility issues

Our model isn’t complete yet. During discussions with the business, it turns out there are “subjects” and “findings”.

Subject

A subject is a person (but not identifiable as such) taking part in a study at a particular study site. The same person taking part in another study (probably at another site), is a different subject.

The business key is nothing more than a sequence number (within the “study site”).

Finding

A finding is an side effect that a subject exposed during the study. Although not synonymous, think of it as a symptom.

The business key is nothing more than a sequence number or a date within the subject.

Do you see the problem here? Everything seems to have some kind of hierarchical relationship. Using the denormalization technique, you would end up with the following model:

To me this looks a bit messy, having too many dependencies. I know this can be just subjective again, but if you need to extend even further, it will become even more messy.

The alternative extension

If we however follow the idea2 of the third attempt in the previous post, subject and finding are just new hubs. Adding those to the model is easy, see below:

This model is much cleaner and easy to extend. You can argue however that some information is lost, because you can’t immediately see the dependencies that lie within the business keys. But what the question is whether these are really business keys or just a technical solution in the source that the business adopted as business keys (because they had not other choice).

Even if the composition of these business keys change, the definition of each of the hubs remains the same. The relationships between these hubs most likely don’t change either. And if they do, you can easily create other links between the hubs, without further impact on the rest of the model.

Furthermore, but this is only in latest developments in the DataVault community (not necessarily supported by Dan Linstedt and DV2.0), there is no necessity for link satellites that contain descriptive attributes.

Conclusion

In reality there is no conclusion. It’s partially a matter of preference. If you want to stick to pure DV2.0 recommendations, go for the denormalization technique as link-to-link is still not recommended.

But if you truly understand the modeling part of DataVault, go for the alternative2. It’s more flexible and easier to extend. And the DV2.01 book doesn’t not forbid you to model an event or transaction as a hub either so you don’t have to worry about that.


  1. “Building a scalable data warehouse with DataVault 2.0”, http://a.co/aynqjSK ↩︎
  2. “Modeling the agile data warehouse with DataVault”, http://a.co/a0hyHWh ↩︎

Modeling business concepts using DataVault – part 1 of 2

In this series of 2 articles I’m going to discuss different approaches to modeling business concepts using DataVault. It is based on discussions from the past at one of my clients in the pharmacy industry.

The backbone of a DataVault

The backbone of a DataVault consists of hubs and links. Satellites are part of the core concepts but of less significance for this series of articles.

Let me start by giving the definitions of hubs and links, as stated in literature about this subject.

Hubs

“Hubs are defined using a unique list of business keys and provide a soft-integration point of raw data that is not altered from the source system, but is supposed to have the same semantic meaning.1

“The Hub represents the key of a Core Business Concept and is established the first time a new instance of that concept’s Business Key is introduced to the data warehouse.2

Links

“The link entity type is responsible for modeling transactions, associations, hierarchies, and redefinitions of business terms. The next sections of this chapter define Data Vault links more formally. A link connects business keys; therefore links are modeled between hubs. Links capture and record the past, present, and future relationships between data elements at the lowest possible granularity.1

“A Link represents an association between core concepts and is established the first time this new unique association is presented to the EDW. Just as a Hub is based on a core business concept, the Link is based on a natural business relationship between business concepts.2

Business definitions

Now that the core constructs are defined, let’s define the business concepts that form the basis of this article and the data model.

Healthcare Facility

“A place where healthcare is being practiced. This can be a hospital, a department of a hospital, a laboratory or another place.”

Healthcare Professional

“A person that has followed some form of medical studies and practices healthcare.”

Master Site

“A Master Site is the assignment of a Healthcare Professional to a Healthcare Facility.”

Study

“A Study is a formally followed research process in the development of medicine.”

Study Site

“A Study Site is the assignment of a Master Site to a Study.”

A first attempt to model the business concepts

The following data model uses the “colors” of the DataVault as introduced in “Modeling the agile data warehouse with DataVault”2.

  • Hubs are blue
  • Links are green
  • Satellites are yellow

Based on the business definitions given above, there is probably no doubt that “healthcare facility”, “healthcare professional” and “study” should be represented as hubs.

As you can see, both “master site” and “study site” are modeled as links. The reason why this is done, is because the definitions of these concepts are indeed a kind of association and links are used for representing associations.

But this also poses an immediate problem. We now have a link-to-link relation in the model. This is not recommended practice: “This dependency does not scale nor perform well in a high volume, high velocity (big data) situation. The problem with link-to-link entities is that a change to the parent link requires changes to all dependent child links.”1

A second attempt to model the business concepts

One way to get rid of the link-to-link relation is by using (a kind of) denormalization1.

If you apply that principle, you’ll get this:

Even though this is a correct approach, I have two problems with it:

  1. It starts to look like a dimensional model and not like a DataVault model that is more fractal like. This is of course very subjective, but it just doesn’t feel right to me;
  2. The extensibility of the model is more difficult than with other approaches.

A third attempt to model the business concepts

Another approach is to take a closer look at the following statement: “This understanding is crucial to data vault modeling. A Link – by itself – cannot represent a business concept. So a Link – by itself – cannot represent an Event, a Transaction, or a Sale. Each of these event-oriented business concepts must be entire data vault constellations – including a Hub, Satellite(s) and Link(s).”2

Now think about that for a minute…

Both “master site” and “study site” are assignments, which is a kind of event. But both are business concepts too. In fact, these business concepts each have their own (composed) business keys. So according to the statement, these should be modeled as (keyed instance) hubs, not as links.

Let’s try again:

In part 2 I will elaborate on why this third attempt is the better option.


  1. “Building a scalable data warehouse with DataVault 2.0”, http://a.co/aynqjSK ↩︎
  2. “Modeling the agile data warehouse with DataVault”, http://a.co/a0hyHWh ↩︎

ERwin data modeller plug-in MODGEN for DataVault generation

Thanks to George McGeachie my attention was drawn to the following article on the blog of Erwin, a well known data modelling tool.

The article is about DataVault in general and how a data modelling tools like Erwin can help.

More interesting is the fact that the German company heureka e-Business GmbH has written a plug-in for Erwin called MODGEN that is able to generate a DataVault model from another data model.

I will certainly contact them and see if they do their webcast again on whether a recorded version is available or offline viewing.

Who knows this is one step further in automating DataVault.

My 2 cents on DataVault standards (evolution)

My 2 cents on DataVault standards (evolution)

Generating #datavault models & Issues to address | Accelerated Business Intelligence

Generating #datavault models & Issues to address | Accelerated Business Intelligence

Waiting for DataVault implementation classes

 I’m anxious to learn more about the implementation part of a DataVault, something that is not covered in detail in his book “Supercharge Your Data Warehouse”. 

For a while now, Dan Linstedt has annouced that he is working classes that cover the implementation. You can register to be notified about it here.

 

Unfortunately, I keep receiving emails from that list with all kinds of info about DataVault – but not about the implementation classes – and that I can be notified when the implementation classes are ready. But that’s why I registered in the first place. If I already did, why it that still in the emails that I keep receiving?

 

Just now, I saw another tweet from him:

 

 

Now I find it interesting that he’s working on multiple things, but it would be nice if something actually finishes. The implementation classes are almost ready for production and release. We all know what this means in IT. Either something is ready, or it’s not. “Almost” doesn’t exist…

 

Dan, get those implementation classes out now. People are waiting for it. You can always finetune them later…

 

How to load a #datavault hub from multiple sources

In this blogpost I will describe a few scenario’s on how to load a datavault hub from multiple sources, each with their pros and cons. Many thanks to Martijn Evers (@DM_Unseen) and @RonaldKunenborg for giving their input.

The main reason for this blogpost is to share knowledge about Data Vault, so that it becomes accessible to a broader audience.

Disclaimer: I am not a Certified Data Vault Modeler, but the fore mentioned persons are.

Some considerations

One of the “principles” in the Data Vault Methodology is that most of the loading can be done in parallel. First the hubs are loaded in parallel, then the links and finally the satellites (although satellites belonging to hubs can of course be loaded after the hubs are loaded, one doesn’t need to wait for the links to be loaded).

It’s exactly at this point where I initially was somewhat confused regarding the loading of a hub that has multiple sources. If you would load that hub sequentially, as explained in scenario 2 below, wouldn’t you be defying this principle of parallel loading?

On the other hand, the idea is to load your data as soon as it becomes available. This poses a problem when using a union construct as explained in scenario 1 below. If one of the sources is ready a lot later, you have to sit and wait before you can load. Precious time is lost.

Scenario 1: using a union between sources

In this scenario the business keys from the different sources will be unioned together, while keeping information about the record source of each business key.

The following pseudo-SQL provides the basis for this scenario.

select distinct a.col1 as business_key  , 'table1.col1' as record_source , load_cycle_ts() as load_dts from table1 a  union  select distinct b.col2 as business_key , 'table2.col2' as record_source , load_cycle_ts() as load_dts from table2 b

Note that the above is not entirely correct, as it can result in duplicate business keys due to the inclusion of the record_source. This can be handled however with most ETL tools.

A typical setup for the above using Pentaho Data Integration would be like:

Pros

  • All the sources of a particular hub in one transformation which can give a better overview
  • Principle of parallel loading is maintained

Cons

  • Not easy to generate via automation
  • Difficult to maintain if additional sources are needed
  • Additional steps are needed to prevent duplicate keys
  • Additional constructs are needed to appoint the master source
  • Synchronization between sources, because you need to wait to start loading when all sources are ready

Scenario 2: sequentially for each source

In this scenario the business keys from the different sources will be loaded sequentially, starting with the master source.

Step 1: master source first

select distinct a.col1 as business_key  , 'table1.col1' as record_source , load_cycle_ts() as load_dts from table1 a

Step 2: next source

select distinct b.col2 as business_key , 'table2.col2' as record_source , load_cycle_ts() as load_dts from table2 b

A typical setup for each of these steps in Pentaho Data Integration would be like:

With this kind of setup, you can also use “micro-batches” to load the data. This idea is explained by Martijn Evers (@DM_Unseen) in his reply to my original question about this subject on LinkedIn (you need to be a member of the Data Vault Discussions group to view it).

Pros

  • Can easily be generated via automation
  • Adding new sources is easy
  • None of the Cons of scenario 1

Cons

  • Slightly defies the principle of parallel loading of hubs, but this is really of minor importance

Conclusion

It should be clear that the second scenario of loading the hub sequentially from it sources is normally the best one to choose. However, to quote @RonaldDamhof, “it depends on the context”. You can always come across a situation where another way is better…