How to load a #datavault hub from multiple sources

In this blogpost I will describe a few scenario’s on how to load a datavault hub from multiple sources, each with their pros and cons. Many thanks to Martijn Evers (@DM_Unseen) and @RonaldKunenborg for giving their input.

The main reason for this blogpost is to share knowledge about Data Vault, so that it becomes accessible to a broader audience.

Disclaimer: I am not a Certified Data Vault Modeler, but the fore mentioned persons are.

Some considerations

One of the “principles” in the Data Vault Methodology is that most of the loading can be done in parallel. First the hubs are loaded in parallel, then the links and finally the satellites (although satellites belonging to hubs can of course be loaded after the hubs are loaded, one doesn’t need to wait for the links to be loaded).

It’s exactly at this point where I initially was somewhat confused regarding the loading of a hub that has multiple sources. If you would load that hub sequentially, as explained in scenario 2 below, wouldn’t you be defying this principle of parallel loading?

On the other hand, the idea is to load your data as soon as it becomes available. This poses a problem when using a union construct as explained in scenario 1 below. If one of the sources is ready a lot later, you have to sit and wait before you can load. Precious time is lost.

Scenario 1: using a union between sources

In this scenario the business keys from the different sources will be unioned together, while keeping information about the record source of each business key.

The following pseudo-SQL provides the basis for this scenario.

select distinct a.col1 as business_key  , 'table1.col1' as record_source , load_cycle_ts() as load_dts from table1 a  union  select distinct b.col2 as business_key , 'table2.col2' as record_source , load_cycle_ts() as load_dts from table2 b

Note that the above is not entirely correct, as it can result in duplicate business keys due to the inclusion of the record_source. This can be handled however with most ETL tools.

A typical setup for the above using Pentaho Data Integration would be like:

Pros

  • All the sources of a particular hub in one transformation which can give a better overview
  • Principle of parallel loading is maintained

Cons

  • Not easy to generate via automation
  • Difficult to maintain if additional sources are needed
  • Additional steps are needed to prevent duplicate keys
  • Additional constructs are needed to appoint the master source
  • Synchronization between sources, because you need to wait to start loading when all sources are ready

Scenario 2: sequentially for each source

In this scenario the business keys from the different sources will be loaded sequentially, starting with the master source.

Step 1: master source first

select distinct a.col1 as business_key  , 'table1.col1' as record_source , load_cycle_ts() as load_dts from table1 a

Step 2: next source

select distinct b.col2 as business_key , 'table2.col2' as record_source , load_cycle_ts() as load_dts from table2 b

A typical setup for each of these steps in Pentaho Data Integration would be like:

With this kind of setup, you can also use “micro-batches” to load the data. This idea is explained by Martijn Evers (@DM_Unseen) in his reply to my original question about this subject on LinkedIn (you need to be a member of the Data Vault Discussions group to view it).

Pros

  • Can easily be generated via automation
  • Adding new sources is easy
  • None of the Cons of scenario 1

Cons

  • Slightly defies the principle of parallel loading of hubs, but this is really of minor importance

Conclusion

It should be clear that the second scenario of loading the hub sequentially from it sources is normally the best one to choose. However, to quote @RonaldDamhof, “it depends on the context”. You can always come across a situation where another way is better…