My thoughts on 9 Skills You Need to Become a Data Modeler

Ronald van Loon has written an interesting article on the 9 Skills You Need to Become a Data Modeler. In this post I want to express my thoughts on that article, because I don’t agree with all the statements made in the article.

The good

I certainly agree that data modeling skills are one of the best skills to have in the current information driven industries, but would like to add that it hasn’t recently emerged. Data modeling has been around for decades (in fact more than half a century), but seems to have been buried under a lot of misconceptions. It is finally slowly being recognized again however.

Data modeling indeed helps in understanding how the data neurons connect with each other, which is crucial. It doesn’t define per se how the data is generated, nor does it need to be in a computer system. Data modeling mostly determines definition, structure and relations between, but not the processes. It should be based on facts that can and need to be verified.

Stepping into a career as a modeler, you’ll have to work with data analysts and architects to identify key dimensions and facts to support the system requirements of your client or company. […]

As long as dimensions and facts don’t refer to the concepts of dimensional / star modeling as was popularized and extensively taught by Ralph Kimball, I agree with this part and it also implicitly contains the number one skill you need as a data modeler: communication skills.

The bad

The career path for becoming a data modeler starts with specific education in the data science field […]

I really don’t agree on this point. Data science as we know it today was never part of my education. In fact, I am not really good at some of the underlying mathematical aspects that are involved in data science. Still, there are lots of “colleagues” that acknowledge I am good at data modeling.

The ugly

In general, my opinion is that the article mixes up a few things and focuses too much on technology. The “definition” of data modeling in the article is too narrow, because data modeling is not just focused at database management systems.

Data modeling in itself is not the same as data management. It is a part of data management, as described by the DAMA-DMBOK2 and the DAMA Wheel1.

Data modeling serves as a means to complement business modeling and to work towards generating a sufficient database.

Again, I don’t agree that is serves to work towards generating a sufficient database. Data doesn’t need to reside in a database. In my opinion, data modeling serves as a way of communicating, structuring, interpreting and understanding data. That structuring can go down to the implementation level where databases, files systems or other forms are used to store (and retrieve) the data.

The process for designing a database includes the production of three major schemas: conceptual, logical and physical. […] A Data Definition Language is used to convert these schemas into an active database. A data model that is fully attributed and covers all major aspects includes detailed descriptions for every entity contained within it.

I guess the author meant three major models, not schemas. Schemas are a technical way of separating data structures on particular database implementations. A Data Definition Language (DDL) is generated from a physical data model. The physical model however no longer talks about entities and attributes, but about tables and columns (assuming that we are talking about a relational database management system as the target).

The skills

This part is basically the essence of the article and exactly the part that I think is the ugliest of all.

You must exhibit the following skills before pursuing a career in data modeling:

1. Digital logic
2. Computer architecture and organization
3. Data representation
4. Memory architecture
5. Familiarity with numerous modeling tools that are currently in place within organizations
6. Directions in computing
7. SQL language and its implementing
8. Exemplary communication skills that will help you in making your way around organizations with an intricate hierarchy
9. Sufficient experience using Teradata or Oracle database systems

The most important skill you need to possess is number 8. Communication skills are ranked as the number one and even following skills by Steve Hoberman.

Skills one to four on the list are irrelevant until you get to physical data modeling and even then they could be questionable to a certain degree. Skill number 5 is something you can learn on the job. And beware, lots of data modeling tools in the form of software tend to focus only on certain aspects of data modeling, not even all of them. A data model can be as simple as a set of post-it’s on a whiteboard with lines between them. In fact, that is most likely to be the data model that is best understood by people without the technical background that is being referred to by most of the skills listed. Skill number 7 comes in handy at some point when the implementation of the physical data model that actually can handle SQL. What about graph databases that don’t support it?

I can’t comment on skill number 6 as I don’t understand what is meant by it.

And skill number 9… well, sorry Microsoft and all other database vendors. Seems like you have all just been wiped out of business…

Training and certification

Getting sufficient data modeling training and staying up-to-date with the evolution of the industry is indeed very important.

Certifications are crucial when it comes to data modeling in the formal setting. Companies agree it’s important for their data modelers to obtain reputable certifications that prove their expertise and also enhances their skills. These certifications include Big Data and Data Science courses, Big Data Architect Master’s Programs, Big Data Hadoop Training, and Data Science with R, among others.

I really am missing the importance of these certifications regarding data modeling. I’m pretty sure they have their value, but on a entirely different area.

  1. If the DAMA organization doesn’t want me to include the picture of the DAMA Wheel in this post, please let me know and I will remove it. 

Is the traditional data warehouse still alive? Who cares?

Over the years and especially recently when I check my Twitter or LinkedIn timeline, I see questions like:

Is the traditional data warehouse dead?

I think it is a non-question, badly formulated and posed in a way that the answer is likely to be no, even without having read the actual article. It’s just marketing bullshit, clickbait.

First of all, what does traditional mean in this context? Secondly, what is meant by data warehouse? And last but not least, what is meant by dead?

Let’s start with the last one. If dead means deceased, no longer existing, then obviously the answer would be no to the original question, no matter what traditional data warehouse really means. We all know that these are still around. If dead means “no longer created in the same way”, then the answer could be yes, but is still probably no in lots of cases.

And that’s exactly where we come to traditional. What is that? Something that has been around for ages or decades? Data warehouses exist for decades already, not for ages. Even traditions are known to evolve, albeit slowly, very slowly sometimes. To be honest, I have no clue what it means in this context. If you find that strange, let me ask you another question: Is the traditional car dead? Same problem here. What is that? A Ford T for example? Yes, those are no longer made or in use (although I’m not sure about the latter part, maybe for special occasions these are still used). On the other hand, a Tesla still has four wheels, a steering wheel, an engine (or two) etc. It’s still a car, just different technology.

Finally, what is a data warehouse? Is it a technical thing? Certainly not! It’s a concept that addresses particular concerns regarding – but not limited to – the way data is gathered, stored, processed, managed, governed, made available etc. Sure, you can also take the definition given by Barry Devlin, Bill Inmon or Ralph Kimball. And when they coined the term and definition – decades ago -, there were certain technological restrictions on how to build and implement it. Let’s be honest, even though that all has evolved, we still face technological restrictions. If you don’t believe me, let’s look back in 15 years from now (maybe even shorter).

So is the traditional data warehouse dead? Or is it still alive? Or doesn’t the “question” really matter at all? My answers: no, yes, no. In no particular order 😉

Move to the Netherlands

The last and coming weeks are busy times. Apart from my daily work for my client, I’m moving to the Netherlands, both personally and professionally.

My company in Belgium will cease to exist by the end of the first quarter 2018, while a Dutch company is being setup as we speak. The name will stay the same, as well as the services delivered.

Meanwhile, I’m looking for a permanent residence in The Hague, where I used to live before I moved to Belgium.

I’m not going to bother you will all the administrative details that are involved, but it’s a bit overwhelming. It’s also my main focus at the moment, other things are put on hold till everything has been arranged.

This site will stay and be maintained in the future but I will drop the .be domain. The .com domain will remain.

See you soon!

Modeling business concepts using DataVault – part 2 of 2

In part 1 of this series, I showed three approaches to model the business concepts. The third attempt, shown below, is the better one and I’ll explain why.

The business key issues

After discussions with the business and by looking at the sources, the business keys of the concepts turned out to be as follows for a “study site”:

  • Study number
  • Region abbreviation
  • Site sequence number

This is a legacy business key however, because the concept of “master site” was introduced later. But it does pose some issues with the first and second attempt:

  1. Region isn’t foreseen anywhere but could potentially be modeled as a hub on its own;
  2. The site sequence number is meaningless on its own and can’t be modeled as a hub as such, resulting in a degenerate column in the “study site” link as it’s still part of the key.

The second issue is still valid according to Dan’s book1, but will require a slightly different pattern for loading (and thus automation) than a standard link that is the intersection of actual hubs.

If the region isn’t actually that important to the business to be modeled as its own hub, it would also need to be modeled as a degenerate column, bringing us back by the second issue.

The extensibility issues

Our model isn’t complete yet. During discussions with the business, it turns out there are “subjects” and “findings”.


A subject is a person (but not identifiable as such) taking part in a study at a particular study site. The same person taking part in another study (probably at another site), is a different subject.

The business key is nothing more than a sequence number (within the “study site”).


A finding is an side effect that a subject exposed during the study. Although not synonymous, think of it as a symptom.

The business key is nothing more than a sequence number or a date within the subject.

Do you see the problem here? Everything seems to have some kind of hierarchical relationship. Using the denormalization technique, you would end up with the following model:

To me this looks a bit messy, having too many dependencies. I know this can be just subjective again, but if you need to extend even further, it will become even more messy.

The alternative extension

If we however follow the idea2 of the third attempt in the previous post, subject and finding are just new hubs. Adding those to the model is easy, see below:

This model is much cleaner and easy to extend. You can argue however that some information is lost, because you can’t immediately see the dependencies that lie within the business keys. But what the question is whether these are really business keys or just a technical solution in the source that the business adopted as business keys (because they had not other choice).

Even if the composition of these business keys change, the definition of each of the hubs remains the same. The relationships between these hubs most likely don’t change either. And if they do, you can easily create other links between the hubs, without further impact on the rest of the model.

Furthermore, but this is only in latest developments in the DataVault community (not necessarily supported by Dan Linstedt and DV2.0), there is no necessity for link satellites that contain descriptive attributes.


In reality there is no conclusion. It’s partially a matter of preference. If you want to stick to pure DV2.0 recommendations, go for the denormalization technique as link-to-link is still not recommended.

But if you truly understand the modeling part of DataVault, go for the alternative2. It’s more flexible and easier to extend. And the DV2.01 book doesn’t not forbid you to model an event or transaction as a hub either so you don’t have to worry about that.

  1. “Building a scalable data warehouse with DataVault 2.0”, ↩︎
  2. “Modeling the agile data warehouse with DataVault”, ↩︎

Modeling business concepts using DataVault – part 1 of 2

In this series of 2 articles I’m going to discuss different approaches to modeling business concepts using DataVault. It is based on discussions from the past at one of my clients in the pharmacy industry.

The backbone of a DataVault

The backbone of a DataVault consists of hubs and links. Satellites are part of the core concepts but of less significance for this series of articles.

Let me start by giving the definitions of hubs and links, as stated in literature about this subject.


“Hubs are defined using a unique list of business keys and provide a soft-integration point of raw data that is not altered from the source system, but is supposed to have the same semantic meaning.1

“The Hub represents the key of a Core Business Concept and is established the first time a new instance of that concept’s Business Key is introduced to the data warehouse.2


“The link entity type is responsible for modeling transactions, associations, hierarchies, and redefinitions of business terms. The next sections of this chapter define Data Vault links more formally. A link connects business keys; therefore links are modeled between hubs. Links capture and record the past, present, and future relationships between data elements at the lowest possible granularity.1

“A Link represents an association between core concepts and is established the first time this new unique association is presented to the EDW. Just as a Hub is based on a core business concept, the Link is based on a natural business relationship between business concepts.2

Business definitions

Now that the core constructs are defined, let’s define the business concepts that form the basis of this article and the data model.

Healthcare Facility

“A place where healthcare is being practiced. This can be a hospital, a department of a hospital, a laboratory or another place.”

Healthcare Professional

“A person that has followed some form of medical studies and practices healthcare.”

Master Site

“A Master Site is the assignment of a Healthcare Professional to a Healthcare Facility.”


“A Study is a formally followed research process in the development of medicine.”

Study Site

“A Study Site is the assignment of a Master Site to a Study.”

A first attempt to model the business concepts

The following data model uses the “colors” of the DataVault as introduced in “Modeling the agile data warehouse with DataVault”2.

  • Hubs are blue
  • Links are green
  • Satellites are yellow

Based on the business definitions given above, there is probably no doubt that “healthcare facility”, “healthcare professional” and “study” should be represented as hubs.

As you can see, both “master site” and “study site” are modeled as links. The reason why this is done, is because the definitions of these concepts are indeed a kind of association and links are used for representing associations.

But this also poses an immediate problem. We now have a link-to-link relation in the model. This is not recommended practice: “This dependency does not scale nor perform well in a high volume, high velocity (big data) situation. The problem with link-to-link entities is that a change to the parent link requires changes to all dependent child links.”1

A second attempt to model the business concepts

One way to get rid of the link-to-link relation is by using (a kind of) denormalization1.

If you apply that principle, you’ll get this:

Even though this is a correct approach, I have two problems with it:

  1. It starts to look like a dimensional model and not like a DataVault model that is more fractal like. This is of course very subjective, but it just doesn’t feel right to me;
  2. The extensibility of the model is more difficult than with other approaches.

A third attempt to model the business concepts

Another approach is to take a closer look at the following statement: “This understanding is crucial to data vault modeling. A Link – by itself – cannot represent a business concept. So a Link – by itself – cannot represent an Event, a Transaction, or a Sale. Each of these event-oriented business concepts must be entire data vault constellations – including a Hub, Satellite(s) and Link(s).”2

Now think about that for a minute…

Both “master site” and “study site” are assignments, which is a kind of event. But both are business concepts too. In fact, these business concepts each have their own (composed) business keys. So according to the statement, these should be modeled as (keyed instance) hubs, not as links.

Let’s try again:

In part 2 I will elaborate on why this third attempt is the better option.

  1. “Building a scalable data warehouse with DataVault 2.0”, ↩︎
  2. “Modeling the agile data warehouse with DataVault”, ↩︎

When your e-bike (data) goes… anywhere

This post should be taken a bit lightly but is nevertheless true.

Recently I bought an e-bike. You know, a bicycle with an electric motor that supports you. This e-bike, though a relatively simple (I.e. cheap) one, has some nice features:

– built in GPS tracking
– a SIM card in the frame somewhere to send your e-bike data over the air to a server owned by the manufacturer (I guess)
– other sensors that detect movement of some kind

So it comes with an app. Of course. Everything needs to be smart and connected to the internet nowadays.

The app shows me data about:
– what route I drove, using the GPS data and a map
– how many calories I burned
– how much CO2 I did not throw into the air as I took the bike and not the car
– mileage
– average speed

For the above data I can see today, yesterday, last days, last month etc.

The sensors also emit signals that the app will receive and show me as notifications on my phone or watch, such as:
– whether movement was detected
– whether the bike has left a predetermined geofence
– whether the bike fell over
– whether the bike is moving faster than 50 kilometres an hour

Wow, great! Isn’t it?

Well, not all of it, because:
– the sensors can’t tell that it’s me moving the bike
– the sensors can’t tell that it’s me that fitted the bike to my car (hence the notification for faster than 50) and not someone you don’t know driving a white van and who isn’t the bike repairman either
– the sensors don’t know the difference between falling and a speed limiter bump that you drive over at about 20 km/h

As the bike is parked in a building with lots of concrete and other close by buildings, I also get a notification sometimes that the bike has left its geofence. Yesterday it did 13 km on its own. But I didn’t get a movement notification. In fact, the bike never moved at all. Nobody borrowed it. I was at work.

So that data is a little inaccurate…

Which brings me to the reporting part of that data in the app. It tells me I saved an amount of CO2. But compared to what? To a car it says on the help page. But what car? A Tesla doesn’t produce CO2 at all (apart from when you have a flat tire). Not very useful.

So my key points of this post are:
– is your sensor data trustworthy?
– is your reporting telling the right story?
– is your data lineage clear?
– do you know how your data relates to each other?

If the answer is no to most points, I wish you all the best interpreting your (big) data. You know where to find me to help out…

How to explain data architecture to a teenager

Yesterday I attended the initial awareness session of the “Full Scale” Data Architects. We had an open discussion on what it is, could be or should be.

One of the questions raised during that session and afterwards on LinkedIn was how to explain what we – data architects – do.

Although data architecture and architecture (in construction) have many differences, I still see an analogy.

When asked what I do, I also make that analogy. It doesn’t cover it completely of course but it is often enough for the first introduction.

I design “something”, make the blueprint and lay the foundation.

And that while taking into account all wishes, (legal) requirements, environmental factors, durability, change and – although in data architecture we try as much as possible to be technology-independent – available “building material”. It’s basically finding the right balance as Ronald Damhof put it.

In practice the architect may also be the contractor that takes the lead in the construction. This can be an incentive for some but not for others1.

But I always keep an eye – or delegate it – that the construction is according to plan. When necessary I even change the plan (due to external changes, available building materials).

I should therefore have an overview and be part of a whole team that I can trust.

And I shouldn’t make it more complex than strictly necessary, certainly not when I try to explain it to someone else.

Of course it can definitely help if you have ever constructed things yourself – and I did -, but mainly from the point of view of the problems you can run into. Otherwise you risk that you start with a technical bias (yes, it does happen to me occasionally).

  1. Another question was how to make data architecture attractive to teenagers so that they will study it, if there were any real studies about it.

ERwin data modeller plug-in MODGEN for DataVault generation

Thanks to George McGeachie my attention was drawn to the following article on the blog of Erwin, a well known data modelling tool.

The article is about DataVault in general and how a data modelling tools like Erwin can help.

More interesting is the fact that the German company heureka e-Business GmbH has written a plug-in for Erwin called MODGEN that is able to generate a DataVault model from another data model.

I will certainly contact them and see if they do their webcast again on whether a recorded version is available or offline viewing.

Who knows this is one step further in automating DataVault.

My 2 cents on DataVault standards (evolution)

My 2 cents on DataVault standards (evolution)

Generating #datavault models & Issues to address | Accelerated Business Intelligence

Generating #datavault models & Issues to address | Accelerated Business Intelligence