Concept maps and graph data modeling: techniques in the “data modeler toolbox”

As a data modeler I’m always searching for ways to learn more about data modeling. The more I read and learn about it – on top of what I have already learned and used -, the more sense it all makes and the easier it gets. It’s a bit like learning foreign languages. The more languages you’ve learned, the easier it get’s to learn a new one, especially when that language has the same kind origin.

As such, I have been doing some study regarding Concept Maps and Graph Data Modeling. It’s surprisingly (well, it isn’t) how close these two are related. And graph databases are “hot” (at the time of this writing).

I’m not going into all the details because it makes no sense to repeat all that has been written by others, people that are far more knowledgeable about these subjects.

Concept map

A concept map or conceptual diagram is a diagram that depicts suggested relationships between concepts. It is a graphical tool that instructional designers, engineers, technical writers, and others use to organize and structure knowledge.

This definition was taken from Wikipedia and there are more interesting links from that article that you should read, such as “The Theory Underlying Concept Maps and How to Construct and Use Them”.

The main idea is that you start with a “focus question”. This is the question you try to answer with the concept map. Without it, your concept map could include a lot more than is necessary or relevant. And you’re not trying to model the entire universe…

What I like about concept maps is their simplicity. There is not really anything technical about them and they are easy to read for almost everyone. Of course, from a modeling perspective they don’t cover all concerns you eventually want to cover, but it is a nice starting point for brainstorming over a particular knowledge domain.

Surely, there are other more advanced (covering more concerns) modeling techniques such as FCO-IM but these have a higher learning curve as well.

An example concept map

The following concept map I created myself, using CmapTools. It tries to answer the following focus question:

What data is relevant to a recruiting company?

Note that I have no particular domain knowledge regarding recruitment and I didn’t consult anyone. I just had a look at my own CV.

While working on it, I discovered more concepts than I originally included in my first sketch, and more importantly, more relations between those concepts as well. Each of those relations must have a “linking phrase” so that a proposition can be formed, such as:

  • Person has Hobby
  • Hobby requires Skill
  • Education teaches Skill

Of course there are concerns not being addressed here, such as:

  • None of the concepts has an actual definition (and you should have those!)
  • There is no explicit view on the cardinality between the concepts (can a person have just one or multiple hobbies?)
  • There are no attributes defined for the concepts or relationships

And there are a few flaws as well in my example:

  • A relationship in a concept map is uni-directional, but some of them are bi-directional in the real world, so the “way back” is missing
  • It’s not complete (but that was not the intention)
  • Even though you know in which countries an organisation is located and which assignment belongs to which organisation (well, the latter is actually missing due to the uni-directional relationship drawn from the organisation to the assignment), you don’t know for sure in which country the assignment took place

Graph Data Modeling

That example concept map looks “surprisingly” well to be the structure of a Graph Data Model. This is no coincidence. Thomas Frisendal has written an excellent book on this subject called “Graph Data Modeling for NoSQL and SQL: Visualize Structure and Meaning“. Visit his website that accompanies the book.

Taken from that website:

In the graph world the “property graph” style of graphing makes it possible to rethink the representation of data models. Graph Data Modeling sets a new standard for visualization of data models based on the property graph approach. Property graphs are graph data models consisting of nodes and relationships. The properties can reside with the nodes and / or the relationships. Accordingly the property graph model consists of just 3 simple types, as laid out in this property graph representation of the meta model:

This diagramming style is very close to what people – by intuition – draw on whiteboards. Rather than modeling data architecture on complex mathematics, we should focus on the psychology of the end user. If we do so, then engineering could be replaced with relevant business processes. In short, to achieve more logical and efficient results, we need to return to data modeling’s roots.

Even a ten year old can spot the resemblance between a graph data model and a concept map (I actually checked this with my son and he confirmed, even though he didn’t like the fact that I disturbed him in his Fortnite game play, in which he got shot due to my distraction). Even more interesting is the fact that concept maps were originally created by Joseph D. Novak to assist in the teaching of children.

So your ten year old might be a data modeler without knowing yet! Maybe (s)he is interested when you sit together to build a concept map around Fortnite…

Anyway, the graph data modeling technique as Thomas Frisendal present it in his book at least addresses one more concern: properties or attributes of concepts and relationships (technically speaking these could be part of a concept map too).

Graph databases

Databases that support property graphs such as Neo4j can translate these models basically one-on-one1. These graph databases are one category of the so called NoSQL databases and are extremely good in resolving questions about your data that is highly connected to each other through a plethora of relationships.

However, graph databases are also schema-less. That is, Neo4j is able to show you a schema – consisting of nodes and relationships, not properties of them – but doesn’t enforce it. So the application developer can go wild and do anything. As far as I can tell, Neo4j bases the schema on the labels you assign to the nodes and relationships. But it does that based on actual instances. If you don’t have an instance of a particular relationship between two nodes in your data, you won’t see that it could exist when looking at the schema. That’s where your concept map can be used as a reference when developing.

Another thing to consider is that due to properties not being part of the schema and having no schema enforcement, is that a particular instance of a node label or relationship label (i.e. the type) can have totally different properties that another instance of that same node or relationship label. Flexibility that has its downside as well.

An example of a graph database

Based on the concept map example earlier, I started to play around with Neo4j on my desktop. Looking at my own CV, I created a few nodes and relationships, but by far not all of them covered in the concept map.

When I queried the database to return me what I had created, the following was shown:

Quite impressive already and there is hardly any data present. I checked the schema with the concept map and only saw one thing that is probably just a glitch in Neo4j. There was a relationship label used_in that was self referencing the Organisation node label, but I couldn’t find any data that actually does that.

The example also shows that having the data only – without a schema, concept map or graph data model made upfront – quickly results in a situation where you can’t see the forest through the trees (pun intended, as a tree is a special kind of graph).


None, you are on your own here to draw conclusions. Further reading is recommended if you see potential in this as part of your “data modeler toolbox” or any other deeper understanding regarding data modeling.

  1. Not entirely one-on-one. Can you spot which part of the concept map can’t be translated as such? Hint: look at the relationship from Vacancy towards Education. 

Data modeling for multi-structured data, nothing new

In today’s world there seems to be a lot of focus on the technology to handle multi-structured data. And for sure the advances made in technology support various aspects of data management. However this technology is just there to support the higher goals.

Schema-less or schema-on-read are terms that have often been advocated in the recent past. The headaches of that approach have also arisen quite a bit, just search Google for relevant articles.

The fact is, there is no such thing as unstructured. Rick van der Lans has also repeated multiple times that there is always some structure present, albeit not known at first. He therefore prefers the term semi-structured or multi-structured.

Let’s be honest, without some kind of structure we – humans – are not able to make sense out of data. The fact that we may not know the structure beforehand doesn’t matter. But once we know the structure, things start to get interesting. For some reason, even though we can start making sense out of our data, many fail to properly document this (multi-)structure. Why would you anyway, right?

Well, why wouldn’t you? Documenting the structure in a “data” model brings many benefits:

  • Communication with others that need to deal with the data in some way becomes a lot easier
  • You need to do it because of regulatory requirements, saving you from fines
  • You get a better overview and more insight how data relates to each other and where the gaps are
  • You are able to model and implement validations that are needed
  • Etc.

And you know what, data modeling has been around for at least half a century now. A lot of principles are old but still apply. The younger technology-focused generation just seems to have either forgotten them (best case) or has never learned them (worst case).

Data modeling comes on many different levels, not just the physical database. That is just a possible end point. While I was diving into even more around it, I came across a book called “Data Model Patterns – Conventions of Thought”, written by David C. Hay…. in 1995. Yes, you read it correctly, 1995. This book is a must-read for anyone that deals with data. It’s old and still so extremely valid. Apart from the many patterns that apply to a lot of organizations, it also show abstraction and generalization. Even better, it contains examples on how to deal with multi-structured data.

The following data model is taken from chapter 4 in the book and gives a perfect example on how to model multi-structured data. In this case it relates to “products” that may have many but varying descriptive attributes.

Now compare this with the “key-value pair” databases around nowadays. Ring a bell1? Remember the book was written in 1995, long before the internet was really a hype. Long before we started to talk about big data. Long before specialized databases supported this particular kind of data.

As said in the title of this post, there is nothing new here. Same old wine, just in a newer bag…

  1. If it doesn’t ring any bell, please read the book and start to look for training on data modeling. 

My thoughts on 9 Skills You Need to Become a Data Modeler

Ronald van Loon has written an interesting article on the 9 Skills You Need to Become a Data Modeler. In this post I want to express my thoughts on that article, because I don’t agree with all the statements made in the article.

The good

I certainly agree that data modeling skills are one of the best skills to have in the current information driven industries, but would like to add that it hasn’t recently emerged. Data modeling has been around for decades (in fact more than half a century), but seems to have been buried under a lot of misconceptions. It is finally slowly being recognized again however.

Data modeling indeed helps in understanding how the data neurons connect with each other, which is crucial. It doesn’t define per se how the data is generated, nor does it need to be in a computer system. Data modeling mostly determines definition, structure and relations between, but not the processes. It should be based on facts that can and need to be verified.

Stepping into a career as a modeler, you’ll have to work with data analysts and architects to identify key dimensions and facts to support the system requirements of your client or company. […]

As long as dimensions and facts don’t refer to the concepts of dimensional / star modeling as was popularized and extensively taught by Ralph Kimball, I agree with this part and it also implicitly contains the number one skill you need as a data modeler: communication skills.

The bad

The career path for becoming a data modeler starts with specific education in the data science field […]

I really don’t agree on this point. Data science as we know it today was never part of my education. In fact, I am not really good at some of the underlying mathematical aspects that are involved in data science. Still, there are lots of “colleagues” that acknowledge I am good at data modeling.

The ugly

In general, my opinion is that the article mixes up a few things and focuses too much on technology. The “definition” of data modeling in the article is too narrow, because data modeling is not just focused at database management systems.

Data modeling in itself is not the same as data management. It is a part of data management, as described by the DAMA-DMBOK2 and the DAMA Wheel1.

Data modeling serves as a means to complement business modeling and to work towards generating a sufficient database.

Again, I don’t agree that is serves to work towards generating a sufficient database. Data doesn’t need to reside in a database. In my opinion, data modeling serves as a way of communicating, structuring, interpreting and understanding data. That structuring can go down to the implementation level where databases, files systems or other forms are used to store (and retrieve) the data.

The process for designing a database includes the production of three major schemas: conceptual, logical and physical. […] A Data Definition Language is used to convert these schemas into an active database. A data model that is fully attributed and covers all major aspects includes detailed descriptions for every entity contained within it.

I guess the author meant three major models, not schemas. Schemas are a technical way of separating data structures on particular database implementations. A Data Definition Language (DDL) is generated from a physical data model. The physical model however no longer talks about entities and attributes, but about tables and columns (assuming that we are talking about a relational database management system as the target).

The skills

This part is basically the essence of the article and exactly the part that I think is the ugliest of all.

You must exhibit the following skills before pursuing a career in data modeling:

1. Digital logic
2. Computer architecture and organization
3. Data representation
4. Memory architecture
5. Familiarity with numerous modeling tools that are currently in place within organizations
6. Directions in computing
7. SQL language and its implementing
8. Exemplary communication skills that will help you in making your way around organizations with an intricate hierarchy
9. Sufficient experience using Teradata or Oracle database systems

The most important skill you need to possess is number 8. Communication skills are ranked as the number one and even following skills by Steve Hoberman.

Skills one to four on the list are irrelevant until you get to physical data modeling and even then they could be questionable to a certain degree. Skill number 5 is something you can learn on the job. And beware, lots of data modeling tools in the form of software tend to focus only on certain aspects of data modeling, not even all of them. A data model can be as simple as a set of post-it’s on a whiteboard with lines between them. In fact, that is most likely to be the data model that is best understood by people without the technical background that is being referred to by most of the skills listed. Skill number 7 comes in handy at some point when the implementation of the physical data model that actually can handle SQL. What about graph databases that don’t support it?

I can’t comment on skill number 6 as I don’t understand what is meant by it.

And skill number 9… well, sorry Microsoft and all other database vendors. Seems like you have all just been wiped out of business…

Training and certification

Getting sufficient data modeling training and staying up-to-date with the evolution of the industry is indeed very important.

Certifications are crucial when it comes to data modeling in the formal setting. Companies agree it’s important for their data modelers to obtain reputable certifications that prove their expertise and also enhances their skills. These certifications include Big Data and Data Science courses, Big Data Architect Master’s Programs, Big Data Hadoop Training, and Data Science with R, among others.

I really am missing the importance of these certifications regarding data modeling. I’m pretty sure they have their value, but on a entirely different area.

  1. If the DAMA organization doesn’t want me to include the picture of the DAMA Wheel in this post, please let me know and I will remove it. 

Interpretation in dimensional modeling vs data vault

In my previous post I mentioned that there is less interpretation for the designer in data vault modeling than in dimensional modeling.

Let me elaborate on that with an example.

Dimensional modeling

The question that I received in the workshop was what to do with the age of the customer at time of the transaction. Is it a fact or a dimensional attribute?

The age is something that could be calculated by using the customer’s birthdate and the date of the transaction. In this case, the dimensional attribute would be birthdate. The fact table should hold a reference to the time dimension representing the transaction date.

Depending on the number of fact records however, there could be a negative impact on performance when calculating the age on the fly, but this can probably be neglected.

But what if the age is supposed to be used to used to determine if it falls in a certain range? What if these ranges are variable and only known at query time? And what about the average age of the customers buying a particular product within a certain time frame?

Could the age be a fact measure in those cases?

Some designers would argue that it is a derived measure that doesn’t need to be stored in your fact table. I agree, but it does require your query reporting tool to be able to handle it all.

Others would argue that the age is a measure that you better store in the fact table. It wouldn’t be exactly an additive measure, but at least you could answer the “average age” somewhat easier.

You could even argue that it is both something dimensional and factual. And that’s true as well. The birthdate of the customer is dimensional and the age could be used for filtering facts.

It should be clear that it depends on the context. If there is no business question yet about the age, I would at least store the birthdate as a dimensional attribute of the customer. Adaption to the fact table when the question arises can be done later, but does require reengineering.

Data vault modeling

In data vault modeling you don’t suffer from this interpretation problem. The birthdate of the customer is a satellite attribute linked to your customer hub. There is no question about it.

You divide and conquer.

The interpretation will only be needed once you get the business question. Based on that, you can determine how to model it in the presentation layer, whether it be dimensional or any other form needed.

Wait, did you hear me mention something about Agile BI here? Well, sort of 😉

The Art of Data Modeling – follow up

In my previous post on data modeling, I wrote that it is an art and not pure science. In this post, I’ll elaborate on that.

Today I gave a small workshop on dimensional modeling for the business intelligence team that I am currently part of. The main objectives of this workshop were to get the team in a standard way of working and to refresh their memory on dimensional modeling.

As preparation I had written a small summary about dimensional modeling, mostly based on Dr. Ralph Kimball’s book The Data Warehouse Toolkit.

Before I got into presenting some example models for the project I’m working in, I had a small Q&A session about the preparation material that I had provided.

Well, I thought it would be small…

I received many design questions that I could not answer… in the way they thought I would answer it, i.e. as a straight answer that would set it for once and for all.

That is exactly what designing a data model is all about, especially when doing dimensional modeling. To quote Ronald Damhof:

it all depends on the context

That is also exactly why there are some drawbacks in dimensional modeling compared to other modeling techniques such as data vault.

Is a particular attribute dimensional or a measure that belongs in the fact table? It just depends in how it will be used. There is no scientific answer to it.

That’s why I say, data modeling is an art. Art represents the artist’s interpretation or view.

It is also the reason why I prefer data vault over dimensional modeling for the EDW. There is less left up to the interpretation. But that’s another story…