It’s been a while since I briefly looked into graph databases for the purpose of using them for metadata. As I have a metadata “problem” at hand, I decided to look into it again and just as last time, opted for Neo4j.
The problem I have is relatively simple. The BI solution at a client has many documents describing business data domains, data models (well, a sort of data models), mapping documents, mapping rules etc. These documents are stored on a file server in several directories, often arranged by change request. There is no good procedure in place yet – but in the works – to make sure there is a central place to store the documentation that actually describes the production state.
To make things worse, documents exist in different versions that were sometimes incorrectly “forked” into a newer version, resulting in changes getting lost from other versions. On top of the cake, the intend is to move all documentation over to Microsoft SharePoint (may the force be with us…).
In addition we have DDL in files and ETL in Informatica PowerDesigner. Due to the fact that naming conventions were sometimes a bit forgotten, it is pretty difficult to tie all documentation together. There are mapping rules/functions described in Word documents that are referenced in mapping specs in Excel sheets, but nobody knows which ones etc. etc. You get the drift.
So as we work our way in making a decent inventory of this metadata (at least that is how I see these documents, DDL and ETL code) I thought it would be a good time to make the relationships between this metadata explicit (as the way SharePoint is going to be used won’t help us here).
And that’s exactly where I see a good use of a graph database, simply because these are very strong in describing the relationships between things.
My first rough thoughts, without really thinking it all through, is a graph model like the following: