Disclaimer: These are the thoughts of someone who was introduced to graph databases very recently, so please correct me if I have some fundamental misunderstanding.
Problems solved by the relational model, RDF/OWL, and streaming pipelines
The reason the relational model took over the world is the scalability it offers. The strong guarantees it provides people the comforting feeling of mathematically certain safety while they make changes to arbitrarily complex databases.
Now our data needs are outgrowing relational databases in two ways:
- The foreign key model can’t provide semantic information, which forces people to rely heavily on data dictionaries to understand the data. The schema constraints of relational databases are rudimentary, which forces developers to write unit tests to enforce expected relationships. This is the heart of what Terminus and OWL in general aim to solve.
- The fundamental structure of a database-driven enterprise is wrong. I work on a 20-person data warehouse team in a major hospital, and our jobs exist solely because when a nurse enters a new patient into a system, there are 10 departments with 20 different applications that want to know about it. One open-source project that’s tackling this issue head-on is Apache Kafka. The Kafka philosophy is to treat the changes to the data as first-class data - ‘facts’ - and stream them throughout the entire enterprise, allowing any number (even thousands) of applications to process facts of interest in real time. When taken to its logical conclusion, we can do without any centralized databases entirely. Every individual application will hold on to its own cache of aggregated data that it can reconstruct from the log of past streamed messages if necessary. Information can flow from source to destination directly in an efficient and elegant manner.
I’m growing to appreciate the streaming model that Apache Kafka offers more and more, but it actually suffers even more from the lack of semantic information than a centralized database. Whereas before you had one central database with one data dictionary, you now potentially have tens or hundreds or thousands of applications both producing and consuming messages. No one person or team can keep track of all the meanings and constraints of all of this data.
A stream-based distributed graph database
I think there’s a real opportunity for Terminus to solve both problems and become more pure and elegant in the process. I envision a future where any change made by a user of any application is encoded as an RDF triple and sent to a relevant stream where any number of specialized Terminus-powered applications can access it. These applications will:
- Receive the RDF triple
- Enrich it with schema information from one or several schema servers
- Optionally make a change to its internal triple store based on the semantic information encoded in the triple
- Optionally perform some action
- Optionally produce some output to another stream as another RDF triple
This will enable people to build arbitrarily complex distributed applications based on Terminus. Even applications built by completely different teams without any contact with each other will be able to talk to each other thanks to the power of OWL. If this becomes possible, I would have a good chance of successfully pitching a Terminus pilot to our data warehouse director as a potential way of replacing our entire current ETL process.
Concrete proposed changes to Terminus
What concrete changes would need to be made to the Terminus structure and philosophy for this to happen?
- Inputs and outputs need to become RDF, not just the data. WOQL/JSON-LD need to be replaced with pure RDF. The outputs of selects and any other operations that produce data need to be RDF also.
- It should be easy to write a Terminus-powered application in any language that specifies the rules for what to do with triples coming in over some stream. Specifically, how should the internal data store evolve and what triples should be sent to other streams based on the semantics of the data received?
- Terminus will guarantee that the incoming and outgoing messages and the internal data store satisfy all conditions in both the incoming message’s schema and the internal schema.
Thanks for reading, and I would love to hear others’ feedback and thoughts.