Terminus as a stream-based distributed graph database

Disclaimer: These are the thoughts of someone who was introduced to graph databases very recently, so please correct me if I have some fundamental misunderstanding. :slight_smile:

Problems solved by the relational model, RDF/OWL, and streaming pipelines

The reason the relational model took over the world is the scalability it offers. The strong guarantees it provides people the comforting feeling of mathematically certain safety while they make changes to arbitrarily complex databases.

Now our data needs are outgrowing relational databases in two ways:

  1. The foreign key model can’t provide semantic information, which forces people to rely heavily on data dictionaries to understand the data. The schema constraints of relational databases are rudimentary, which forces developers to write unit tests to enforce expected relationships. This is the heart of what Terminus and OWL in general aim to solve.
  2. The fundamental structure of a database-driven enterprise is wrong. I work on a 20-person data warehouse team in a major hospital, and our jobs exist solely because when a nurse enters a new patient into a system, there are 10 departments with 20 different applications that want to know about it. One open-source project that’s tackling this issue head-on is Apache Kafka. The Kafka philosophy is to treat the changes to the data as first-class data - ‘facts’ - and stream them throughout the entire enterprise, allowing any number (even thousands) of applications to process facts of interest in real time. When taken to its logical conclusion, we can do without any centralized databases entirely. Every individual application will hold on to its own cache of aggregated data that it can reconstruct from the log of past streamed messages if necessary. Information can flow from source to destination directly in an efficient and elegant manner.

I’m growing to appreciate the streaming model that Apache Kafka offers more and more, but it actually suffers even more from the lack of semantic information than a centralized database. Whereas before you had one central database with one data dictionary, you now potentially have tens or hundreds or thousands of applications both producing and consuming messages. No one person or team can keep track of all the meanings and constraints of all of this data.

A stream-based distributed graph database

I think there’s a real opportunity for Terminus to solve both problems and become more pure and elegant in the process. I envision a future where any change made by a user of any application is encoded as an RDF triple and sent to a relevant stream where any number of specialized Terminus-powered applications can access it. These applications will:

  1. Receive the RDF triple
  2. Enrich it with schema information from one or several schema servers
  3. Optionally make a change to its internal triple store based on the semantic information encoded in the triple
  4. Optionally perform some action
  5. Optionally produce some output to another stream as another RDF triple

This will enable people to build arbitrarily complex distributed applications based on Terminus. Even applications built by completely different teams without any contact with each other will be able to talk to each other thanks to the power of OWL. If this becomes possible, I would have a good chance of successfully pitching a Terminus pilot to our data warehouse director as a potential way of replacing our entire current ETL process.

Concrete proposed changes to Terminus

What concrete changes would need to be made to the Terminus structure and philosophy for this to happen?

  1. Inputs and outputs need to become RDF, not just the data. WOQL/JSON-LD need to be replaced with pure RDF. The outputs of selects and any other operations that produce data need to be RDF also.
  2. It should be easy to write a Terminus-powered application in any language that specifies the rules for what to do with triples coming in over some stream. Specifically, how should the internal data store evolve and what triples should be sent to other streams based on the semantics of the data received?
  3. Terminus will guarantee that the incoming and outgoing messages and the internal data store satisfy all conditions in both the incoming message’s schema and the internal schema.

Thanks for reading, and I would love to hear others’ feedback and thoughts.

  1. JSON-LD is actually a format for serialising RDF. In version 1.0 of TerminusDB, WOQL’s underlying JSON format was not JSON-LD so not properly a serialisation of RDF. However in version 2.0 which has just been released in a Canary release, WOQL is JSON-LD and so all data is in fact RDF. Other translations and serialisations will be possible in the future, including turtle and XML.

  2. Should already be possible. You can write a library for WOQL by simply generating the appropriate JSON-LD. Two example libraries already exist - one in javascript and one in python.

In terms of streaming, the database is really designed to be a versioned branching store based on delta encodings. All of our strengths come from this underlying design. However these deltas are quite compact and can be communicated in a distributed fashion.

1 Like

Thanks for replying @gavin!

That’s really cool, I hadn’t realized that JSON-LD is an RDF format!

Give me a couple of days to do more research on all this, come up with a concrete use case for Terminus, and ask follow-up questions if I can’t find a capability I need.

Aram

1 Like

I recently created the delightful project and am about to add another curated list one of these days, called delightful linked data. Currently it just has a bunch of open standards on it, but one of them is interesting in light of this topic:

https://www.w3.org/TR/json-ld11-streaming/

Just mentioning :slight_smile:

3 Likes

This is awesome @aschrijver, thank you so much for the link! I literally just designed a JSON-LD schema for a Kafka-based distributed application at work, so this couldn’t be more relevant.

I’ve also been tinkering with the idea of an eventsourcing graph-based communication platform. I don’t have anything concrete yet, but interesting things arise when an application broadcasts all of its state changes as RDF graphs for other applications to consume. A special application called a topology can allow multiple applications to consume each other’s state changes. The topology application itself broadcasts all of its internal changes too, so topologies can be nested.

1 Like

That is cool. Do you have some pointers to a bit of background regarding this topology concept?

Got the idea from the Kafka Streams documentation:

a processor topology is a graph of stream processors (nodes) that are connected by streams (edges)

I think it’s a very useful general concept. Pipes in Unix can be thought of as a processor topology too.

1 Like

Interesting, thanks! This looks a bit like an actor model to me on first sight, or may be combined with one.

1 Like