Terminus and Eventsourcing, CQRS, DDD

Even though I am in early beginnings still, I’ll mention the use case for which I am investigating TerminusDB:

I am looking to starting two fediverse (ActivityPub / ActivityStreams) based applications, one a collaboration server and the other a federation of (relatively) small-scale social networks.

Besides these standards I will combine with the richness of linked data (similar to SemApps) but - probably - server-side based on TypeScript / NestJS.

So here already I get already quite excited with TerminusDB features for the ‘web era’, and WOQL/JSON-LD as replacement for SPARQL queries.

But I also intend to investigate Event Sourcing and Command/Query separation, treating incoming ActivityPub messages as commands, triggering Events in the business logic which are stored in Terminus as the eventstore. I don’t know how well Terminus is suited for that, but with its git-for-data concept I imagine quite well. I read that Terminus is read-optimized and I expect to have way more reads than writes too. And Terminus acting as an eventstore can also keep the semantic relationships and allow for a great query composition on the read side.

Finally with all this in place I intend to go the Doman Driven Design route, making it ever easier to translate domain concepts to an ever growing linked data model.

Some questions I now have:

  • Do you intend to provide TypeScript support?

  • One problem with eventsourcing is complying with the GDPR Right to be Forgotten, and I think this is a problem that Terminus, with its forward-only immutable storage concept will have too: https://duckduckgo.com/?q=eventsourcing+gdpr&t=canonical
    I wonder what your approach is in this regard?

3 Likes

GDPR Right to be Forgotten is definitely a thing we intend to support, but as of yet, we do not.
Our plan for supporting it is to introduce a history-rewriting filter operation. This would rewrite your history using a filter query, ensuring that any data you’re no longer supposed to know about is deleted. So unlike an ordinary delete, this wouldn’t just append a new layer with deletions, but would actually go over all the old layers, and create a new layer out of them which doesn’t include the offending data. After that, the unreferenced layers can be garbage collected (another operation which isn’t there yet, but which we’re working on including soon). Git also has an operation like this, called filter-branch.

Note though that this would only take care of this history on one server. If you’re using the data sharing operations that we’ll start pushing out starting next release, you may end up with copies of this data in multiple places. Much like a git repository, you can’t force everyone who downloaded it to also delete the sensitive data.
Also, doing this operation would have you end up with a completely new history, so this operation would be rather intrusive. Anyone having cloned this particular database would have to know about this rewrite operation having happened before attempting to push new data, as all common history will be lost.

That said, if you’re dealing with personal data, you’re probably just storing it on one server, or a set of trusted servers, and so the rewrite operation may be a viable strategy for you.

I hope this answers your GDPR related questions. As I’m not very familiar with the plans regarding supported client languages, I’ll leave that one for someone else to answer.

3 Likes

Thanks for the elaborate answer, @matthijs! Yes, it is a tough nut to crack this one, keeping the data consistent after such operation, and especially if part of the data represents application state.

1 Like

Some more background that you may find interesting…

So I’m interested in looking into combining Domain Driven Design + Linked Data for the fediverse apps I’m elaborating on. This DDD + LD approach is a bit odd, and there is hardly any information in-the-wild on the combination of these two fields. Usually LD brings you to more academic data science sections of the web, while DDD leads to more of enterprise business applications.

This combo is interesting, I think, in order to make rich semantic models available to the masses in well-designed (clean architecture) applications. Eventsourcing, CQRS and DDD has gotten better tool, framework and library support to the extent that it is now within easy reach for a large part of the developer community. Many large, production ready projects use ES, CQRS and DDD is now following along.

Like I said I want to use NestJS framework, which has a CQRS module, a DDD nodejs starter kit and great articles introducing DDD + Nest.

On the Linked Data side of things it does not look all that good yet, dev-support-wise, unfortunately. Many unwieldy, complicated or almost abandoned tools, horrible UX designs, overly technical approaches, small ecosystem in general, etc.

I’ve sold my heart to the Fediverse, which - based on some initial specs - has grown organically by people in the right spirit of FOSS and openness. This has come at the cost of some incompatibilities between federated softwares, which fediverse communities (specifically SocialHub and Feneas) are now tackling. There is still a risk that some, practical-minded devs want to drop the JSON-LD in favor of plain JSON, but I think things are going in the right direction. The fediverse has proven to be production-ready though, and most of the applications have proper UX (and this is further improving all of the time).

PS. I think the Fediverse is a great place to find contributors to TerminusDB and would heartily recommend you create a Mastodon account.

1 Like

I’m also interested in event streaming applications of Terminus.

From what I understand so far, since Terminus uses this ‘git-for-data concept’ as you say, it can be represented as a ‘forward-only immutable’ log of update events.

I am imagining a Terminus database that uses Apache Kafka as its persistent storage. Then any update to a Terminus database would be published to an Apache Kafka topic, which would allow multiple applications on other servers to ‘subscribe’ to Terminus update events and invoke business logic based on what kind of update event was just published to the topic.

2 Likes

That is interesting, @aram.

Are you intending to write the persistence layer yourself then (in Rust probably), or use Termius as-is and persist based on db events in some existing Kafka infrastructure?

I am curious myself how @luke and @kevin view the ES / CQRS / DDD use cases on top of TerminusDB?

Am thinking if on the write side (in a CQRS design) Terminus might not be the best choice for some reason (e.g. read-optimized), then on the read side it may still be ideal to create semantic projections of the data and powerful querying facilities. If Terminus is a good fit for the write side - which with git-for-data seems the case - maybe there still should be 2 databases for read + wite side, but other setups may be better still.

1 Like

To be honest, writing a new persistence layer seems hard. :man_shrugging:

An easier and more immediate solution to test the concept out would be to have each application that you write that updates your Terminus database write the JSON-LD that it used to make the request to some publish-subscribe system like Kafka. This would achieve the same effect of both propagating events and being able to easily restore a database from the logs of the publish-subscribe system.

I’m new to Terminus too and am just thinking out loud here. Would also love to see the opinions of core Terminus developers on ES and CQRS.

2 Likes

Sounds like an interesting experiment. I’d be interested to know what scale / rate of events that terminusdb would struggle at. I always thought that a good kafka/terminusdb combination would be event summarisation - filtering all those low level events into more meaningful objects. Never thought of using kafka as a document cache - sounds interesting. The new WOQL has WOQL.read_document(“doc:id”, “v:JSONLD”) which gives you a dump of the whole document into a woql variable so it should be super easy to write the caching logic.

3 Likes

In the fediverse + eventsourcing case there is some interesting mapping to do, to see how things fit. The ActivityStreams Vocabulary defines Activities which has an ‘object’ property (Object is at the root of the data model, like Document in TerminusDB). The ActivityPub standard has 2 parts, client-2-server and server-2-server. Most fediverse apps only implement S2S.

The Activities can be seen as Commands when they are sent by a client, and when sent for federation they are an Event for the sender and a Command to the recipient federated instance (but that last bit is not important considering just the server application on first instance).

Toots (similar to tweets), their comments, boosts (similar to retweet) and likes triggers Create{Note}, Announce{Note}, Like{Note} activities, but much more complex structures are possible too. There is no command/event separation in that there is a Like/Liked version of the activity. Users are Actors that have a bunch of collections, such as ‘followers’ and ‘following’ of other actors as well as an ‘inbox’ and ‘outbox’ of activities. Based on this all kinds of social graphs and timelines need to be constructed.

The current fediverse still uses a rather small domain model with very few activities, actor types and object types in total, and with a rather limited set of properties on each. It gets interesting when combining this with more of the richness of linked data for creating more domain-specific federated applications (for a list of current fediverse apps, see the fediverse watchlist).

1 Like

@kevin - In theory, it would be simple to just store the layers in a Kafka topic instead of in a file. In practice, it would probably have to be done by adding a storage layer to the terminusdb-store source code, right? Honestly, the code for even file.rs looks pretty long and intimidating, and I don’t see any unit tests anywhere. I’d be afraid to touch anything in that codebase. Are you saying there’s another (easier) way to do it?

@aschrijver - Do you want to use Terminus as a publish-subscribe service as well as an event store?

terminusdb-store definitely does have unit tests, 131 of them at this time. file.rs indeed doesn’t contain them, cause all that is there are a bunch of traits and data transfer objects. Better look at directory.rs, where the file-based backend is actually implemented.

Implementing a new persistent backend could be done by providing a new implementation for PersistentLayerStore, which in turn requires some kind of File type which implements the FileLoad and FileStore traits.

1 Like

Sorry @matthijs, I see the tests now. I was expecting to see a separate ‘tests’ directory - didn’t realize the tests were in the source code files themselves.

1 Like

I can’t tell yet. There is a lot to figure out and I will start from small beginnings. The projects are FOSS and have a potentially very large scope and organic growth shaping their direction. There is also an actor model that I am looking at, but I don’t know if that makes sense for NodeJS (I still have an option for a Java backend on top of vertx.io though, which is actor-based).

1 Like

I thought of creating a new topic for this, but I just wanted to throw a link at you as a FYI, so could as well do that here. There are many challenges on my path (where luckily given my bounded context I hope to avoid many of these for a long while). There is lotsa work done all over the place in decentralized semantic solutions. One such club working on this has brought a lot of things together and relates in many cases with concepts that TerminusDB and maybe particularly Terminus Hub will also have to deal with (maybe). So here I throw you the link:

https://infocentral.org/

:wink:

2 Likes