Materialized Views in TerminusDB

I work in data warehousing. When you strip away all the fluff and jargon, a data warehouse is just a set of materialized views. Our work consists of taking data from tables and storing the same data in tables but in a different shape.

The problem of efficiently updating materialized views is as old as the 1986 paper of the same name. The problem is simple to state: A change in a source system table must be converted to a set of changes in the materialized view tables. 30+ years and hundreds of papers later, a general-purpose relational solution is nowhere in sight. Data warehouses still rely on an army of administrators using point-and-click tools to move data in. I think we can confidently say that the relational model is just not up to the task.

Intuitively, I feel that the problem of updating materialized views is solvable with Terminus because Terminus’ query language can be represented as data itself and reasoned about. I’m currently trying to show it, but it will take me a while.

I would appreciate any thoughts or feedback on this topic.

I think Terminus is uniquely positioned to solve this problem once and for all. There’s quite a bit of money on the line too. I would estimate hundreds of millions of dollars are spent annually on point-and-click ETL tools like Informatica PowerCenter. Every sane data warehouse architect hates these tools and knows there must be a better solution…

3 Likes

Materialised view update is very closely related to the study of bidirectional transformation and lenses. The reason for this is that we need a relation that allows us to establish that the original antecedents would yield the modified consequent. This will only be true if we can genuinely produce the same solutions in either direction.

The study of lenses, bidirectional transformations and indeed its use on materialised views (Relational Lenses) has progressed somewhat in the language theory community but it has never found wide adoption. There have also been developments specifically in terms of graphs (see: Grace Report for a survey of some of the literature here).

We have taken the view that we should try to distinguish the inferred from the basic as much as possible for the sake of simplicity and view the creation of altered materialisations primarily as a strategy of changing the antecedents appropriately and then cache invalidating the “view”. The reason is primarily to avoid the complexity entailed by the lens solution, which while it seems theoretically ideal, would require a fair bit of work to make feasible in practice.

3 Likes

Thanks @gavin, this is a great set of papers and will keep me busy reading for some time.

I was hoping that it would be easy to apply a bidirectional transformation in WOQL due to its graph nature, but I guess that’s not the case… I’ll keep studying this question.

Aram

2 Likes

I just finished skimming through the Relational Lenses paper. It’s very impressive theoretically, but almost completely useless in the real world, because the bread and butter of data warehousing is grouping and aggregations of both integers (as sums) and strings (as semicolon-separated lists). Select-join-project alone is not enough for us.

This is why I’m more excited about WOQL since it already has grouping and aggregation. For example, I was able to do the following in the bike share database:

/* Group journey bikes and ends by journey starts */
group_by("v:Start", ["v:Bike", "v:End"], "v:BikeEnds")
    .triple("v:Journey", "start_station", "v:Start")
    .triple("v:Journey", "end_station", "v:End")
    .triple("v:Journey", "journey_bicycle", "v:Bike")

I was hoping that if I treated the WOQL JSON-LD itself as data, it would be possible to figure out at least which rows in the query will be updated when the source data is updated.

I have seen much more promising developments in papers exploring this problem from a graph database perspective. I’ll read the Grace Report next.

4 Likes

In doing research on this subject, I came across the Rust libraries Timely Dataflow and Differential Dataflow. Differential dataflow in particular seems like it just about solves the materialized views problem I presented here, and I highly recommend anyone interested to check it out (there’s a paper, a video, a book, and a bunch of other stuff I’m sure). There’s also a company called Materialize that uses these technologies to provide fast differential updates for almost arbitrary SQL queries, which is really cool, and something I thought was impossible.

3 Likes

Differential dataflow is a fantastic development I should have mentioned in the original post :smiley:

3 Likes

Here’s another project about Lenses:

And their Typescript project: https://github.com/inkandswitch/cambria

3 Likes

wow - that is interesting. Thanks for sharing.