Indexes in the database?

Hi, and thanks for open sourcing this cool project!

I was wondering how indexing is handled - is there “a full table scan” in every query? Any ways to define indexes?

It looks like you have not based the code on the rdf-package in swi-prolog, am I right?

Petter

2 Likes

The triples are stored in a self-index based on succinct data-structures. The precise index depends on the mode of access. Forward search uses succinct arrays. Going backwards entails the use of wavelet trees. Technical details can be found in our white-paper here: https://github.com/terminusdb/terminusdb-server/blob/master/docs/whitepaper/terminusdb.pdf.

We originally based it on the rdf-package in swi-prolog but this did not scale to billions of triples well. We chose the current approach after trying to load several billions of triples for a business intelligence application.

We intend to alter the storage of literals to improve performance such that they are in type-prefixed blocks in sorted order. This should improve range based search significantly and will reduce storage overhead.

We may look at including other special-purpose indexes in the future. Indexes in our world are on immutable datastructures so many of the optimisations which are designed to make update work are not relevant so there is a lot of interesting research to be done!

2 Likes

The triples are stored in a self-index based on succinct data-structures. The precise index depends on the mode of access. Forward search uses succinct indexes. Going backwards entails the use of wavelet trees.

To expand on this, and correct slightly…

The triples are indeed stored in a self-index, meaning, they’re stored ordered, as two adjacency lists. This makes it very quick to do a triple lookup if you have the subject, and even quicker if you also know the predicate.
If you do not query using the subject, we use an index structure to retrieve your data. In addition to the self-index, there’s two more indexes:

  • an index from object to subject-predicate pairs. This index is also an adjacency list, only in the other direction compared to the self-index. Looking up by object is slightly slower than by subject, because there’s less data locality.
  • an index from predicate to subject. This index is a wavelet tree, which compared to the adjacency list has worse lookup time but better space efficiency. Since most datasets will have much less distinct predicates than distinct subjects/objects, this is a worthwhile tradeoff.

So what structure is used depends on what information is known in the lookup. We try to go for the self-index, failing that we try to use the object to subject-predicate-pairs index, and failing that, we use the predicate index.

The only time a full scan is done is when you query for every triple.

3 Likes

Okay, I see - sounds very good to me!

Do all data need to fit into memory or is there any index functionality when it comes to the persisted data as well?

Ps - looking forward to use the prolog api when it is finished :slight_smile:

1 Like

Yes, all data needs to fit in memory. While the storage backend does provide some ways to stream the data structures without loading them (including the indexes), this is not used at the moment, and is not exposed in the server. In the future, we’ll probably use these to reduce the memory footprint of some operations, such as merges.

We believe that modern computers have sufficient memory for most data sets that you’re likely to run into or construct yourself. This is why we went with an in-memory approach for all queries. Doing so is much faster than reading from disk on demand, and it greatly simplifies our design, which means less obscure bugs. Should the need ever arise, we’ll investigate alternative query mechanisms of course (such as streaming the data structures from disk), but so far this hasn’t come up, and we expect it likely won’t come up unless your needs are niche.

4 Likes

Yes, thanks for good answers.

I fully agree in you design choices - a memory based solution would do very well these days.

2 Likes