Data in memory, a few questions

As I was reading Indexes in the database? about indexes, the fact that all data must fit in memory kind of sunk in for me:

  • what happens if the dataset grows bigger than available server memory, say if I’m hosting terminus on a VPS where data could be contributed by users (via an API for instance)?
  • is there some magic formula to estimate the memory size required, even approximately? I imagine number of commits/documents/relations/triples would be inputs to it?
1 Like

Good questions. If the dataset grows bigger than the available memory, there are two ways to deal with it. If the growth is through accumulation of transactions and the resulting increase in histories, delta roll-ups allow you to create a new ground state and you can throw away or archive the old commit histories. This is not currently implemented, but it is on our very near term development roadmap, as it is a feature that can become urgently important. With delta-rollups, you can set it up so that the memory consumption of your database is as close as you want to the heavily compressed size of all the data in the database. An interesting question is what the optimal default delta-roll-up strategy is - even in cases where you want to keep all commit histories, you can create roll-up layers within commit chains which improve performance by reducing the number of deltas you have to traverse to answer a query. I’m pretty sure that there’s a reasonably simple algorithm that you could apply to automated delta-rolling up that would suit 99% of use-cases but I have no idea what that algorithm is :slight_smile: If anybody is interested in contributing to the answer, we’re all ears - we definitely don’t know the answer right now and it’s an interesting problem

The second approach is for situations when accumulation of datapoints rather than transactions is the cause of growth. In this case the solution is to partition the database by class - as long as you are aware that predicates that cross databases can cross servers and therefore will be orders of magnitude slower than local predicates to follow, especially recursively, you can partition the databases across servers and still query fine. We have done some of the groundwork in allowing hot-loading of graph fragments from distributed file systems and S3 and the like, which would allow, in theory at least, auto-scaling to infinity. In practice, of course, when you go to disk, you’re still losing a lot of performance on query compared to in-memory.

Which brings me to your second question - unfortunately we do not yet have properly reliable formula for memory and resource usage expectations. We’re still running tests to simulate the kind of usage that we are likely to see with different usage patterns when we factor in the remote repositories and commit histories growth over time. This is a thing that is high on our agenda - we will have much better answers in the next few weeks. What I can say is that, in our experience, the memory footprint on ingestion of data from a raw csv file is generally less than 20% of the size of that file and the size of the database grows relatively slowly with transaction volume. The core database guys may have much better number, but that’s what I go on. There are still some rough edge around very large databases - more than 10 billion records and up. We tend to hit limits in stack size or choicepoints before we hit memory limits though. It should not be long before we have much better numbers here again.

1 Like

Thanks a lot for the thorough reply Kevin!
That 20% of raw CSV input estimate is already very useful to get a sense of what to expect, I imagine it would be a bit more for heavily connected data, but at least we’re not talking multiples of raw input, which is great!
On your comment about hot-loading data from S3, can I extrapolate to loading graph fragments from a local SSD on the server for something in between memory and S3 in terms of performance, and therefore, assuming I’m willing to sacrifice a few ms on queries, I could actually grow the dataset beyond the size of my memory?

Yes, there’s all sorts of interesting new ways coming out of blending regular volatile main memory with somewhat less volatile types like SSD. I think I saw a recent article that Intel can get multiple petabytes into an architecture that was addressing 4 PB of memory as if it was main memory with about 75% of regular main memory performance (figures from memory - not reliable)

1 Like

This is great. Thank you for the insights!