Good questions. If the dataset grows bigger than the available memory, there are two ways to deal with it. If the growth is through accumulation of transactions and the resulting increase in histories, delta roll-ups allow you to create a new ground state and you can throw away or archive the old commit histories. This is not currently implemented, but it is on our very near term development roadmap, as it is a feature that can become urgently important. With delta-rollups, you can set it up so that the memory consumption of your database is as close as you want to the heavily compressed size of all the data in the database. An interesting question is what the optimal default delta-roll-up strategy is - even in cases where you want to keep all commit histories, you can create roll-up layers within commit chains which improve performance by reducing the number of deltas you have to traverse to answer a query. I’m pretty sure that there’s a reasonably simple algorithm that you could apply to automated delta-rolling up that would suit 99% of use-cases but I have no idea what that algorithm is If anybody is interested in contributing to the answer, we’re all ears - we definitely don’t know the answer right now and it’s an interesting problem
The second approach is for situations when accumulation of datapoints rather than transactions is the cause of growth. In this case the solution is to partition the database by class - as long as you are aware that predicates that cross databases can cross servers and therefore will be orders of magnitude slower than local predicates to follow, especially recursively, you can partition the databases across servers and still query fine. We have done some of the groundwork in allowing hot-loading of graph fragments from distributed file systems and S3 and the like, which would allow, in theory at least, auto-scaling to infinity. In practice, of course, when you go to disk, you’re still losing a lot of performance on query compared to in-memory.
Which brings me to your second question - unfortunately we do not yet have properly reliable formula for memory and resource usage expectations. We’re still running tests to simulate the kind of usage that we are likely to see with different usage patterns when we factor in the remote repositories and commit histories growth over time. This is a thing that is high on our agenda - we will have much better answers in the next few weeks. What I can say is that, in our experience, the memory footprint on ingestion of data from a raw csv file is generally less than 20% of the size of that file and the size of the database grows relatively slowly with transaction volume. The core database guys may have much better number, but that’s what I go on. There are still some rough edge around very large databases - more than 10 billion records and up. We tend to hit limits in stack size or choicepoints before we hit memory limits though. It should not be long before we have much better numbers here again.