I understand that each tablet is served by a DocDB instance and you can control the number of tablets when a table is initially created.
That is correct. The knob (that’s configurable) is “number of tablets per node” for each table. So if that number is 8 (which is our default for nodes with more than 2 cpus), then irrespective of the number of nodes in the cluster, each node will have 24 tablets (assuming the replication factor is 3) for each table.
Our leader balancing ensures that each node then will be a leader for 8 tablets (of that table), and follower for 16 tablets (of that table).
In a typically large cluster, what is the average number of DocDB instances per node?
[See above point.] The size of the cluster doesn’t quite matter. The number of tablets or DocDB instances per node depends on the number of tables, the replication factor and the setting for number of tablets per node.
Does this number reach to something as high as 100s of instances, in which case you encounter issues with multiplexing the CPU and memory needs of these instances?
Yes, depending upon the number of tables, this number can reach pretty high. But in most well designed applications, the number of “active” tables, especially “write intensive” tables doesn’t tend to be very large. The meta-data/config type tables, while might increase the count, they usually tend to be more read-heavy and do not really complicate matters much.
The main impact with too many “write-active” tables or tablets is that there are that many RAFT logs active and so group commit efficiency can decrease.
But we are considering multiplexing RAFT logs for multiple tablets into a single physical log for efficiency in the future.
Otherwise, from the DocDB perspective (and the LSM-style SSTable files that it does large sequential writes to) there aren’t any complications. We have trimmed away (don’t use) the RocksDB WAL, because the RAFT WAL is the source of transaction log for us.
Have you encountered a scenario where your RocksDB customization to support multiple instances per node(the shared memory and thread pool) turns out to be insufficient? To how many instances can the customized RocksDB implementation you have scale?
The customizations we have done to use a shared block cache across all the DocDB instances, to make the cache scan resistant, to have global memstore thresholds across all the individual DocDB memstores, etc. enable us to really scale to very large number of DocDB instances.
We have tested to about 1000 tablets per server yet (but not all of them where write heavy).
The global memstore & block cache design really helps in cases that different tables (and hence their tablets) have very different IO patterns.
Perhaps table1 is read-heavy and table2 is write heavy. Having global memstore and block cache thresholds, nicely allows table1 to use much of the block cache and table2 to use much of the memstore as opposed to a scheme where you would have to break up the available RAM for block cache & memstore equally among the tables (and their tablets). The latter approach will be really hard to tune when new tables are dynamically created.
Additionally, we also balance the tablets (both the RAFT WAL and the SSTable files), on a per table basis, across the available SSDs on each node.
How would the performance of the system vary as we increase the number of tablets per node? (I can imagine a scenario where each of the RocksDB instance competes for compaction and memtable flush threads thereby leading to stalls in writes?)
The question to ask is why are the number of tablets per node going up? Beyond a certain point, increasing the number of tablets per node is not very useful.
We also don’t let each DocDB/RocksDB instance make its independent decision on when to compact or when to flush. That can cause a storm. The compaction and memstore flush thread pools are global, and are configurably throttled. Depending on number of CPUs on the node, YugaByte DB picks automatic defaults for number of concurrent compactions and flushes. But these are overridable.
Additionally, we have reserved slots for minor compactions (as opposed to major compactions), and this avoids major compactions taking up all the available compaction threads and prevents a build of SSTable files that can degrade performance.