Thanks for your question @sliu.
A high-level overview of YugaByte DB’s storage format is explained here: https://docs.yugabyte.com/latest/architecture/docdb/persistence/
We have built this on top of on RocksDB (a log-structured key-to-value storage engine), and extended RocksDB to efficiently support a document/row storage model. [See also: https://blog.yugabyte.com/how-we-built-a-high-performance-document-store-on-rocksdb/]
With regards to your question, for rows that share the same prefix of the primary key (such as
metric in your example), logically, that data is repeated on every row. However, YugaByte DB uses a two-level compression scheme.
The block cache format is “prefix compressed”. So rows with a common prefix don’t actually incur much space overhead in memory.
Additionally, on-disk, in SSTable files, the blocks are stored after Snappy compression.
You mentioned about Apache Cassandra << In Cassandra, if I understand it correctly. I would expect the partition key
metric only appear once in an sstable after compaction.>>
You are probably correct. I am not 100% sure. I think this design causes Apache Cassandra issues when a single partition key has lots and lots of rows. Likely, the entire partition needs to be brought into memory in a all-or-nothing manner making memory usage inefficient and also causing GC issues (because of Java implementation).
With YugaByte DB, a single partition key can have lots of lots of rows that can span several database blocks. Not all of them need to be even brought into memory if the query is interested only in a slice (e.g. time range 5 to 10pm). Only the blocks with the matching time range need to be brought into memory.