How is data stored internally in YugaByte

I have a question regarding the internal format of YugaByteDB, comparing with Cassandra.
Take this table in KairosDB for example:

CREATE TABLE IF NOT EXISTS row_key_time_index (
    metric text,
    table_name text,
    row_time timestamp,
    value text,
    PRIMARY KEY ((metric), table_name, row_time)
);

In Cassandra, if I understand it correctly. I would expect the partition key metric only appear once in an sstable after compaction. Is this also true for YugaByteDB?

Thanks for your question @sliu.

A high-level overview of YugaByte DB’s storage format is explained here: Persistence in YugabyteDB | YugabyteDB Docs

We have built this on top of on RocksDB (a log-structured key-to-value storage engine), and extended RocksDB to efficiently support a document/row storage model. [See also: How We Built a High Performance Document Store on RocksDB? | Yugabyte]

With regards to your question, for rows that share the same prefix of the primary key (such as metric in your example), logically, that data is repeated on every row. However, YugaByte DB uses a two-level compression scheme.

  • The block cache format is “prefix compressed”. So rows with a common prefix don’t actually incur much space overhead in memory.

  • Additionally, on-disk, in SSTable files, the blocks are stored after Snappy compression.

You mentioned about Apache Cassandra << In Cassandra, if I understand it correctly. I would expect the partition key metric only appear once in an sstable after compaction.>>

You are probably correct. I am not 100% sure. I think this design causes Apache Cassandra issues when a single partition key has lots and lots of rows. Likely, the entire partition needs to be brought into memory in a all-or-nothing manner making memory usage inefficient and also causing GC issues (because of Java implementation).

With YugaByte DB, a single partition key can have lots of lots of rows that can span several database blocks. Not all of them need to be even brought into memory if the query is interested only in a slice (e.g. time range 5 to 10pm). Only the blocks with the matching time range need to be brought into memory.