What is the purpose of SST in yugabyte?

We have a 6 node cluster.

Yugaware interface shows “SST” & “uncompressed SST” column for every node(as shown below):
Screen Shot 2022-03-02 at 11.10.37 AM

What is SST? uncompressed SST?

Hi, tables and indexes are stored in SST (Sorted Sequence Table) files. “Total SST Files Size” is the size on disk of those files, the total is the physical size of the database. There are various levels of data compression and the “Uncompressed SST Files Size” is the size before compression, so the logical volume of the data (tables and indexes).

An example looking at the size of tables here:

@FranckPachot
When query is submitted by GoCQL driver to yugabyte,
In terms of usage(for query processing), How RAM memory different from SST files?

The client driver/language doesn’t matter. Irrespective of the client language (whether it is Go/PHP/Python/Java)… YugabyteDB brings/needs to have the relevant portions of data into RAM. The data (both metadata - such as index, bloom filters, and the user data or rows) live in SSTable files are composed of smaller blocks (say 32K or 64K sized blocks). YugabyteDB has a on-demand block cache… and doesn’t require all data to be resident in RAM… it basically automatically keeps only the hot data in cache. Therefore, YugabyteDB can generally work quite well even if RAM size is much smaller than the data set size. Even for workloads that do fairly random reads in a way that the entire data set doesn’t fit well in RAM… YugabyteDB is pretty optimal in terms of the number of disk seeks it ends up doing. See for example: Achieving Sub-ms Latencies on Large Datasets in Public Clouds | Yugabyte.

@FranckPachot
Is SST file of yugabyte similar to WAL file of TSDB(in Prometheus)?

I just had a quick look at TSDB and it doesn’t map exactly but SST Files are not WAL.

  • In short, in YugabyteDB tables and indexes are split (aka sharded, aka horizontally partitioned) in what we call ‘tablet’. Maybe you can think of it as TSDB blocks when range sharding on time.
  • Those tablets are stored in LSM Tree (log-structured merge-tree) where all changes are appended as a log of the new version (and, for HA, those logs are replicated though Raft protocol to multiple tablet peers).
  • LSM Tree has “levels” - this storage is inherited from RocksDB. The first level is in memory, in a MemTable, for fast writes and fast read access to latest changes. As it is in memory, it is protected by Write-Ahead Logging (WAL, that other databases call redo log) but it is not really the storage but a way to get the in-memory storage back in case of crash
  • This MemTable has a maximum size, depending on the available RAM in the node, and is flushed to disk at some points. This creates SST Files (Sorted Sequence Table). The properties of them is that once written (sequentially, coming from he flush) they are immutable. This makes it optimal for writing into SSD and take easy snapshots for backup
  • As SST Files accumulate with new writes, and then new flushes, they grow in number. A background compaction runs on them when reaching some thresholds, reading a range of SST Files and writing into a new one, smaller because there’s a lot of intermediate versions that are not needed after a while (versioning retention). There’s also some compression happening during compaction.
    In addition to that:
  • WALs are needed for a short time only (to protect the MemTable) so they are re-cycled quickly
  • tablets have actually two LSM Trees when transactions are involved: one for transactions intents, which go to the regular structure at commit.
  • The rocksdb storage has been enhanced with structures to accelerate point and range reads (DocDB performance enhancements to RocksDB | YugabyteDB Docs)

So, if you create a table with few rows, you will still see a zero size for SST Files because it is all in MemTable and WALs are not counted there. After some more writes, you will see the SST Files increasing, in number and then in size. And at some point, the total size will be decreasing after compaction. But it is roughly the size of your data on disk. The uncompressed size gives an idea of the size of them if they were not compressed, so roughly the logical size of your data.