How to check the per-row on disk space usage in YugabyteDB?

For a simple table with 4 bigint columns, 2 of which form the primary key:

 CREATE TABLE test(
           a bigint,
           b bigint,
           c bigint,
           d bigint,
      PRIMARY KEY((a), b)

Tried to measure the on-disk space usage using a replication factor 1
setup (created using yb-ctl). This was tried using 2.1.6 version on Linux.

./bin/yb-ctl destroy
./bin/yb-ctl start

Then loaded the above table with 1 million rows (1000 distinct values for column “a”, each with 1000 distinct values for “b” and random values for c and d).

Then looked at space utilized:

% du -sh ./yb-data/tserver
410M    ./yb-data/tserver

This seems to suggest that each row takes about 410 bytes. Note that the system does not have any other tables.

This seems on the high side. Is this a correct way of measuring the space utilized?

The -fs_data_dirs directory(ies) contains both the transaction logs (WAL) and SSTable files.

Quite likely much of the data is still in WAL and memtables and not yet flushed to SSTable format on disk. And the WAL files are not in compressed format. To get a better indication, we can either load a lot more data (such that the WAL portion is insignificant) or force a manual flush of the table, and then inspect the size of just the data directories.

Tried a similar table with 1,000,000 (1M) rows.

After inserting the rows data, but before forcing a flush:

$ du -hs yb-data/tserver/data                                                                                |
60K     yb-data/tserver/data

This confirms that most of the data is still in WAL and memtables.

After forcing a flush, using:

./bin/yb-admin --master_addresses 127.0.0.1 flush_table <keyspace> <table>

the data directory size is as follows:

$ du -hs yb-data/tserver/data
48M     yb-data/tserver/data

So approximate on disk size per row for the given schema is about 48 bytes. This includes the overheads for the metadata (indexes, bloom filters, internal timestamps, etc.).