For a simple table with 4 bigint columns, 2 of which form the primary key:
CREATE TABLE test(
a bigint,
b bigint,
c bigint,
d bigint,
PRIMARY KEY((a), b)
Tried to measure the on-disk space usage using a replication factor 1
setup (created using yb-ctl). This was tried using 2.1.6 version on Linux.
./bin/yb-ctl destroy
./bin/yb-ctl start
Then loaded the above table with 1 million rows (1000 distinct values for column “a”, each with 1000 distinct values for “b” and random values for c and d).
Then looked at space utilized:
% du -sh ./yb-data/tserver
410M ./yb-data/tserver
This seems to suggest that each row takes about 410 bytes. Note that the system does not have any other tables.
This seems on the high side. Is this a correct way of measuring the space utilized?
The -fs_data_dirs directory(ies) contains both the transaction logs (WAL) and SSTable files.
Quite likely much of the data is still in WAL and memtables and not yet flushed to SSTable format on disk. And the WAL files are not in compressed format. To get a better indication, we can either load a lot more data (such that the WAL portion is insignificant) or force a manual flush of the table, and then inspect the size of just the data directories.
Tried a similar table with 1,000,000 (1M) rows.
After inserting the rows data, but before forcing a flush:
$ du -hs yb-data/tserver/data |
60K yb-data/tserver/data
This confirms that most of the data is still in WAL and memtables.
$ du -hs yb-data/tserver/data
48M yb-data/tserver/data
So approximate on disk size per row for the given schema is about 48 bytes. This includes the overheads for the metadata (indexes, bloom filters, internal timestamps, etc.).