How to check the per-row on disk space usage in YugabyteDB?

Edward_Ribbery · June 25, 2020, 12:38pm

For a simple table with 4 bigint columns, 2 of which form the primary key:

 CREATE TABLE test(
           a bigint,
           b bigint,
           c bigint,
           d bigint,
      PRIMARY KEY((a), b)

Tried to measure the on-disk space usage using a replication factor 1
setup (created using yb-ctl). This was tried using 2.1.6 version on Linux.

./bin/yb-ctl destroy
./bin/yb-ctl start

Then loaded the above table with 1 million rows (1000 distinct values for column “a”, each with 1000 distinct values for “b” and random values for c and d).

Then looked at space utilized:

% du -sh ./yb-data/tserver
410M    ./yb-data/tserver

This seems to suggest that each row takes about 410 bytes. Note that the system does not have any other tables.

This seems on the high side. Is this a correct way of measuring the space utilized?

kannan · June 25, 2020, 3:18pm

The -fs_data_dirs directory(ies) contains both the transaction logs (WAL) and SSTable files.

Quite likely much of the data is still in WAL and memtables and not yet flushed to SSTable format on disk. And the WAL files are not in compressed format. To get a better indication, we can either load a lot more data (such that the WAL portion is insignificant) or force a manual flush of the table, and then inspect the size of just the data directories.

Tried a similar table with 1,000,000 (1M) rows.

After inserting the rows data, but before forcing a flush:

$ du -hs yb-data/tserver/data                                                                                |
60K     yb-data/tserver/data

This confirms that most of the data is still in WAL and memtables.

After forcing a flush, using:

./bin/yb-admin --master_addresses 127.0.0.1 flush_table <keyspace> <table>

the data directory size is as follows:

$ du -hs yb-data/tserver/data
48M     yb-data/tserver/data

So approximate on disk size per row for the given schema is about 48 bytes. This includes the overheads for the metadata (indexes, bloom filters, internal timestamps, etc.).

Topic		Replies	Views
Tserver: unbalanced utilization of disks General	5	57	December 17, 2024
How is data stored internally in YugaByte General	1	2407	March 25, 2019
How much storage should be overprovisioned in cases of failure? General	1	961	July 19, 2020
On-disk size differences between jsonb structures General	8	81	January 14, 2025
Uncached/Large Data Set Random Read Workload General	2	2040	January 10, 2018

How to check the per-row on disk space usage in YugabyteDB?

Related topics