Tserver: unbalanced utilization of disks

Hi,

we are currently running some tests with yugabyte 2.20.7.
The cluster consists of 3 masters and 3 tablet servers.
It is deployed with helm on kubernetes with the automatic tablet splitting feature enabled.
Each tablet server has two 74 GB disks.

We are using the YCQL API and created a test table like this:

CREATE TABLE IF NOT EXISTS ks.test(id text, secondary_id uuid, data1 text, data2 blob, ts timestamp, PRIMARY KEY (id, secondary_id));

We then loaded some test data (11,883,854 rows).
As expected more tablets were created over time.
At the end of the test, 10 tablets were created for this test table.

We noticed that the tablets were not evenly distributed across the available disks.
On each tserver more space is used on disk-0 than disk-1.

tserver-1:

  • disk-0: 58.6 GB
  • disk-1: 30.7 GB

tserver-2:

  • disk-0: 59.2 GB
  • disk-1: 42.2 GB

tserver-3:

  • disk-0: 55.3 GB
  • disk-1: 28.5 GB

We had a look at the tablets of the table:

Tablet-UUID                                    Range                                                Leader-IP                      Leader-UUID
17a7c4c8bc884ab78a7067d52fa96b87 partition_key_start: "" partition_key_end: "\031U"			        10.240.10.50:9100         66789c933308415d96fbdffe094a6e68
6e56377eb86e4c2388788a6a4f482bf1 partition_key_start: "\031U" partition_key_end: "4\201"            10.240.2.166:9100         391a9a3cf1904782853499d4677258f9
1b5b6e2289eb4bdd86819bce3ec0a22e partition_key_start: "4\201" partition_key_end: "L^"               10.240.10.50:9100         66789c933308415d96fbdffe094a6e68
6ffcfc10fc1f4c2aadce41d58db9538d partition_key_start: "L^" partition_key_end: "UU"                  10.240.13.226:9100        625e32c70d6149ff8ade08aded841e17
cf3e0593abdc46cd925c9da5217c5f93 partition_key_start: "UU" partition_key_end: "p\010"               10.240.2.166:9100         391a9a3cf1904782853499d4677258f9
465c68ff2ab34554934cfb97219bdfdf partition_key_start: "p\010" partition_key_end: "\212\274"         10.240.2.166:9100         391a9a3cf1904782853499d4677258f9
e9b064458236489096b449c163114d61 partition_key_start: "\212\274" partition_key_end: "\252\252"      10.240.13.226:9100        625e32c70d6149ff8ade08aded841e17
2abb97341c134bf8a2c95e38e2e71fae partition_key_start: "\252\252" partition_key_end: "\304["         10.240.10.50:9100         66789c933308415d96fbdffe094a6e68
7d02316c432b4b11b8d82f0a02e06f29 partition_key_start: "\304[" partition_key_end: "\337\225"         10.240.2.166:9100         391a9a3cf1904782853499d4677258f9
6de2baf421664b48820576af307ab9fc partition_key_start: "\337\225" partition_key_end: ""              10.240.13.226:9100        625e32c70d6149ff8ade08aded841e17

And looked on tserver-1 how the tablets are distributed across the disks:

disk-0:

  • tablet-17a7c4c8bc884ab78a7067d52fa96b87
  • tablet-6e56377eb86e4c2388788a6a4f482bf1
  • tablet-1b5b6e2289eb4bdd86819bce3ec0a22e
  • tablet-6ffcfc10fc1f4c2aadce41d58db9538d
  • tablet-cf3e0593abdc46cd925c9da5217c5f93
  • tablet-465c68ff2ab34554934cfb97219bdfdf
  • tablet-e9b064458236489096b449c163114d61

disk-1:

  • tablet-2abb97341c134bf8a2c95e38e2e71fae
  • tablet-7d02316c432b4b11b8d82f0a02e06f29
  • tablet-6de2baf421664b48820576af307ab9fc

Is there are way to distribute the tablets evenly across the available tserver disks?

We also notice that the tservers had different number of leaders. We are not sure if this is related with the unbalanced disk usage, as the cluster status is shown as balanced.

The web ui shows this info for the tablets of the test table:

Hi,

The table would have started with one tablet to begin with and as data is ingested, it would have gotten split. Tablet splits are implemented using hard links and as a result, the split tablets continue to stay in the same drive as the original tablet. The load balancer component only balances tablets across the tservers, but not necessarily across disks within a tserver. So your observation is expected.

Is this resulting in skewed performance characteristics for your workload?

Hi @MichiganL

Can you trigger yb-admin - command line tool for advanced YugabyteDB administration | YugabyteDB Docs on the table?
What are the results after it finishes?

@Raghavendra_Thallam,

Although after split initially the child tablets point to their respective portion of the parent tables, doesn’t a post-split compaction get scheduled to decouple the child tablets from the parent? At which point, the tablet is free to be load-balanced.

I suppose here, given that is a RF=3 cluster, and there are only 3-tablet servers, there isn’t much to be done for the load-balancer in terms of inter-node load-balancing.

But yes, we don’t initiate a intra-node (inter-disk) load-balancing for this case.

@MichiganL – is it a hard requirement for you to have multiple drives per node? Or, can you just have 1 larger drive per-node?

Hi,

thank you for the information and suggestions.

@Raghavendra_Thallam
We just started with a first test run and ran out of disk space on disk-0 while disk-1 still had quite some space available. So I can’t say much about it at the moment.

@dorian_yugabyte
Unfortunately, compaction did not rebalance the tablets within the tserver. The tablet directories are still located on the same disks.

@kannan
We did some other load tests before this one and got better results when using two disks (and more) instead of one. For these load tests we did not use automatic tablet splitting yet and created the table with a larger number of tablets (24-48) from the outset. The reason why we are hesitant to use only one disk is that we already have yugabyte clusters in production where each tserver has 2 disks. The load tests we are currently performing are for a new table which our app will need in the future. If possible, we would like to avoid migrating live data. So far, we have deactivated tablet splitting, but were considering to enable it.

Would it help to create the table with more tablets to begin with? We expect a high volume of data for production, but wanted to rely on automatic tablet splitting to create the optimal number of tablets for us. We didn’t want to create too many tablets at first, as we wouldn’t be able to merge them again. With an initial minimum number of tablets (e.g. 12), could automatic tablet splitting still lead to an unbalanced disk utilization?

At the moment we are thinking of keeping automatic tablet splitting disabled and configuring a higher number of tablets when creating the table. If needed, we would split the tablets manually.

Are there plans to support inter-disk balancing in the future?

Yes it should. But will the new data be spread across these tablets equally?

Depends how the data is spread among them.

It hasn’t been a priority yet. We only consider multiple drives when moving tablets as part of the LB or at table creation time.