Tablets are not splitted evenly

Hi
I have 5 node cluster where tservers are running and automatic tablet split are on and num shards per tserver is 1.
I have a partitioned table, master table tablet leaders split evenly, however 1 child table tablet leader did not split evenly at first.
After deleting the child table and recreating that child table again, tablet leaders split evenly.

I could not understand why uneven split happened at first. Please help me to understand.

What do you mean by “uneven”?
The dynamic tablet splitting mechanism should pick a midpoint based on the data, so if you have a lot of data on the lower end of the spectrum, the split point will be lower.
Is the table hash partitioned or range partitioned?
When recreating the table, did you CREATE TABLE AS or data load everything immediately compared to before?

@jason Please find the tablets list below:
…/yugabyte-2.20.1.1# bin/yb-admin --master_addresses xx.xx.xx…214:7100,xx.xx.xx…215:7100,xx.xx.xx…216:7100 list_tablets ysql.mydb mytable_202402 include_followers
Tablet-UUID Range Leader-IP Leader-UUID Followers
258f5346bf8041adad65baafb2b6f44f partition_key_start: “” partition_key_end: “\010\010” xx.xx.xx.191:9100 0a964fb757a541979a67eca750b5f3e8 xx.xx.xx.194:9100,xx.xx.xx…192:9100
e0b32395a4f24c98be0b4575d60c7ffc partition_key_start: “\010\010” partition_key_end: “\017\225” xx.xx.xx.192:9100 1d10523685ba4fbb9874ea3369713828 xx.xx.xx.190:9100,xx.xx.xx…191:9100
f95943ab3f50462badb3b0292c71668a partition_key_start: “\017\225” partition_key_end: “\026\213” xx.xx.xx.192:9100 1d10523685ba4fbb9874ea3369713828 xx.xx.xx.194:9100,xx.xx.xx…190:9100
a3031ad01249401baa100b17e93c53bb partition_key_start: “\026\213” partition_key_end: “\034\210” xx.xx.xx.190:9100 4e65a702b3dd4b3d97621f53b4d16adf xx.xx.xx.192:9100,xx.xx.xx…194:9100
252a3b8f6a7c48baa8d95d97888a82b8 partition_key_start: “\034\210” partition_key_end: “"\353” xx.xx.xx.190:9100 4e65a702b3dd4b3d97621f53b4d16adf xx.xx.xx.191:9100,xx.xx.xx…192:9100
acaa3f1cc38d4474a09e1a4d62e07777 partition_key_start: “"\353” partition_key_end: “(\213” xx.xx.xx.194:9100 b089665ff27d47b288ddbdf8102c350e xx.xx.xx.190:9100,xx.xx.xx…192:9100
0644ecab35da468686f747b605c815f0 partition_key_start: “(\213” partition_key_end: “.\t” xx.xx.xx.192:9100 1d10523685ba4fbb9874ea3369713828 xx.xx.xx.190:9100,xx.xx.xx…191:9100
d5f812c8acc74fc38f2bd684083cdb34 partition_key_start: “.\t” partition_key_end: “33” xx.xx.xx.190:9100 4e65a702b3dd4b3d97621f53b4d16adf xx.xx.xx.194:9100,xx.xx.xx…191:9100
d35006f4da2f4b54a7b0638fb170460f partition_key_start: “33” partition_key_end: “:\335” xx.xx.xx.190:9100 4e65a702b3dd4b3d97621f53b4d16adf xx.xx.xx.194:9100,xx.xx.xx…191:9100
68969746437a41e0840ebf44bc2dceea partition_key_start: “:\335” partition_key_end: “A\344” xx.xx.xx.190:9100 4e65a702b3dd4b3d97621f53b4d16adf xx.xx.xx.194:9100,xx.xx.xx…191:9100

The 5 servers IPs are xx.xx.xx.190,xx.xx.xx.191,xx.xx.xx.192,xx.xx.xx.193,xx.xx.xx.194

This can be seen that, among the 5 tserver nodes, 193 is totally absent in the list. most tablet concentration is in 190 node. But when we dropped the table and created once again, the distribution was fine after that. After loading data tablet split were fine and the distribution is also proper.

this table is a child table of range partitioned table. This is created with Create table <child_table_name> partition of <main_partitioned_table from (<dt_start>) to (<dt_end>);

Streaming data is being loaded in those tables through Kafka consumer.

Our question here is,

  1. in what circumstances this may happen
  2. if this type of skewed distribution resurfaces in future(production), then how to make sure that it gets rebalanced properly again.

Hi dipanjanghos,

If you use range partitioned tables, the default number of shards/tablets would be 1 (on the 2.20.1 builds that you are using). As data is loaded into the tablet, it can get split. Automatic tablet splitting works at a per tablet level, and it splits the tablet into 2 - the split point is currently derived from the data that is flushed to disk and if there is additional data that is already inserted to the table (but not yet flushed etc), that data might not influence the split point. The net result is that it can sometimes result in uneven splitting of data. We are looking at couple of cases of uneven splits, especially if the input data comes in an ascending or descending order. Chances of uneven splits would be low, if the input data is shuffled. Also, If you know the split points, then you can use them at the table creation time as mentioned in thee example in - Tablet splitting | YugabyteDB Docs.

Are you concerned about traffic not being spread across the nodes in the above case (when the split was uneven?).

Hi @Raghavendra_Thallam ,
The problems we faced is:
In one occasion one of the tservers didn’t have any leader.
In another occasion though other things are correct, one of the tservers doesn’t have any of the tablets including indexes (report is share in the above post).

Yes. traffic distribution also skewed. Due to this uneven split the throughput is going down drastically. Where we are getting ~8k steady inserts per second (even when 40+ crores of data is there in the table) after even distribution took place on truncating the tables, we were struggling to reach 2.2 K ingestion per second in uneven distribution.

What I do concern about is the a way to re-uniform it. In the long run it can happen that one tablet server is down for long and the skewedness crops up with newly created tables. Then what would be the remedy to make every thing in line again.

Thanks & regards,
Dipanjan

@dipanjanghos :

Do you have any logs from yb-master when 1 t-server did not have any leaders or tablets ? I’d be interested in seeing if the Load Balancer was trying to balance things out to ensure an even distribution of tablets.

Could you also share the output of http://:7000/tablet-servers and http://:7000/cluster-config

No. We have truncated the records and moved ahead. But we are facing skewed distribution issue. In different times have observed one tserver with no leader (the tablet distribution was ok here), among 5 tservers one is getting maximum leaders. We have added more nodes to force the rebalancing, but seems to have no use. The issue is the ingestion rate is dropping significantly.

Actually this will be a bigger installation of yugabyte replacing hive which should support more than 50k TPS.

We will be trying to explain the case here.

Thanks ,
Dipanjan

Note that you’re describing 3 different issues here, and you need to provide details in all of them to see what’s going wrong.

This isn’t skewed distribution, something is wrong with the yb-tserver or cluster config.
Skewed distribution would be if some tablets are bigger than others. (we try to balance by number of leaders/peers per-node).

Great!

Please do.

Hi Dorian
We are using yugabyte version 2.20.1.3. Our Tserver config is as below:
./bin/yb-tserver
–tserver_master_addrs xx.xx.xx.71:7100, xx.xx.xx.72:7100, xx.xx.xx.73:7100
–rpc_bind_addresses xx.xx.xx.74:9100
–enable_ysql
–pgsql_proxy_bind_address xx.xx.xx.74:5433
–webserver_port 9001
–fs_data_dirs “/data1/ybtdata1,/data2/ybtdata2”
–fs_wal_dirs “/data3/waldir”
–log_dir “/applog/ybtlog”
–placement_cloud=cloud1
–placement_region=cloud1a
–placement_zone= region1
–stream_compression_algo=3
–ysql_enable_packed_row=true
–enable_automatic_tablet_splitting=true
–ysql_num_shards_per_tserver=1 > /applog/ybtserver-4.out 2>&1 &

Initially we started with 5 Tservers and we observed tablet distribution as below:

Then we added another 4 servers and after load balance distribution was as below:

And when we checked write speed it was showing as below:

We want to understand

  1. What we are doing wrong that tablet leaders did not distribute evenly between nodes in all zones.
  2. Is there anything specific we can mention in tserver config to ensure evenly tablet Leader distribution between nodes in all zones? Like modify_placement_info?
  3. If yes, is it possible to add that config in an already running cluster?
  4. We see write speed is high in some nodes than others. What can we do to make it more even.
  5. Do we need to follow anything specific while adding nodes to a cluster?
    Thank you for your help in advance.

Can you show a screenshot of http://master-ip:7000/tablet-servers + http://master-ip:7000/cluster-config?

Can you explain what your requirements are in the deployment (multi-az/region, az-region failover, row-level geo-replication, etc)?

We should be able to fix an existing cluster, so yes.

After we get the screenshots, we’ll look at the schema/tables/indexes.

Example: maybe you have a global range index on a timestamp column so all new writes will also write to a single tablet (since new writes will have the latest timestamp).

Or maybe you shard by client_id a table, and 1 client is very heavy resulting in many writes to it’s tablet leader compared to other tablets.

Depends on the topology. If it’s just a single az, no.


We can also get on a call and look at the same things together. But we’ll still need the info requested above.

Hi @subh14 ,

The following will really help us understand why you are experiencing an imbalance of leaders as well as tablet peers in the cluster:

  1. Screenshot of http://master-ip:7000/tablet-servers : This will give us a few pieces of information - The Cloud/Region/Zone for each of your T-servers which is what the Load balancer uses to distribute tablet leaders and followers. This information typically comes from your t-server startup arguments. See example at YB-TServer manual start | YugabyteDB Docs.

  2. Screenshot of http://master-ip:7000/cluster-config : This has any specific placement configuration that you have provided for your tables. You can find an example of this here where a custom placement configuration has been provided: yb-admin - command line tool for advanced YugabyteDB administration | YugabyteDB Docs

  3. Lastly, we wanted to check if the Load Balancer is trying to resolve the situation or not. While this will be evident from the master-logs, a quick view of the http://master-ip:/7000 page shows 2 pieces of information: “Load Balancer Enabled” and “Is Load Balanced?” which will be useful to understand the issue.

Hi Sandeep
Yes, I will share the details soon.

Hi Sandeep
PFB, screenshots related to point 1 and 3. And for point 2 we do not have any specific placement configuration in ybt and ybmaster conf. Please suggest if i am doing anything wrong.

Thanks
Subhankar

@subh14

You are missing the rest of the page of screenshot:

Please paste the full page, including the table of yb-tservers.

There are 2 things that I see from the screenshots:

  1. You see the ‘x’ mark for “Is Load Balanced ?” This is indicating that the Load balancer is still attempting to balance the load on the cluster. It would be a green tick mark if the cluster was fully balanced. You also see a message above the first screenshot saying “Cluster is not Balanced”

  2. The first screenshot also seems to show that each zone almost has an equal number of tablets and leaders, so looks load is almost evenly distributed in the zone.

Could you share the cluster config: http://master-ip:7000/cluster-config and complete list of T-servers from the page ?

Hi Sandeep
Sorry for the late reply. PFB screenshots of tablet servers. Although load balancer is showing on, still cluster load is not balanced for long time.



Cluster Config dedatils:

Also i am seeing something like below in log.

W0402 12:53:48.218786 1195022 transaction.cc:1913] 00000000-0000-0000-0000-000000000000: Send heartbeat failed: Timed out (yb/rpc/rpc.cc:220): UpdateTransaction: tablet_id: “283749be8dd84a84abc17b33251405ec” state { transaction_id: “[\356\005\016\005\231F\031\247\272\204Z\2233\033\216” status: CREATED host_node_uuid: “5dd6d75260cd4796a7849a7226dcf6a6” } propagated_hybrid_time: 7012607685246177291, retrier: { task_id: -1 state: kRunning deadline: 4746192.209s } passed its deadline 4746192.209s (passed: 5.063s): Illegal state (yb/consensus/consensus.cc:161): Not the leader (tablet server error 15), txn state: kRunning

In addition I found below in replica info, Is this indicating something wrong?

You need to paste screenshot of port 7000, not 7100.

See previous reply from Sandeep:

@subh14 I think you’re missing this, going over the previous replies.

Hi Dorain
I started master with IP 7100, instead of 7000 as
–master_addresses=ip:7100,ip2:7100,ip3:7100

When i am trying to curl -v http://master-ip:7000/cluster-config. Its showing connection refused as below.

Thanks
Subhankar