Tablets are not splitted evenly

dorian_yugabyte · April 8, 2024, 1:08pm

Note that the default port for Internode RPC communication is 7100 Default ports reference | YugabyteDB Docs

For the yb-master UI, it’s 7000 Default ports reference | YugabyteDB Docs

So you need to use the yb-master UI port to get the info that Sandeep requested.

Maybe it’s blocked somehow? But it’s the same port as when you got “tablet servers” here:

subh14 · April 8, 2024, 1:26pm

Sorry I got it wrong. PFB the screenshot. Our webserver interface working on 7020.

Thanks
Subhankar

dorian_yugabyte · April 8, 2024, 1:31pm

Please make a screenshot from browser, that is HTML markup you screenshotted.

subh14 · April 8, 2024, 7:00pm

Hi Dorian
It is blocked maybe, could not get it from browser. However from Yuga UI, I could not see any cluster config option. Only xCluster config option i can see.

Is there anything can be done from here?

dorian_yugabyte · April 9, 2024, 7:19am

How can it be blocked? You already opened it with curl. If it is somehow, just save the curl response in a .HTML file, open it with firefox, and use “Fireshot” plugin to make a full page screenshot. (or upload the .HTML response here)

subh14 · April 9, 2024, 8:15am

Hi dorian
I have redirected to a html file. And below is there screenshot of the page.

Thanks
Subhankar

dorian_yugabyte · April 9, 2024, 9:16am

Looking your latest screenshots, you have 3 availability zones, and tablets look mostly load balanced.

Note that load balancing is made from balancing number of tablet-leaders & peers to be equal among the servers.

Do you mean that it took a long time to get to this state?

Or do you mean that some servers are more loaded than others (this could be hot data skew etc) like you said below:

subh14 · April 9, 2024, 10:33am

Some server has more tablet peers/leaders, Insertion is skewed though, but cluster load is not balanced for too long.
Also noticed transactions table as Leaderless Tablets under replica info. Is it something to worried about?

Thanks
Subhankar

FranckPachot · April 9, 2024, 4:40pm

Also noticed transactions table as Leaderless Tablets under replica info. Is it something to worried about?

There are cases where it is misreported you can check in tables > system.transactions that all tablets for have a leader and followers

Sandeep_L · April 16, 2024, 1:08am

Thanks @subh14 for sharing the screenshots. I can see that the Load balancer is active and attempting some operations but can’t infer from the screenshots why it isn’t able to converge. Would it be possible to share the log of master leader ?

If that is not an option, could we hop on a call together to debug further. Without the logs, it would be hard to debug what the load balancer is stuck on.

dorian_yugabyte · April 16, 2024, 7:23am

Note that this is fixed in [Master] Leaderless tablets endpoint has a bug for RF1 clusters where it reports leaderless tablets even when the node hosting the tablets is up and all tablets are running · Issue #20919 · yugabyte/yugabyte-db · GitHub and happens only in RF1 clusters & some specific versions.

subh14 · April 16, 2024, 6:29pm

Hi Sandeep
Thanks for your reply. I have checked yb-master log of the leader and found below log repeating itself.

I0416 18:18:25.131608 1406131 cluster_balance.cc:705] tablet server 25901c23d9684e338dc2029f47f5a2bc has a pending delete for tablets [c2d5cfb588144417950446e5ebf4ff6d]
W0416 18:17:39.883428 1406461 catalog_manager.cc:13051] ProcessTabletReplicaFullCompactionStatus: Not found (yb/master/catalog_manager.cc:3115): Tablet 84fed376dd814b55a326fb70ff2d93fa not found
W0416 18:17:39.883441 1406461 catalog_manager.cc:13051] ProcessTabletReplicaFullCompactionStatus: Not found (yb/master/catalog_manager.cc:3115): Tablet 816ac2ecb46e4969a51f6125c8890f64 not found
W0416 18:17:39.883452 1406461 catalog_manager.cc:13051] ProcessTabletReplicaFullCompactionStatus: Not found (yb/master/catalog_manager.cc:3115): Tablet e5abae4c5dc24404b868a5b517b157ac not found
W0416 18:17:39.883477 1406461 catalog_manager.cc:13051] ProcessTabletReplicaFullCompactionStatus: Not found (yb/master/catalog_manager.cc:3115): Tablet 9cd34d9cd68c4b72ace16c2e3a187174 not found
W0416 18:17:39.883495 1406461 catalog_manager.cc:13051] ProcessTabletReplicaFullCompactionStatus: Not found (yb/master/catalog_manager.cc:3115): Tablet 2887965d7fd14c39865f11524fe70888 not found
W0416 18:17:39.883507 1406461 catalog_manager.cc:13051] ProcessTabletReplicaFullCompactionStatus: Not found (yb/master/catalog_manager.cc:3115): Tablet a980ea3d2fc8440386766fdfc6135460 not found
I0416 18:17:40.507510 1406131 cluster_balance.cc:438] Total pending adds=1, total pending removals=0, total pending leader stepdowns=0
I0416 18:17:40.507557 1406131 cluster_balance.cc:705] tablet server 25901c23d9684e338dc2029f47f5a2bc has a pending delete for tablets [c2d5cfb588144417950446e5ebf4ff6d]
I0416 18:17:41.522347 1406131 cluster_balance.cc:438] Total pending adds=1, total pending removals=0, total pending leader stepdowns=0
I0416 18:17:41.522392 1406131 cluster_balance.cc:705] tablet server 25901c23d9684e338dc2029f47f5a2bc has a pending delete for tablets [c2d5cfb588144417950446

W0416 18:18:25.595906 1445955 catalog_manager.cc:13051] ProcessTabletReplicaFullCompactionStatus: Not found (yb/master/catalog_manager.cc:3115): Tablet 85cea
09d797340029c73fa2c85ea1135 not found
I0416 18:18:26.144498 1406131 cluster_balance.cc:438] Total pending adds=1, total pending removals=0, total pending leader stepdowns=0
I0416 18:18:26.144531 1406131 cluster_balance.cc:705] tablet server 25901c23d9684e338dc2029f47f5a2bc has a pending delete for tablets [c2d5cfb588144417950446
e5ebf4ff6d]
I0416 18:18:27.159514 1406131 cluster_balance.cc:438] Total pending adds=1, total pending removals=0, total pending leader stepdowns=0
I0416 18:18:27.159595 1406131 cluster_balance.cc:705] tablet server 25901c23d9684e338dc2029f47f5a2bc has a pending delete for tablets [c2d5cfb588144417950446
e5ebf4ff6d]
I0416 18:18:27.171049 1406131 catalog_manager.cc:3135] Got tablet to split: c22cc12c730d47e9a1bb863d3a5aa8be, is manual split: 0
I0416 18:18:27.171327 1406131 tablet_split_manager.cc:442] Scheduled split for tablet_id: c22cc12c730d47e9a1bb863d3a5aa8be.
W0416 18:18:27.172641 1758785 async_rpc_tasks.cc:1482] Get Tablet Split Key RPC for tablet 0x000017d1bab3e000 → c22cc12c730d47e9a1bb863d3a5aa8be (table _rmtmmidmobile_202404_idx [id=00004300000030008000000000004f17]) (_rmtmmidmobile_202404_idx [id=00004300000030008000000000004f17]) (task=0x000017d1be6278
20, state=kRunning): TS 112babe077c44d4897c4337dcbe13541: GetSplitKey (attempt 1) failed for tablet c22cc12c730d47e9a1bb863d3a5aa8be with error code TABLET_SPLIT_KEY_RANGE_TOO_SMALL: Illegal state (yb/tablet/tablet.cc:3946): Failed to detect middle key for tablet c22cc12c730d47e9a1bb863d3a5aa8be (key_bounds: “47E149” - “”): got “47E149”.: TABLET_SPLIT_KEY_RANGE_TOO_SMALL (tablet server error 31)
I0416 18:18:27.172710 1758785 catalog_manager.cc:3162] Tablet key range is too small to split, disabling splitting temporarily.
I0416 18:18:27.172735 1758785 async_rpc_tasks.cc:387] Get Tablet Split Key RPC for tablet 0x000017d1bab3e000 → c22cc12c730d47e9a1bb863d3a5aa8be (table _rmtmmidmobile_202404_idx [id=00004300000030008000000000004f17]) (_rmtmmidmobile_202404_idx [id=00004300000030008000000000004f17]) (task=0x000017d1be627820, state=kFailed): No reschedule for this task: kFailed

And Load balancer status is rebalancing for last 8/10 days.

Thanks
Subhankar

subh14 · April 16, 2024, 6:35pm

Hi Dorian
I have replication factor of 3 inside master.conf. And while checking the system transaction table from UI as below. Although it says it has a leader though, it does not show.

I have a seperate wal mount point pointed for wals inside tserver.conf. It was getting full so I deleted some of the previous wal files few days back.
Is there something wrong happend because of deleting the wals?
Is it safe to delete previous wal files?

Thanks
Subhankar

subh14 · April 17, 2024, 12:02am

Additionally, now I have used modify_placement_info command. So, here we have 12 Tservers and 3 Masters, with replication factor 3 in master.conf.
Since, cluster load was not balanced for a long time, I used modify_plcament_info as below to check.

bin/yb-admin
-master_addresses xx.xx.xx.71:7100,xx.xx.xx.72:7100,xx.xx.xx.73:7100
modify_placement_info 3\ -pvt-cloud.-chn.-chn-1d:4,-pvt-cloud.-chn.-chn-2d:4,-pvt-cloud.-chn.-chn-3d:4

After 5 mins of running this command, from UI i see cluster load is balanced, However i see too many Underreplicated tables as below:

Image shows only one, but there are many. Also now cluster config shows as below:

Is this approach correct?
Will it impact insertion or updation in the table.?
I have automatic_tablet_splitting enable, will this command impact this?
Is there a way to get back to previous state before running this command? Not sure about the previous state either.
Also, when i added additional 3 Tservers to make it total 12 Tserver, I see System Tablet-peers/Leaders in newly added servers are 0/0. Is there a way to balance this? Or its fine if it stays this way?

Thanks
Subhankar

FranckPachot · April 19, 2024, 7:53am

Hi, for replication factor 3 on 3 zones, it should be :1 for each, not :4
The modify_placement_info defines the minimum replicas on each zone

subh14 · April 19, 2024, 8:27am

I have rectified it to :1, after that i see too many underreplicated tablets and could not create new tables now.

Any suggestion how to resolve.

Thanks
Subhankar

Topic		Replies	Views
[RFC] Tablet splitting design Design Discussions	2	1954	September 7, 2021
Tserver: unbalanced utilization of disks General	5	57	December 17, 2024
Leaderless Tablets and cluster load is not balanced General	2	308	April 8, 2024
The relationship between parent tablets and child tablets? General	6	408	September 25, 2023
Newly added Tserver not have any load General	8	1245	May 3, 2024

Tablets are not splitted evenly

Related topics