We recently had an issue on our yugabyte cluster, where we saw something similar to an election storm.
We saw that there were too many tablet elections happening, and also saw that yb-master’s cluster balancer was also triggering a lot of leader rebalancing, (which was also adding to more elections, therefore making the situation worse).
We were able to stablise the cluster by identifying a tserver with large number of threads as compared to other tservers and stopping that tserver. [We don’t know what started the election storm]
As part of Action Items to mitigate this from happening again we were wondering if changing the value for leader_balance_threshold from default 0 to a higher value would help.
yugabyte db version: 2024.2.0.0
we tried this config change on our test environment.
before: [User Tablet-Peers / Leaders]
- tserver1: 445 / 148
- tserver2: 445 / 148
- tserver3: 445 / 149
- Updated the leader_balance_threshold to 3 on all yb-master vms
- blacklisted tserver1 using change_leader_blacklist command
- this made tserver1: 445 / 0, and increased leaders on other 2 tservers
- removed blacklist for tserver1
expectation was:
- tserver1: 445 / 147
- tserver2: 445 / 149
- tserver3: 445 / 149
what I got:
- tserver1: 445 / 40
- tserver2: 445 / 202
- tserver3: 445 / 203
redid the whole thing with leader_balance_threshold to 2, same result
redid the whole thing with leader_balance_threshold to 1, result was
- tserver1: 445 / 90
- tserver2: 445 / 177
- tserver3: 445 / 178
I am wondering if leader balancer works on each table level, balancing and honouring the threashold set for tablets of each table. instead of on the cluster level. documentation does not say anything about this.