Failed to trigger leader election: Illegal state

My cluster can not recover after disaster. Two of five tservers failed and remaining did not have enough free space. One of the remaining servers reached 100% disk usage and stopped working. I’ve deleted pg_data and yb-data/tserver directories and restarted tserver.
Cluster appears alive now, and most of the tablets are replicated and load-balanced (two failed tservers are running atm). But some tablets stuck in 2 or 1 RF.

Manual bootstrapping does not help:

# ./yb-ts-cli --server_address=yb1.example.com:9100  remote_bootstrap yb3.example.com:9100 2f4c9f653ba64fae97004697873d8592
Successfully started remote bootstrap for tablet <2f4c9f653ba64fae97004697873d8592> from server <yb3.example.com:9100>
# grep 2f4c9f653ba64fae97004697873d8592 /var/log/yugabyte/yb-tserver.yb1.yugabyte.log.WARNING.20231217-212631.343572 | tail -n1
W1220 06:05:43.376397 348779 raft_consensus.cc:1033] T 2f4c9f653ba64fae97004697873d8592 P 2b15b4b1a857416ebe7117a3567f1a66 [term 61 NON_PARTICIPANT]: Failed to trigger leader election: Illegal state (yb/consensus/raft_consensus.cc:599): Not starting pre-election: Node is currently a non-participant in the raft config: opid_index: 115326 peers { permanent_uuid: "dcc5c3ef7a6a45cdb137f6355159b0cf" member_type: VOTER last_known_private_addr { host: "yb3.example.com" port: 9100 } last_known_broadcast_addr { host: "yb3.example.com" port: 9100 } cloud_info { ... } }

What can I do?

Is the yb-master quorum ok? What is the original RF?

On which server? On one of the remaining 3?

So how many yb-tservers do you have in total now?

Web UI shows all 3 masters and 5 tservers up and running. RF is 3

On which server? On one of the remaining 3?

yes

Can you upload some logs from all the yb-tservers & yb-masters?

You even have tablets on RF2 that can’t go to RF3?

Yes, RF2 can’t go to RF3.

Some logs? There are too many repeating logs. Some of them are:
master:

W1220 20:02:59.852226 1425712 catalog_manager.cc:11728] Expected replicas 3 but found 2 for tablet 50627749b1f041dcbadb8e37a6eddfba: tablet_id: "50627749b1f041dcbadb8e37a6eddfba" replicas { ts_info { permanent_uuid: "51fcf5730bac4237b9377ce5ce73f47b" private_rpc_addresses { host: "yb1.example.com" port: 9100 } cloud_info { ...} placement_uuid: "" broadcast_addresses { host: "yb1.example.com" port: 9100 } capabilities: 2189743739 capabilities: 1427296937 capabilities: 2980225056 } role: LEADER member_type: VOTER } replicas { ts_info { permanent_uuid: "2b15b4b1a857416ebe7117a3567f1a66" private_rpc_addresses { host: "yb4.[-gps.ru](http://example.com)" port: 9100 } cloud_info { ... } placement_uuid: "" broadcast_addresses { host: "yb4.example.com" port: 9100 } capabilities: 2189743739 capabilities: 1427296937 capabilities: 2980225056 } role: FOLLOWER member_type: VOTER } stale: false partition { partition_key_start: "\277\375" partition_key_end: "" } table_id: "00004000000030008000000000004017" table_ids: "00004000000030008000000000004017" split_depth: 0 expected_live_replicas: 3 expected_read_replicas: 0 split_parent_tablet_id: "" [suppressed 3 similar messages]
W1220 20:02:59.869972 563601 cluster_balance.cc:452] Skipping load balancing 0000400000003000800000000000403e: Leader not ready to serve requests (yb/master/cluster_balance_util.cc:191): Master leader has not yet received heartbeat from ts 51fcf5730bac4237b9377ce5ce73f47b, either master just became leader or a network partition.

tserver:

W1220 19:47:03.803236 1565928 tablet_rpc.cc:464] Timed out (yb/rpc/rpc.cc:222): Failed Read(tablet: 50627749b1f041dcbadb8e37a6eddfba, num_ops: 1, num_attempts: 862, txn: 00000000-0000-0000-0000-000000000000, subtxn: [none]) to tablet 50627749b1f041dcbadb8e37a6eddfba on tablet server { uuid: 2b15b4b1a857416ebe7117a3567f1a66 private: [host: "yb4.example.com" port: 9100] public: [host: "yb4.example.com" port: 9100] cloud_info: ... after 862 attempt(s): Read(tablet: 50627749b1f041dcbadb8e37a6eddfba, num_ops: 1, num_attempts: 862, txn: 00000000-0000-0000-0000-000000000000, subtxn: [none]) passed its deadline 12563103.609s (passed: 602.216s): Not found (yb/tserver/ts_tablet_manager.cc:1865): Tablet 50627749b1f041dcbadb8e37a6eddfba not found (tablet server error 6) 
W1220 19:50:12.300014 1594432 leader_election.cc:283] T 287c7e4016d147a89bb9e1e6771595cd P dcc5c3ef7a6a45cdb137f6355159b0cf [CANDIDATE]: Term 53 pre-election: Tablet error from VoteRequest() call to peer 51fcf5730bac4237b9377ce5ce73f47b: Invalid argument (yb/tserver/service_util.h:94): RequestConsensusVote: Wrong destination UUID requested. Local UUID: 2b27dc641c094010944689cdd2afd798. Requested UUID: 51fcf5730bac4237b9377ce5ce73f47b (tablet server error 16) W1220 19:50:14.354085 1565915 tablet_rpc.cc:464] Timed out (yb/rpc/rpc.cc:222): Failed Read(tablet: 50627749b1f041dcbadb8e37a6eddfba, num_ops: 1, num_attempts: 861, txn: 00000000-0000-0000-0000-000000000000, subtxn: [none]) to tablet 50627749b1f041dcbadb8e37a6eddfba on tablet server { uuid: 2b15b4b1a857416ebe7117a3567f1a66 private: [host: "yb4.example.com" port: 9100] public: [host: "yb4.example.com" port: 9100] cloud_info: ... after 861 attempt(s): Read(tablet: 50627749b1f041dcbadb8e37a6eddfba, num_ops: 1, num_attempts: 861, txn: 00000000-0000-0000-0000-000000000000, subtxn: [none]) passed its deadline 12563293.624s (passed: 602.748s): Not found (yb/tserver/ts_tablet_manager.cc:1865): Tablet 50627749b1f041dcbadb8e37a6eddfba not found (tablet server error 6) W1220 19:50:14.354159 1565915 yb_rpc.cc:341] Call yb.tserver.PgClientService.Perform 111.222.333.444:44214 => 111.222.333.555:9100 (request call id 115) took 602752ms (client timeout 02000ms).

@WhiteWind please compress some log-files from all yb-tservers & yb-masters, and upload them to a file-sharing site.

Also upload full page screenshot of http://yb-master-ip:7000/cluster-config, http://yb-master-ip:7000/tablet-servers, http://yb-master-ip:7000/tablet-replication.

Logs and screenshots are here:
https://drive.google.com/drive/folders/1t0j8HaHyctL6hlWBz91Oqaq2w7RCfXI4?usp=sharing

Do you have any ideas?

Hi @WhiteWind

I’ve asked internally but it was the Christmas break and everyone is off.

I really need your help

Hi @WhiteWind

I will get you today a response with some steps to follow.

Seems like the user had a majority failure and then wiped the data directory on 2 nodes, so they will need to do a unsafe_config_change. We have a KB article I believe to help with this.

Will get access today and share it with you.

Hi, do you have any new info?

Is this KB article publicly available?

Hi @WhiteWind can you join our community slack and DM me.
The support’s solution is too unsafe to be put on public forum with the risk of people doing more damage. Or fpachot@yugabyte.com if you can’t join slack. Thanks

Adding a summary of what we have done with @WhiteWind - not though the public forum because the procedure can be risky and unsafe if used in another context.

The cluster was setup as RF3 on 5 nodes with default config, which means fault tolerance to single node failure, but two nodes had their storage lost. The consequence is some tablets under-replicated. When there’s only one replica out of three, there’s no quorum to guarantee no data loss.

We used the unsafe_config_change to reconfigure all Raft groups (including the yb-master) to what is available, which means that some tablets are RF1, those may miss the latest changes. Ideally, a ysql_dump should be imported to a new, clean, cluster but this one is several terabytes.

Some tablets remained incorrectly reported as under-replicated related to a tablespace that was incorrectly set (RF2 with two identical placement blocks) and issue #20657. Each tablespace has its own transaction table to accelerate local transactions, and this is the case for this tablet (transactions_fd383052-48c1-477d-8b2e-d232980cb73c).

The cluster looks good with all nodes up, the yb-master reporting “Cluster Load Is Balanced”

One problem remains, not sure if due to the unsafe procedure or other reasons (like not enough disk space to rebalance), when stopping one node (yb5) and waiting 15 minutes to get new replicas created on remaining node. The Cluster still shows “Cluster Load Is Balanced” but shows some under-replicated tablets, and not only the tablespace transactions one.

(@WhiteWind I let you put the screenshot that can be on public forum)

@dorian_yugabyte let’s continue on the forum now. As the configuration looks good when all nodes are up, there may be something else (that may have been there before the failure) that stops load balancing.

I switched off one server: yb5, to check that cluster is functioning properly.
After 3 hours rebalancing stopped:


But data were not distributed evenly, and hundreds of tablets remained underreplicated:

Can you check the Master log (http://master:7000/logs) to see what happened during rebalancing. For a tablet, I expect to see a sequence like:

cluster_balance.cc:1479] Moving replica
async_rpc_tasks.cc:1142] AddServer ChangeConfig RPC
catalog_manager.cc:7841] Tablet: ...reported consensus state change
catalog_manager.cc:7851] Tablet server ...sent incremental report
cluster_balance.cc:1507] Moving leader of
async_rpc_tasks.cc:859] DeleteTablet RPC for tablet

Are there other errors?

In the latest versions, there’s also a nice view to see which table may be under replicated: http://master:7000/load-distribution

After switching off yb5 there are only such lines in master log (–minloglevel=1):

W0117 09:37:26.987874 3522724 async_rpc_tasks.cc:330] DeleteTablet RPC for tablet c5043676f24c4179bff0d1e12666ddb5 (vp_2022_01 [id=0000400100003000800000000000428c]) on TS=10c18e9e818d441dae6ab6e99d4f08f5 (task=0x00005574fdcacc58, state=kRunning): TS 10c18e9e818d441dae6ab6e99d4f08f5: Delete Tablet RPC failed for tablet c5043676f24c4179bff0d1e12666ddb5: Network error (yb/util/net/socket.cc:540): recvmsg error: Connection refused (system error 111)
W0117 09:37:26.987938 3522724 async_rpc_tasks.cc:334] DeleteTablet RPC for tablet c5043676f24c4179bff0d1e12666ddb5 (vp_2022_01 [id=0000400100003000800000000000428c]) on TS=10c18e9e818d441dae6ab6e99d4f08f5 (task=0x00005574fdcacc58, state=kRunning): TS 10c18e9e818d441dae6ab6e99d4f08f5: delete failed for tablet c5043676f24c4179bff0d1e12666ddb5. TS is DEAD. No further retry.
W0117 09:37:26.987968 3522724 catalog_manager.cc:7609] Pending delete for tablet c5043676f24c4179bff0d1e12666ddb5 in ts 10c18e9e818d441dae6ab6e99d4f08f5 doesn't exist
W0117 09:37:28.693120 3522732 catalog_manager.cc:7609] Pending delete for tablet 1425629339c043058dbc27db205158be in ts cea62d4226e64c6193dda4b6da5f8cf6 doesn't exist
W0117 09:37:31.775065 3522740 catalog_manager.cc:7609] Pending delete for tablet 066dc70f4bcd422ca7dac11b7f8615f5 in ts cea62d4226e64c6193dda4b6da5f8cf6 doesn't exist
W0117 09:37:43.122076 3522769 catalog_manager.cc:7609] Pending delete for tablet 5239bf0ea39f4d71a241c21c78621865 in ts cea62d4226e64c6193dda4b6da5f8cf6 doesn't exist
W0117 09:40:19.002000 3523262 async_rpc_tasks.cc:826] DeleteTablet RPC for tablet 70d5ff3d94db490aa66a29c1c00d1c63 (vp_2022_11_object_id_param_id_device_time_time_id_value_idx [id=000040010000300080000000000042cd]) on TS=2b27dc641c094010944689cdd2afd798 (task=0x000055750b462018, state=kRunning): TS 2b27dc641c094010944689cdd2afd798: delete failed for tablet 70d5ff3d94db490aa66a29c1c00d1c63 with error code TABLET_NOT_RUNNING: Already present (yb/tserver/ts_tablet_manager.cc:1456): State transition of tablet 70d5ff3d94db490aa66a29c1c00d1c63 already in progress: {70d5ff3d94db490aa66a29c1c00d1c63, remote bootstrapping tablet from peer 2b15b4b1a857416ebe7117a3567f1a66}

When I switched on yb5, there were some errors:

W0118 15:19:18.857050 3600239 async_rpc_tasks.cc:1129] RemoveServer ChangeConfig RPC for tablet 9d0290975c0e45c3a05df3d432d4ee80 (vp_2022_02_object_id_param_id_device_time_time_id_value_idx [id=000040000000300080000000000040f1]) on peer 2b27dc641c094010944689cdd2afd798 with cas_config_opid_index 117959677 (task=0x0000557520c06d38, state=kRunning): ChangeConfig() failed on leader 2b27dc641c094010944689cdd2afd798. No further retry: Illegal state (yb/consensus/replica_state.cc:267): Replica 2b27dc641c094010944689cdd2afd798 is not leader of this config. Role: FOLLOWER. Consensus state: current_term: 59 leader_uuid: "" config { opid_index: 117959677 peers { permanent_uuid: "dcc5c3ef7a6a45cdb137f6355159b0cf" member_type: VOTER last_known_private_addr { host: "yb3.example.com" port: 9100 } last_known_broadcast_addr { host: "yb3.example.com" port: 9100 } cloud_info { ... } } peers { permanent_uuid: "2b15b4b1a857416ebe7117a3567f1a66" member_type: VOTER last_known_private_addr { host: "yb4.example.com" port: 9100 } last_known_broadcast_addr { host: "yb4.example.com" port: 9100 } cloud_info { ... } } peers { permanent_uuid: "2b27dc641c094010944689cdd2afd798" member_type: VOTER last_known_private_addr { host: "yb1.example.com" port: 9100 } last_known_broadcast_addr { host: "yb1.example.com" port: 9100 } cloud_info { ... } } peers { permanent_uuid: "10c18e9e818d441dae6ab6e99d4f08f5" member_type: VOTER last_known_private_addr { host: "yb5.example.com" port: 9100 } last_known_broadcast_addr { host: "yb5.example.com" port: 9100 } cloud_info { ... } } }

And before and after there are many such lines:

W0118 15:19:31.590631 1425719 catalog_manager.cc:7670] Stale heartbeat for Tablet 0425c764f7b84502816f194b69e82eee (table vp_2021_12 [id=000040000000300080000000000040e0]) on TS 10c18e9e818d441dae6ab6e99d4f08f5 cstate=current_term: 108config { opid_index: 162377924 peers { permanent_uuid: "2b15b4b1a857416ebe7117a3567f1a66" member_type: VOTER last_known_private_addr { host: "yb4.example.com" port: 9100 } last_known_broadcast_addr { host: "yb4.example.com" port: 9100 } cloud_info { ... } } peers { permanent_uuid: "2b27dc641c094010944689cdd2afd798" member_type: VOTER last_known_private_addr { host: "yb1.example.com" port: 9100 } last_known_broadcast_addr { host: "yb1.example.com" port: 9100 } cloud_info { ... } } peers { permanent_uuid: "cea62d4226e64c6193dda4b6da5f8cf6" member_type: VOTER last_known_private_addr { host: "yb2.example.com" port: 9100 } last_known_broadcast_addr { host: "yb2.example.com" port: 9100 } cloud_info { ... } } peers { permanent_uuid: "10c18e9e818d441dae6ab6e99d4f08f5" member_type: PRE_VOTER last_known_private_addr { host: "yb5.example.com" port: 9100 } last_known_broadcast_addr { host: "yb5.example.com" port: 9100 } cloud_info { ... } } }, prev_cstate=current_term: 108 leader_uuid: "2b15b4b1a857416ebe7117a3567f1a66" config { opid_index: 162377925 peers { permanent_uuid: "2b15b4b1a857416ebe7117a3567f1a66" member_type: VOTER last_known_private_addr { host: "yb4.example.com" port: 9100 } last_known_broadcast_addr { host: "yb4.example.com" port: 9100 } cloud_info { ... } } peers { permanent_uuid: "2b27dc641c094010944689cdd2afd798" member_type: VOTER last_known_private_addr { host: "yb1.example.com" port: 9100 } last_known_broadcast_addr { host: "yb1.example.com" port: 9100 } cloud_info { ... } } peers { permanent_uuid: "cea62d4226e64c6193dda4b6da5f8cf6" member_type: VOTER last_known_private_addr { host: "yb2.example.com" port: 9100 } last_known_broadcast_addr { host: "yb2.example.com" port: 9100 } cloud_info { ... } } peers { permanent_uuid: "10c18e9e818d441dae6ab6e99d4f08f5" member_type: VOTER last_known_private_addr { host: "yb5.example.com" port: 9100 } last_known_broadcast_addr { host: "yb5.example.com" port: 9100 } cloud_info { ... } } }

There are no ‘Moving replica’, ‘AddServer’, ‘reported consensus state change’, ‘incremental report’ or ‘Moving leader’ in master logs

Ok, for the under-replicated tables you can check their replication info (in master UI → tables → tablet)
Maybe there are some remaining in the tablespace that was set as RF2?

If there’s not replication info, the default should be RF3 to all nodes, but better set it explicitely with yb-admin ... modify_placement_info cloud.region.zone 3

Be sure to set cloud/region/zone the same as what you define in your cluster

This to be sure that the replication configuration is correct for the load balancer to do its stuff.

No, that tables have RF 3, and when I switched on yb5 their RF was restored.
I don’t want to waste whole day to test it again, will yb-ts-cli delete_tablet help me to reproduce this issue?

And now I’m thinking of migrating this cluster to more resilient hardware. What is the best method to move data, but leave behind that erroneous replication state? (besides ysql_dump)