Hi, I’ve encountered an issue while deploying yugabyte in k8s environment, for the record I’m not using k8s charts / operator, this is hand-rolled installation for my environment, but this issue will be same for the VMs.
So, one or more of t-servers data directories was cleaned (PVC failed) / VM / Pod was recreated with the same DNS/IP.
Now master tries to access tserver with old UUID, while tserver have no old data. In the UI tserver marked as ALIVE / ONLINE, but some tablets fail to elect leader.
W1118 09:08:35.778658 874275 leader_election.cc:281] T 614a400f26994c31ae465df0ecc2b309 P bcaf0fe5bc6d4f809f3a6ac0d79a6234 [CANDIDATE]: Term 14 pre-election: Tablet error from VoteRequest() call to peer 1b5ff68dc983423786a4ca88cf263346: Invalid argument (yb/tserver/service_util.h:96): RequestConsensusVote: Wrong destination UUID requested. Local UUID: d54d09ae8e3b47ccb7800101fac907c1. Requested UUID: 1b5ff68dc983423786a4ca88cf263346 (tablet server error 16)
As I understand it, Master stored Tserver state somewhere. After rapid data wipe from tserver, it didn’t registered as dead, master still tries to request old data from empty tserver instead of considering it empty and rebalancing and replicating data onto it.
In this case UUID 1b5ff68dc983423786a4ca88cf263346 was UUID of the server that was wiped, and d54d09ae8e3b47ccb7800101fac907c1 current UUID of the given tserver.
Note, I’m evaluating how to manage failure states in yugabyte and breaking it in various ways.