Clean old tserver information from masters

Hi, I’ve encountered an issue while deploying yugabyte in k8s environment, for the record I’m not using k8s charts / operator, this is hand-rolled installation for my environment, but this issue will be same for the VMs.

So, one or more of t-servers data directories was cleaned (PVC failed) / VM / Pod was recreated with the same DNS/IP.
Now master tries to access tserver with old UUID, while tserver have no old data. In the UI tserver marked as ALIVE / ONLINE, but some tablets fail to elect leader.

W1118 09:08:35.778658 874275 leader_election.cc:281] T 614a400f26994c31ae465df0ecc2b309 P bcaf0fe5bc6d4f809f3a6ac0d79a6234 [CANDIDATE]: Term 14 pre-election: Tablet error from VoteRequest() call to peer 1b5ff68dc983423786a4ca88cf263346: Invalid argument (yb/tserver/service_util.h:96): RequestConsensusVote: Wrong destination UUID requested. Local UUID: d54d09ae8e3b47ccb7800101fac907c1. Requested UUID: 1b5ff68dc983423786a4ca88cf263346 (tablet server error 16)

As I understand it, Master stored Tserver state somewhere. After rapid data wipe from tserver, it didn’t registered as dead, master still tries to request old data from empty tserver instead of considering it empty and rebalancing and replicating data onto it.

In this case UUID 1b5ff68dc983423786a4ca88cf263346 was UUID of the server that was wiped, and d54d09ae8e3b47ccb7800101fac907c1 current UUID of the given tserver.

Note, I’m evaluating how to manage failure states in yugabyte and breaking it in various ways.

It looks like instance_uuid_override All YB-TServer flags | YugabyteDB Docs is one option for resolving this situation, but it would be great also to be able to remove old UUID from the master’s data, in case of not being able to obtain old tserver UUID.

And in general when FQDN/ IP Adress after master/tserver replacement becomes safe to re-use?

Hi @hispebarzu

Can you follow the guide in Change cluster configuration | YugabyteDB Docs to remove the old server (with cleanup) and then add it again with same ip?

Kill it, do the cleanup, clean data, start it again and add it.

It looks like, I’ve broken something else in this instance (probably deleted more than 1 tserver), as I was not being able to reproduce this issue on another cluster. Sorry for your time. I’ll update thread, if I’ll encounter this issue again.