Yugabyted master status not clear reported

Here is a interesting situation. 3x Node via yugabyted.

3 Nodes, 3 T-Servers, 3 T-Masters

All 3 are working correctly.

2 Nodes, 2 T-Servers, 2 T-Masters

We took down node3 for a day. Ui sees that node 3 is down, everything keeps working. Great.

3 Nodes, 3 T-Servers, 2 T-Masters

We start node3 again with the same cli command. We see TServer is online, and the UI reports all 3 Nodes are active on the main page. Our tables are synced and all active. Great …

But, the 3th master did not go online and shows up as red in the nodes listing.

We check the :7000 port, nothing. We go to alive node :7000 and check.

xxx.xxx.xxx.xxx:7100 UNKNOWN_ROLE ERROR: Timed out (yb/rpc/outbound_call.cc:647): Unable to get registration information for peer ([xxx.xxx.xxx.xxx:7100]) id (8287ba4d41ad4d2c96729b1f1c3dcb6a): GetMasterRegistration RPC (request call id 1265511) to xxx.xxx.xxx.xxx:7100 timed out after 1.500s

Yep … she is down Captain.

3 Nodes, 3 T-Servers, 2 T-Masters

We waited a night, assuming there may be some timeout / retry mechanic every 1500s. We checked the next day. Still down.

3 Nodes, 3 T-Servers, 2 T-Masters

We take down node 3 again, and start it up before the UI detects node3 is down. Same as before, T-server is online, T-master is down.

3 Nodes, 3 T-Servers, 3 T-Masters

We take down node 3 again, and wait until the UI shows that we are down to 2 nodes. Start the node 3 again … drumroll … T-server and T-master are showing in the node listing.

Its a bit odd. We also checked the firewall and port 7100 was open as before between the nodes.

A improvement to the interface:

Please also add the T-Master status to the main UI, because its now very easy to be down to 2 T-Masters, without you realizing that your system is critical (one more and your system goes down). If you do not check the nodes status.

So if you did any maintenance assuming you had 3 T-Master up, restarted a live (2/3) T-Master Server/VPS, and you just took down your entire DB. Oeps …

Also if possible, expose more of the 7000 / 9000 UI’s directly via the yugabyted interface. Its a bit clumsy to have 3 different interfaces going on.

Are you using yugabyted or just starting them individually?

Which UI are you referring to the yb-master or the yugabyted UI?
yb-master UI home page shows the state of all masters. You can also use list_all_masters command.

Can you check the master err and log files of the node in trouble? Any error should be reported there. It’s likely it failed to start due to some transient issue, since it worked on the retry.

You don’t have to wait a day, if it does not come up within a couple of minutes then it means there is an issue.

Are you using yugabyted or just starting them individually?

yugabyted

Which UI are you referring to the yb-master or the yugabyted UI?

yugabyted (overview page). We can see the “node status” page, but tend to have the overview page open or the performance page. So its easy to miss / not know that a T-Master is not active.

We consider the T-Master as integral to the DB its working, as the T-Servers. And think that this information needs to be more prominent displayed. Like how many T-Servers are active, how many T-Masters … In the current working, its only if a T-Server is down, that it indicates a node failure on the overview page UI.

Can you check the master err and log files of the node in trouble? Any error should be reported there. It’s likely it failed to start due to some transient issue, since it worked on the retry.

See below. This is the time we started it in the evening and “let it run”.

W0401 00:02:10.084316 415358 catalog_manager.cc:1655] Failed to get current config: Illegal state (yb/master/catalog_manager.cc:11898): Node 8287ba4d41ad4d2c96729b1f1c3dcb6a peer not initialized.
W0401 00:02:10.084671 415355 async_client_initializer.cc:94] Failed to initialize client: Illegal state (yb/client/client-internal.cc:2823): Could not locate the leader master: Unable to determine master addresses
W0401 00:02:10.095928 415365 catalog_manager.cc:1655] Failed to get current config: Illegal state (yb/master/catalog_manager.cc:11898): Node 8287ba4d41ad4d2c96729b1f1c3dcb6a peer not initialized.
W0401 00:02:10.096002 415362 async_client_initializer.cc:94] Failed to initialize client: Illegal state (yb/client/client-internal.cc:2823): Could not locate the leader master: Unable to determine master addresses
W0401 00:02:10.145489 415363 log_util.cc:242] Could not read footer for segment: /home/xxxxxxxx/yuga/node/data/yb-data/master/wals/table-sys.catalog.uuid/tablet-00000000000000000000000000000000/wal-000000004: Not found (yb/consensus/l>
W0401 00:02:10.147444 415363 log_reader.cc:217] T 00000000000000000000000000000000 P 8287ba4d41ad4d2c96729b1f1c3dcb6a: Log segment /home/xxxxxxxx/yuga/node/data/yb-data/master/wals/table-sys.catalog.uuid/tablet-00000000000000000000000>
W0401 00:02:10.159518 415376 master-path-handlers.cc:228] Illegal state (yb/tablet/tablet_peer.cc:1497): Unable to list masters during web request handling: Tablet peer T 00000000000000000000000000000000 P 8287ba4d41ad4d2c96729b1f1c3d>
W0401 00:02:10.161201 415363 transaction_participant.cc:1887] T 00000000000000000000000000000000 P 8287ba4d41ad4d2c96729b1f1c3dcb6a: Transaction not found: 0eb3868b-eea7-45e8-8939-a2814690758c, for: pre apply
W0401 00:02:10.161252 415363 transaction_participant.cc:883] T 00000000000000000000000000000000 P 8287ba4d41ad4d2c96729b1f1c3dcb6a: Apply of unknown transaction: { leader_term: -1 transaction_id: 0eb3868b-eea7-45e8-8939-a2814690758c a>
W0401 00:02:10.169905 415363 remove_intents_task.cc:58] Remove intents task failed: Aborted (yb/tablet/tablet_peer.cc:1700): Thread pool not ready
W0401 00:02:10.253392 415383 catalog_manager_bg_tasks.cc:194] Catalog manager background task thread going to sleep: Service unavailable (yb/master/scoped_leader_shared_lock.cc:91): Catalog manager is not initialized. State: 1
W0401 00:02:11.755707 415450 long_operation_tracker.cc:125] UpdateReplica running for 1.000s in thread 415449:
    @     0xffffb60cb6f3  (unknown)
    @     0xffffb60ce22f  pthread_cond_wait
    @     0xaaaab73e36f3  std::__1::__assoc_sub_state::__sub_wait()
    @     0xaaaab754c327  std::__1::__assoc_state<>::copy()
    @     0xaaaab896e827  yb::tablet::RunningTransaction::SendStatusRequest()
    @     0xaaaab896e11f  yb::tablet::RunningTransaction::RequestStatusAt()
    @     0xaaaab8a4a4cf  yb::tablet::TransactionParticipant::MinRunningHybridTime()
    @     0xaaaab8982207  yb::tablet::Tablet::ApplyIntents()
    @     0xaaaab8a535ef  yb::tablet::TransactionParticipant::Impl::ProcessReplicated()
    @     0xaaaab89619e7  yb::tablet::UpdateTxnOperation::DoReplicated()
    @     0xaaaab8954bc3  yb::tablet::Operation::Replicated()
    @     0xaaaab89571df  yb::tablet::OperationDriver::ReplicationFinished()
    @     0xaaaab794ea1b  yb::consensus::ConsensusRound::NotifyReplicationFinished()
    @     0xaaaab799c873  yb::consensus::ReplicaState::ApplyPendingOperationsUnlocked()
    @     0xaaaab799bbef  yb::consensus::ReplicaState::AdvanceCommittedOpIdUnlocked()
    @     0xaaaab7985e57  yb::consensus::RaftConsensus::UpdateReplica()
W0401 00:02:16.705066 415608 yb_rpc.cc:362] Call yb.consensus.ConsensusService.UpdateConsensus xxx.xxx.xxx.211:34211 => xxx.xxx.xxx.78:7100 (request call id 1163055) took 3004ms (client timeout 3000ms).

The last line repeats a few 1000s times. IPs removed for privacy.

xxx.xxx.xxx.211 = db2 node
xxx.xxx.xxx.78 = db3 node (this failed master)

You don’t have to wait a day, if it does not come up within a couple of minutes then it means there is an issue.

Noted. We are not sure how YugubyteDB works internally, so everything is a learning experience as how it reacts (especially with so many parts).

What is the db version?

Can you please file a GH for the UI ask? We try to hide the information about the master vs Tserver in yugabyted since it leads to confusion for new users, but you have a legitimate case where it will be helpful to at least show a warning.

Latest version: 2.25

Tserver in yugabyted since it leads to confusion for new users, but you have a legitimate case where it will be helpful to at least show a warning.

Will do.

The process’s status is reported on the nodes page. You must navigate to Edit Columns and select the Processes status.

However, I do agree that when one of the processes is down, we should report a warning on the overview page and direct the users to navigate to the nodes page with the process’s status displayed. Please file a GH issue for this.

Thanks,
Nikhil Chandrappa

@Benjiro Can you please let us know the version of YBDB you’re using for the tests.

Also, can you pls send the yugabyted.conf file of the node that is not coming back up. Or better, you could run yugabyted collect_logs and send us the .gz file.

Thanks,
Nikhil

Can you please let us know the version of YBDB you’re using for the tests.

Latest 2.25

Also, can you pls send the yugabyted.conf file of the node that is not coming back up. Or better, you could run yugabyted collect_logs and send us the .gz file.

Logs have been submitted. Support ticked: 13678