One node crushed suddenly

Hi,
While testing ingestion with 9 tablet servers we suddenly observed that one of the tservers crushed after printing some FATAL logs. I am just sharing the logs here to understand the issue. All others don’t have such issues.


F20240221 21:22:24 …/…/src/yb/gutil/casts.cc:21] Bad narrow cast: -1 < 0
@ 0x55b1ac30ae47 google::LogMessage::SendToLog()
@ 0x55b1ac30bacd google::LogMessage::Flush()
@ 0x55b1ac30c149 google::LogMessageFatal::~LogMessageFatal()
@ 0x55b1acc5d6ba yb::BadNarrowCast()
@ 0x55b1ad69503e std::__1::__function::__func<>::operator()()
@ 0x55b1ad5a5148 yb::tablet::Operation::Replicated()
@ 0x55b1ad5a6ea0 yb::tablet::OperationDriver::ReplicationFinished()
@ 0x55b1ac788b3d yb::consensus::ConsensusRound::NotifyReplicationFinished()
@ 0x55b1ac7d62ef yb::consensus::ReplicaState::ApplyPendingOperationsUnlocked()
@ 0x55b1ac7d5669 yb::consensus::ReplicaState::AdvanceCommittedOpIdUnlocked()
@ 0x55b1ac7b2465 yb::consensus::RaftConsensus::UpdateMajorityReplicated()
@ 0x55b1ac77c891 yb::consensus::PeerMessageQueue::NotifyObserversOfMajorityReplOpChangeTask()
@ 0x55b1adc90599 yb::ThreadPool::DispatchThread()
@ 0x55b1adc8c754 yb::thread::SuperviseThread()
@ 0x7f55f8152694 start_thread
@ 0x7f55f844f41d __clone

Please advice.

Thanks & regards,
Dipanjan

Hi @dipanjanghos

Can you show some more details about the situation:

  1. the version of yugabytedb
  2. cluster setup/topology
  3. please upload the core dump, it should be generated
  4. usually upload a bit more logs

Sorry for the long delay. Actually we tested with YugabyteDB YSQL with all sorts of workloads according to data lake use cases with 40k insertion per second.
It was fine until recently we faced some sudden crushing of more than one tservers . The reasons seem to be different in those cases.
Now first of all let me answer your question:

yugabyte-2.20.1.3

not sure that i understood this question. We are using YSQL with 3 master and 12 tserver.

I didn’t found any coredump . but there are some FATAL and ERROR logs in the directory. But how to upoad here?

Trying to upload the latest instance of cursh and some of the logs.

vi yb-tserver.HYNRABMPRFRMYDN01.yugabyte.log.FATAL.20240331-145012.2662058

Log file created at: 2024/03/31 14:50:12
Current UTC time: 2024/03/31 14:50:12
Running on machine: HYNRABMPRFRMYDN01
Application fingerprint: version 2.20.1.3 build 3 revision 4887e12d8a3d85d68b3bce96a9f8ec0edebb9f00 build_type RELEASE built at 23 Jan 2024 20:15:52 UTC
Node information: { hostname: 'HYNRABMPRFRMYDN01', rpc_ip: '10.81.88.74', webserver_ip: '10.81.86.74', uuid: '874102d787974a9c937a9eb304f759c9' }
Running duration (h:mm:ss): 242:15:16
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
F0331 14:50:12.791029 623605 maintenance_manager.cc:101] Check failed: !manager_.get() You must unregister the LogGCOp(7b72199ed9714f649d10be762212950d) Op before destroying it.
F0331 14:50:12.862526 2687864 maintenance_manager.cc:101] Check failed: !manager_.get() You must unregister the LogGCOp(f281b80a4a7c44ce8103fe10cc66538a) Op before destroying it.

cat yb-tserver.HYNRABMPRFRMYDN01.yugabyte.log.ERROR.20240321-123500.2662058

Log file created at: 2024/03/21 12:35:00
Current UTC time: 2024/03/21 12:35:00
Running on machine: HYNRABMPRFRMYDN01
Application fingerprint: version 2.20.1.3 build 3 revision 4887e12d8a3d85d68b3bce96a9f8ec0edebb9f00 build_type RELEASE built at 23 Jan 2024 20:15:52 UTC
Node information: { hostname: 'HYNRABMPRFRMYDN01', rpc_ip: '10.81.88.74', webserver_ip: '10.81.86.74', uuid: '874102d787974a9c937a9eb304f759c9' }
Running duration (h:mm:ss): 0:00:04
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
E0321 12:35:00.173357 2662137 tablet_metadata.cc:768] T 19c7a13766074e65937c658afdb47bec P 874102d787974a9c937a9eb304f759c9: Failed to destroy regular DB at: /data2/yugabyte/yb-data/tserver/data/rocksdb/table-d71e983cd28d407a9eeaa0ecdc1adbd5/tablet-19c7a13766074e65937c658afdb47bec: IO error (yb/rocksdb/util/env_posix.cc:315): /data2/yugabyte/yb-data/tserver/data/rocksdb/table-d71e983cd28d407a9eeaa0ecdc1adbd5/tablet-19c7a13766074e65937c658afdb47bec/LOCK: No such file or directory
E0321 12:35:05.160899 2662136 tablet_metadata.cc:768] T 0a0c0c4e8a344178a81969f323e8c12f P 874102d787974a9c937a9eb304f759c9: Failed to destroy regular DB at: /data1/yugabyte/yb-data/tserver/data/rocksdb/table-d71e983cd28d407a9eeaa0ecdc1adbd5/tablet-0a0c0c4e8a344178a81969f323e8c12f: IO error (yb/rocksdb/util/env_posix.cc:315): /data1/yugabyte/yb-data/tserver/data/rocksdb/table-d71e983cd28d407a9eeaa0ecdc1adbd5/tablet-0a0c0c4e8a344178a81969f323e8c12f/LOCK: No such file or directory
E0321 12:35:10.167415 2662141 tablet_metadata.cc:768] T 9456670f5bb1422b8b62972947394c88 P 874102d787974a9c937a9eb304f759c9: Failed to destroy regular DB at: /data2/yugabyte/yb-data/tserver/data/rocksdb/table-d71e983cd28d407a9eeaa0ecdc1adbd5/tablet-9456670f5bb1422b8b62972947394c88: IO error (yb/rocksdb/util/env_posix.cc:315): /data2/yugabyte/yb-data/tserver/data/rocksdb/table-d71e983cd28d407a9eeaa0ecdc1adbd5/tablet-9456670f5bb1422b8b62972947394c88/LOCK: No such file or directory
E0321 12:35:14.167564 2662129 tablet_metadata.cc:768] T 33a03d566f1f402a8aa860792d9bb4a1 P 874102d787974a9c937a9eb304f759c9: Failed to destroy regular DB at: /data1/yugabyte/yb-data/tserver/data/rocksdb/table-d71e983cd28d407a9eeaa0ecdc1adbd5/tablet-33a03d566f1f402a8aa860792d9bb4a1: IO error (yb/rocksdb/util/env_posix.cc:315): /data1/yugabyte/yb-data/tserver/data/rocksdb/table-d71e983cd28d407a9eeaa0ecdc1adbd5/tablet-33a03d566f1f402a8aa860792d9bb4a1/LOCK: No such file or directory
E0321 12:35:19.176136 2662146 tablet_metadata.cc:768] T 2e8bc41f9d1e4103b31c42f8dde1566a P 874102d787974a9c937a9eb304f759c9: Failed to destroy regular DB at: /data2/yugabyte/yb-data/tserver/data/rocksdb/table-d71e983cd28d407a9eeaa0ecdc1adbd5/tablet-2e8bc41f9d1e4103b31c42f8dde1566a: IO error (yb/rocksdb/util/env_posix.cc:315): /data2/yugabyte/yb-data/tserver/data/rocksdb/table-d71e983cd28d407a9eeaa0ecdc1adbd5/tablet-2e8bc41f9d1e4103b31c42f8dde1566a/LOCK: No such file or directory
E0321 12:35:24.182878 2662145 tablet_metadata.cc:768] T e304ae86742f43198ff9ecc9d06bf1ff P 874102d787974a9c937a9eb304f759c9: Failed to destroy regular DB at: /data1/yugabyte/yb-data/tserver/data/rocksdb/table-d71e983cd28d407a9eeaa0ecdc1adbd5/tablet-e304ae86742f43198ff9ecc9d06bf1ff: IO error (yb/rocksdb/util/env_posix.cc:315): /data1/yugabyte/yb-data/tserver/data/rocksdb/table-d71e983cd28d407a9eeaa0ecdc1adbd5/tablet-e304ae86742f43198ff9ecc9d06bf1ff/LOCK: No such file or directory
E0321 12:35:30.192399 2662142 tablet_metadata.cc:768] T 79deda933fdd45458d5e560779243530 P 874102d787974a9c937a9eb304f759c9: Failed to destroy regular DB at: /data2/yugabyte/yb-data/tserver/data/rocksdb/table-d71e983cd28d407a9eeaa0ecdc1adbd5/tablet-79deda933fdd45458d5e560779243530: IO error (yb/rocksdb/util/env_posix.cc:315): /data2/yugabyte/yb-data/tserver/data/rocksdb/table-d71e983cd28d407a9eeaa0ecdc1adbd5/tablet-79deda933fdd45458d5e560779243530/LOCK: No such file or directory
E0321 12:35:35.196211 2662131 tablet_metadata.cc:768] T 57b5da0ebd1f450497d1d6eed0665f4f P 874102d787974a9c937a9eb304f759c9: Failed to destroy regular DB at: /data1/yugabyte/yb-data/tserver/data/rocksdb/table-d71e983cd28d407a9eeaa0ecdc1adbd5/tablet-57b5da0ebd1f450497d1d6eed0665f4f: IO error (yb/rocksdb/util/env_posix.cc:315): /data1/yugabyte/yb-data/tserver/data/rocksdb/table-d71e983cd28d407a9eeaa0ecdc1adbd5/tablet-57b5da0ebd1f450497d1d6eed0665f4f/LOCK: No such file or directory
E0321 12:35:41.206802 2662147 tablet_metadata.cc:768] T 57575a8f3df34b319596e8bccc93954d P 874102d787974a9c937a9eb304f759c9: Failed to destroy regular DB at: /data2/yugabyte/yb-data/tserver/data/rocksdb/table-d71e983cd28d407a9eeaa0ecdc1adbd5/tablet-57575a8f3df34b319596e8bccc93954d: IO error (yb/rocksdb/util/env_posix.cc:315): /data2/yugabyte/yb-data/tserver/data/rocksdb/table-d71e983cd28d407a9eeaa0ecdc1adbd5/tablet-57575a8f3df34b319596e8bccc93954d/LOCK: No such file or directory
E0321 12:35:46.209388 2662129 tablet_metadata.cc:768] T 57131472e0be4019a2e2caae0f8771fd P 874102d787974a9c937a9eb304f759c9: Failed to destroy regular DB at: /data1/yugabyte/yb-data/tserver/data/rocksdb/table-d71e983cd28d407a9eeaa0ecdc1adbd5/tablet-57131472e0be4019a2e2caae0f8771fd: IO error (yb/rocksdb/util/env_posix.cc:315): /data1/yugabyte/yb-data/tserver/data/rocksdb/table-d71e983cd28d407a9eeaa0ecdc1adbd5/tablet-57131472e0be4019a2e2caae0f8771fd/LOCK: No such file or directory
E0321 19:08:56.738435 2687868 tablet_metadata.cc:768] T 1bea1cf942e54c358327224176ccd045 P 874102d787974a9c937a9eb304f759c9: Failed to destroy regular DB at: /data1/yugabyte/yb-data/tserver/data/rocksdb/table-00004005000030008000000000004019/tablet-1bea1cf942e54c358327224176ccd045: IO error (yb/rocksdb/util/env_posix.cc:315): /data1/yugabyte/yb-data/tserver/data/rocksdb/table-00004005000030008000000000004019/tablet-1bea1cf942e54c358327224176ccd045/LOCK: No such file or directory
E0321 19:09:56.612288 2687777 remote_bootstrap_service.cc:522] Remote bootstrap session 37386f98eddb429799c410b21787df08-627ea847713247dba63ef5d763cddecf-2117963.547s on tablet 627ea847713247dba63ef5d763cddecf with peer 37386f98eddb429799c410b21787df08 failed. session_succeeded = 0
E0321 19:23:56.507210 2687757 remote_bootstrap_service.cc:522] Remote bootstrap session 58a3fb0063004648867312bf4c1e8c30-c7252a0d800d4e368b5428dabeff7cf4-2118802.707s on tablet c7252a0d800d4e368b5428dabeff7cf4 with peer 58a3fb0063004648867312bf4c1e8c30 failed. session_succeeded = 0
E0322 11:39:08.106683 2701076 consensus_queue.cc:1641] Invalid peer UUID: f940a72013f24e338e5d5b597f2c0d38
E0323 14:22:28.450532 2697222 consensus_queue.cc:1641] Invalid peer UUID: 8f8b311cfa1a463eac277d4e8a4f4cfe
E0324 12:48:45.509325 2688198 consensus_queue.cc:1641] Invalid peer UUID: 58a3fb0063004648867312bf4c1e8c30
E0325 06:23:01.238265 2689684 ts_tablet_manager.cc:1708] T 1c2a7dd3b08e40efbe6d1d3dba3c4f34 P 874102d787974a9c937a9eb304f759c9: Tablet failed to bootstrap: Illegal state (yb/tablet/tablet_bootstrap.cc:1561): Failed log replay. Reason: WAL files missing, or committed op id is incorrect. Expected both term and index of prev_op_id to be greater than or equal to the corresponding components of committed_op_id. prev_op_id=1.824963, committed_op_id=4.1809577
E0325 06:23:01.539403 2687847 tablet_metadata.cc:768] T 1c2a7dd3b08e40efbe6d1d3dba3c4f34 P 874102d787974a9c937a9eb304f759c9: Failed to destroy regular DB at: /data2/yugabyte/yb-data/tserver/data/rocksdb/table-0000400500003000800000000000402d/tablet-1c2a7dd3b08e40efbe6d1d3dba3c4f34: IO error (yb/rocksdb/util/env_posix.cc:315): /data2/yugabyte/yb-data/tserver/data/rocksdb/table-0000400500003000800000000000402d/tablet-1c2a7dd3b08e40efbe6d1d3dba3c4f34/LOCK: No such file or directory
E0325 06:39:00.085647 2662144 remote_bootstrap_service.cc:522] Remote bootstrap session 7b7f524af3cb4bc893f536c9d49b2bd3-cc480425e6da42b982f8cd38811671bb-2418497.977s on tablet cc480425e6da42b982f8cd38811671bb with peer 7b7f524af3cb4bc893f536c9d49b2bd3 failed. session_succeeded = 0
E0325 08:15:29.954365 2687696 consensus_queue.cc:1641] Invalid peer UUID: 37386f98eddb429799c410b21787df08
E0326 04:34:02.665969 2687773 tablet_metadata.cc:768] T 0d69b82c07204f50845b1cf1fd1e2b7c P 874102d787974a9c937a9eb304f759c9: Failed to destroy regular DB at: /data1/yugabyte/yb-data/tserver/data/rocksdb/table-00004005000030008000000000004033/tablet-0d69b82c07204f50845b1cf1fd1e2b7c: IO error (yb/rocksdb/util/env_posix.cc:315): /data1/yugabyte/yb-data/tserver/data/rocksdb/table-00004005000030008000000000004033/tablet-0d69b82c07204f50845b1cf1fd1e2b7c/LOCK: No such file or directory
E0326 06:55:01.792387 2687865 tablet_metadata.cc:768] T 73990c9d38a44caea851f6d9df4734eb P 874102d787974a9c937a9eb304f759c9: Failed to destroy regular DB at: /data2/yugabyte/yb-data/tserver/data/rocksdb/table-00004005000030008000000000004038/tablet-73990c9d38a44caea851f6d9df4734eb: IO error (yb/rocksdb/util/env_posix.cc:315): /data2/yugabyte/yb-data/tserver/data/rocksdb/table-00004005000030008000000000004038/tablet-73990c9d38a44caea851f6d9df4734eb/LOCK: No such file or directory
E0326 13:22:41.691524 2688198 consensus_queue.cc:1641] Invalid peer UUID: 58a3fb0063004648867312bf4c1e8c30
E0326 14:07:53.396988 2689689 consensus_queue.cc:1641] Invalid peer UUID: 8f8b311cfa1a463eac277d4e8a4f4cfe
E0327 08:45:29.665184 3666015 consensus_queue.cc:1641] Invalid peer UUID: 8f8b311cfa1a463eac277d4e8a4f4cfe
E0331 14:50:12.790409 623605 tablet_peer.cc:203] T 7b72199ed9714f649d10be762212950d P 874102d787974a9c937a9eb304f759c9 [state=RUNNING]: TabletPeer not fully shut down.
E0331 14:50:12.790900 623605 strand.cc:43] Strand 0x00001788ab1201c0 has not been fully shut down, running: 0, closing: 0, active_tasks: 0
F0331 14:50:12.791029 623605 maintenance_manager.cc:101] Check failed: !manager_.get() You must unregister the LogGCOp(7b72199ed9714f649d10be762212950d) Op before destroying it.
E0331 14:50:12.853185 2696805 tablet_peer.cc:203] T 7b72199ed9714f649d10be762212950d P fed73e3f60ec4b9a8b1037afdc85350d [state=RUNNING]: TabletPeer not fully shut down.
E0331 14:50:12.862305 2687864 tablet_peer.cc:203] T f281b80a4a7c44ce8103fe10cc66538a P 874102d787974a9c937a9eb304f759c9 [state=RUNNING]: TabletPeer not fully shut down.
E0331 14:50:12.862354 2687864 strand.cc:43] Strand 0x000017c53767f780 has not been fully shut down, running: 0, closing: 0, active_tasks: 0
F0331 14:50:12.862526 2687864 maintenance_manager.cc:101] Check failed: !manager_.get() You must unregister the LogGCOp(f281b80a4a7c44ce8103fe10cc66538a) Op before destroying it.

The postgresql-2024-03-31_000000.log has the following entries:

I0331 14:44:20.875800 2176246 pg_client.cc:158] Using TServer host_port: 10.81.88.74:9100
I0331 14:44:20.877102 2176246 pg_client.cc:165] Session id 19456: Session id acquired. Postgres backend pid: 2176246
I0331 14:44:30.954257 2176523 mem_tracker.cc:254] Creating root MemTracker with garbage collection threshold 5242880 bytes
I0331 14:44:30.955114 2176523 thread_pool.cc:167] Starting thread pool { name: pggate_ybclient max_workers: 1024 }
I0331 14:44:30.955724 2176523 pg_client.cc:158] Using TServer host_port: 10.81.88.74:9100
I0331 14:44:30.957034 2176523 pg_client.cc:165] Session id 19457: Session id acquired. Postgres backend pid: 2176523
I0331 14:44:34.806078 2172264 poller.cc:66] Poll stopped: Service unavailable (yb/rpc/scheduler.cc:78): Scheduler is shutting down (system error 108)
I0331 14:45:00.142319 2176613 mem_tracker.cc:254] Creating root MemTracker with garbage collection threshold 5242880 bytes
I0331 14:45:00.143167 2176613 thread_pool.cc:167] Starting thread pool { name: pggate_ybclient max_workers: 1024 }
I0331 14:45:00.143788 2176613 pg_client.cc:158] Using TServer host_port: 10.81.88.74:9100

.

.

I0331 14:45:25.065631 2176729 pg_client.cc:165] Session id 19460: Session id acquired. Postgres backend pid: 2176729
I0331 14:45:25.514719 2172456 poller.cc:66] Poll stopped: Service unavailable (yb/rpc/scheduler.cc:78): Scheduler is shutting down (system error 108)
I0331 14:45:32.956957 2172484 poller.cc:66] Poll stopped: Service unavailable (yb/rpc/scheduler.cc:78): Scheduler is shutting down (system error 108)
I0331 14:45:34.217507 2172502 poller.cc:66] Poll stopped: Service unavailable (yb/rpc/scheduler.cc:78): Scheduler is shutting down (system error 108)
I0331 1

.

.

2024-03-31 14:50:13.260 UTC [2176729] FATAL: terminating connection due to administrator command
2024-03-31 14:50:13.260 UTC [2176613] FATAL: terminating connection due to administrator command
2024-03-31 14:50:13.260 UTC [2176698] FATAL: terminating connection due to administrator command

.

.

I0331 14:50:13.261905 2176616 poller.cc:66] Poll stopped: Service unavailable (yb/rpc/scheduler.cc:78): Scheduler is shutting down (system error 108)
I0331 14:50:13.262017 2176701 poller.cc:66] Poll stopped: Service unavailable (yb/rpc/scheduler.cc:78): Scheduler is shutting down (system error 108)
I0

.

,

2024-03-31 14:50:13.271 UTC [2662208] WARNING: server process (PID 2175806) exited with exit code 1
2024-03-31 14:50:13.271 UTC [2662208] WARNING: server process (PID 2176613) exited with exit code 1

2024-03-31 14:50:13.283 UTC [2662260] LOG: shutting down
2024-03-31 14:50:13.331 UTC [2662208] LOG: database system is shut down

Please provide some light.

Thanks & Regards,
DG

Meaning cpu/memory/ssd-space of the servers, number of servers, and any cluster configuration (single/multi az/region, row level georeplication, etc).

You can upload a .zip in a google driver or file sharing site and link here.

On-premises, bare metal
cpu: 64 core
memory: 756 GB
SSD: two data mount point of 4 TB (SAN storage, in RAID 5) each.
3 master, 9 tserver (total 12 servers)
Single region, three RACKs as AZ

If possible, you shouldn’t use external storage (the db already does replication, and you already have persistent storage on-prem, unlike cloud vms).

You shouldn’t use RAID5 (expensive to write,rebuild), raid 10 or 1 would be better (in yugabytedb no raid at all).

Got it. But can you explain the error. and why is it coming. or on which situation?

Could you access the zip file in

Yes.

I haven’t encountered it before. It seems to be happening in the WAL GC code path.
I’ve asked a dev internally who works on this subproject.

Seems like a corner case when bootstrapping from the closest peer. This issue was created to fix it [DocDB] TServer crash - Check failed: !manager_.get() · Issue #22007 · yugabyte/yugabyte-db · GitHub

As a temporary workaround, remote_bootstrap_from_leader_only could be set to true on yb-tserver to disable the feature and not hit the bug.