Hi,
While testing ingestion with 9 tablet servers we suddenly observed that one of the tservers crushed after printing some FATAL logs. I am just sharing the logs here to understand the issue. All others don’t have such issues.
Sorry for the long delay. Actually we tested with YugabyteDB YSQL with all sorts of workloads according to data lake use cases with 40k insertion per second.
It was fine until recently we faced some sudden crushing of more than one tservers . The reasons seem to be different in those cases.
Now first of all let me answer your question:
yugabyte-2.20.1.3
not sure that i understood this question. We are using YSQL with 3 master and 12 tserver.
I didn’t found any coredump . but there are some FATAL and ERROR logs in the directory. But how to upoad here?
Trying to upload the latest instance of cursh and some of the logs.
vi yb-tserver.HYNRABMPRFRMYDN01.yugabyte.log.FATAL.20240331-145012.2662058
Log file created at: 2024/03/31 14:50:12
Current UTC time: 2024/03/31 14:50:12
Running on machine: HYNRABMPRFRMYDN01
Application fingerprint: version 2.20.1.3 build 3 revision 4887e12d8a3d85d68b3bce96a9f8ec0edebb9f00 build_type RELEASE built at 23 Jan 2024 20:15:52 UTC
Node information: { hostname: 'HYNRABMPRFRMYDN01', rpc_ip: '10.81.88.74', webserver_ip: '10.81.86.74', uuid: '874102d787974a9c937a9eb304f759c9' }
Running duration (h:mm:ss): 242:15:16
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
F0331 14:50:12.791029 623605 maintenance_manager.cc:101] Check failed: !manager_.get() You must unregister the LogGCOp(7b72199ed9714f649d10be762212950d) Op before destroying it.
F0331 14:50:12.862526 2687864 maintenance_manager.cc:101] Check failed: !manager_.get() You must unregister the LogGCOp(f281b80a4a7c44ce8103fe10cc66538a) Op before destroying it.
Log file created at: 2024/03/21 12:35:00
Current UTC time: 2024/03/21 12:35:00
Running on machine: HYNRABMPRFRMYDN01
Application fingerprint: version 2.20.1.3 build 3 revision 4887e12d8a3d85d68b3bce96a9f8ec0edebb9f00 build_type RELEASE built at 23 Jan 2024 20:15:52 UTC
Node information: { hostname: 'HYNRABMPRFRMYDN01', rpc_ip: '10.81.88.74', webserver_ip: '10.81.86.74', uuid: '874102d787974a9c937a9eb304f759c9' }
Running duration (h:mm:ss): 0:00:04
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
E0321 12:35:00.173357 2662137 tablet_metadata.cc:768] T 19c7a13766074e65937c658afdb47bec P 874102d787974a9c937a9eb304f759c9: Failed to destroy regular DB at: /data2/yugabyte/yb-data/tserver/data/rocksdb/table-d71e983cd28d407a9eeaa0ecdc1adbd5/tablet-19c7a13766074e65937c658afdb47bec: IO error (yb/rocksdb/util/env_posix.cc:315): /data2/yugabyte/yb-data/tserver/data/rocksdb/table-d71e983cd28d407a9eeaa0ecdc1adbd5/tablet-19c7a13766074e65937c658afdb47bec/LOCK: No such file or directory
E0321 12:35:05.160899 2662136 tablet_metadata.cc:768] T 0a0c0c4e8a344178a81969f323e8c12f P 874102d787974a9c937a9eb304f759c9: Failed to destroy regular DB at: /data1/yugabyte/yb-data/tserver/data/rocksdb/table-d71e983cd28d407a9eeaa0ecdc1adbd5/tablet-0a0c0c4e8a344178a81969f323e8c12f: IO error (yb/rocksdb/util/env_posix.cc:315): /data1/yugabyte/yb-data/tserver/data/rocksdb/table-d71e983cd28d407a9eeaa0ecdc1adbd5/tablet-0a0c0c4e8a344178a81969f323e8c12f/LOCK: No such file or directory
E0321 12:35:10.167415 2662141 tablet_metadata.cc:768] T 9456670f5bb1422b8b62972947394c88 P 874102d787974a9c937a9eb304f759c9: Failed to destroy regular DB at: /data2/yugabyte/yb-data/tserver/data/rocksdb/table-d71e983cd28d407a9eeaa0ecdc1adbd5/tablet-9456670f5bb1422b8b62972947394c88: IO error (yb/rocksdb/util/env_posix.cc:315): /data2/yugabyte/yb-data/tserver/data/rocksdb/table-d71e983cd28d407a9eeaa0ecdc1adbd5/tablet-9456670f5bb1422b8b62972947394c88/LOCK: No such file or directory
E0321 12:35:14.167564 2662129 tablet_metadata.cc:768] T 33a03d566f1f402a8aa860792d9bb4a1 P 874102d787974a9c937a9eb304f759c9: Failed to destroy regular DB at: /data1/yugabyte/yb-data/tserver/data/rocksdb/table-d71e983cd28d407a9eeaa0ecdc1adbd5/tablet-33a03d566f1f402a8aa860792d9bb4a1: IO error (yb/rocksdb/util/env_posix.cc:315): /data1/yugabyte/yb-data/tserver/data/rocksdb/table-d71e983cd28d407a9eeaa0ecdc1adbd5/tablet-33a03d566f1f402a8aa860792d9bb4a1/LOCK: No such file or directory
E0321 12:35:19.176136 2662146 tablet_metadata.cc:768] T 2e8bc41f9d1e4103b31c42f8dde1566a P 874102d787974a9c937a9eb304f759c9: Failed to destroy regular DB at: /data2/yugabyte/yb-data/tserver/data/rocksdb/table-d71e983cd28d407a9eeaa0ecdc1adbd5/tablet-2e8bc41f9d1e4103b31c42f8dde1566a: IO error (yb/rocksdb/util/env_posix.cc:315): /data2/yugabyte/yb-data/tserver/data/rocksdb/table-d71e983cd28d407a9eeaa0ecdc1adbd5/tablet-2e8bc41f9d1e4103b31c42f8dde1566a/LOCK: No such file or directory
E0321 12:35:24.182878 2662145 tablet_metadata.cc:768] T e304ae86742f43198ff9ecc9d06bf1ff P 874102d787974a9c937a9eb304f759c9: Failed to destroy regular DB at: /data1/yugabyte/yb-data/tserver/data/rocksdb/table-d71e983cd28d407a9eeaa0ecdc1adbd5/tablet-e304ae86742f43198ff9ecc9d06bf1ff: IO error (yb/rocksdb/util/env_posix.cc:315): /data1/yugabyte/yb-data/tserver/data/rocksdb/table-d71e983cd28d407a9eeaa0ecdc1adbd5/tablet-e304ae86742f43198ff9ecc9d06bf1ff/LOCK: No such file or directory
E0321 12:35:30.192399 2662142 tablet_metadata.cc:768] T 79deda933fdd45458d5e560779243530 P 874102d787974a9c937a9eb304f759c9: Failed to destroy regular DB at: /data2/yugabyte/yb-data/tserver/data/rocksdb/table-d71e983cd28d407a9eeaa0ecdc1adbd5/tablet-79deda933fdd45458d5e560779243530: IO error (yb/rocksdb/util/env_posix.cc:315): /data2/yugabyte/yb-data/tserver/data/rocksdb/table-d71e983cd28d407a9eeaa0ecdc1adbd5/tablet-79deda933fdd45458d5e560779243530/LOCK: No such file or directory
E0321 12:35:35.196211 2662131 tablet_metadata.cc:768] T 57b5da0ebd1f450497d1d6eed0665f4f P 874102d787974a9c937a9eb304f759c9: Failed to destroy regular DB at: /data1/yugabyte/yb-data/tserver/data/rocksdb/table-d71e983cd28d407a9eeaa0ecdc1adbd5/tablet-57b5da0ebd1f450497d1d6eed0665f4f: IO error (yb/rocksdb/util/env_posix.cc:315): /data1/yugabyte/yb-data/tserver/data/rocksdb/table-d71e983cd28d407a9eeaa0ecdc1adbd5/tablet-57b5da0ebd1f450497d1d6eed0665f4f/LOCK: No such file or directory
E0321 12:35:41.206802 2662147 tablet_metadata.cc:768] T 57575a8f3df34b319596e8bccc93954d P 874102d787974a9c937a9eb304f759c9: Failed to destroy regular DB at: /data2/yugabyte/yb-data/tserver/data/rocksdb/table-d71e983cd28d407a9eeaa0ecdc1adbd5/tablet-57575a8f3df34b319596e8bccc93954d: IO error (yb/rocksdb/util/env_posix.cc:315): /data2/yugabyte/yb-data/tserver/data/rocksdb/table-d71e983cd28d407a9eeaa0ecdc1adbd5/tablet-57575a8f3df34b319596e8bccc93954d/LOCK: No such file or directory
E0321 12:35:46.209388 2662129 tablet_metadata.cc:768] T 57131472e0be4019a2e2caae0f8771fd P 874102d787974a9c937a9eb304f759c9: Failed to destroy regular DB at: /data1/yugabyte/yb-data/tserver/data/rocksdb/table-d71e983cd28d407a9eeaa0ecdc1adbd5/tablet-57131472e0be4019a2e2caae0f8771fd: IO error (yb/rocksdb/util/env_posix.cc:315): /data1/yugabyte/yb-data/tserver/data/rocksdb/table-d71e983cd28d407a9eeaa0ecdc1adbd5/tablet-57131472e0be4019a2e2caae0f8771fd/LOCK: No such file or directory
E0321 19:08:56.738435 2687868 tablet_metadata.cc:768] T 1bea1cf942e54c358327224176ccd045 P 874102d787974a9c937a9eb304f759c9: Failed to destroy regular DB at: /data1/yugabyte/yb-data/tserver/data/rocksdb/table-00004005000030008000000000004019/tablet-1bea1cf942e54c358327224176ccd045: IO error (yb/rocksdb/util/env_posix.cc:315): /data1/yugabyte/yb-data/tserver/data/rocksdb/table-00004005000030008000000000004019/tablet-1bea1cf942e54c358327224176ccd045/LOCK: No such file or directory
E0321 19:09:56.612288 2687777 remote_bootstrap_service.cc:522] Remote bootstrap session 37386f98eddb429799c410b21787df08-627ea847713247dba63ef5d763cddecf-2117963.547s on tablet 627ea847713247dba63ef5d763cddecf with peer 37386f98eddb429799c410b21787df08 failed. session_succeeded = 0
E0321 19:23:56.507210 2687757 remote_bootstrap_service.cc:522] Remote bootstrap session 58a3fb0063004648867312bf4c1e8c30-c7252a0d800d4e368b5428dabeff7cf4-2118802.707s on tablet c7252a0d800d4e368b5428dabeff7cf4 with peer 58a3fb0063004648867312bf4c1e8c30 failed. session_succeeded = 0
E0322 11:39:08.106683 2701076 consensus_queue.cc:1641] Invalid peer UUID: f940a72013f24e338e5d5b597f2c0d38
E0323 14:22:28.450532 2697222 consensus_queue.cc:1641] Invalid peer UUID: 8f8b311cfa1a463eac277d4e8a4f4cfe
E0324 12:48:45.509325 2688198 consensus_queue.cc:1641] Invalid peer UUID: 58a3fb0063004648867312bf4c1e8c30
E0325 06:23:01.238265 2689684 ts_tablet_manager.cc:1708] T 1c2a7dd3b08e40efbe6d1d3dba3c4f34 P 874102d787974a9c937a9eb304f759c9: Tablet failed to bootstrap: Illegal state (yb/tablet/tablet_bootstrap.cc:1561): Failed log replay. Reason: WAL files missing, or committed op id is incorrect. Expected both term and index of prev_op_id to be greater than or equal to the corresponding components of committed_op_id. prev_op_id=1.824963, committed_op_id=4.1809577
E0325 06:23:01.539403 2687847 tablet_metadata.cc:768] T 1c2a7dd3b08e40efbe6d1d3dba3c4f34 P 874102d787974a9c937a9eb304f759c9: Failed to destroy regular DB at: /data2/yugabyte/yb-data/tserver/data/rocksdb/table-0000400500003000800000000000402d/tablet-1c2a7dd3b08e40efbe6d1d3dba3c4f34: IO error (yb/rocksdb/util/env_posix.cc:315): /data2/yugabyte/yb-data/tserver/data/rocksdb/table-0000400500003000800000000000402d/tablet-1c2a7dd3b08e40efbe6d1d3dba3c4f34/LOCK: No such file or directory
E0325 06:39:00.085647 2662144 remote_bootstrap_service.cc:522] Remote bootstrap session 7b7f524af3cb4bc893f536c9d49b2bd3-cc480425e6da42b982f8cd38811671bb-2418497.977s on tablet cc480425e6da42b982f8cd38811671bb with peer 7b7f524af3cb4bc893f536c9d49b2bd3 failed. session_succeeded = 0
E0325 08:15:29.954365 2687696 consensus_queue.cc:1641] Invalid peer UUID: 37386f98eddb429799c410b21787df08
E0326 04:34:02.665969 2687773 tablet_metadata.cc:768] T 0d69b82c07204f50845b1cf1fd1e2b7c P 874102d787974a9c937a9eb304f759c9: Failed to destroy regular DB at: /data1/yugabyte/yb-data/tserver/data/rocksdb/table-00004005000030008000000000004033/tablet-0d69b82c07204f50845b1cf1fd1e2b7c: IO error (yb/rocksdb/util/env_posix.cc:315): /data1/yugabyte/yb-data/tserver/data/rocksdb/table-00004005000030008000000000004033/tablet-0d69b82c07204f50845b1cf1fd1e2b7c/LOCK: No such file or directory
E0326 06:55:01.792387 2687865 tablet_metadata.cc:768] T 73990c9d38a44caea851f6d9df4734eb P 874102d787974a9c937a9eb304f759c9: Failed to destroy regular DB at: /data2/yugabyte/yb-data/tserver/data/rocksdb/table-00004005000030008000000000004038/tablet-73990c9d38a44caea851f6d9df4734eb: IO error (yb/rocksdb/util/env_posix.cc:315): /data2/yugabyte/yb-data/tserver/data/rocksdb/table-00004005000030008000000000004038/tablet-73990c9d38a44caea851f6d9df4734eb/LOCK: No such file or directory
E0326 13:22:41.691524 2688198 consensus_queue.cc:1641] Invalid peer UUID: 58a3fb0063004648867312bf4c1e8c30
E0326 14:07:53.396988 2689689 consensus_queue.cc:1641] Invalid peer UUID: 8f8b311cfa1a463eac277d4e8a4f4cfe
E0327 08:45:29.665184 3666015 consensus_queue.cc:1641] Invalid peer UUID: 8f8b311cfa1a463eac277d4e8a4f4cfe
E0331 14:50:12.790409 623605 tablet_peer.cc:203] T 7b72199ed9714f649d10be762212950d P 874102d787974a9c937a9eb304f759c9 [state=RUNNING]: TabletPeer not fully shut down.
E0331 14:50:12.790900 623605 strand.cc:43] Strand 0x00001788ab1201c0 has not been fully shut down, running: 0, closing: 0, active_tasks: 0
F0331 14:50:12.791029 623605 maintenance_manager.cc:101] Check failed: !manager_.get() You must unregister the LogGCOp(7b72199ed9714f649d10be762212950d) Op before destroying it.
E0331 14:50:12.853185 2696805 tablet_peer.cc:203] T 7b72199ed9714f649d10be762212950d P fed73e3f60ec4b9a8b1037afdc85350d [state=RUNNING]: TabletPeer not fully shut down.
E0331 14:50:12.862305 2687864 tablet_peer.cc:203] T f281b80a4a7c44ce8103fe10cc66538a P 874102d787974a9c937a9eb304f759c9 [state=RUNNING]: TabletPeer not fully shut down.
E0331 14:50:12.862354 2687864 strand.cc:43] Strand 0x000017c53767f780 has not been fully shut down, running: 0, closing: 0, active_tasks: 0
F0331 14:50:12.862526 2687864 maintenance_manager.cc:101] Check failed: !manager_.get() You must unregister the LogGCOp(f281b80a4a7c44ce8103fe10cc66538a) Op before destroying it.
The postgresql-2024-03-31_000000.log has the following entries:
I0331 14:44:20.875800 2176246 pg_client.cc:158] Using TServer host_port: 10.81.88.74:9100
I0331 14:44:20.877102 2176246 pg_client.cc:165] Session id 19456: Session id acquired. Postgres backend pid: 2176246
I0331 14:44:30.954257 2176523 mem_tracker.cc:254] Creating root MemTracker with garbage collection threshold 5242880 bytes
I0331 14:44:30.955114 2176523 thread_pool.cc:167] Starting thread pool { name: pggate_ybclient max_workers: 1024 }
I0331 14:44:30.955724 2176523 pg_client.cc:158] Using TServer host_port: 10.81.88.74:9100
I0331 14:44:30.957034 2176523 pg_client.cc:165] Session id 19457: Session id acquired. Postgres backend pid: 2176523
I0331 14:44:34.806078 2172264 poller.cc:66] Poll stopped: Service unavailable (yb/rpc/scheduler.cc:78): Scheduler is shutting down (system error 108)
I0331 14:45:00.142319 2176613 mem_tracker.cc:254] Creating root MemTracker with garbage collection threshold 5242880 bytes
I0331 14:45:00.143167 2176613 thread_pool.cc:167] Starting thread pool { name: pggate_ybclient max_workers: 1024 }
I0331 14:45:00.143788 2176613 pg_client.cc:158] Using TServer host_port: 10.81.88.74:9100
.
.
I0331 14:45:25.065631 2176729 pg_client.cc:165] Session id 19460: Session id acquired. Postgres backend pid: 2176729
I0331 14:45:25.514719 2172456 poller.cc:66] Poll stopped: Service unavailable (yb/rpc/scheduler.cc:78): Scheduler is shutting down (system error 108)
I0331 14:45:32.956957 2172484 poller.cc:66] Poll stopped: Service unavailable (yb/rpc/scheduler.cc:78): Scheduler is shutting down (system error 108)
I0331 14:45:34.217507 2172502 poller.cc:66] Poll stopped: Service unavailable (yb/rpc/scheduler.cc:78): Scheduler is shutting down (system error 108)
I0331 1
.
.
2024-03-31 14:50:13.260 UTC [2176729] FATAL: terminating connection due to administrator command
2024-03-31 14:50:13.260 UTC [2176613] FATAL: terminating connection due to administrator command
2024-03-31 14:50:13.260 UTC [2176698] FATAL: terminating connection due to administrator command
.
.
I0331 14:50:13.261905 2176616 poller.cc:66] Poll stopped: Service unavailable (yb/rpc/scheduler.cc:78): Scheduler is shutting down (system error 108)
I0331 14:50:13.262017 2176701 poller.cc:66] Poll stopped: Service unavailable (yb/rpc/scheduler.cc:78): Scheduler is shutting down (system error 108)
I0
.
,
2024-03-31 14:50:13.271 UTC [2662208] WARNING: server process (PID 2175806) exited with exit code 1
2024-03-31 14:50:13.271 UTC [2662208] WARNING: server process (PID 2176613) exited with exit code 1
2024-03-31 14:50:13.283 UTC [2662260] LOG: shutting down
2024-03-31 14:50:13.331 UTC [2662208] LOG: database system is shut down
On-premises, bare metal
cpu: 64 core
memory: 756 GB
SSD: two data mount point of 4 TB (SAN storage, in RAID 5) each.
3 master, 9 tserver (total 12 servers)
Single region, three RACKs as AZ
we have seen a similar issue, but are not sure if it is really the same.
We are running yugabyte version 2.20.4.0-b50. The cluster was running fine for some weeks.
Today we noticed that one of the tablet servers restarted with an error message.
Fortunately, the tablet server could recover quickly and seems to be running without problems again.
This is the error we saw in the logs of the tablet server:
@dorian_yugabyte We are using Alma Linux 8.10 and the cpu architecture is x86_64.
@Hari_yb The tserver started at: 2024-08-08 13:39:51 and crashed at 2024-09-06 06:40:52. So it ran for about 28.7 days.
I can provide an excerpt of the logs, but I could not find a log message starting with “Log file created at:”. Maybe because we activated “–logtostderr”.
But I could find “Fatal failure details written to /mnt/disk0/yb-data/tserver/logs/yb-tserver.FATAL.details.2024-09-06T06_40_52.pid29.txt”
I uploaded the log excerpt to yb-logs/bad_narrow_cast/yb-tserver-3-20240906.txt at main · MichiganL/yb-logs · GitHub
@MichiganL thank you for that information. Just to double check can you also run this on the machine and provide us the output please? $ date;uptime
Also, if you have the /mnt/disk0/yb-data/tserver/logs/yb-tserver.FATAL.details.2024-09-06T06_40_52.pid29.txt still around that will also be useful to get.
We believe this has something to do with the edge cases around CLOCK_MONOTONIC_COARSE which internally uses current_kernel_time which is why the CPU (type, clock rate), and OS info is useful here.
@Hari_yb Thank you for looking into it.
We deployed yugabyte on kubernetes. I used the container status to see when the container was created and restarted.
Unfortunately I do not have direct access to the pods to execute date;uptime. I will try to get access. Once I get access I will also post more info about the cpu and kernel.
The file /mnt/disk0/yb-data/tserver/logs/yb-tserver.FATAL.details.2024-09-06T06_40_52.pid29.txt contains the logs I posted earlier. It did not contain more lines.
@MichiganL it appears this is happening due to NTP clock corrections.
Do you have any special NTP settings? What is your CPU model number (if you are able to get it)?
We have a private build with the fix. Note that this is not an official release so please DO NOT use it in production. You can use docker image yugabytedb/yugabyte:2024.1.1.20335-b2 for your testing. Official release will be available soon.
@Hari_yb I get an unauthorized error when trying to pull the image from quay.io. I guess it is a private registry. Would it be possible to publish the image on docker hub?