I’m using yugabyte in k8s on a bare-metal on-premis installation and encountered a problem with yb-tserver suddenly crashing, the process of which ends with exit code 1 with no ERROR or FATAL message on the logs.
I’m using three nodes AMD Ryzen 9 7950X: 16 × 4.5ГГц, 128 ГБ DDR5, 2 × 2 ТБ SSD NVMe M.2 dedicated for yugabyte cluster, three yb-masters and three yb-tservers are deployed on the nodes: one yb-master and one yb-tserver on each node. It’s all ok with yb-master’s pods, but yb-tserver have a high restart counter with Reason: Error - exit code: 1
There is no new messages on ERROR on FATAL logs (fatal logs does not exists at all) before or after restart. So I do not know where to start to dig.
Last message in the container log
2024-02-26 09:47:59,247 [INFO] k8s_parent.py: core_pattern is: |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h %e
Traceback (most recent call last):
File "/home/yugabyte/tools/k8s_parent.py", line 236, in <module>
files_copied = copy_cores(cores_dir)
File "/home/yugabyte/tools/k8s_parent.py", line 124, in copy_cores
dir_path = get_core_dump_dir()
File "/home/yugabyte/tools/k8s_parent.py", line 65, in get_core_dump_dir
raise ValueError("core_pattern starts with |, can't do anything useful")
ValueError: core_pattern starts with |, can't do anything useful
By the way, there are a lot of warning messages before stop:
W0226 09:44:33.717324 135 long_operation_tracker.cc:125] Read running for 1.001s in thread 614:
@ 0x7fe94d4b6b39 __lll_lock_wait
@ 0x7fe94d4b16e2 __GI___pthread_mutex_lock
@ 0x5643bccc254a rocksdb::(anonymous namespace)::ShardedLRUCache::Lookup()
@ 0x5643bcc95ea6 rocksdb::BlockBasedTable::GetDataBlockFromCache()
@ 0x5643bcc95317 rocksdb::BlockBasedTable::RetrieveBlock()
@ 0x5643bcc9802f rocksdb::BlockBasedTable::NewDataBlockIterator()
@ 0x5643bccbeeda rocksdb::(anonymous namespace)::TwoLevelIterator::InitDataBlock()
@ 0x5643bccbe6ef rocksdb::(anonymous namespace)::TwoLevelIterator::Seek()
@ 0x5643bccacef8 rocksdb::MergingIterator::Seek()
@ 0x5643bcc04b6e rocksdb::DBIter::Seek()
@ 0x5643bc166fba yb::docdb::BoundedRocksDbIterator::Seek()
@ 0x5643bc22ef1c yb::docdb::(anonymous namespace)::SeekPossiblyUsingNext()
@ 0x5643bc22f66f yb::docdb::PerformRocksDBSeek()
@ 0x5643bc222103 yb::docdb::IntentAwareIterator::Seek()
@ 0x5643bc1da0d1 yb::docdb::DocRowwiseIterator::Seek()
@ 0x5643bc1e151b yb::docdb::DocRowwiseIteratorBase::SeekTuple()
W0226 09:46:34.203969 135 long_operation_tracker.cc:125] Read running for 1.000s in thread 855:
@ 0x7fe94d79de62 __GI___pread
@ 0x5643bd500261 yb::PosixRandomAccessFile::Read()
@ 0x5643bd4ff1ba yb::RandomAccessFile::ReadAndValidate()
@ 0x5643bcccd4f8 rocksdb::RandomAccessFileReader::ReadAndValidate()
@ 0x5643bcca075f rocksdb::ReadBlockContents()
@ 0x5643bcc962db rocksdb::block_based_table::ReadBlockFromFile()
@ 0x5643bcc95409 rocksdb::BlockBasedTable::RetrieveBlock()
@ 0x5643bcc9802f rocksdb::BlockBasedTable::NewDataBlockIterator()
@ 0x5643bccbeeda rocksdb::(anonymous namespace)::TwoLevelIterator::InitDataBlock()
@ 0x5643bccbe6ef rocksdb::(anonymous namespace)::TwoLevelIterator::Seek()
@ 0x5643bccacef8 rocksdb::MergingIterator::Seek()
@ 0x5643bcc04b6e rocksdb::DBIter::Seek()
@ 0x5643bc166fba yb::docdb::BoundedRocksDbIterator::Seek()
@ 0x5643bc22ef1c yb::docdb::(anonymous namespace)::SeekPossiblyUsingNext()
@ 0x5643bc22f66f yb::docdb::PerformRocksDBSeek()
@ 0x5643bc222103 yb::docdb::IntentAwareIterator::Seek()
W0226 09:46:59.360953 135 long_operation_tracker.cc:125] Read running for 1.000s in thread 817:
@ 0x5643bcc7bb53 rocksdb::DecodeRestartEntry()
@ 0x5643bcc7b445 rocksdb::BlockIter::Seek()
@ 0x5643bccab3cb rocksdb::IteratorWrapper::Seek()
@ 0x5643bccaadab rocksdb::MultiLevelIterator::Seek()
@ 0x5643bccbe6e3 rocksdb::(anonymous namespace)::TwoLevelIterator::Seek()
@ 0x5643bccacef8 rocksdb::MergingIterator::Seek()
@ 0x5643bcc04b6e rocksdb::DBIter::Seek()
@ 0x5643bc166fba yb::docdb::BoundedRocksDbIterator::Seek()
@ 0x5643bc22ef1c yb::docdb::(anonymous namespace)::SeekPossiblyUsingNext()
@ 0x5643bc22f66f yb::docdb::PerformRocksDBSeek()
@ 0x5643bc222103 yb::docdb::IntentAwareIterator::Seek()
@ 0x5643bc1da0d1 yb::docdb::DocRowwiseIterator::Seek()
@ 0x5643bc1e151b yb::docdb::DocRowwiseIteratorBase::SeekTuple()
@ 0x5643bc24e6cd yb::docdb::(anonymous namespace)::FilteringIterator::FetchTuple()
@ 0x5643bc24aacd yb::docdb::PgsqlReadOperation::Execute()
@ 0x5643bcecb0f5 yb::tablet::Tablet::HandlePgsqlReadRequest()
W0226 09:47:00.436651 135 long_operation_tracker.cc:125] Read running for 1.224s in thread 806:
Thread did not respond: maybe it is blocking signals
W0226 09:47:00.711908 869 long_operation_tracker.cc:155] Read took a long time: 1.487s
Yb-tserver uses sig-storage-local-static-provisioner as storageclass, so all data stored on the same server.
Yugabytedb was installed with yugabytedb Helm chart, current container version is yugabytedb/yugabyte:2.20.1.3-b3 (for master and tserver).
I thought that it could be wired with OOM kills because of resource limits. But processes stopped by OOM killer have exit code 137 but not 1. There are some questions about memory consumption exist, but this is a subject for a separate topic.
Please assist me how to debug this situation with unexpected “exit code: 1” of yb-tserver.