Tserver in dead status after bad narrow cast

I wanted to test yugabytedb before going on the server. I start docker container with such a settings:
docker run -d --name yugabyte --cpus=8 -m 16g -p7000:7000 -p9001:9000 -p15433:15433 -p5433:5433 -p9042:9042 yugabytedb/yugabyte:latest bin/yugabyted start --background=false

I have prepared the spring-boot based application which insert or update records in the single table. It executes sql queries in 20 threads.
After I’ve limited resources to 8 cpus and 16gb of ram, the tserver is in the status of dead.
I see that it happens after 3 error connected with Bad narrow cast: -32 < 0. It could recover after previous two

F1121 08:26:56.614046  1274 casts.cc:21] Bad narrow cast: -1 < 0
F1121 08:26:56.614130  1203 casts.cc:21] Bad narrow cast: -1 < 0

On docker container, the process of tserver looks as follow:
root 1341 70.5 0.0 0 0 ? Z 08:26 51:50 [yb-tserver] <defunct>

The error happens after conflict resolutions:

I1121 08:36:11.590842  1667 tablet_rpc.cc:495] Operation failed. Try again (yb/docdb/conflict_resolution.cc:954): Failed Write(tablet: 0e643bc9b3694aaf842526104bb49dce, num_ops: 1, num_attempts: 1, txn: f7958969-d31a-4efd-8f48-b2e6a9e15421, subtxn: [none]) to tablet 0e643bc9b3694aaf842526104bb49dce on tablet server { uuid: 9d74615089e341018f3c26172ade8d16 private: [host: "172.17.0.2" port: 9100] public: [host: "172.17.0.2" port: 9100] cloud_info: placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1" after 1 attempt(s): Conflict with concurrently committed data. Value write after transaction start: doc ht ({ physical: 1732178171581104 }) >= read time ({ physical: 1732178171581104 }): kConflict (transaction error 3)
I1121 08:36:11.590842  1500 tablet_rpc.cc:495] Operation failed. Try again (yb/docdb/conflict_resolution.cc:954): Failed Write(tablet: dbfe391c83b14857a9f22f5e739d8414, num_ops: 1, num_attempts: 1, txn: f7958969-d31a-4efd-8f48-b2e6a9e15421, subtxn: [none]) to tablet dbfe391c83b14857a9f22f5e739d8414 on tablet server { uuid: 9d74615089e341018f3c26172ade8d16 private: [host: "172.17.0.2" port: 9100] public: [host: "172.17.0.2" port: 9100] cloud_info: placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1" after 1 attempt(s): Conflict with concurrently committed data. Value write after transaction start: doc ht ({ physical: 1732178171581104 }) >= read time ({ physical: 1732178171581104 }): kConflict (transaction error 3)
F1121 08:36:11.930039  1633 casts.cc:21] Bad narrow cast: -32 < 0

Hi @yugi

What version are you using? Can you do select version(); and paste the output?

Can you please share the FATAL file?

It’s most likely the same as [DocDB] Bad narrow cast FATAL occurred in the tserver, caused by CoarseMonoClock() going back in time. · Issue #20335 · yugabyte/yugabyte-db · GitHub

Fix is in versions higher than or equal to 2.23.1.0, 2.20.7.0, and 2024.1.3.0

PostgreSQL 11.2-YB-2.23.0.0-b0 on x86_64-pc-linux-gnu, compiled by clang version 17.0.6 (GitHub - yugabyte/llvm-project: The LLVM Project is a collection of modular and reusable compiler and toolchain technologies. Note: the repository does not accept github pull requests at this moment. Please submit your patches at http://reviews.llvm.org. 9b881774e40024e901fc6f3d313607b071c08631), 64-bit

sure. All fatal files:

yb-tserver.d4651acfa5bc.root.log.FATAL.20241121-082443.102

Log file created at: 2024/11/21 08:24:43
Current UTC time: 2024/11/21 08:24:43
Running on machine: d4651acfa5bc
Application fingerprint: version 2.23.0.0 build 710 revision a31a4eba31527467192adc816e771ac09c28bb1a build_type RELEASE built at 30 Aug 2024 21:07:28 UTC
Node information: { hostname: 'd4651acfa5bc', rpc_ip: '172.17.0.2', webserver_ip: '172.17.0.2', uuid: '9d74615089e341018f3c26172ade8d16' }
Running duration (h:mm:ss): 0:23:45
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
F1121 08:24:43.054159   474 casts.cc:21] Bad narrow cast: -1 < 0
F1121 08:24:43.054395   501 casts.cc:21] Bad narrow cast: -1 < 0

yb-tserver.d4651acfa5bc.root.log.FATAL.20241121-082656.960

Log file created at: 2024/11/21 08:26:56
Current UTC time: 2024/11/21 08:26:56
Running on machine: d4651acfa5bc
Application fingerprint: version 2.23.0.0 build 710 revision a31a4eba31527467192adc816e771ac09c28bb1a build_type RELEASE built at 30 Aug 2024 21:07:28 UTC
Node information: { hostname: 'd4651acfa5bc', rpc_ip: '172.17.0.2', webserver_ip: '172.17.0.2', uuid: '9d74615089e341018f3c26172ade8d16' }
Running duration (h:mm:ss): 0:02:10
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
F1121 08:26:56.614046  1274 casts.cc:21] Bad narrow cast: -1 < 0
F1121 08:26:56.614130  1203 casts.cc:21] Bad narrow cast: -1 < 0

yb-tserver.d4651acfa5bc.root.log.FATAL.20241121-083611.1341


Log file created at: 2024/11/21 08:36:11
Current UTC time: 2024/11/21 08:36:11
Running on machine: d4651acfa5bc
Application fingerprint: version 2.23.0.0 build 710 revision a31a4eba31527467192adc816e771ac09c28bb1a build_type RELEASE built at 30 Aug 2024 21:07:28 UTC
Node information: { hostname: 'd4651acfa5bc', rpc_ip: '172.17.0.2', webserver_ip: '172.17.0.2', uuid: '9d74615089e341018f3c26172ade8d16' }
Running duration (h:mm:ss): 0:09:14
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
F1121 08:36:11.930039  1633 casts.cc:21] Bad narrow cast: -32 < 0

yb-tserver.FATAL.details.2024-11-21T08_24_43.pid102.txt

F20241121 08:24:43 ../../src/yb/gutil/casts.cc:21] Bad narrow cast: -1 < 0
    @     0x55e52c68ae27  google::LogMessage::SendToLog()
    @     0x55e52c68bd5d  google::LogMessage::Flush()
    @     0x55e52c68c3a9  google::LogMessageFatal::~LogMessageFatal()
    @     0x55e52d07c2c7  yb::BadNarrowCast()
    @     0x55e52dc31eb3  std::__1::__function::__func<>::operator()()
    @     0x55e52db2fe98  yb::tablet::Operation::Replicated()
    @     0x55e52db3214d  yb::tablet::OperationDriver::ReplicationFinished()
    @     0x55e52cb75a2b  yb::consensus::ConsensusRound::NotifyReplicationFinished()
    @     0x55e52cbc452f  yb::consensus::ReplicaState::ApplyPendingOperationsUnlocked()
    @     0x55e52cbc3899  yb::consensus::ReplicaState::AdvanceCommittedOpIdUnlocked()
    @     0x55e52cb9f040  yb::consensus::RaftConsensus::UpdateMajorityReplicated()
    @     0x55e52cb69e21  yb::rpc::StrandTaskWithErrorFunc<>::Run()
    @     0x55e52da8e3a9  yb::rpc::Strand::Done()
    @     0x55e52da97fcb  yb::rpc::(anonymous namespace)::Worker::Execute()
    @     0x55e52e307373  yb::Thread::SuperviseThread()
    @     0x7f09881081ca  start_thread
    @     0x7f09883598d3  __GI___clone

yb-tserver.FATAL.details.2024-11-21T08_26_56.pid960.txt

F20241121 08:26:56 ../../src/yb/gutil/casts.cc:21] Bad narrow cast: -1 < 0
    @     0x560221d87e27  google::LogMessage::SendToLog()
    @     0x560221d88d5d  google::LogMessage::Flush()
    @     0x560221d893a9  google::LogMessageFatal::~LogMessageFatal()
    @     0x5602227792c7  yb::BadNarrowCast()
    @     0x56022332eeb3  std::__1::__function::__func<>::operator()()
    @     0x56022322ce98  yb::tablet::Operation::Replicated()
    @     0x56022322f14d  yb::tablet::OperationDriver::ReplicationFinished()
    @     0x560222272a2b  yb::consensus::ConsensusRound::NotifyReplicationFinished()
    @     0x5602222c152f  yb::consensus::ReplicaState::ApplyPendingOperationsUnlocked()
    @     0x5602222c0899  yb::consensus::ReplicaState::AdvanceCommittedOpIdUnlocked()
    @     0x56022229c040  yb::consensus::RaftConsensus::UpdateMajorityReplicated()
    @     0x560222266e21  yb::rpc::StrandTaskWithErrorFunc<>::Run()
    @     0x56022318b3a9  yb::rpc::Strand::Done()
    @     0x560223194fcb  yb::rpc::(anonymous namespace)::Worker::Execute()
    @     0x560223a04373  yb::Thread::SuperviseThread()
    @     0x7fe7ecc181ca  start_thread
    @     0x7fe7ece698d3  __GI___clone

yb-tserver.FATAL.details.2024-11-21T08_36_11.pid1341.txt

F20241121 08:36:11 ../../src/yb/gutil/casts.cc:21] Bad narrow cast: -32 < 0
    @     0x5636418abe27  google::LogMessage::SendToLog()
    @     0x5636418acd5d  google::LogMessage::Flush()
    @     0x5636418ad3a9  google::LogMessageFatal::~LogMessageFatal()
    @     0x56364229d2c7  yb::BadNarrowCast()
    @     0x563642e52eb3  std::__1::__function::__func<>::operator()()
    @     0x563642d50e98  yb::tablet::Operation::Replicated()
    @     0x563642d5314d  yb::tablet::OperationDriver::ReplicationFinished()
    @     0x563641d96a2b  yb::consensus::ConsensusRound::NotifyReplicationFinished()
    @     0x563641de552f  yb::consensus::ReplicaState::ApplyPendingOperationsUnlocked()
    @     0x563641de4899  yb::consensus::ReplicaState::AdvanceCommittedOpIdUnlocked()
    @     0x563641dc0040  yb::consensus::RaftConsensus::UpdateMajorityReplicated()
    @     0x563641d8ae21  yb::rpc::StrandTaskWithErrorFunc<>::Run()
    @     0x563642caf3a9  yb::rpc::Strand::Done()
    @     0x563642cb8fcb  yb::rpc::(anonymous namespace)::Worker::Execute()
    @     0x563643528373  yb::Thread::SuperviseThread()
    @     0x7fb13bd231ca  start_thread
    @     0x7fb13bf748d3  __GI___clone

Thanks for the info @yugi.
The stack looks identical to the bug.
Fix is available in 2.23.1.0.

Thanks, so I will try with latest version.

1 Like

I can confirm that I’ve not been able to reproduce this issue on the latest version.