Hi!
I’m testing YB and have come across some interesting failure. Test consists of writing key/value into a table (value is a big binary blob). Data are read from SATA drive, single process, it is very slow disk (disk 200MB/s max read, network throughput monitoring shows 50MB/s transfers) while the target, cluster of 6 nodes (RF=3) has NVME disks and 10G network
After ~40mins of continuous writes, the cluster starts loosing leader, will initiate election and then never recovers.
W1021 02:15:01.308610 192 consensus_peers.cc:477] T e4726c67e2204455a2f67b053b1ebde2 P 0cae6b527c0a4d45acfe983ef4334acb -> Peer 33670157cd2f4fdeabb92f5524adfefa ([host: "yb-tserver-0.yb-tservers.yb.svc.cluster.local" port: 9100], [host: "yb-tserver-0.yb-tservers.yb.svc.cluster.local" port: 9100]): Couldn't send request. Status: Network error (yb/util/net/socket.cc:551): recvmsg got EOF from remote (system error 108). Retrying in the next heartbeat period. Already tried 1 times. State: 2
I1021 02:15:01.362828 30 tcp_stream.cc:321] { local: 10.96.80.137:48873 remote: 10.96.22.232:9100 }: Recv failed: Network error (yb/util/net/socket.cc:557): recvmsg error: Connection refused (system error 111)
I1021 02:15:02.163990 30 tcp_stream.cc:321] { local: 10.96.80.137:44025 remote: 10.96.22.232:9100 }: Recv failed: Network error (yb/util/net/socket.cc:557): recvmsg error: Connection refused (system error 111)
I1021 02:15:02.791898 30 tcp_stream.cc:321] { local: 10.96.80.137:55597 remote: 10.96.22.232:9100 }: Recv failed: Network error (yb/util/net/socket.cc:557): recvmsg error: Connection refused (system error 111)
I1021 02:15:02.951280 29 transaction_status_resolver.cc:66] T f13720ab0d3d4331b28798a598606caa P 0cae6b527c0a4d45acfe983ef4334acb: Start, queues: 1
I1021 02:15:02.951340 27 transaction_status_resolver.cc:66] T e4726c67e2204455a2f67b053b1ebde2 P 0cae6b527c0a4d45acfe983ef4334acb: Start, queues: 1
I1021 02:15:02.951364 28 transaction_status_resolver.cc:66] T ded486df38514249bbd80744895f4e21 P 0cae6b527c0a4d45acfe983ef4334acb: Start, queues: 1
I1021 02:15:02.951280 26 transaction_status_resolver.cc:66] T a2ce44f310d64fa5ae0e31d701b64a1c P 0cae6b527c0a4d45acfe983ef4334acb: Start, queues: 1
I1021 02:15:02.966351 27 transaction_status_resolver.cc:66] T afe3e1841a954f9799d64121276a2e51 P 0cae6b527c0a4d45acfe983ef4334acb: Start, queues: 1
I1021 02:15:02.972729 30 tablet_rpc.cc:143] Unable to pick leader for 1505cf6be2244ea084fd6bc7bed65b02, replicas: [0x0000000001749590 -> { uuid: 0cae6b527c0a4d45acfe983ef4334acb private: [host: "yb-tserver-2.yb-tservers.yb.svc.cluster.local" port: 9100] public: [host: "yb-tserver-2.yb-tservers.yb.svc.cluster.local" port: 9100] cloud_info: placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1", 0x0000000001495ef0 -> { uuid: 17cda0c0b7c24c9698f602e6272cc750 private: [host: "yb-tserver-1.yb-tservers.yb.svc.cluster.local" port: 9100] public: [host: "yb-tserver-1.yb-tservers.yb.svc.cluster.local" port: 9100] cloud_info: placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1"], followers: [{0x0000000001495ef0 -> { uuid: 17cda0c0b7c24c9698f602e6272cc750 private: [host: "yb-tserver-1.yb-tservers.yb.svc.cluster.local" port: 9100] public: [host: "yb-tserver-1.yb-tservers.yb.svc.cluster.local" port: 9100] cloud_info: placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1", { status: Illegal state (yb/consensus/consensus.cc:113): Not the leader (tablet server error 15) time: 0.003s }}, {0x0000000001749590 -> { uuid: 0cae6b527c0a4d45acfe983ef4334acb private: [host: "yb-tserver-2.yb-tservers.yb.svc.cluster.local" port: 9100] public: [host: "yb-tserver-2.yb-tservers.yb.svc.cluster.local" port: 9100] cloud_info: placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1", { status: Illegal state (yb/consensus/consensus.cc:113): Not the leader (tablet server error 15) time: 0.011s }}] [suppressed 9 similar messages]
I1021 02:15:03.369033 31 tcp_stream.cc:321] { local: 10.96.80.137:36447 remote: 10.96.22.232:9100 }: Recv failed: Network error (yb/util/net/socket.cc:557): recvmsg error: Connection refused (system error 111)
all the while, monitoring shows ongoing network traffic between table servers (so clearly, they have connection)
The first whiff of a warning happens only few ms before the connection error (not sure if relevant):
W1021 02:14:59.813429 162 long_operation_tracker.cc:113] UpdateReplica running for 1.000s in thread 245:
@ 0x7f10eb93611f (unknown)
@ 0x7f10ec2b100c __pthread_cond_wait
@ 0x7f10f365a7ac rocksdb::port::CondVar::Wait()
@ 0x7f10f36cf22b rocksdb::InstrumentedCondVar::Wait()
@ 0x7f10f3592258 rocksdb::DBImpl::DelayWrite()
@ 0x7f10f35b82ee rocksdb::DBImpl::WriteImpl()
@ 0x7f10f35b90f3 rocksdb::DBImpl::Write()
@ 0x7f10fa946990 yb::tablet::Tablet::WriteToRocksDB()
@ 0x7f10fa94c424 yb::tablet::Tablet::ApplyKeyValueRowOperations()
@ 0x7f10fa94c96e yb::tablet::Tablet::ApplyOperation()
@ 0x7f10fa94cc32 yb::tablet::Tablet::ApplyRowOperations()
@ 0x7f10fa9e5958 yb::tablet::WriteOperation::DoReplicated()
@ 0x7f10fa9d61bb yb::tablet::Operation::Replicated()
@ 0x7f10fa9dbaa0 yb::tablet::OperationDriver::ApplyTask()
@ 0x7f10fa9dc1ee yb::tablet::OperationDriver::ReplicationFinished()
@ 0x7f10fa5eb3c7 yb::consensus::ReplicaState::NotifyReplicationFinishedUnlocked()
For about one minute (all the while my script is sending data) the following dance continues:
I1021 02:15:03.890039 315 leader_election.cc:241] T 3a8bde9e1c544b589e9c4b2bfde51536 P 0cae6b527c0a4d45acfe983ef4334acb [CANDIDATE]: Term 14 pre-election: Election decided. Result: candidate lost.
I1021 02:15:03.890080 200 raft_consensus.cc:3232] T 3a8bde9e1c544b589e9c4b2bfde51536 P 0cae6b527c0a4d45acfe983ef4334acb [term 13 FOLLOWER]: Snoozing failure detection for 3.196s
I1021 02:15:03.890102 200 raft_consensus.cc:3078] T 3a8bde9e1c544b589e9c4b2bfde51536 P 0cae6b527c0a4d45acfe983ef4334acb [term 13 FOLLOWER]: Leader Pre-election lost for term 14. Reason: None given. Originator:
I1021 02:15:03.892127 312 leader_election.cc:366] T 3b46800050df491ea22dcab868596b01 P 0cae6b527c0a4d45acfe983ef4334acb [CANDIDATE]: Term 16 election: Vote granted by peer 17cda0c0b7c24c9698f602e6272cc750
I1021 02:15:03.892161 312 leader_election.cc:241] T 3b46800050df491ea22dcab868596b01 P 0cae6b527c0a4d45acfe983ef4334acb [CANDIDATE]: Term 16 election: Election decided. Result: candidate won.
I1021 02:15:03.892213 321 raft_consensus.cc:3232] T 3b46800050df491ea22dcab868596b01 P 0cae6b527c0a4d45acfe983ef4334acb [term 16 FOLLOWER]: Snoozing failure detection for 3.248s
I1021 02:15:03.892251 321 raft_consensus.cc:3139] T 3b46800050df491ea22dcab868596b01 P 0cae6b527c0a4d45acfe983ef4334acb [term 16 FOLLOWER]: Leader election won for term 16
I1021 02:15:03.892259 321 replica_state.cc:1344] T 3b46800050df491ea22dcab868596b01 P 0cae6b527c0a4d45acfe983ef4334acb [term 16 FOLLOWER]: SetLeaderNoOpCommittedUnlocked(0), committed: 15.21, received: 15.21
I1021 02:15:03.892344 321 consensus_meta.cc:298] T 3b46800050df491ea22dcab868596b01 P 0cae6b527c0a4d45acfe983ef4334acb: Updating active role from FOLLOWER to LEADER. Consensus state: current_term: 16 leader_uuid: "0cae6b527c0a4d45acfe983ef4334acb" config { opid_index: 14 peers { permanent_uuid: "17cda0c0b7c24c9698f602e6272cc750" member_type: VOTER last_known_private_addr { host: "yb-tserver-1.yb-tservers.yb.svc.cluster.local" port: 9100 } last_known_broadcast_addr { host: "yb-tserver-1.yb-tservers.yb.svc.cluster.local" port: 9100 } cloud_info { placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1" } } peers { permanent_uuid: "33670157cd2f4fdeabb92f5524adfefa" member_type: VOTER last_known_private_addr { host: "yb-tserver-0.yb-tservers.yb.svc.cluster.local" port: 9100 } last_known_broadcast_addr { host: "yb-tserver-0.yb-tservers.yb.svc.cluster.local" port: 9100 } cloud_info { placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1" } } peers { permanent_uuid: "0cae6b527c0a4d45acfe983ef4334acb" member_type: VOTER last_known_private_addr { host: "yb-tserver-2.yb-tservers.yb.svc.cluster.local" port: 9100 } last_known_broadcast_addr { host: "yb-tserver-2.yb-tservers.yb.svc.cluster.local" port: 9100 } cloud_info { placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1" } } }, has_pending_config = 0
I1021 02:15:03.892361 321 raft_consensus.cc:3178] T 3b46800050df491ea22dcab868596b01 P 0cae6b527c0a4d45acfe983ef4334acb [term 16 LEADER]: Calling mark dirty synchronously for reason code NEW_LEADER_ELECTED
And then, suddenly, the tablet server dies or restarts (it is not killed by k8s – the container stays running). And the client fails with:
psycopg2.OperationalError: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
So if the log is true, perhaps no new connections could be opened until something restarted the yb-tserver? But old connections kept working fine… I’m not sure where to look for more logs on the yb-tserver, how to debug it properly…
I’ve observed a different failure mode associated with this test. A client would be refused because of backpressure (and this was also associated with losing quorum)
I’ve implemented these recommendations: https://support.yugabyte.com/hc/en-us/articles/360058731252-How-to-optimize-and-resolve-common-memory-errors-in-Yugabyte
The backpressure gone, but it got objectively worse – the yb-tserver is now dying.
Also, and this might be related; YB failed to count values on time – there is less than 100K rows in the table.
yugabyte=# \c ybload;
You are now connected to database "ybload" as user "yugabyte".
ybload=# \d bigtable;
Table "public.bigtable"
Column | Type | Collation | Nullable | Default
--------+------------------------+-----------+----------+---------
key | character varying(255) | | not null |
value | bytea | | |
Indexes:
"bigtable_pkey" PRIMARY KEY, lsm (key HASH)
ybload=# select count(*) from bigtable;
ERROR: Timed out: Read RPC (request call id 851) to 10.96.70.87:9100 timed out after 60.000s
ybload=# select count(*) from bigtable;
ERROR: Timed out: Read RPC (request call id 859) to 10.96.70.87:9100 timed out after 60.000s
and logs for t-server k logs --tail 100 yb-tserver-2 yb-tserver -n yb
:
I1021 01:09:09.874341 31 reactor.cc:459] TabletServer_R001: DEBUG: Closing idle connection: Connection (0x0000000036590490) server 10.96.22.232:41124 => 10.96.80.137:9100 - it has been idle for 65.1s
W1021 01:10:53.811908 162 long_operation_tracker.cc:113] Read running for 1.000s in thread 288:
@ 0x7f10eb93611f (unknown)
@ 0x7f10f31fa6f9 snappy::RawUncompress()
@ 0x7f10f31faa70 snappy::RawUncompress()
@ 0x7f10f36766b9 rocksdb::UncompressBlockContents()
@ 0x7f10f367726c rocksdb::ReadBlockContents()
@ 0x7f10f367121d rocksdb::block_based_table::ReadBlockFromFile()
@ 0x7f10f366cfdf rocksdb::BlockBasedTable::NewDataBlockIterator()
@ 0x7f10f36a85d0 rocksdb::(anonymous namespace)::TwoLevelIterator::InitDataBlock()
@ 0x7f10f36a8767 rocksdb::(anonymous namespace)::TwoLevelIterator::SkipEmptyDataBlocksForward()
@ 0x7f10f368db3a rocksdb::MergingIterator::Next()
@ 0x7f10f35dc9ba rocksdb::DBIter::Next()
@ 0x7f10f995c832 yb::docdb::SeekPossiblyUsingNext()
@ 0x7f10f995dbd9 yb::docdb::PerformRocksDBSeek()
@ 0x7f10f995e009 yb::docdb::SeekForward()
@ 0x7f10f995e055 yb::docdb::SeekOutOfSubKey()
@ 0x7f10f9994946 yb::docdb::IntentAwareIterator::SeekOutOfSubDoc()
W1021 01:10:53.849444 162 long_operation_tracker.cc:113] Read running for 1.024s in thread 290:
@ 0x7f10eb93611f (unknown)
@ 0x7f10ec2b4592 __GI_pread64
@ 0x7f10f093b741 yb::PosixRandomAccessFile::Read()
@ 0x7f10f36cabb6 rocksdb::RandomAccessFileReader::ReadAndValidate()
@ 0x7f10f3676d5d rocksdb::ReadBlockContents()
@ 0x7f10f367121d rocksdb::block_based_table::ReadBlockFromFile()
@ 0x7f10f366cfdf rocksdb::BlockBasedTable::NewDataBlockIterator()
@ 0x7f10f36a85d0 rocksdb::(anonymous namespace)::TwoLevelIterator::InitDataBlock()
@ 0x7f10f36a8767 rocksdb::(anonymous namespace)::TwoLevelIterator::SkipEmptyDataBlocksForward()
@ 0x7f10f36a88cc rocksdb::(anonymous namespace)::TwoLevelIterator::Seek()
@ 0x7f10f368d3c5 rocksdb::MergingIterator::Seek()
@ 0x7f10f35dc175 rocksdb::DBIter::Seek()
@ 0x7f10f9903edc yb::docdb::BoundedRocksDbIterator::Seek()
@ 0x7f10f995c85e yb::docdb::SeekPossiblyUsingNext()
@ 0x7f10f995dbd9 yb::docdb::PerformRocksDBSeek()
@ 0x7f10f995e009 yb::docdb::SeekForward()
W1021 01:11:06.949064 288 long_operation_tracker.cc:142] Read took a long time: 14.164s
I1021 01:11:06.949167 288 yb_rpc.cc:440] Call yb.tserver.TabletServerService.Read 10.96.22.232:41820 => 10.96.80.137:9100 (request call id 858) took 14163ms. Trace:
I1021 01:11:06.949174 288 yb_rpc.cc:441]
W1021 01:11:43.283773 290 long_operation_tracker.cc:142] Read took a long time: 50.496s
W1021 01:11:43.283990 290 yb_rpc.cc:426] Call yb.tserver.TabletServerService.Read 10.96.22.232:41820 => 10.96.80.137:9100 (request call id 857) took 50497ms (client timeout 60000ms).
the data where YB is writing, is a persistent volume backed by NVME drive - read/write speeds (for large chunks) in excess of 1.5GB/s – so the disk i/o is not the factor. Also, YB is the only significant container running on those nodes
There are no dropped/retransmitted packets (I’ll attach a screenshot if I’m allowed to post pics).
This is 6node cluster (RF=3), deployed with helm, single rack, 10G network, 8GB RAM per machine, NVME disks.
I’m very new to YB, the only weird thing I’m noticing (besides the error :)) is that YB is rather slow, but I am very new and it is hard for me to understand what is ‘normal’. The disk throughput being <450MB/s per machine (assuming YB is reading big blocks) is on the lower side… but as said: the problems could be on my side/setup