Lost leader, then the yb-tserver - while testing long-running (low throughput) writes

Hi!
I’m testing YB and have come across some interesting failure. Test consists of writing key/value into a table (value is a big binary blob). Data are read from SATA drive, single process, it is very slow disk (disk 200MB/s max read, network throughput monitoring shows 50MB/s transfers) while the target, cluster of 6 nodes (RF=3) has NVME disks and 10G network

After ~40mins of continuous writes, the cluster starts loosing leader, will initiate election and then never recovers.

W1021 02:15:01.308610   192 consensus_peers.cc:477] T e4726c67e2204455a2f67b053b1ebde2 P 0cae6b527c0a4d45acfe983ef4334acb -> Peer 33670157cd2f4fdeabb92f5524adfefa ([host: "yb-tserver-0.yb-tservers.yb.svc.cluster.local" port: 9100], [host: "yb-tserver-0.yb-tservers.yb.svc.cluster.local" port: 9100]): Couldn't send request.  Status: Network error (yb/util/net/socket.cc:551): recvmsg got EOF from remote (system error 108). Retrying in the next heartbeat period. Already tried 1 times. State: 2
I1021 02:15:01.362828    30 tcp_stream.cc:321] { local: 10.96.80.137:48873 remote: 10.96.22.232:9100 }:  Recv failed: Network error (yb/util/net/socket.cc:557): recvmsg error: Connection refused (system error 111)
I1021 02:15:02.163990    30 tcp_stream.cc:321] { local: 10.96.80.137:44025 remote: 10.96.22.232:9100 }:  Recv failed: Network error (yb/util/net/socket.cc:557): recvmsg error: Connection refused (system error 111)
I1021 02:15:02.791898    30 tcp_stream.cc:321] { local: 10.96.80.137:55597 remote: 10.96.22.232:9100 }:  Recv failed: Network error (yb/util/net/socket.cc:557): recvmsg error: Connection refused (system error 111)
I1021 02:15:02.951280    29 transaction_status_resolver.cc:66] T f13720ab0d3d4331b28798a598606caa P 0cae6b527c0a4d45acfe983ef4334acb: Start, queues: 1
I1021 02:15:02.951340    27 transaction_status_resolver.cc:66] T e4726c67e2204455a2f67b053b1ebde2 P 0cae6b527c0a4d45acfe983ef4334acb: Start, queues: 1
I1021 02:15:02.951364    28 transaction_status_resolver.cc:66] T ded486df38514249bbd80744895f4e21 P 0cae6b527c0a4d45acfe983ef4334acb: Start, queues: 1
I1021 02:15:02.951280    26 transaction_status_resolver.cc:66] T a2ce44f310d64fa5ae0e31d701b64a1c P 0cae6b527c0a4d45acfe983ef4334acb: Start, queues: 1
I1021 02:15:02.966351    27 transaction_status_resolver.cc:66] T afe3e1841a954f9799d64121276a2e51 P 0cae6b527c0a4d45acfe983ef4334acb: Start, queues: 1
I1021 02:15:02.972729    30 tablet_rpc.cc:143] Unable to pick leader for 1505cf6be2244ea084fd6bc7bed65b02, replicas: [0x0000000001749590 -> { uuid: 0cae6b527c0a4d45acfe983ef4334acb private: [host: "yb-tserver-2.yb-tservers.yb.svc.cluster.local" port: 9100] public: [host: "yb-tserver-2.yb-tservers.yb.svc.cluster.local" port: 9100] cloud_info: placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1", 0x0000000001495ef0 -> { uuid: 17cda0c0b7c24c9698f602e6272cc750 private: [host: "yb-tserver-1.yb-tservers.yb.svc.cluster.local" port: 9100] public: [host: "yb-tserver-1.yb-tservers.yb.svc.cluster.local" port: 9100] cloud_info: placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1"], followers: [{0x0000000001495ef0 -> { uuid: 17cda0c0b7c24c9698f602e6272cc750 private: [host: "yb-tserver-1.yb-tservers.yb.svc.cluster.local" port: 9100] public: [host: "yb-tserver-1.yb-tservers.yb.svc.cluster.local" port: 9100] cloud_info: placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1", { status: Illegal state (yb/consensus/consensus.cc:113): Not the leader (tablet server error 15) time: 0.003s }}, {0x0000000001749590 -> { uuid: 0cae6b527c0a4d45acfe983ef4334acb private: [host: "yb-tserver-2.yb-tservers.yb.svc.cluster.local" port: 9100] public: [host: "yb-tserver-2.yb-tservers.yb.svc.cluster.local" port: 9100] cloud_info: placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1", { status: Illegal state (yb/consensus/consensus.cc:113): Not the leader (tablet server error 15) time: 0.011s }}] [suppressed 9 similar messages]
I1021 02:15:03.369033    31 tcp_stream.cc:321] { local: 10.96.80.137:36447 remote: 10.96.22.232:9100 }:  Recv failed: Network error (yb/util/net/socket.cc:557): recvmsg error: Connection refused (system error 111)

all the while, monitoring shows ongoing network traffic between table servers (so clearly, they have connection)

The first whiff of a warning happens only few ms before the connection error (not sure if relevant):

W1021 02:14:59.813429   162 long_operation_tracker.cc:113] UpdateReplica running for 1.000s in thread 245:
    @     0x7f10eb93611f  (unknown)
    @     0x7f10ec2b100c  __pthread_cond_wait
    @     0x7f10f365a7ac  rocksdb::port::CondVar::Wait()
    @     0x7f10f36cf22b  rocksdb::InstrumentedCondVar::Wait()
    @     0x7f10f3592258  rocksdb::DBImpl::DelayWrite()
    @     0x7f10f35b82ee  rocksdb::DBImpl::WriteImpl()
    @     0x7f10f35b90f3  rocksdb::DBImpl::Write()
    @     0x7f10fa946990  yb::tablet::Tablet::WriteToRocksDB()
    @     0x7f10fa94c424  yb::tablet::Tablet::ApplyKeyValueRowOperations()
    @     0x7f10fa94c96e  yb::tablet::Tablet::ApplyOperation()
    @     0x7f10fa94cc32  yb::tablet::Tablet::ApplyRowOperations()
    @     0x7f10fa9e5958  yb::tablet::WriteOperation::DoReplicated()
    @     0x7f10fa9d61bb  yb::tablet::Operation::Replicated()
    @     0x7f10fa9dbaa0  yb::tablet::OperationDriver::ApplyTask()
    @     0x7f10fa9dc1ee  yb::tablet::OperationDriver::ReplicationFinished()
    @     0x7f10fa5eb3c7  yb::consensus::ReplicaState::NotifyReplicationFinishedUnlocked()

For about one minute (all the while my script is sending data) the following dance continues:

I1021 02:15:03.890039   315 leader_election.cc:241] T 3a8bde9e1c544b589e9c4b2bfde51536 P 0cae6b527c0a4d45acfe983ef4334acb [CANDIDATE]: Term 14 pre-election: Election decided. Result: candidate lost.
I1021 02:15:03.890080   200 raft_consensus.cc:3232] T 3a8bde9e1c544b589e9c4b2bfde51536 P 0cae6b527c0a4d45acfe983ef4334acb [term 13 FOLLOWER]: Snoozing failure detection for 3.196s
I1021 02:15:03.890102   200 raft_consensus.cc:3078] T 3a8bde9e1c544b589e9c4b2bfde51536 P 0cae6b527c0a4d45acfe983ef4334acb [term 13 FOLLOWER]: Leader Pre-election lost for term 14. Reason: None given. Originator: 
I1021 02:15:03.892127   312 leader_election.cc:366] T 3b46800050df491ea22dcab868596b01 P 0cae6b527c0a4d45acfe983ef4334acb [CANDIDATE]: Term 16 election: Vote granted by peer 17cda0c0b7c24c9698f602e6272cc750
I1021 02:15:03.892161   312 leader_election.cc:241] T 3b46800050df491ea22dcab868596b01 P 0cae6b527c0a4d45acfe983ef4334acb [CANDIDATE]: Term 16 election: Election decided. Result: candidate won.
I1021 02:15:03.892213   321 raft_consensus.cc:3232] T 3b46800050df491ea22dcab868596b01 P 0cae6b527c0a4d45acfe983ef4334acb [term 16 FOLLOWER]: Snoozing failure detection for 3.248s
I1021 02:15:03.892251   321 raft_consensus.cc:3139] T 3b46800050df491ea22dcab868596b01 P 0cae6b527c0a4d45acfe983ef4334acb [term 16 FOLLOWER]: Leader election won for term 16
I1021 02:15:03.892259   321 replica_state.cc:1344] T 3b46800050df491ea22dcab868596b01 P 0cae6b527c0a4d45acfe983ef4334acb [term 16 FOLLOWER]: SetLeaderNoOpCommittedUnlocked(0), committed: 15.21, received: 15.21
I1021 02:15:03.892344   321 consensus_meta.cc:298] T 3b46800050df491ea22dcab868596b01 P 0cae6b527c0a4d45acfe983ef4334acb: Updating active role from FOLLOWER to LEADER. Consensus state: current_term: 16 leader_uuid: "0cae6b527c0a4d45acfe983ef4334acb" config { opid_index: 14 peers { permanent_uuid: "17cda0c0b7c24c9698f602e6272cc750" member_type: VOTER last_known_private_addr { host: "yb-tserver-1.yb-tservers.yb.svc.cluster.local" port: 9100 } last_known_broadcast_addr { host: "yb-tserver-1.yb-tservers.yb.svc.cluster.local" port: 9100 } cloud_info { placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1" } } peers { permanent_uuid: "33670157cd2f4fdeabb92f5524adfefa" member_type: VOTER last_known_private_addr { host: "yb-tserver-0.yb-tservers.yb.svc.cluster.local" port: 9100 } last_known_broadcast_addr { host: "yb-tserver-0.yb-tservers.yb.svc.cluster.local" port: 9100 } cloud_info { placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1" } } peers { permanent_uuid: "0cae6b527c0a4d45acfe983ef4334acb" member_type: VOTER last_known_private_addr { host: "yb-tserver-2.yb-tservers.yb.svc.cluster.local" port: 9100 } last_known_broadcast_addr { host: "yb-tserver-2.yb-tservers.yb.svc.cluster.local" port: 9100 } cloud_info { placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1" } } }, has_pending_config = 0
I1021 02:15:03.892361   321 raft_consensus.cc:3178] T 3b46800050df491ea22dcab868596b01 P 0cae6b527c0a4d45acfe983ef4334acb [term 16 LEADER]: Calling mark dirty synchronously for reason code NEW_LEADER_ELECTED

And then, suddenly, the tablet server dies or restarts (it is not killed by k8s – the container stays running). And the client fails with:

psycopg2.OperationalError: server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.

So if the log is true, perhaps no new connections could be opened until something restarted the yb-tserver? But old connections kept working fine… I’m not sure where to look for more logs on the yb-tserver, how to debug it properly…

I’ve observed a different failure mode associated with this test. A client would be refused because of backpressure (and this was also associated with losing quorum)

I’ve implemented these recommendations: https://support.yugabyte.com/hc/en-us/articles/360058731252-How-to-optimize-and-resolve-common-memory-errors-in-Yugabyte

The backpressure gone, but it got objectively worse – the yb-tserver is now dying.

Also, and this might be related; YB failed to count values on time – there is less than 100K rows in the table.

yugabyte=# \c ybload;
You are now connected to database "ybload" as user "yugabyte".
ybload=# \d bigtable;
                     Table "public.bigtable"
 Column |          Type          | Collation | Nullable | Default 
--------+------------------------+-----------+----------+---------
 key    | character varying(255) |           | not null | 
 value  | bytea                  |           |          | 
Indexes:
    "bigtable_pkey" PRIMARY KEY, lsm (key HASH)

ybload=# select count(*) from bigtable;
ERROR:  Timed out: Read RPC (request call id 851) to 10.96.70.87:9100 timed out after 60.000s
ybload=# select count(*) from bigtable;
ERROR:  Timed out: Read RPC (request call id 859) to 10.96.70.87:9100 timed out after 60.000s

and logs for t-server k logs --tail 100 yb-tserver-2 yb-tserver -n yb:

I1021 01:09:09.874341    31 reactor.cc:459] TabletServer_R001: DEBUG: Closing idle connection: Connection (0x0000000036590490) server 10.96.22.232:41124 => 10.96.80.137:9100 - it has been idle for 65.1s
W1021 01:10:53.811908   162 long_operation_tracker.cc:113] Read running for 1.000s in thread 288:
    @     0x7f10eb93611f  (unknown)
    @     0x7f10f31fa6f9  snappy::RawUncompress()
    @     0x7f10f31faa70  snappy::RawUncompress()
    @     0x7f10f36766b9  rocksdb::UncompressBlockContents()
    @     0x7f10f367726c  rocksdb::ReadBlockContents()
    @     0x7f10f367121d  rocksdb::block_based_table::ReadBlockFromFile()
    @     0x7f10f366cfdf  rocksdb::BlockBasedTable::NewDataBlockIterator()
    @     0x7f10f36a85d0  rocksdb::(anonymous namespace)::TwoLevelIterator::InitDataBlock()
    @     0x7f10f36a8767  rocksdb::(anonymous namespace)::TwoLevelIterator::SkipEmptyDataBlocksForward()
    @     0x7f10f368db3a  rocksdb::MergingIterator::Next()
    @     0x7f10f35dc9ba  rocksdb::DBIter::Next()
    @     0x7f10f995c832  yb::docdb::SeekPossiblyUsingNext()
    @     0x7f10f995dbd9  yb::docdb::PerformRocksDBSeek()
    @     0x7f10f995e009  yb::docdb::SeekForward()
    @     0x7f10f995e055  yb::docdb::SeekOutOfSubKey()
    @     0x7f10f9994946  yb::docdb::IntentAwareIterator::SeekOutOfSubDoc()
W1021 01:10:53.849444   162 long_operation_tracker.cc:113] Read running for 1.024s in thread 290:
    @     0x7f10eb93611f  (unknown)
    @     0x7f10ec2b4592  __GI_pread64
    @     0x7f10f093b741  yb::PosixRandomAccessFile::Read()
    @     0x7f10f36cabb6  rocksdb::RandomAccessFileReader::ReadAndValidate()
    @     0x7f10f3676d5d  rocksdb::ReadBlockContents()
    @     0x7f10f367121d  rocksdb::block_based_table::ReadBlockFromFile()
    @     0x7f10f366cfdf  rocksdb::BlockBasedTable::NewDataBlockIterator()
    @     0x7f10f36a85d0  rocksdb::(anonymous namespace)::TwoLevelIterator::InitDataBlock()
    @     0x7f10f36a8767  rocksdb::(anonymous namespace)::TwoLevelIterator::SkipEmptyDataBlocksForward()
    @     0x7f10f36a88cc  rocksdb::(anonymous namespace)::TwoLevelIterator::Seek()
    @     0x7f10f368d3c5  rocksdb::MergingIterator::Seek()
    @     0x7f10f35dc175  rocksdb::DBIter::Seek()
    @     0x7f10f9903edc  yb::docdb::BoundedRocksDbIterator::Seek()
    @     0x7f10f995c85e  yb::docdb::SeekPossiblyUsingNext()
    @     0x7f10f995dbd9  yb::docdb::PerformRocksDBSeek()
    @     0x7f10f995e009  yb::docdb::SeekForward()
W1021 01:11:06.949064   288 long_operation_tracker.cc:142] Read took a long time: 14.164s
I1021 01:11:06.949167   288 yb_rpc.cc:440] Call yb.tserver.TabletServerService.Read 10.96.22.232:41820 => 10.96.80.137:9100 (request call id 858) took 14163ms. Trace:
I1021 01:11:06.949174   288 yb_rpc.cc:441] 
W1021 01:11:43.283773   290 long_operation_tracker.cc:142] Read took a long time: 50.496s
W1021 01:11:43.283990   290 yb_rpc.cc:426] Call yb.tserver.TabletServerService.Read 10.96.22.232:41820 => 10.96.80.137:9100 (request call id 857) took 50497ms (client timeout 60000ms).

the data where YB is writing, is a persistent volume backed by NVME drive - read/write speeds (for large chunks) in excess of 1.5GB/s – so the disk i/o is not the factor. Also, YB is the only significant container running on those nodes

There are no dropped/retransmitted packets (I’ll attach a screenshot if I’m allowed to post pics).

This is 6node cluster (RF=3), deployed with helm, single rack, 10G network, 8GB RAM per machine, NVME disks.

I’m very new to YB, the only weird thing I’m noticing (besides the error :)) is that YB is rather slow, but I am very new and it is hard for me to understand what is ‘normal’. The disk throughput being <450MB/s per machine (assuming YB is reading big blocks) is on the lower side… but as said: the problems could be on my side/setup

this image shows some metrics: the humps (in order) correspond to:

  1. timeout doing SELECT COUNT(*) from bigtable
  2. delete from bigtable
  3. running the test (described above)

Hi @roman

Can you state the spec of each node/container in memory & number of (v)cpus ?
Can you also explain the workload/benchmark that you’re running in the cluster ? It looks something like:

  1. Import big file with copy
  2. Select count(*);
  3. delete from table;
  4. repeat ?

Hi @dorian_yugabyte,

4x: AMD Ryzen 3200G, 8GB RAM (2GB reserved for graphics), 4 physical cores (4 threads)
1x: AMD Ryzen 5 3600X, 64GB RAM, 6 cores, 12 threads

It’s deployed with helm, so the default resource requirements (2Gbi per master, 4Gbi ram per tserver); there is no other load than YB (+ monitoring, networking, and logging k8* pods) - the bigger node will have both tmaster and tserver containers running

The test is basically a loop, it reads from a file which contains:

key1\tfilepath1
key2\tfilepath

For each line we open the filepath, read its contents into memory and insert it (and files are scientific papers, so nothing humongous - 100kB on average). Once the execution exits the inner with block, the files are closed/memory freed – I’m saying it to be explicit, this client code is executed on one of the nodes which runs the tserver container

For illustration (simplified):

        sql = "INSERT INTO {} VALUES(%s, %s) ON CONFLICT DO NOTHING".format(BigTable.__tablename__)
        i = 0
        with open(location, 'r') as fi:
            for line in fi:
                key,filepath = line.split('\t')
                with open(filepath, 'rb') as input:
                    # Perform the insertions
                    cursor.execute(sql, (key, psycopg2.Binary(input.read())))
                    i += 1
                    if i % 100 == 0:
                        connection.commit()
                        # plus close cursor, get new connection (not shown here)

The count(*) and delete events that I described, those happened manually - the read timeout seemed interesting to be mentioned, though it’s not related to the test

I’ve since tried to execute the same test against single postgres - in a very non-scientific fashion, but perhaps it might be useful to mention too: PG will finish the same import in ~45mins, where YB ingests ~40K files in 40mins (before the failure). YB is distributed, but it seems to be only taking twice as long – from that I’m concluding that the test is not overloading any of the instances. But the long-running nature might be perhaps tripping over some wires…

I have tried to execute the same test on YB multiple times, the longest I observed it running before the failure was ~2h – the shortest was 15mins.

I’ll be happy to rerun/retry various options it it helps YB, if I could get some guidance. Thank you.

Have you checked the CPU utilization of the cluster while running the benchmark to verify this?
In both scenarios, I believe you can add more concurrency/threads when inserting. (unless you were CPU limited)

Some other notes:

  1. Did you use synchronized replication in PostgreSQL?
  2. We haven’t optimized the db for small nodes Deployment checklist for YugabyteDB clusters | YugabyteDB Docs
  3. You should spread connections on all nodes of the cluster.

Can you explain your other use cases? Like tables,indexes,types of queries,etc and I can help to better model the db.

Yes, I verified the CPU - during the test it barely went above 15%. Just to be clear: speed is not my primary concern, it is the backpressure (refusing to accept more data) which, after changing the startup parameters, transformed itself into table server restart (or something that looks like it)

Increasing concurrency would sound like hitting the tserver harder. But reading the link, I think I’ll try to start the server with list of IPs rather DNS names (for other peers) and double check clock; other than that I’m out of ideas

@dorian_yugabyte Could you elaborate on spreading the connections please? My client (python script) establishes single connection to a tb-server; which can be assigned randomly (when load balanced) but stays the same during its lifetime. I could establish multiple connections too, but I’d really like to understand why that should be necessary. But perhaps I’m missing something important (also, YB when installed with Helm exposes only one service with port 5432 – so I picked one instance from the pods backing that service; in practice that should be the same as connecting through that service - even if load balanced)

thanks!

Since you have big rows, I’d suggest also having smaller transactions, say 20 rows.

Ok, then it’s fine.

So the work is also distributed on different servers(the yb-tserver you connect to will coordinate the transaction). Assuming you’re running multiple scripts or requests and each one gets a random server, it should be fine.

Note that delete is a heavy operation(a “delete” mark is inserted, the rows are still there and will be overhead until compaction time). It’s better to use truncate when you want to empty the table.

A faster way would be to send 1 query where you write multiple rows for the whole batch instead of a single query for each row.

Can you explain your use case a little more?

I have discovered something interesting while retrying commit frequency (10/50/100) – they had no effect, but the failure always occurred around the same watermark (45600-45700 files ingested) which had a large PDF documents (>8MB)

So I told the client to ignore all files bigger than 7MBs; and managed to complete the test run (ingesting ~98K files). YB finished in 68mins, which is good compared to the same test run but against share nothing PG without any replication; PG did the same in ~60mins (I have to get better statistics reporting, sorry about that)

Note however that PG was able to insert those all 100K blobs without any timeouts.

This is an encouraging development: is there some buffer limit I’m hitting? Is there a size limit per object stored or some recommendations?

I’ve collected the large files and am testing the import of just those; I’m now getting 60s RPC call timeouts (and no errors/warnings in tserver log) – will try to play more and report back once I see something interesting.

Apropos the usecases: one situation we need to deal with is storage of scientific papers and associated data - we have ~40TB of those and projecting need for (100-150TB). Most papers are <1MB but clearly there are outliers. So the idea is to have this data available together with metadata we are curating. However, I’d also like to evaluate YB as a storage for metrics/logstream – so basically OLTP + OLAP. That would be really lovely, but I know these have different usage patterns.

Btw: the hardware I’m using is intentionally undersized, I find that I can get more signal that way. And the import of the 100K files is just an initial phase.

What error and log did you get in this failure?

Storing the actual paper.pdf in a separate system that is optimized for files is better. It will be more efficient. Something like S3,seaweedfs,normal filesystem,etc. This is true with any database for this specific usecase.

I believe you lose more than you gain this way. It’s not like you can “process” the data, since it’s binary blob(ie pdf).

Just explain the usage patterns.

I forgot to update the post, sorry…

The problem is specific to my k8s cluster: I’m running with swap enabled; with 4gb of ram given to tablet server, the container would go over the cgroup limit and be killed. That is with those recommended config params: https://support.yugabyte.com/hc/en-us/articles/360058731252-How-to-optimize-and-resolve-common-memory-errors-in-Yugabyte

I tested various commit sizes; with 4gb ram – I could import all files (80GB) if I issued commits before sending 15MB of binary data. Anything bigger would eventually cause the cgroup kill.

With 8gb of ram given to the tablet - could send 200MB of binary data before every commit is issued, without problems. I haven’t noted the exact time, but single-thread performance is very near to how long it takes PG to ingest the same – mind you, this is just one process and reading sequentially from a spinning drive, so nothing nowhere close to ‘high effort’, but it is a useful factoid nevertheless

You should use our recommended hardware for production: Deployment checklist for YugabyteDB clusters | YugabyteDB Docs

Note that for this use case, storing the raw files in another system meant for that will be a lot more efficient like I mentioned above.

This is probably how fast your single client is. PostgreSQL should be faster since it’s not doing any replication.