Weird Cql Request timeout problem

Setup: Three nodes, replication factor 3, yugabyte 2.5.1.0, datastax php wrapper, with datastax C driver 2.1.5

Requests fail in weird pattern: Request imeout always happens on 56, 38, or 29th second. It looks like this depends on the server where request failed. If we look at the set with 56th second , then requests alway fail on multiplier of 5 minutes. This may happen with insert or select query. Does anyone know what happens every 5 minutes on each node? Logs show normal activity. Any idea how to debug this?

[2021-01-03 14:34:56] app.ERROR: |Request timed out [] []
[2021-01-03 14:39:56] app.ERROR: |Request timed out [] []
[2021-01-03 14:44:56] app.ERROR: |Request timed out [] []
[2021-01-03 16:09:56] app.ERROR: |Request timed out [] []
[2021-01-03 16:12:29] app.ERROR: |Request timed out [] []
[2021-01-03 16:19:56] app.ERROR: |Request timed out [] []
[2021-01-03 16:29:56] app.ERROR: |Request timed out [] []
[2021-01-03 16:34:56] app.ERROR: |Request timed out [] []
[2021-01-03 16:44:56] app.ERROR: |Request timed out [] []
[2021-01-03 17:44:38] app.ERROR: |Request timed out [] []
[2021-01-03 17:49:38] app.ERROR: |Request timed out [] []
[2021-01-03 17:54:38] app.ERROR: |Request timed out [] []
[2021-01-03 17:59:38] app.ERROR: |Request timed out [] []
[2021-01-03 18:04:38] app.ERROR: |Request timed out [] []
[2021-01-03 18:09:38] app.ERROR: |Request timed out [] []
[2021-01-03 18:29:38] app.ERROR: |Request timed out [] []
[2021-01-03 18:49:56] app.ERROR: |Request timed out [] []
[2021-01-03 18:59:56] app.ERROR: |Request timed out [] []
[2021-01-03 19:24:56] app.ERROR: |Request timed out [] []
[2021-01-03 19:29:56] app.ERROR: |Request timed out [] []
[2021-01-03 19:39:56] app.ERROR: |Request timed out [] []
[2021-01-03 19:44:56] app.ERROR: |Request timed out [] []
[2021-01-03 19:49:56] app.ERROR: |Request timed out [] []
[2021-01-03 19:54:56] app.ERROR: |Request timed out [] []
[2021-01-03 19:59:56] app.ERROR: |Request timed out [] []
[2021-01-03 20:04:56] app.ERROR: |Request timed out [] []
[2021-01-03 20:39:38] app.ERROR: |Request timed out [] []
[2021-01-03 20:39:56] app.ERROR: |Request timed out [] []
[2021-01-03 20:49:38] app.ERROR: |Request timed out [] []
[2021-01-03 20:49:56] app.ERROR: |Request timed out [] []
[2021-01-03 20:54:56] app.ERROR: |Request timed out [] []
[2021-01-03 20:59:38] app.ERROR: |Request timed out [] []
[2021-01-03 21:04:38] app.ERROR: |Request timed out [] []
[2021-01-03 21:04:56] app.ERROR: |Request timed out [] []
[2021-01-03 21:14:38] app.ERROR: |Request timed out [] []
[2021-01-03 21:19:56] app.ERROR: |Request timed out [] []
[2021-01-03 21:24:56] app.ERROR: |Request timed out [] []
[2021-01-03 21:29:56] app.ERROR: |Request timed out [] []
[2021-01-03 21:44:38] app.ERROR: |Request timed out [] []
[2021-01-03 21:54:38] app.ERROR: |Request timed out [] []
[2021-01-03 22:04:56] app.ERROR: |Request timed out [] []
[2021-01-03 22:14:38] app.ERROR: |Request timed out [] []
[2021-01-03 22:19:38] app.ERROR: |Request timed out [] []
[2021-01-03 22:19:38] app.ERROR: |Request timed out [] []
[2021-01-03 22:24:56] app.ERROR: |Request timed out [] []
[2021-01-03 22:29:56] app.ERROR: |Request timed out [] []
[2021-01-03 22:34:38] app.ERROR: |Request timed out [] []
[2021-01-03 22:34:56] app.ERROR: |Request timed out [] []
[2021-01-03 22:39:38] app.ERROR: |Request timed out [] []
[2021-01-03 22:39:56] app.ERROR: |Request timed out [] []
[2021-01-03 22:44:56] app.ERROR: |Request timed out [] []
[2021-01-03 22:49:38] app.ERROR: |Request timed out [] []
[2021-01-03 22:54:38] app.ERROR: |Request timed out [] []
[2021-01-03 22:54:56] app.ERROR: |Request timed out [] []
[2021-01-03 23:04:38] app.ERROR: |Request timed out [] []
[2021-01-03 23:14:38] app.ERROR: |Request timed out [] []
[2021-01-03 23:14:56] app.ERROR: |Request timed out [] []
[2021-01-03 23:37:29] app.ERROR: |Request timed out [] []
[2021-01-03 23:39:38] app.ERROR: |Request timed out [] []
[2021-01-03 23:49:38] app.ERROR: |Request timed out [] []
[2021-01-03 23:59:38] app.ERROR: |Request timed out [] []
[2021-01-04 00:17:29] app.ERROR: |Request timed out [] []
[2021-01-04 00:32:29] app.ERROR: |Request timed out [] []
[2021-01-04 00:42:29] app.ERROR: |Request timed out [] []
[2021-01-04 00:52:29] app.ERROR: |Request timed out [] []
[2021-01-04 00:57:29] app.ERROR: |Request timed out [] []
[2021-01-04 01:27:29] app.ERROR: |Request timed out [] []
[2021-01-04 01:37:29] app.ERROR: |Request timed out [] []
[2021-01-04 01:52:29] app.ERROR: |Request timed out [] []
[2021-01-04 12:54:56] app.ERROR: |Request timed out [] []
[2021-01-04 13:04:56] app.ERROR: |Request timed out [] []
[2021-01-04 13:09:56] app.ERROR: |Request timed out [] []
[2021-01-04 13:14:56] app.ERROR: |Request timed out [] []
[2021-01-04 13:19:56] app.ERROR: |Request timed out [] []
[2021-01-04 16:59:56] app.ERROR: |Request timed out [] []
[2021-01-04 17:14:56] app.ERROR: |Request timed out [] []
[2021-01-04 17:29:56] app.ERROR: |Request timed out [] []
[2021-01-04 17:34:56] app.ERROR: |Request timed out [] []
[2021-01-04 17:49:56] app.ERROR: |Request timed out [] []
[2021-01-04 17:54:56] app.ERROR: |Request timed out [] []
[2021-01-04 17:59:56] app.ERROR: |Request timed out [] []
[2021-01-04 18:04:56] app.ERROR: |Request timed out [] []
[2021-01-04 18:54:56] app.ERROR: |Request timed out [] []
[2021-01-04 19:09:56] app.ERROR: |Request timed out [] []
[2021-01-04 19:14:56] app.ERROR: |Request timed out [] []
[2021-01-04 19:19:56] app.ERROR: |Request timed out [] []
[2021-01-04 19:34:56] app.ERROR: |Request timed out [] []
[2021-01-04 20:39:56] app.ERROR: |Request timed out [] []
[2021-01-04 20:49:56] app.ERROR: |Request timed out [] []
[2021-01-05 11:44:38] app.ERROR: |Request timed out [] []

Hi @alfabetagama

Welcome to YugabyteDB Forum!

Can you upload some logs from the yb-tservers just to check ?

Can you show what query is happening in this requests that time out ? (increasing debug level in php?)

Or can you make a simple script that replicates this ?

Regards,
Dorian
Technical Support Engineer

Hi Dorian, which logs do you want? We are testing migration of records from postgres to yugabyte so logs are pretty big. I can post specific event and all tserver logs around it if it is ok?

Yes, just upload compressed.

Also paste the queries (with table schema) that are timing out ?

Hi i have data and sample test, but zip upload is not allowed.

Yes my bad. Can you upload on external file hosting and paste the link ?

Included: test and three log samples when test was running (INFO log).

Timeouts happened on:
2021-01-06 15:10:58|0.0043
2021-01-06 15:20:58|0.0035
2021-01-06 15:15:33|0.0016

Last number is measured operation time in seconds, which is much smaller than request timeout.

And this is driver debug log for one event. At the end sockets disonnects and request times out.

    07-01-2021 22:10:28 UTC [INFO] New node 172.31.6.28 added event (control_connection.cpp:700)
    07-01-2021 22:10:28 UTC [DEBUG] Refresh node: SELECT * FROM system.peers WHERE peer = '172.31.6.28' (control_connection.cpp:379)
    07-01-2021 22:10:28 UTC [INFO] New node 172.31.3.118 added event (control_connection.cpp:700)
    07-01-2021 22:10:28 UTC [DEBUG] Refresh node: SELECT * FROM system.peers WHERE peer = '172.31.3.118' (control_connection.cpp:379)
    07-01-2021 22:10:28 UTC [INFO] New node 172.31.15.240 added event (control_connection.cpp:700)
    07-01-2021 22:10:28 UTC [DEBUG] Refresh node: SELECT * FROM system.local WHERE key='local' (control_connection.cpp:379)
    07-01-2021 22:10:28 UTC [INFO] Node 172.31.15.240 moved event (control_connection.cpp:713)
    07-01-2021 22:10:28 UTC [DEBUG] Refresh node: SELECT * FROM system.local WHERE key='local' (control_connection.cpp:379)
    07-01-2021 22:10:28 UTC [WARN] Attempting to add host 172.31.6.28 that we already have (cluster.cpp:578)
    07-01-2021 22:10:28 UTC [DEBUG] Attempted to remove or mark host 172.31.6.28 as DOWN, but it doesn't exist (round_robin_policy.cpp:75)
    07-01-2021 22:10:28 UTC [DEBUG] Attempted to remove or mark host 172.31.6.28 as DOWN, but it doesn't exist (round_robin_policy.cpp:75)
    07-01-2021 22:10:28 UTC [DEBUG] Socket(0x7f03d00031d0) to host 172.31.6.28 closed (socket.cpp:377)
    07-01-2021 22:10:28 UTC [DEBUG] Connected to host 172.31.6.28 on socket(0x7f03d8002c50) (socket_connector.cpp:253)
    07-01-2021 22:10:28 UTC [WARN] Invalid address format for peer address (address_factory.cpp:32)
    07-01-2021 22:10:28 UTC [WARN] Invalid address format for peer address (address_factory.cpp:32)
    07-01-2021 22:10:28 UTC [DEBUG] Successfully prepared all on host 172.31.3.118 (prepare_all_handler.cpp:58)
    07-01-2021 22:10:28 UTC [WARN] Attempting to add host 172.31.3.118 that we already have (cluster.cpp:578)
    07-01-2021 22:10:28 UTC [DEBUG] Attempted to remove or mark host 172.31.3.118 as DOWN, but it doesn't exist (round_robin_policy.cpp:75)
    07-01-2021 22:10:28 UTC [DEBUG] Connected to host 172.31.3.118 on socket(0x7f03d8002c50) (socket_connector.cpp:253)
    07-01-2021 22:10:28 UTC [DEBUG] Successfully prepared query "INSERT INTO test.log (did,month,message_time,fid,aid,oid,jid,owid,lid,rid,sid,cr,la,ver,pa,message,type) VALUES (?, ?, ?, ?, ?,?,?,?,?,?,?,?,?,?,?,?,?)" on host 172.31.6.28 while preparing all queries (prepare_host_handler.cpp:163)
    07-01-2021 22:10:28 UTC [DEBUG] Socket(0x7f03d8005980) to host 172.31.6.28 closed (socket.cpp:377)
    07-01-2021 22:10:28 UTC [DEBUG] Attempted to remove or mark host 172.31.3.118 as DOWN, but it doesn't exist (round_robin_policy.cpp:75)
    07-01-2021 22:10:28 UTC [DEBUG] Socket(0x7f03d0002f40) to host 172.31.3.118 closed (socket.cpp:377)
    07-01-2021 22:10:28 UTC [DEBUG] Socket(0x7f03d0002f40) to host 172.31.6.28 closed (socket.cpp:377)
    07-01-2021 22:10:28 UTC [DEBUG] Socket(0x7f03d0003530) to host 172.31.15.240 closed (socket.cpp:377)
    07-01-2021 22:10:28 UTC [DEBUG] Attempted to remove or mark host 172.31.15.240 as DOWN, but it doesn't exist (round_robin_policy.cpp:75)
    07-01-2021 22:10:28 UTC [DEBUG] Attempted to remove or mark host 172.31.15.240 as DOWN, but it doesn't exist (round_robin_policy.cpp:75)
    07-01-2021 22:10:28 UTC [DEBUG] Socket(0x7f03d8005460) to host 172.31.15.240 closed (socket.cpp:377)
    07-01-2021 22:10:28 UTC [DEBUG] Successfully prepared query "INSERT INTO test.log (did,month,message_time,fid,aid,oid,jid,owid,lid,rid,sid,cr,la,ver,pa,message,type) VALUES (?, ?, ?, ?, ?,?,?,?,?,?,?,?,?,?,?,?,?)" on host 172.31.3.118 while preparing all queries (prepare_host_handler.cpp:163)
    07-01-2021 22:10:28 UTC [DEBUG] Socket(0x7f03d80056f0) to host 172.31.3.118 closed (socket.cpp:377)

Looking a little at your php code, you’re always building the same <monthid, did>, which all values will end up in 1 partition(tablet).

But all your queries in the script are efficient by primary-key and not any full table scans.

And to confirm, the queries that are timing out is the INSERT or UPDATE that is inside the php file, correct ?

Is there a possibility that you may be inserting or updating the same row hundreds of times ?

Hi, this test is for demonstration purposes. Access pattern will be:

  • realtime insert <month, did, message_time> at most once per second per primary key
  • range selects by primary key
  • occasional/rare update of rows by primary key <month, did> for historic data

Timeouts also happen on SELECTS. But as far as i can see the problem is really not request timeout. The problem is connection drop that results in request timeout.
What i am worried about is the flood of disconnects/reconnects which happpen every couple of minutes

Do you have any load balancer or firewall between the servers ?
Can you replicate these connection disconnects using a local setup of 1 node yugabyted reference | YugabyteDB Docs ?

This only happens on cluster, we were unable to replicate on local setup. All nodes are setup as Amazon EC2 instances, and they communicate only via private network. There is no connectivity allowed to outside world from the cluster per our policy. Test server communicates with cluster via same private network. Acces to instances is only allowed from test server.

I am really not familiar with amazon network infrastructure so i can not say anything about balancers and firewalls outside.

I am not an expert, but if i would have to speculate i would say the problem is on realtion Datastax C driver->Yugabyte.

Problems start at this point:
07-01-2021 22:10:28 UTC [DEBUG] Refresh node: SELECT * FROM system.peers WHERE peer = ‘172.31.3.118’ (control_connection.cpp:379)
07-01-2021 22:10:28 UTC [INFO] New node 172.31.15.240 added event (control_connection.cpp:700)
07-01-2021 22:10:28 UTC [DEBUG] Refresh node: SELECT * FROM system.local WHERE key=‘local’ (control_connection.cpp:379)
07-01-2021 22:10:28 UTC [INFO] Node 172.31.15.240 moved event (control_connection.cpp:713)
07-01-2021 22:10:28 UTC [DEBUG] Refresh node: SELECT * FROM system.local WHERE key=‘local’ (control_connection.cpp:379)
07-01-2021 22:10:28 UTC [WARN] Attempting to add host 172.31.6.28 that we already have (cluster.cpp:578)
07-01-2021 22:10:28 UTC [DEBUG] Attempted to remove or mark host 172.31.6.28 as DOWN, but it doesn’t exist (round_robin_policy.cpp:75)
07-01-2021 22:10:28 UTC [DEBUG] Attempted to remove or mark host 172.31.6.28 as DOWN, but it doesn’t exist (round_robin_policy.cpp:75)
07-01-2021 22:10:28 UTC [DEBUG] Socket(0x7f03d00031d0) to host 172.31.6.28 closed (socket.cpp:377)
07-01-2021 22:10:28 UTC [DEBUG] Connected to host 172.31.6.28 on socket(0x7f03d8002c50) (socket_connector.cpp:253)
07-01-2021 22:10:28 UTC [WARN] Invalid address format for peer address (address_factory.cpp:32)
07-01-2021 22:10:28 UTC [WARN] Invalid address format for peer address (address_factory.cpp:32)

@alfabetagama can you tell me which driver (with link) and which php version ?

PHP version 7.4 on Ubuntu 20.04

Php driver : GitHub - datastax/php-driver: [MAINTENANCE ONLY] DataStax PHP Driver for Apache Cassandra v 1.3.2

Compiled with Cpp driver 2.15.3: GitHub - datastax/cpp-driver at 2.15.3

I a also tried to compile against yugabyte cpp driver but your cpp driver does not compile.

Can you paste the error ?

What was the error here ? Our c++ driver compiles.

mkdir build
pushd build
cmake .. -DCMAKE_CXX_FLAGS="-fPIC" -DCMAKE_INSTALL_PREFIX:PATH=/usr -DCASS_BUILD_STATIC=OFF   -DCASS_BUILD_SHARED=ON -DCMAKE_BUILD_TYPE=RELEASE -DCASS_USE_ZLIB=ON
make
make install
popd

Please paste the text of the ouput and not a photo.

Sure, here you go:

/home/vagrant/cassandra-cpp-driver/src/third_party/sparsehash/src/sparsehash/dense_hash_map:264:59:   required from ‘sparsehash::dense_hash_map<Key, T, HashFcn, EqualKey, Alloc>::data_type& sparsehash::dense_hash_map<Key, T, HashFcn, EqualKey, Alloc>::operator[](const key_type&) [with Key = std::__cxx11::basic_string<char>; T = cass::SharedRefPtr<const cass::PreparedMetadata::Entry>; HashFcn = std::hash<std::__cxx11::basic_string<char> >; EqualKey = std::equal_to<std::__cxx11::basic_string<char> >; Alloc = sparsehash::libc_allocator_with_realloc<std::pair<const std::__cxx11::basic_string<char>, cass::SharedRefPtr<const cass::PreparedMetadata::Entry> > >; sparsehash::dense_hash_map<Key, T, HashFcn, EqualKey, Alloc>::data_type = cass::SharedRefPtr<const cass::PreparedMetadata::Entry>; sparsehash::dense_hash_map<Key, T, HashFcn, EqualKey, Alloc>::key_type = std::__cxx11::basic_string<char>]’
/home/vagrant/cassandra-cpp-driver/src/prepared.hpp:113:26:   required from here
/home/vagrant/cassandra-cpp-driver/src/third_party/sparsehash/src/sparsehash/internal/libc_allocator_with_realloc.h:68:40: error: ‘void* realloc(void*, size_t)’ moving an object of non-trivially copyable type ‘struct std::pair<const std::__cxx11::basic_string<char>, cass::SharedRefPtr<const cass::PreparedMetadata::Entry> >’; use ‘new’ and ‘delete’ instead [-Werror=class-memaccess]
   68 |     return static_cast<pointer>(realloc(p, n * sizeof(value_type)));
      |                                 ~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /usr/include/c++/9/utility:70,
                 from /usr/include/c++/9/algorithm:60,
                 from /home/vagrant/cassandra-cpp-driver/src/hash.hpp:20,
                 from /home/vagrant/cassandra-cpp-driver/src/address.hpp:20,
                 from /home/vagrant/cassandra-cpp-driver/src/serialization.hpp:20,
                 from /home/vagrant/cassandra-cpp-driver/src/buffer.hpp:21,
                 from /home/vagrant/cassandra-cpp-driver/src/request.hpp:20,
                 from /home/vagrant/cassandra-cpp-driver/src/batch_request.hpp:23,
                 from /home/vagrant/cassandra-cpp-driver/src/batch_request.cpp:17:
/usr/include/c++/9/bits/stl_pair.h:208:12: note: ‘struct std::pair<const std::__cxx11::basic_string<char>, cass::SharedRefPtr<const cass::PreparedMetadata::Entry> >’ declared here
  208 |     struct pair
      |            ^~~~
cc1plus: all warnings being treated as errors
make[2]: *** [CMakeFiles/cassandra.dir/build.make:128: CMakeFiles/cassandra.dir/src/batch_request.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:124: CMakeFiles/cassandra.dir/all] Error 2
make: *** [Makefile:152: all] Error 2
root@ubuntu2004:/home/vagrant/cassandra-cpp-driver/build#