No node was available to execute the query

I have installed YugabyteDB with the Helm chart in a K8s cluster and updated the Helm Chart by setting an additional GFlag. The master pods restarted, but the tserver pods remained running.

My application also kept on running, as I am expecting a no downtime experience when doing rolling updates on a K8s cluster. However, the smart drivers apparently did not deal well with this situation. Here’s some log output:

[s0-admin-0] WARN com.datastax.oss.driver.internal.core.pool.ChannelPool - [s0|/10.42.7.40:9042] Error while opening new channel (ConnectionInitException: [s0|connecting.
..] Protocol initialization request, step 1 (STARTUP {CQL_VERSION=3.0.0, DRIVER_NAME=Java driver for YugabyteDB, DRIVER_VERSION=4.6.0-yb-11, CLIENT_ID=0377f192-43f4-47a1-b
0d5-3204cb112427}): failed to send request (java.nio.channels.NotYetConnectedException))
[s0-admin-0] WARN com.datastax.oss.driver.internal.core.pool.ChannelPool - [s0|yb-tservers.yb.svc.cluster.local/10.42.8.41:9042] Error while opening new channel (Connecti
onInitException: [s0|connecting...] Protocol initialization request, step 1 (STARTUP {CQL_VERSION=3.0.0, DRIVER_NAME=Java driver for YugabyteDB, DRIVER_VERSION=4.6.0-yb-11
, CLIENT_ID=0377f192-43f4-47a1-b0d5-3204cb112427}): failed to send request (java.nio.channels.NotYetConnectedException))
[XNIO-1 task-4] WARN com.zaxxer.hikari.pool.PoolBase - HikariPool-1 - Failed to validate connection com.yugabyte.jdbc.PgConnection@606e89ac (This connection has been close
d.). Possibly consider using a shorter maxLifetime value.
[XNIO-1 task-4] WARN com.zaxxer.hikari.pool.PoolBase - HikariPool-1 - Failed to validate connection com.yugabyte.jdbc.PgConnection@a607f1f (This connection has been closed
.). Possibly consider using a shorter maxLifetime value.
[XNIO-1 task-4] WARN com.zaxxer.hikari.pool.PoolBase - HikariPool-1 - Failed to validate connection com.yugabyte.jdbc.PgConnection@1ed7a416 (This connection has been close
d.). Possibly consider using a shorter maxLifetime value.
[XNIO-1 task-4] WARN com.zaxxer.hikari.pool.ProxyConnection - HikariPool-1 - Connection com.yugabyte.jdbc.PgConnection@41513084 marked as broken because of SQLSTATE(57P01)
, ErrorCode(0)
com.yugabyte.util.PSQLException: FATAL: terminating connection due to administrator command
        at com.yugabyte.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2675)
        at com.yugabyte.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2365)
        at com.yugabyte.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:355)

After a while this:

Caused by: com.yugabyte.util.PSQLException: ERROR: Aborted: Shutdown connection
        at com.yugabyte.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2675) ~[jdbc-yugabytedb-42.3.4.jar!/:42.3.4]
        at com.yugabyte.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2365) ~[jdbc-yugabytedb-42.3.4.jar!/:42.3.4]
        at com.yugabyte.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:355) ~[jdbc-yugabytedb-42.3.4.jar!/:42.3.4]
        at com.yugabyte.jdbc.PgStatement.executeInternal(PgStatement.java:490) ~[jdbc-yugabytedb-42.3.4.jar!/:42.3.4]
        at com.yugabyte.jdbc.PgStatement.execute(PgStatement.java:408) ~[jdbc-yugabytedb-42.3.4.jar!/:42.3.4]
        at com.yugabyte.jdbc.PgPreparedStatement.executeWithFlags(PgPreparedStatement.java:162) ~[jdbc-yugabytedb-42.3.4.jar!/:42.3.4]
        at com.yugabyte.jdbc.PgPreparedStatement.executeQuery(PgPreparedStatement.java:114) ~[jdbc-yugabytedb-42.3.4.jar!/:42.3.4]
        at com.zaxxer.hikari.pool.ProxyPreparedStatement.executeQuery(ProxyPreparedStatement.java:52) ~[HikariCP-4.0.3.jar!/:?]
        at com.zaxxer.hikari.pool.HikariProxyPreparedStatement.executeQuery(HikariProxyPreparedStatement.java) ~[HikariCP-4.0.3.jar!/:?]
        at org.hibernate.engine.jdbc.internal.ResultSetReturnImpl.extract(ResultSetReturnImpl.java:57) ~[hibernate-core-5.6.14.Final.jar!/:5.6.14.Final]
        ... 176 more

2023-01-18 17:32:06.238  WARN  [XNIO-1 task-1]  SQL Error: 0, SQLState: XX000
2023-01-18 17:32:06.238 ERROR  [XNIO-1 task-1]  ERROR: Aborted: Shutdown connection
2023-01-18 17:32:06.238  WARN  [XNIO-1 task-1]  SQL Error: 0, SQLState: 57P01
2023-01-18 17:32:06.238 ERROR  [XNIO-1 task-1]  FATAL: terminating connection due to administrator command
2023-01-18 17:32:06.238  WARN  [XNIO-1 task-1]  SQL Error: 0, SQLState: 08006
2023-01-18 17:32:06.238 ERROR  [XNIO-1 task-1]  An I/O error occurred while sending to the backend.

Then this


[s0-admin-1] WARN com.datastax.oss.driver.internal.core.pool.ChannelPool - [s0|/10.42.4.44:9042]  Error while opening new channel (ConnectionInitException: [s0|connecting...] Protocol initialization request, step 1 (STARTUP {CQL_VERSION=3.0.0, DRIVER_NAME=Java driver for YugabyteDB, DRIVER_VERSION=4.6.0-yb-11, CLIENT_ID=0377f192-43f4-47a1-b0d5-3204cb112427}): failed to send request (io.netty.channel.StacklessClosedChannelException))
2023-01-18 17:37:19.158 ERROR  [XNIO-1 task-1]  Cannot remove Data

com.datastax.oss.driver.api.core.NoNodeAvailableException: No node was available to execute the query
        at com.datastax.oss.driver.api.core.NoNodeAvailableException.copy(NoNodeAvailableException.java:40) ~[java-driver-core-4.6.0-yb-11.jar!/:?]
        at com.datastax.oss.driver.internal.core.util.concurrent.CompletableFutures.getUninterruptibly(CompletableFutures.java:151) ~[java-driver-core-4.6.0-yb-11.jar!/:?]
        at com.datastax.oss.driver.internal.core.cql.CqlRequestSyncProcessor.process(CqlRequestSyncProcessor.java:53) ~[java-driver-core-4.6.0-yb-11.jar!/:?]
        at com.datastax.oss.driver.internal.core.cql.CqlRequestSyncProcessor.process(CqlRequestSyncProcessor.java:30) ~[java-driver-core-4.6.0-yb-11.jar!/:?]
        at com.datastax.oss.driver.internal.core.session.DefaultSession.execute(DefaultSession.java:230) ~[java-driver-core-4.6.0-yb-11.jar!/:?]
        at com.datastax.oss.driver.api.core.cql.SyncCqlSession.execute(SyncCqlSession.java:53) ~[java-driver-core-4.6.0-yb-11.jar!/:?]

Am I expecting something from the Java drivers that they cannot possibly do? I had to completely restart my application to get it operational again.

This is how I connect to YSQL:
jdbc:yugabytedb://yb-tservers.my-yb-namespace.svc.cluster.local:5433/myDB

And to YCQL:
yb-tservers.my-yb-namespace.svc.cluster.local:9042

Does the cluster have a master-leader?

For the YSQL part, maybe the connection died somehow in the middle. Your error SQLSTATE(57P01) says admin_shutdown from PostgreSQL: Documentation: 11: Appendix A. PostgreSQL Error Codes.

I haven’t looked at the internals of the java driver, if it catches this exception and retries the query again with a new connection. cc @Amogh_Yugabyte since you committed.

But because of the aspect of client-controlled transactions, I won’t expect the driver to automatically retry, because it can’t know the whole context. So I would expect some pseudo-code like:

while tries<3:
    do some queries
    do some logic
    other queries in same transaction
    connection breaks;
    we have to start from scratch, we can't just "retry" the latest query

You can do catch-all automatic retry in YCQL for example, because each query is totally separate.

You mean all connections broke? You couldn’t get a clean connection even if you tried several times?

So you were also using YCQL on the same app?

cc @nmalladi

Yes, I am using YSQL as well as YCQL in a Spring Boot application with Hikari connection pool (for YSQL) and all the pool connections were broken after the master servers restarted. Usually the pool will evict non-working connections and request new ones.

I was under the impression that for YCQL the driver itself does client-side load balancing and so would check all nodes and if no nodes worked, it would request a new list of nodes. Apparently in this case it got an empty list of nodes, hence the NoNodeAvailableException. But I would expect that later on, after the master servers restarted, it should recover by getting a fresh list of nodes.

No. The YCQL client never connects to yb-master directly. It must always have access to an yb-tserver node, where it will query for the list of available nodes. Maybe you restarted all yb-tserver nodes, got new ips, and the client didn’t know about any of them?

Well, IPs constantly change in K8s, that’s why DNS is used. My app connects to yb-tservers.my-yb-namespace.svc.cluster.local which I believed to always be available. IIRC the tservers kept on running, but the masters restarted. But I suppose I need to reproduce the case and prepare to copy the appropriate logfiles at the time of the crash.

Does the java driver always make a clean DNS request to that endpoint with no caching, thereby always resolving to an IP that is online? Or maybe it does some caching and still looking at the old killed ip.

cc @Amogh_Yugabyte @nmalladi