Frequent timeout/connection losing

m3lkor · November 11, 2021, 10:22am

Hi,
I’m trying to learn how to use YB integrating it into an existing project, replacing a pre-existent postgres db.
I’ve succesfully deployed YB on an AWS test node using helm-kubernetes tutorial on yb-site and i can connect to yb-master-ui and db via pgAdmin client.
I’m experiencing weird problems on a frequent base:

pgAdmin disconnects from db server at an annoying rate
my Wildfly AS often looses connection during a transaction

I’ve tried to change db timeouts (like tcp_keepalives_idle parameters) but using the following command
./yb-ctl restart --tserver_flags ysql_pg_conf=“tcp_keepalives_idle=1200”
raised “Cannot form yb-admin command without knowing master addresses” errors.
There’s any how to’s or documentation to follow to try to walkthrough this issue?

Thank you in advance

dorian_yugabyte · November 11, 2021, 4:23pm

Hi @m3lkor

What’s “Wildfly AS” ? Is it Jboss? (from a quick google search)

What type of hardware are you using?

What command did you use to start the server(s)?

m3lkor · November 15, 2021, 7:44am

Hi Dorian,
Yes Wildfly is Jboss.
I’m not using any “metal” hardware, i’m using a m5a.2xlarge instance on AWS, using suggested specs as per YugabyteDB documentation.
Im’ using kubernetes so Im’ trying to start pods in a minimal configuration (1 master and 1 tserver) just to try having a DB running.

dorian_yugabyte · November 15, 2021, 11:10am

Can you paste some logs from the server when you’re having the disconnects? (how to get logs)

Is the server under pressure? (meaning CPU utilization)

m3lkor · November 15, 2021, 3:46pm

Hi @dorian_yugabyte,
Thanks for your reply.
As far as i can see, the server is just fine, i see a CPU utilization of something like 25% with very low usage on the volumes side.
ON postgresql…log i can see a lot og

2021-11-15 15:43:35.829 UTC [5387] ERROR: Query error: Restart read required at: { read: { physical: 1636991015802541 } local_limit: { physical: 1636991015802541 } global_limit: in_txn_limit: serial_no: 0 }

And a looping log like:

I1115 15:45:38.241672 6872 mem_tracker.cc:282] MemTracker: hard memory limit is 3.398438 GB
I1115 15:45:38.241786 6872 mem_tracker.cc:284] MemTracker: soft memory limit is 2.888672 GB
I1115 15:45:38.242923 6872 thread_pool.cc:171] Starting thread pool { name: pggate_ybclient queue_limit: 10000 max_workers: 1024 }
I1115 15:45:38.243638 6872 pggate.cc:161] Reset YSQL bind address to yb-tserver-0.yb-tservers.default.svc.cluster.local:5432
I1115 15:45:38.243737 6872 server_base_options.cc:137] Updating master addrs to {yb-master-0.yb-masters.default.svc.cluster.local:7100}
I1115 15:45:38.245041 6872 pggate.cc:101] InitTServerSharedObject: 0, 0, 6
I1115 15:45:38.245199 6872 pg_client.cc:59] Using TServer host_port: yb-tserver-0.yb-tservers.default.svc.cluster.local:9100
I1115 15:45:38.245208 6879 async_initializer.cc:82] Starting to init ybclient
I1115 15:45:38.245386 6879 client-internal.cc:2355] New master addresses: [yb-master-0.yb-masters.default.svc.cluster.local:7100]
I1115 15:45:38.245981 6879 async_initializer.cc:86] Successfully built ybclient
I1115 15:45:38.745290 6875 poller.cc:66] Poll stopped: Service unavailable (yb/rpc/scheduler.cc:80): Scheduler is shutting down (system error 108)

dorian_yugabyte · November 17, 2021, 5:00pm

Can you check if this article helps https://support.yugabyte.com/hc/en-us/articles/4403469712397-Database-Transactions-errors-out-with-Restart-read-required- ?

Can you add more resources to the pods/containers? See our hardware recommendations: Deploy | YugabyteDB Docs

m3lkor · November 18, 2021, 7:58am

Hi Dorian.

I’ve tried to add resources using c5a.4xlarge instance instead of the 2xlarge but as I stated before, the cpu was not stressed with something like 25% usage on a 2xlarge and like 18% on a 4xlarge as a peak usage… I can try to split the load starting more than 1 instance and leaving YB alone on one pod and the AS on another pod to maximize node usage by YB pod.
I’ve tried to set transaction isolation level to serializable read only deferrable but i cannot observe any improvement on the run i did. Also, it creates an issue to my test architecture 'cause if the pod is destroyed, i’ve to log again to run that instruction.

dorian_yugabyte · November 18, 2021, 8:56am

@m3lkor can you tell the command you used to run it?
I’m assuming 1 server and RF using yb-ctl, correct ?

Also, can you explain the workload that you’re running?

m3lkor · November 18, 2021, 9:34am

Thanks for your reply Dorian.
I’m using the Helm version: 2.9.1
that starts YB with te following command (master first and tserver following):

        command:
          - "/bin/bash"
          - "-c"
          - |
            if [ -f /home/yugabyte/tools/k8s_preflight.py ]; then
              PYTHONUNBUFFERED="true" /home/yugabyte/tools/k8s_preflight.py \
                --addr="$(HOSTNAME).yb-masters.$(NAMESPACE).svc.cluster.local" \
                --port="7100"
            fi && \
            if [ -f /home/yugabyte/tools/k8s_preflight.py ]; then
              PYTHONUNBUFFERED="true" /home/yugabyte/tools/k8s_preflight.py \
                --addr="$(HOSTNAME).yb-masters.$(NAMESPACE).svc.cluster.local:7100" \
                --port="7100"
            fi && \
            if [ -f /home/yugabyte/tools/k8s_preflight.py ]; then
              PYTHONUNBUFFERED="true" /home/yugabyte/tools/k8s_preflight.py \
                --addr="0.0.0.0" \
                --port="7000"
            fi && \
            exec /home/yugabyte/bin/yb-master \
              --fs_data_dirs=/mnt/disk0 \
              --master_addresses= \
              --replication_factor=0 \
              --enable_ysql=true \
              --metric_node_name=$(HOSTNAME) \
              --memory_limit_hard_bytes=1824522240 \
              --stderrthreshold=0 \
              --num_cpus=2 \
              --undefok=num_cpus,enable_ysql \
              --default_memory_limit_to_ram_ratio="0.85" \
              --rpc_bind_addresses=$(HOSTNAME).yb-masters.$(NAMESPACE).svc.cluster.local \
              --server_broadcast_addresses=$(HOSTNAME).yb-masters.$(NAMESPACE).svc.cluster.local:7100 \
              --webserver_interface=0.0.0.0

command:
          - "/bin/bash"
          - "-c"
          - |
            if [ -f /home/yugabyte/tools/k8s_preflight.py ]; then
              PYTHONUNBUFFERED="true" /home/yugabyte/tools/k8s_preflight.py \
                --addr="$(HOSTNAME).yb-tservers.$(NAMESPACE).svc.cluster.local" \
                --port="9100"
            fi && \
            if [ -f /home/yugabyte/tools/k8s_preflight.py ]; then
              PYTHONUNBUFFERED="true" /home/yugabyte/tools/k8s_preflight.py \
                --addr="$(HOSTNAME).yb-tservers.$(NAMESPACE).svc.cluster.local:9100" \
                --port="9100"
            fi && \
            if [ -f /home/yugabyte/tools/k8s_preflight.py ]; then
              PYTHONUNBUFFERED="true" /home/yugabyte/tools/k8s_preflight.py \
                --addr="0.0.0.0" \
                --port="9000"
            fi && \
            if [ -f /home/yugabyte/tools/k8s_preflight.py ]; then
              PYTHONUNBUFFERED="true" /home/yugabyte/tools/k8s_preflight.py \
                --addr="$(HOSTNAME).yb-tservers.$(NAMESPACE).svc.cluster.local" \
                --port="9042"
            fi && \
            if [ -f /home/yugabyte/tools/k8s_preflight.py ]; then
              PYTHONUNBUFFERED="true" /home/yugabyte/tools/k8s_preflight.py \
                --addr="0.0.0.0:5433" \
                --port="5433"
            fi && \
            exec /home/yugabyte/bin/yb-tserver \
              --fs_data_dirs=/mnt/disk0 \
              --tserver_master_addrs= \
              --metric_node_name=$(HOSTNAME) \
              --memory_limit_hard_bytes=3649044480 \
              --stderrthreshold=0 \
              --num_cpus=2 \
              --undefok=num_cpus,enable_ysql \
              --use_node_hostname_for_local_tserver=true \
              --rpc_bind_addresses=$(HOSTNAME).yb-tservers.$(NAMESPACE).svc.cluster.local \
              --server_broadcast_addresses=$(HOSTNAME).yb-tservers.$(NAMESPACE).svc.cluster.local:9100 \
              --webserver_interface=0.0.0.0 \
              --enable_ysql=true \
              --pgsql_proxy_bind_address=0.0.0.0:5433 \
              --cql_proxy_bind_address=$(HOSTNAME).yb-tservers.$(NAMESPACE).svc.cluster.local

As for now, i’m trying a simple workload, like a rest service that receives request and saves on some different tables what receives as input. I’ve a little shell script that creates the request and sends to my rest service, deployed into that JBoss AS that is connected to YB with a connection pool. I’ve tried to connect my AS to a different DB (a postgres instance) without any problem. For my test i’ve just changed the address and port to yugabyte’s one.

dorian_yugabyte · November 19, 2021, 1:33pm

Can you try with any of our benchmarking tools Benchmark YugabyteDB | YugabyteDB Docs ?
Otherwise we’d need a way to reproduce the benchmark and see what/why goes wrong.

m3lkor · November 25, 2021, 7:51am

Hi Dorian,
After some more testing using the application (instead of benchmarking tools that i haven’t got any chance to try as for now), seems that our issue regards concurrent transactions. I’ve tried to start yugabyte using isolation flag for ysql (both SERIALIZABLE that REPEATABLE READ, to achieve a read commited isolation level-type) but forcing serializable, seems to worsen the issue, 'cause we have a lot of transaction in parallel.
As for my analysis, seems that a transaction locks data for a larger time span as expected and concurrent transactions find those data locked and receive error instead of some “queuing” of those instruction.
I was expecting a behaviour like “if data is uncommitted a concurrent transaction may only enqueue what it wants to do on the same data, those instruction will be executed in order after the first transaction commits”, but seems that I cannot achieve this behaviour, what I’m missing?

Thank you

dorian_yugabyte · November 25, 2021, 11:45am

In our case you have to retry the transaction that is aborted. Because we support Optimistic Transactions like explained here: Explicit locking | YugabyteDB Docs. In 2 conflicting transactions, 1 of them gets aborted and needs to be retried by the client.

Marko_Rajcevic · November 25, 2021, 4:01pm

Hi @m3lkor are you doing these in batches? If so, what size?

m3lkor · November 25, 2021, 4:25pm

Hi @Marko_Rajcevic,

we’re not doing batch operation, but it’s something like a batch. We have long transactions, with a lot of operations done by our application during a single transaction. We have many threads running with an high degree of parallelism so maybe you can see it like a number of batches running in parallel.

So, if I understand well what you’re saying, there’s no way to use YB in a context where we have long transactions running lasting several seconds, without having to manage transaction clash via resubmission of the entire transaction? We’re using vanilla Postgres without any issue, maybe there’s any configuration that may be done or it’s something that is unsupported by YB at the current moment?

Thanks

Alessandro · November 25, 2021, 4:51pm

Hello,

I am evaluating YB and I am facing the same issues described by @m3lkor, basically we have several long-running transactions (several seconds with hundreds of updates per transaction, so mini-batches if you want to call them like that) distributed on several parallel threads. We are experimenting the same errors described in this thread due to clashes of writes (“Restart read required”, “Aborted transaction”, etc.) and if we have to rollback and retry every clashed transaction, it would introduce an unbearable overhead.

Please note that the same application works well with native PostgreSQL (and also with Oracle DB), but we are evaluating YB to support horizontal scalability as a main driver (not much interested to geographic distribution for now).

I endorse @m3lkor question to see if there is a configuration in YB to support this kind of workload - so basically, let the db manage this kind of parallel updates, maybe putting some locks, but avoid to raise exceptions and force the client retry the whole transaction.

Any suggestion would be really appreciated.

Thanks and regards,

Alessandro

dorian_yugabyte · November 26, 2021, 9:16am

Can you post the error message you’re getting here?

dorian_yugabyte · November 26, 2021, 9:24am

Hi @Alessandro

Can you open a separate issue? And we can look at each error 1 by 1 and see what we can do.

Alessandro · November 30, 2021, 6:25pm

No worries Dorian, I finally realized that @m3lkor is a colleague of mine and we are working on the same evaluation!

The error we are receiving is “Restart read required”.

We made some progress in the analysis and it turned out that the issue appeared when we had some update statements that were not using indexes, and so they performed the updates using a Full Table Scan.

We added the missing indexes and now the failure rate is really low, almost absent I can say.

What do you think, could this be a valid root cause?

So let’s say that I have an update statement with a where condition that is not indexed, maybe the resulting Full Table Scan may “lock” also records that are not in interest, and so some parallel queries may receive the “Restart read required” error?

Thanks and regards,
Alessandro

dorian_yugabyte · December 1, 2021, 3:34pm

Yes it should be.

Yes. In this case, since the update is taking too long, it’s possible that the data underneath has changed (from compactions) since the transaction started.

This & other slow queries can exaggerate a lot in distributed scenarios (example: the overhead of synchronous replication over multiple regions) compared to a single PostgreSQL/RDBMS with streaming replication.

In this case, you can either show your tables with the most popular queries and we can decide a better structure. Or you can paste here any slow query along with an “explain analyze” and with the schema and we can help on how to better structure it.

In this example, the big update with no index is an anti-pattern and will bite you sooner or later depending on the data size.

Topic		Replies	Views
Lost leader, then the yb-tserver - while testing long-running (low throughput) writes General	10	1566	November 11, 2021
All the transactions hang - not sure whats wrong General	4	5189	November 21, 2019
Initializing YSQL Failed to Initialize client: Timed out General	4	3941	October 18, 2019
A tserver node went down while processing a write heavy workload General	7	1210	March 15, 2019
Weird Cql Request timeout problem General	22	2279	January 26, 2021

Frequent timeout/connection losing

Related topics