Tserver docker CPU load in idle

victor · February 8, 2020, 9:38pm

Each tserver docker node has 25% cpu load in idle.

karthik · February 9, 2020, 6:40am

Interesting. Wondering if there is some background task going on. Could you please answer a few questions about your setup, would help us identify what is going on:

How many vCPUs on each yb-tserver pod?
How many tables/tablets do you have in the system? You can find this from the master ui: http://yb-master-0:7000 (tablet-servers page will give you how many tablets on each yb-tserver).
Did you just finish loading data into any of the tables?
Could you also look at any of the tserver logs to see if anything is reported there?

victor · February 9, 2020, 6:45pm

I use docker-compose to setup cluster in dev environment (win10 + docker) - docker has 8 cpus for all services. 6 empty tables (created on db updater start). Yes. Logs in attach for tserver1.

Logs:

…

W0209 18:34:10.465245 W0209 18:34:12.184116 W0209 18:34:12.184180 W0209 18:34:12.198673 W0209 18:34:12.198679 W0209 18:34:12.244817 W0209 18:34:13.625566 W0209 18:34:13.625627 W0209 18:34:23.969054 W0209 18:34:23.969092 W0209 18:34:23.983862 W0209 18:34:23.983912 W0209 18:34:24.004087 W0209 18:34:25.972113 W0209 18:34:25.972182 W0209 18:34:25.986752 W0209 18:34:25.986797 W0209 18:34:26.008317 W0209 18:34:30.978746 W0209 18:34:30.978780 W0209 18:34:30.992595 W0209 18:34:30.992645 W0209 18:34:31.010702 W0209 18:34:31.980079 W0209 18:34:31.980187 W0209 18:34:31.994076 W0209 18:34:31.994104 W0209 18:34:32.012917 19 reactor.cc:380] TabletServer_R007: Client call yb.consensus.ConsensusService.RequestConsensusVote has no timeout set for connection id: { remote: 172.18.0.7:9100 idx: 6 protocol: 0x00007f927e0a8f00 → tcp }
13 reactor.cc:380] TabletServer_R001: Client call yb.consensus.ConsensusService.RequestConsensusVote has no timeout set for connection id: { remote: 172.18.0.10:9100 idx: 7 protocol: 0x00007f927e0a8f00 → tcp }
15 reactor.cc:380] TabletServer_R003: Client call yb.consensus.ConsensusService.RequestConsensusVote has no timeout set for connection id: { remote: 172.18.0.7:9100 idx: 2 protocol: 0x00007f927e0a8f00 → tcp }
17 reactor.cc:380] TabletServer_R005: Client call yb.consensus.ConsensusService.RequestConsensusVote has no timeout set for connection id: { remote: 172.18.0.7:9100 idx: 4 protocol: 0x00007f927e0a8f00 → tcp }
14 reactor.cc:380] TabletServer_R002: Client call yb.consensus.ConsensusService.RequestConsensusVote has no timeout set for connection id: { remote: 172.18.0.10:9100 idx: 0 protocol: 0x00007f927e0a8f00 → tcp }
141 replica_state.cc:733] T e5645f251c394ca282b38125e1f10386 P 36f9a55694454f28941011b16cffc067 [term 3 LEADER]: Can’t advance the committed index across term boundaries until operations from the current term are replicated. Last committed operation was: { term: 1 index: 1 }, New majority replicated is: term: 1 index: 1, Current term is: 3
16 reactor.cc:380] TabletServer_R004: Client call yb.consensus.ConsensusService.RequestConsensusVote has no timeout set for connection id: { remote: 172.18.0.10:9100 idx: 2 protocol: 0x00007f927e0a8f00 → tcp }
14 reactor.cc:380] TabletServer_R002: Client call yb.consensus.ConsensusService.RequestConsensusVote has no timeout set for connection id: { remote: 172.18.0.7:9100 idx: 1 protocol: 0x00007f927e0a8f00 → tcp }
18 reactor.cc:380] TabletServer_R006: Client call yb.consensus.ConsensusService.RequestConsensusVote has no timeout set for connection id: { remote: 172.18.0.10:9100 idx: 4 protocol: 0x00007f927e0a8f00 → tcp }
13 reactor.cc:380] TabletServer_R001: Client call yb.consensus.ConsensusService.RequestConsensusVote has no timeout set for connection id: { remote: 172.18.0.7:9100 idx: 0 protocol: 0x00007f927e0a8f00 → tcp }
14 reactor.cc:380] TabletServer_R002: Client call yb.consensus.ConsensusService.RequestConsensusVote has no timeout set for connection id: { remote: 172.18.0.7:9100 idx: 1 protocol: 0x00007f927e0a8f00 → tcp }
13 reactor.cc:380] TabletServer_R001: Client call yb.consensus.ConsensusService.RequestConsensusVote has no timeout set for connection id: { remote: 172.18.0.10:9100 idx: 7 protocol: 0x00007f927e0a8f00 → tcp }
141 replica_state.cc:733] T 435f874f25924e2e9055c968cac30dba P 36f9a55694454f28941011b16cffc067 [term 2 LEADER]: Can’t advance the committed index across term boundaries until operations from the current term are replicated. Last committed operation was: { term: 1 index: 1 }, New majority replicated is: term: 1 index: 1, Current term is: 2
14 reactor.cc:380] TabletServer_R002: Client call yb.consensus.ConsensusService.RequestConsensusVote has no timeout set for connection id: { remote: 172.18.0.7:9100 idx: 1 protocol: 0x00007f927e0a8f00 → tcp }
15 reactor.cc:380] TabletServer_R003: Client call yb.consensus.ConsensusService.RequestConsensusVote has no timeout set for connection id: { remote: 172.18.0.10:9100 idx: 1 protocol: 0x00007f927e0a8f00 → tcp }
15 reactor.cc:380] TabletServer_R003: Client call yb.consensus.ConsensusService.RequestConsensusVote has no timeout set for connection id: { remote: 172.18.0.7:9100 idx: 2 protocol: 0x00007f927e0a8f00 → tcp }
19 reactor.cc:380] TabletServer_R007: Client call yb.consensus.ConsensusService.RequestConsensusVote has no timeout set for connection id: { remote: 172.18.0.10:9100 idx: 5 protocol: 0x00007f927e0a8f00 → tcp }
140 replica_state.cc:733] T 7dc71901204a459585ce9828ba10c0c3 P 36f9a55694454f28941011b16cffc067 [term 2 LEADER]: Can’t advance the committed index across term boundaries until operations from the current term are replicated. Last committed operation was: { term: 1 index: 1 }, New majority replicated is: term: 1 index: 1, Current term is: 2
13 reactor.cc:380] TabletServer_R001: Client call yb.consensus.ConsensusService.RequestConsensusVote has no timeout set for connection id: { remote: 172.18.0.10:9100 idx: 7 protocol: 0x00007f927e0a8f00 → tcp }
12 reactor.cc:380] TabletServer_R000: Client call yb.consensus.ConsensusService.RequestConsensusVote has no timeout set for connection id: { remote: 172.18.0.7:9100 idx: 7 protocol: 0x00007f927e0a8f00 → tcp }
17 reactor.cc:380] TabletServer_R005: Client call yb.consensus.ConsensusService.RequestConsensusVote has no timeout set for connection id: { remote: 172.18.0.10:9100 idx: 3 protocol: 0x00007f927e0a8f00 → tcp }
14 reactor.cc:380] TabletServer_R002: Client call yb.consensus.ConsensusService.RequestConsensusVote has no timeout set for connection id: { remote: 172.18.0.7:9100 idx: 1 protocol: 0x00007f927e0a8f00 → tcp }
140 replica_state.cc:733] T 00f717bfad7847e7838bc59769cb0601 P 36f9a55694454f28941011b16cffc067 [term 2 LEADER]: Can’t advance the committed index across term boundaries until operations from the current term are replicated. Last committed operation was: { term: 1 index: 1 }, New majority replicated is: term: 1 index: 1, Current term is: 2
12 reactor.cc:380] TabletServer_R000: Client call yb.consensus.ConsensusService.RequestConsensusVote has no timeout set for connection id: { remote: 172.18.0.10:9100 idx: 6 protocol: 0x00007f927e0a8f00 → tcp }
19 reactor.cc:380] TabletServer_R007: Client call yb.consensus.ConsensusService.RequestConsensusVote has no timeout set for connection id: { remote: 172.18.0.7:9100 idx: 6 protocol: 0x00007f927e0a8f00 → tcp }
14 reactor.cc:380] TabletServer_R002: Client call yb.consensus.ConsensusService.RequestConsensusVote has no timeout set for connection id: { remote: 172.18.0.10:9100 idx: 0 protocol: 0x00007f927e0a8f00 → tcp }
14 reactor.cc:380] TabletServer_R002: Client call yb.consensus.ConsensusService.RequestConsensusVote has no timeout set for connection id: { remote: 172.18.0.7:9100 idx: 1 protocol: 0x00007f927e0a8f00 → tcp }
140 replica_state.cc:733] T 08584550feec4f70b729ebf72d0aca09 P 36f9a55694454f28941011b16cffc067 [term 2 LEADER]: Can’t advance the committed index across term boundaries until operations from the current term are replicated. Last committed operation was: { term: 1 index: 1 }, New majority replicated is: term: 1 index: 1, Current term is: 2

karthik · February 10, 2020, 3:44am

Something seems off here. cc @bogdan @sanketh - could you folks please take a look?

sanketh · February 11, 2020, 10:08pm

Hello Victor,

We have updated our reference docker-compose file at Deploy | YugabyteDB Docs recently. Please make sure your compose config looks similar to the reference.
25% usage on idle for 6 user tables does not match normal behavior. Were you running a benchmark or other workload prior to this? Could you roughly describe what the workload was doing at a high level? Or did you just create the 6 tables?
Could you collect a performance profile from the tserver that would help us understand the idle usage? Here are the commands you would need to run in the yb-tserver container to collect the profile.

$ yum install -y perl graphviz

$ /home/yugabyte/bin/pprof --text http://localhost:9000/pprof/profile --seconds 30 > /mnt/tserver/tserver-profile.txt

$ /home/yugabyte/bin/pprof --ps http://localhost:9000/pprof/profile --seconds 30 > /mnt/tserver/tserver-profile.ps

Please then copy the two files /mnt/tserver/tserver-profile.txt /mnt/tserver/tserver-profile.ps out of the container and attach them to this thread.

Thanks for helping us debug this issue!

victor · February 12, 2020, 11:24pm

Hi.

See attach.
After start – db updater just created empty tables, no api activities, no inserts/selects, etc. I have api1x3 instances connected to each of 6 tservers and api2x3 instances connected to each of 6 tservers – so, 6 clients connected to each tserver.
See attach.

Thanks.

Regards,

Victor

(Attachment docker-compose.yml is missing)

(Attachment tserver-profile.ps is missing)

(Attachment tserver-profile.txt is missing)

dorian_yugabyte · February 13, 2020, 1:24pm

@victor you attachments are missing. Can you try again ?

victor · February 13, 2020, 2:01pm

Hi.

Use this link to download reports: Perf_report

P.S. I can attach only images to this thread.

victor · February 18, 2020, 5:46pm

The same issue on Ubuntu 18.04 VM:

dorian_yugabyte · February 22, 2020, 2:24pm

@victor

Can you please provide a working docker-compose file ?
Example: it misses version on top, misses services, has undefined network, etc.

I tried with yb-docker-ctl multiple replicas and nodes and it worked fine:

Can you send the full docker-compose.yaml and try the guide above ?

victor · February 24, 2020, 10:51pm

Hi.

My docker-compose

After start cpu load = 5-6%.
1 table - cpu = 8%.
2 tables - cpu = 12%

4 tables - cpu = 18%
6 tables - 25%

P.S. Tables are empty, just created.

dorian_yugabyte · February 25, 2020, 1:13pm

I’m using ubuntu 18.04 (desktop) on virtual box vm, setting 8GB ram and 1 vcpu. (windows host)

I have 20 tables create table t1(id bigint primary key, a text, b text, c text, d text); resulting in 240 tablets/tserver.

While I’m getting 10% for each container:

^^^ running docker stats adds 30% to my vm cpu

Total on the ubuntu vm I’m getting 10% total (not 10% * 3 containers):

After adding 300 tablets, resulting in 540 tablets/tserver, cpu went 35%+.
Looks like this is similar to [docdb] Scale to 1k+ tables · Issue #1317 · yugabyte/yugabyte-db · GitHub (raft group, which we are working on).

Because this is single-core, virtualized ubuntu-desktop (running firefox etc), looks like it’s having a bigger overhead than normally.

Solution:

Since you are in development environment, a way to work around this issue is setting --yb_num_shards_per_tserver to 1 configuration.

While for production you should follow Deploy | YugabyteDB Docs guide.

dorian_yugabyte · February 27, 2020, 11:14am

@victor

Can you tell me more regarding your dev setup ?
How many tables,databases,indexes etc.

victor · February 27, 2020, 12:06pm

Hi.

Ubuntu 18.04 on windows hyper-v and amazon EC2.
1 db, 8 empty tables.

Looks like tservers hangs on replication between tables.

dorian_yugabyte · February 27, 2020, 3:41pm

The user was working with YCQL api.
He had 6 tservers in 1 node, 3 replicas, 3 masters, and created tables resulting in a big number of tablets.
Had overhead because many tablets, 3 masters + 6tservers in same node and virtualization.
Fixed by using with tablets when creating tables in dev environments.

Topic		Replies	Views
One node crushed suddenly General	24	474	October 10, 2024
A tserver node went down while processing a write heavy workload General	7	1205	March 15, 2019
Lost leader, then the yb-tserver - while testing long-running (low throughput) writes General	10	1552	November 11, 2021
Yb-tserver unexpectedly stopped with code 1 and no msgs in logs General	7	442	March 6, 2024
Facing strange behaviour while checking ingestion rate in ysql table General	11	595	September 6, 2023

Tserver docker CPU load in idle

Related topics