Tserver docker CPU load in idle

Each tserver docker node has 25% cpu load in idle.

Hi @victor,

Interesting. Wondering if there is some background task going on. Could you please answer a few questions about your setup, would help us identify what is going on:

  • How many vCPUs on each yb-tserver pod?
  • How many tables/tablets do you have in the system? You can find this from the master ui: http://yb-master-0:7000 (tablet-servers page will give you how many tablets on each yb-tserver).
  • Did you just finish loading data into any of the tables?
  • Could you also look at any of the tserver logs to see if anything is reported there?

I use docker-compose to setup cluster in dev environment (win10 + docker) - docker has 8 cpus for all services. 6 empty tables (created on db updater start). Yes. Logs in attach for tserver1.

Logs:

…

W0209 18:34:10.465245 19 reactor.cc:380] TabletServer_R007: Client call yb.consensus.ConsensusService.RequestConsensusVote has no timeout set for connection id: { remote: 172.18.0.7:9100 idx: 6 protocol: 0x00007f927e0a8f00 β†’ tcp }
W0209 18:34:12.184116 13 reactor.cc:380] TabletServer_R001: Client call yb.consensus.ConsensusService.RequestConsensusVote has no timeout set for connection id: { remote: 172.18.0.10:9100 idx: 7 protocol: 0x00007f927e0a8f00 β†’ tcp }
W0209 18:34:12.184180 15 reactor.cc:380] TabletServer_R003: Client call yb.consensus.ConsensusService.RequestConsensusVote has no timeout set for connection id: { remote: 172.18.0.7:9100 idx: 2 protocol: 0x00007f927e0a8f00 β†’ tcp }
W0209 18:34:12.198673 17 reactor.cc:380] TabletServer_R005: Client call yb.consensus.ConsensusService.RequestConsensusVote has no timeout set for connection id: { remote: 172.18.0.7:9100 idx: 4 protocol: 0x00007f927e0a8f00 β†’ tcp }
W0209 18:34:12.198679 14 reactor.cc:380] TabletServer_R002: Client call yb.consensus.ConsensusService.RequestConsensusVote has no timeout set for connection id: { remote: 172.18.0.10:9100 idx: 0 protocol: 0x00007f927e0a8f00 β†’ tcp }
W0209 18:34:12.244817 141 replica_state.cc:733] T e5645f251c394ca282b38125e1f10386 P 36f9a55694454f28941011b16cffc067 [term 3 LEADER]: Can’t advance the committed index across term boundaries until operations from the current term are replicated. Last committed operation was: { term: 1 index: 1 }, New majority replicated is: term: 1 index: 1, Current term is: 3
W0209 18:34:13.625566 16 reactor.cc:380] TabletServer_R004: Client call yb.consensus.ConsensusService.RequestConsensusVote has no timeout set for connection id: { remote: 172.18.0.10:9100 idx: 2 protocol: 0x00007f927e0a8f00 β†’ tcp }
W0209 18:34:13.625627 14 reactor.cc:380] TabletServer_R002: Client call yb.consensus.ConsensusService.RequestConsensusVote has no timeout set for connection id: { remote: 172.18.0.7:9100 idx: 1 protocol: 0x00007f927e0a8f00 β†’ tcp }
W0209 18:34:23.969054 18 reactor.cc:380] TabletServer_R006: Client call yb.consensus.ConsensusService.RequestConsensusVote has no timeout set for connection id: { remote: 172.18.0.10:9100 idx: 4 protocol: 0x00007f927e0a8f00 β†’ tcp }
W0209 18:34:23.969092 13 reactor.cc:380] TabletServer_R001: Client call yb.consensus.ConsensusService.RequestConsensusVote has no timeout set for connection id: { remote: 172.18.0.7:9100 idx: 0 protocol: 0x00007f927e0a8f00 β†’ tcp }
W0209 18:34:23.983862 14 reactor.cc:380] TabletServer_R002: Client call yb.consensus.ConsensusService.RequestConsensusVote has no timeout set for connection id: { remote: 172.18.0.7:9100 idx: 1 protocol: 0x00007f927e0a8f00 β†’ tcp }
W0209 18:34:23.983912 13 reactor.cc:380] TabletServer_R001: Client call yb.consensus.ConsensusService.RequestConsensusVote has no timeout set for connection id: { remote: 172.18.0.10:9100 idx: 7 protocol: 0x00007f927e0a8f00 β†’ tcp }
W0209 18:34:24.004087 141 replica_state.cc:733] T 435f874f25924e2e9055c968cac30dba P 36f9a55694454f28941011b16cffc067 [term 2 LEADER]: Can’t advance the committed index across term boundaries until operations from the current term are replicated. Last committed operation was: { term: 1 index: 1 }, New majority replicated is: term: 1 index: 1, Current term is: 2
W0209 18:34:25.972113 14 reactor.cc:380] TabletServer_R002: Client call yb.consensus.ConsensusService.RequestConsensusVote has no timeout set for connection id: { remote: 172.18.0.7:9100 idx: 1 protocol: 0x00007f927e0a8f00 β†’ tcp }
W0209 18:34:25.972182 15 reactor.cc:380] TabletServer_R003: Client call yb.consensus.ConsensusService.RequestConsensusVote has no timeout set for connection id: { remote: 172.18.0.10:9100 idx: 1 protocol: 0x00007f927e0a8f00 β†’ tcp }
W0209 18:34:25.986752 15 reactor.cc:380] TabletServer_R003: Client call yb.consensus.ConsensusService.RequestConsensusVote has no timeout set for connection id: { remote: 172.18.0.7:9100 idx: 2 protocol: 0x00007f927e0a8f00 β†’ tcp }
W0209 18:34:25.986797 19 reactor.cc:380] TabletServer_R007: Client call yb.consensus.ConsensusService.RequestConsensusVote has no timeout set for connection id: { remote: 172.18.0.10:9100 idx: 5 protocol: 0x00007f927e0a8f00 β†’ tcp }
W0209 18:34:26.008317 140 replica_state.cc:733] T 7dc71901204a459585ce9828ba10c0c3 P 36f9a55694454f28941011b16cffc067 [term 2 LEADER]: Can’t advance the committed index across term boundaries until operations from the current term are replicated. Last committed operation was: { term: 1 index: 1 }, New majority replicated is: term: 1 index: 1, Current term is: 2
W0209 18:34:30.978746 13 reactor.cc:380] TabletServer_R001: Client call yb.consensus.ConsensusService.RequestConsensusVote has no timeout set for connection id: { remote: 172.18.0.10:9100 idx: 7 protocol: 0x00007f927e0a8f00 β†’ tcp }
W0209 18:34:30.978780 12 reactor.cc:380] TabletServer_R000: Client call yb.consensus.ConsensusService.RequestConsensusVote has no timeout set for connection id: { remote: 172.18.0.7:9100 idx: 7 protocol: 0x00007f927e0a8f00 β†’ tcp }
W0209 18:34:30.992595 17 reactor.cc:380] TabletServer_R005: Client call yb.consensus.ConsensusService.RequestConsensusVote has no timeout set for connection id: { remote: 172.18.0.10:9100 idx: 3 protocol: 0x00007f927e0a8f00 β†’ tcp }
W0209 18:34:30.992645 14 reactor.cc:380] TabletServer_R002: Client call yb.consensus.ConsensusService.RequestConsensusVote has no timeout set for connection id: { remote: 172.18.0.7:9100 idx: 1 protocol: 0x00007f927e0a8f00 β†’ tcp }
W0209 18:34:31.010702 140 replica_state.cc:733] T 00f717bfad7847e7838bc59769cb0601 P 36f9a55694454f28941011b16cffc067 [term 2 LEADER]: Can’t advance the committed index across term boundaries until operations from the current term are replicated. Last committed operation was: { term: 1 index: 1 }, New majority replicated is: term: 1 index: 1, Current term is: 2
W0209 18:34:31.980079 12 reactor.cc:380] TabletServer_R000: Client call yb.consensus.ConsensusService.RequestConsensusVote has no timeout set for connection id: { remote: 172.18.0.10:9100 idx: 6 protocol: 0x00007f927e0a8f00 β†’ tcp }
W0209 18:34:31.980187 19 reactor.cc:380] TabletServer_R007: Client call yb.consensus.ConsensusService.RequestConsensusVote has no timeout set for connection id: { remote: 172.18.0.7:9100 idx: 6 protocol: 0x00007f927e0a8f00 β†’ tcp }
W0209 18:34:31.994076 14 reactor.cc:380] TabletServer_R002: Client call yb.consensus.ConsensusService.RequestConsensusVote has no timeout set for connection id: { remote: 172.18.0.10:9100 idx: 0 protocol: 0x00007f927e0a8f00 β†’ tcp }
W0209 18:34:31.994104 14 reactor.cc:380] TabletServer_R002: Client call yb.consensus.ConsensusService.RequestConsensusVote has no timeout set for connection id: { remote: 172.18.0.7:9100 idx: 1 protocol: 0x00007f927e0a8f00 β†’ tcp }
W0209 18:34:32.012917 140 replica_state.cc:733] T 08584550feec4f70b729ebf72d0aca09 P 36f9a55694454f28941011b16cffc067 [term 2 LEADER]: Can’t advance the committed index across term boundaries until operations from the current term are replicated. Last committed operation was: { term: 1 index: 1 }, New majority replicated is: term: 1 index: 1, Current term is: 2

Something seems off here. cc @bogdan @sanketh - could you folks please take a look?

Hello Victor,

  1. We have updated our reference docker-compose file at Deploy | YugabyteDB Docs recently. Please make sure your compose config looks similar to the reference.

  2. 25% usage on idle for 6 user tables does not match normal behavior. Were you running a benchmark or other workload prior to this? Could you roughly describe what the workload was doing at a high level? Or did you just create the 6 tables?

  3. Could you collect a performance profile from the tserver that would help us understand the idle usage? Here are the commands you would need to run in the yb-tserver container to collect the profile.

$ yum install -y perl graphviz

$ /home/yugabyte/bin/pprof --text http://localhost:9000/pprof/profile --seconds 30 > /mnt/tserver/tserver-profile.txt

$ /home/yugabyte/bin/pprof --ps http://localhost:9000/pprof/profile --seconds 30 > /mnt/tserver/tserver-profile.ps

Please then copy the two files /mnt/tserver/tserver-profile.txt /mnt/tserver/tserver-profile.ps out of the container and attach them to this thread.

Thanks for helping us debug this issue!

Hi.

  1. See attach.
  2. After start – db updater just created empty tables, no api activities, no inserts/selects, etc. I have api1x3 instances connected to each of 6 tservers and api2x3 instances connected to each of 6 tservers – so, 6 clients connected to each tserver.
  3. See attach.

Thanks.

Regards,

Victor

(Attachment docker-compose.yml is missing)

(Attachment tserver-profile.ps is missing)

(Attachment tserver-profile.txt is missing)

@victor you attachments are missing. Can you try again ?

Hi.

Use this link to download reports: Perf_report

P.S. I can attach only images to this thread.

The same issue on Ubuntu 18.04 VM:

@victor

Can you please provide a working docker-compose file ?
Example: it misses version on top, misses services, has undefined network, etc.

I tried with yb-docker-ctl multiple replicas and nodes and it worked fine:

Can you send the full docker-compose.yaml and try the guide above ?

Hi.

My docker-compose

After start cpu load = 5-6%.
1 table - cpu = 8%.
2 tables - cpu = 12%


4 tables - cpu = 18%
6 tables - 25%
image

P.S. Tables are empty, just created.

I’m using ubuntu 18.04 (desktop) on virtual box vm, setting 8GB ram and 1 vcpu. (windows host)

I have 20 tables create table t1(id bigint primary key, a text, b text, c text, d text); resulting in 240 tablets/tserver.

While I’m getting 10% for each container:


^^^ running docker stats adds 30% to my vm cpu

Total on the ubuntu vm I’m getting 10% total (not 10% * 3 containers):

After adding 300 tablets, resulting in 540 tablets/tserver, cpu went 35%+.
Looks like this is similar to [docdb] Scale to 1k+ tables Β· Issue #1317 Β· yugabyte/yugabyte-db Β· GitHub (raft group, which we are working on).

Because this is single-core, virtualized ubuntu-desktop (running firefox etc), looks like it’s having a bigger overhead than normally.

Solution:

Since you are in development environment, a way to work around this issue is setting --yb_num_shards_per_tserver to 1 configuration.

While for production you should follow Deploy | YugabyteDB Docs guide.

@victor

Can you tell me more regarding your dev setup ?
How many tables,databases,indexes etc.

Hi.

  1. Ubuntu 18.04 on windows hyper-v and amazon EC2.
  2. 1 db, 8 empty tables.

Looks like tservers hangs on replication between tables.

The user was working with YCQL api.
He had 6 tservers in 1 node, 3 replicas, 3 masters, and created tables resulting in a big number of tablets.
Had overhead because many tablets, 3 masters + 6tservers in same node and virtualization.
Fixed by using with tablets when creating tables in dev environments.

1 Like