Underreplicated Tablets

Description:
Upgrading kubernets from 1.17 to 1.18 to 1.19. leads to underreplicated tablets in yugabytedb. I would expect that yugabytedb ensures that there is the expected number of replicas.

Any ideas how to fix the underreplicated tablets?

A short snippet from tablet-replication:

Underreplicated Tablets
Table Name	Table UUID	Tablet ID	Tablet Replication Count
cassandrakeyvalue	614f8afa1bd744b69079a6e0f1c64f55	fddb6ca3592a4bc6bb0013cd29714e44	2
cassandrakeyvalue	614f8afa1bd744b69079a6e0f1c64f55	a92c1a8a1e4746acb5c421be4a891bbf	2
cassandrakeyvalue	614f8afa1bd744b69079a6e0f1c64f55	b55b6ebb93f74617b121b665b81d5492	2
cassandrakeyvalue	614f8afa1bd744b69079a6e0f1c64f55	91ea6f835e4248438530b8972bfbd18c	2
...

Cluster Setup

  • Number Master Nodes: 3
  • Replication Factor 3
  • Num Nodes (TServers) 11
  • Num User Tables 1
  • Is Load Balanced? No
  • YugabyteDB Version 2.5.3.1
  • Build Type RELEASE
  • Installion with Helm Single DC Cluster on Kubernetes 1.19 and Istio 1.9

@beatrausch are there any yb-master logs happening during this time ? That might show why it’s not re-replicating ?

The same with yb-tservers logs ?

Hi, I try to get some logs, but I cannot promise, as it happens on dev stage.
Is there a way to instruct yugabyte to cteate the missing replicas e.g. via yb-admin tool?

Unfortunately, there are no logs from the exact point in time. But from current leader I get the following logs:

I0325 13:59:21.342280    32 catalog_manager.cc:5529] d3d1bdabb110481e928163ca9608eb56 finished first full report: 45 tablets.
W0325 13:59:21.794284   128 proxy.cc:269] Resolve yb-master-1.yb-masters.yugabytedb.svc.cluster.local failed: Network error (yb/util/net/dns_resolver.cc:64): Resolve failed yb-master-1.yb-masters.yugabytedb.svc.cluster.local: Host not found (authoritative)
W0325 13:59:22.391330   130 proxy.cc:269] Resolve yb-master-1.yb-masters.yugabytedb.svc.cluster.local failed: Network error (yb/util/net/dns_resolver.cc:64): Resolve failed yb-master-1.yb-masters.yugabytedb.svc.cluster.local: Host not found (authoritative)
W0325 13:59:22.906467   130 proxy.cc:269] Resolve yb-master-1.yb-masters.yugabytedb.svc.cluster.local failed: Network error (yb/util/net/dns_resolver.cc:64): Resolve failed yb-master-1.yb-masters.yugabytedb.svc.cluster.local: Host not found (authoritative)
W0325 13:59:23.351598   130 proxy.cc:269] Resolve yb-master-1.yb-masters.yugabytedb.svc.cluster.local failed: Network error (yb/util/net/dns_resolver.cc:64): Resolve failed yb-master-1.yb-masters.yugabytedb.svc.cluster.local: Host not found (authoritative)
W0325 13:59:23.796777   130 proxy.cc:269] Resolve yb-master-1.yb-masters.yugabytedb.svc.cluster.local failed: Network error (yb/util/net/dns_resolver.cc:64): Resolve failed yb-master-1.yb-masters.yugabytedb.svc.cluster.local: Host not found (authoritative)
…
W0325 14:00:18.731916   154 proxy.cc:269] Resolve yb-master-1.yb-masters.yugabytedb.svc.cluster.local failed: Network error (yb/util/net/dns_resolver.cc:64): Resolve failed yb-master-1.yb-masters.yugabytedb.svc.cluster.local: Host not found (authoritative)
W0325 14:00:18.732136    49 consensus_peers.cc:471] T 00000000000000000000000000000000 P 2104441140464eba97f09e8e442509d0 -> Peer 58aa0a091a1242988bf9a13da9ab6077 ([host: "yb-master-1.yb-masters.yugabytedb.svc.cluster.local" port: 7100], [host: "yb-master-1.yb-masters.yugabytedb.svc.cluster.local" port: 7100]): Couldn't send request.  Status: Network error (yb/util/net/dns_resolver.cc:64): Resolve failed yb-master-1.yb-masters.yugabytedb.svc.cluster.local: Host not found (authoritative). Retrying in the next heartbeat period. Already tried 118 times. State: 2
W0325 14:00:19.314105   154 proxy.cc:269] Resolve yb-master-1.yb-masters.yugabytedb.svc.cluster.local failed: Network error (yb/util/net/dns_resolver.cc:64): Resolve failed yb-master-1.yb-masters.yugabytedb.svc.cluster.local: Host not found (authoritative)
W0325 14:00:19.831149   154 proxy.cc:269] Resolve yb-master-1.yb-masters.yugabytedb.svc.cluster.local failed: Network error (yb/util/net/dns_resolver.cc:64): Resolve failed yb-master-1.yb-masters.yugabytedb.svc.cluster.local: Host not found (authoritative)
I0325 14:00:19.836176    41 raft_consensus.cc:2707] T 00000000000000000000000000000000 P 2104441140464eba97f09e8e442509d0 [term 11 LEADER]:  Leader pre-election vote request: Denying vote to candidate 58aa0a091a1242988bf9a13da9ab6077 for term 12 because replica is either leader or believes a valid leader to be alive. Time left: 9222576832.789s
I0325 14:00:20.337493    41 consensus_queue.cc:1165] T 00000000000000000000000000000000 P 2104441140464eba97f09e8e442509d0 [LEADER]: Connected to new peer: { peer: 58aa0a091a1242988bf9a13da9ab6077 is_new: 0 last_received: 10.1736 next_index: 1737 last_known_committed_idx: 1736 is_last_exchange_successful: 0 needs_remote_bootstrap: 0 member_type: VOTER num_sst_files: 0 last_applied: 10.1736 }
I0325 14:00:20.338557    41 log_cache.cc:323] T 00000000000000000000000000000000 P 2104441140464eba97f09e8e442509d0: Successfully read 1 ops from disk.
I0325 14:00:20.349576   154 replica_state.cc:1356] T 00000000000000000000000000000000 P 2104441140464eba97f09e8e442509d0 [term 11 LEADER]: Revoked old leader 58aa0a091a1242988bf9a13da9ab6077 ht lease: { physical: 1616680759229453 }
I0325 14:00:25.064741    21 reactor.cc:450] Master_R000: DEBUG: Closing idle connection: Connection (0x00000000040bca30) server 127.0.0.1:33612 => 127.0.0.1:7100 - it has been idle for 65.0997s
I0325 14:00:29.164598    23 reactor.cc:450] Master_R002: DEBUG: Closing idle connection: Connection (0x00000000040bcc70) server 127.0.0.1:33898 => 127.0.0.1:7100 - it has been idle for 65.0997s
I0325 14:00:29.164695    22 reactor.cc:450] Master_R001: DEBUG: Closing idle connection: Connection (0x00000000040bcb50) server 127.0.0.1:33896 => 127.0.0.1:7100 - it has been idle for 65.0997s
I0325 14:00:47.365151    22 reactor.cc:450] Master_R001: DEBUG: Closing idle connection: Connection (0x00000000040bceb0) server 127.0.0.1:34650 => 127.0.0.1:7100 - it has been idle for 65.0997s
I0325 14:00:47.365190    24 reactor.cc:450] Master_R003: DEBUG: Closing idle connection: Connection (0x00000000040bcd90) server 127.0.0.1:34648 => 127.0.0.1:7100 - it has been idle for 65.0997s
I0325 14:00:58.965286    23 reactor.cc:450] Master_R002: DEBUG: Closing idle connection: Connection (0x0000000003710a30) server 127.0.0.1:35076 => 127.0.0.1:7100 - it has been idle for 65.0007s
I0325 14:00:59.065079    22 reactor.cc:450] Master_R001: DEBUG: Closing idle connection: Connection (0x00000000040bcfd0) server 127.0.0.1:35074 => 127.0.0.1:7100 - it has been idle for 65.0997s
I0325 14:01:03.664491    21 reactor.cc:450] Master_R000: DEBUG: Closing idle connection: Connection (0x0000000003672eb0) server 127.0.0.1:35222 => 127.0.0.1:7100 - it has been idle for 65.0997s
I0325 14:01:03.965220    24 reactor.cc:450] Master_R003: DEBUG: Closing idle connection: Connection (0x00000000036730f0) server 127.0.0.1:35256 => 127.0.0.1:7100 - it has been idle for 65.0997s
I0325 14:01:06.865044    22 reactor.cc:450] Master_R001: DEBUG: Closing idle connection: Connection (0x0000000003673e70) server 127.0.0.1:35348 => 127.0.0.1:7100 - it has been idle for 65.0997s
I0325 14:01:10.164521    24 reactor.cc:450] Master_R003: DEBUG: Closing idle connection: Connection (0x0000000003673570) server 127.0.0.1:35494 => 127.0.0.1:7100 - it has been idle for 65.0007s
I0325 14:01:13.964634    24 reactor.cc:450] Master_R003: DEBUG: Closing idle connection: Connection (0x0000000003793450) server 127.0.0.1:35620 => 127.0.0.1:7100 - it has been idle for 65.0997s
I0325 14:01:17.564548    23 reactor.cc:450] Master_R002: DEBUG: Closing idle connection: Connection (0x000000000386b690) server 127.0.0.1:35210 => 127.0.0.1:7100 - it has been idle for 65.0998s
W0325 14:01:20.451894    36 cluster_balance.cc:342] Skipping load balancing 614f8afa1bd744b69079a6e0f1c64f55: Leader not ready to serve requests. (../../src/yb/master/cluster_balance_util.h:341): Master leader has not yet received heartbeat from ts d443ab177bf3481cb0786cbfd3b254f6, either master just became leader or a network partition.

Can you check if all the yb-tserver pods are running? kubectl get pods -n namespace should give us that list. Are there any crashing yb-tservers? Also it seems like leader is failing to resolve yb-master-1, can you check if that pod is running?

Hi,

kubectl -n yugabyte get pods returns the

NAME           READY   STATUS    RESTARTS   AGE
yb-master-0    3/3     Running   0          24d
yb-master-1    3/3     Running   0          3d17h
yb-master-2    3/3     Running   1          25d
yb-tserver-0   3/3     Running   0          3d17h
yb-tserver-1   3/3     Running   1          24d
yb-tserver-2   3/3     Running   0          22d
yb-tserver-3   3/3     Running   1          25d
yb-tserver-4   3/3     Running   1          25d
yb-tserver-5   3/3     Running   0          23d
yb-tserver-6   3/3     Running   1          25d
yb-tserver-7   3/3     Running   0          3d17h

Right now it seems that all relevant pods are running. But there was an issue with master-1.