Description:
Upgrading kubernets from 1.17 to 1.18 to 1.19. leads to underreplicated tablets in yugabytedb. I would expect that yugabytedb ensures that there is the expected number of replicas.
Any ideas how to fix the underreplicated tablets?
A short snippet from tablet-replication:
Underreplicated Tablets
Table Name Table UUID Tablet ID Tablet Replication Count
cassandrakeyvalue 614f8afa1bd744b69079a6e0f1c64f55 fddb6ca3592a4bc6bb0013cd29714e44 2
cassandrakeyvalue 614f8afa1bd744b69079a6e0f1c64f55 a92c1a8a1e4746acb5c421be4a891bbf 2
cassandrakeyvalue 614f8afa1bd744b69079a6e0f1c64f55 b55b6ebb93f74617b121b665b81d5492 2
cassandrakeyvalue 614f8afa1bd744b69079a6e0f1c64f55 91ea6f835e4248438530b8972bfbd18c 2
...
Cluster Setup
- Number Master Nodes: 3
- Replication Factor 3
- Num Nodes (TServers) 11
- Num User Tables 1
- Is Load Balanced? No
- YugabyteDB Version 2.5.3.1
- Build Type RELEASE
- Installion with Helm Single DC Cluster on Kubernetes 1.19 and Istio 1.9
@beatrausch are there any yb-master logs happening during this time ? That might show why it’s not re-replicating ?
The same with yb-tservers logs ?
Hi, I try to get some logs, but I cannot promise, as it happens on dev stage.
Is there a way to instruct yugabyte to cteate the missing replicas e.g. via yb-admin tool?
Unfortunately, there are no logs from the exact point in time. But from current leader I get the following logs:
I0325 13:59:21.342280 32 catalog_manager.cc:5529] d3d1bdabb110481e928163ca9608eb56 finished first full report: 45 tablets.
W0325 13:59:21.794284 128 proxy.cc:269] Resolve yb-master-1.yb-masters.yugabytedb.svc.cluster.local failed: Network error (yb/util/net/dns_resolver.cc:64): Resolve failed yb-master-1.yb-masters.yugabytedb.svc.cluster.local: Host not found (authoritative)
W0325 13:59:22.391330 130 proxy.cc:269] Resolve yb-master-1.yb-masters.yugabytedb.svc.cluster.local failed: Network error (yb/util/net/dns_resolver.cc:64): Resolve failed yb-master-1.yb-masters.yugabytedb.svc.cluster.local: Host not found (authoritative)
W0325 13:59:22.906467 130 proxy.cc:269] Resolve yb-master-1.yb-masters.yugabytedb.svc.cluster.local failed: Network error (yb/util/net/dns_resolver.cc:64): Resolve failed yb-master-1.yb-masters.yugabytedb.svc.cluster.local: Host not found (authoritative)
W0325 13:59:23.351598 130 proxy.cc:269] Resolve yb-master-1.yb-masters.yugabytedb.svc.cluster.local failed: Network error (yb/util/net/dns_resolver.cc:64): Resolve failed yb-master-1.yb-masters.yugabytedb.svc.cluster.local: Host not found (authoritative)
W0325 13:59:23.796777 130 proxy.cc:269] Resolve yb-master-1.yb-masters.yugabytedb.svc.cluster.local failed: Network error (yb/util/net/dns_resolver.cc:64): Resolve failed yb-master-1.yb-masters.yugabytedb.svc.cluster.local: Host not found (authoritative)
…
W0325 14:00:18.731916 154 proxy.cc:269] Resolve yb-master-1.yb-masters.yugabytedb.svc.cluster.local failed: Network error (yb/util/net/dns_resolver.cc:64): Resolve failed yb-master-1.yb-masters.yugabytedb.svc.cluster.local: Host not found (authoritative)
W0325 14:00:18.732136 49 consensus_peers.cc:471] T 00000000000000000000000000000000 P 2104441140464eba97f09e8e442509d0 -> Peer 58aa0a091a1242988bf9a13da9ab6077 ([host: "yb-master-1.yb-masters.yugabytedb.svc.cluster.local" port: 7100], [host: "yb-master-1.yb-masters.yugabytedb.svc.cluster.local" port: 7100]): Couldn't send request. Status: Network error (yb/util/net/dns_resolver.cc:64): Resolve failed yb-master-1.yb-masters.yugabytedb.svc.cluster.local: Host not found (authoritative). Retrying in the next heartbeat period. Already tried 118 times. State: 2
W0325 14:00:19.314105 154 proxy.cc:269] Resolve yb-master-1.yb-masters.yugabytedb.svc.cluster.local failed: Network error (yb/util/net/dns_resolver.cc:64): Resolve failed yb-master-1.yb-masters.yugabytedb.svc.cluster.local: Host not found (authoritative)
W0325 14:00:19.831149 154 proxy.cc:269] Resolve yb-master-1.yb-masters.yugabytedb.svc.cluster.local failed: Network error (yb/util/net/dns_resolver.cc:64): Resolve failed yb-master-1.yb-masters.yugabytedb.svc.cluster.local: Host not found (authoritative)
I0325 14:00:19.836176 41 raft_consensus.cc:2707] T 00000000000000000000000000000000 P 2104441140464eba97f09e8e442509d0 [term 11 LEADER]: Leader pre-election vote request: Denying vote to candidate 58aa0a091a1242988bf9a13da9ab6077 for term 12 because replica is either leader or believes a valid leader to be alive. Time left: 9222576832.789s
I0325 14:00:20.337493 41 consensus_queue.cc:1165] T 00000000000000000000000000000000 P 2104441140464eba97f09e8e442509d0 [LEADER]: Connected to new peer: { peer: 58aa0a091a1242988bf9a13da9ab6077 is_new: 0 last_received: 10.1736 next_index: 1737 last_known_committed_idx: 1736 is_last_exchange_successful: 0 needs_remote_bootstrap: 0 member_type: VOTER num_sst_files: 0 last_applied: 10.1736 }
I0325 14:00:20.338557 41 log_cache.cc:323] T 00000000000000000000000000000000 P 2104441140464eba97f09e8e442509d0: Successfully read 1 ops from disk.
I0325 14:00:20.349576 154 replica_state.cc:1356] T 00000000000000000000000000000000 P 2104441140464eba97f09e8e442509d0 [term 11 LEADER]: Revoked old leader 58aa0a091a1242988bf9a13da9ab6077 ht lease: { physical: 1616680759229453 }
I0325 14:00:25.064741 21 reactor.cc:450] Master_R000: DEBUG: Closing idle connection: Connection (0x00000000040bca30) server 127.0.0.1:33612 => 127.0.0.1:7100 - it has been idle for 65.0997s
I0325 14:00:29.164598 23 reactor.cc:450] Master_R002: DEBUG: Closing idle connection: Connection (0x00000000040bcc70) server 127.0.0.1:33898 => 127.0.0.1:7100 - it has been idle for 65.0997s
I0325 14:00:29.164695 22 reactor.cc:450] Master_R001: DEBUG: Closing idle connection: Connection (0x00000000040bcb50) server 127.0.0.1:33896 => 127.0.0.1:7100 - it has been idle for 65.0997s
I0325 14:00:47.365151 22 reactor.cc:450] Master_R001: DEBUG: Closing idle connection: Connection (0x00000000040bceb0) server 127.0.0.1:34650 => 127.0.0.1:7100 - it has been idle for 65.0997s
I0325 14:00:47.365190 24 reactor.cc:450] Master_R003: DEBUG: Closing idle connection: Connection (0x00000000040bcd90) server 127.0.0.1:34648 => 127.0.0.1:7100 - it has been idle for 65.0997s
I0325 14:00:58.965286 23 reactor.cc:450] Master_R002: DEBUG: Closing idle connection: Connection (0x0000000003710a30) server 127.0.0.1:35076 => 127.0.0.1:7100 - it has been idle for 65.0007s
I0325 14:00:59.065079 22 reactor.cc:450] Master_R001: DEBUG: Closing idle connection: Connection (0x00000000040bcfd0) server 127.0.0.1:35074 => 127.0.0.1:7100 - it has been idle for 65.0997s
I0325 14:01:03.664491 21 reactor.cc:450] Master_R000: DEBUG: Closing idle connection: Connection (0x0000000003672eb0) server 127.0.0.1:35222 => 127.0.0.1:7100 - it has been idle for 65.0997s
I0325 14:01:03.965220 24 reactor.cc:450] Master_R003: DEBUG: Closing idle connection: Connection (0x00000000036730f0) server 127.0.0.1:35256 => 127.0.0.1:7100 - it has been idle for 65.0997s
I0325 14:01:06.865044 22 reactor.cc:450] Master_R001: DEBUG: Closing idle connection: Connection (0x0000000003673e70) server 127.0.0.1:35348 => 127.0.0.1:7100 - it has been idle for 65.0997s
I0325 14:01:10.164521 24 reactor.cc:450] Master_R003: DEBUG: Closing idle connection: Connection (0x0000000003673570) server 127.0.0.1:35494 => 127.0.0.1:7100 - it has been idle for 65.0007s
I0325 14:01:13.964634 24 reactor.cc:450] Master_R003: DEBUG: Closing idle connection: Connection (0x0000000003793450) server 127.0.0.1:35620 => 127.0.0.1:7100 - it has been idle for 65.0997s
I0325 14:01:17.564548 23 reactor.cc:450] Master_R002: DEBUG: Closing idle connection: Connection (0x000000000386b690) server 127.0.0.1:35210 => 127.0.0.1:7100 - it has been idle for 65.0998s
W0325 14:01:20.451894 36 cluster_balance.cc:342] Skipping load balancing 614f8afa1bd744b69079a6e0f1c64f55: Leader not ready to serve requests. (../../src/yb/master/cluster_balance_util.h:341): Master leader has not yet received heartbeat from ts d443ab177bf3481cb0786cbfd3b254f6, either master just became leader or a network partition.
Can you check if all the yb-tserver pods are running? kubectl get pods -n namespace
should give us that list. Are there any crashing yb-tservers? Also it seems like leader is failing to resolve yb-master-1, can you check if that pod is running?
Hi,
kubectl -n yugabyte get pods
returns the
NAME READY STATUS RESTARTS AGE
yb-master-0 3/3 Running 0 24d
yb-master-1 3/3 Running 0 3d17h
yb-master-2 3/3 Running 1 25d
yb-tserver-0 3/3 Running 0 3d17h
yb-tserver-1 3/3 Running 1 24d
yb-tserver-2 3/3 Running 0 22d
yb-tserver-3 3/3 Running 1 25d
yb-tserver-4 3/3 Running 1 25d
yb-tserver-5 3/3 Running 0 23d
yb-tserver-6 3/3 Running 1 25d
yb-tserver-7 3/3 Running 0 3d17h
Right now it seems that all relevant pods are running. But there was an issue with master-1.