11:35 upgrade errand
11:36 master down
11:38 master up
11:38 worker nodes rolling upgrade starts, app read/write speed decreases to 0.2 reads/sec, 0.2 writes/sec
11:46 nodes upgrade done
11:55 errand finished
11:56 read/write speed resumes to around 500 reads/sec, 280 writes/sec), left running until target all reads(14400000) and writes(7200000) finish
00:22 7200000 writes done, 719998 records in the table. Sample from log:
Read: 475.31 ops/sec (2.10 ms/op), 2641092 total ops | Write: 0.00 ops/sec (0.00 ms/op), 720000 total ops | Uptime: 6346154 ms | maxWrittenKey: 719999 | maxGeneratedKey: 720000 |
01:04 Stop the app. Still 719998 records in the table.
Can I have some explanation about the observations above in terms of performance downgrade and records lost?
To confirm - when you say k8s rolling upgrade, this is an upgrade of k8s software rather than upgrade of YB software to newer release on a k8s cluster, correct?
Are you using anti-affinity where each yb-tserver is on a different VM/node and we don’t have multiple yb-tservers/yb-masters that are down during a roll of each k8s node?
I used the enterprise version without specifying anti-affinity rules. But each pod does get scheduled on different k8s worker node both before and after upgrade
Yes all pods are rescheduled and restarted successfully.
Agree that the ops/sec decrease is not a problem. I’m concerned about the data loss/missing more. Not sure if it’s because YB itself or the error introduced by the app.
Updating this thread with the summary of an offline conversation.
Looks like there could be a window where all nodes could be offline in the case of a rolling restart of the underlying Kubernetes cluster.
@qshao mentioned it would be good to look at Pod Disruption Budgets, since (theoretically) it looks like it can protect the quorum better in the scenario of PKS cluster rolling upgrade. A key difference between rolling upgrades for updating a statefulset and upgrading the k8s cluster nodes is the following:
StatefulSets wait for the previous pod to be running before the next pod is upgraded/updated. This ensures that the quorum has at most one non-operational member at any point of time.
k8s nodes upgrade does not wait for the previous pod, resulting in multiple pods being down at the same time. So, if any pod has a longer startup time, there’s a chance of losing majority which could be the case if there is a workload running.