Observation of YugaByte during k8s rolling upgrade

Hi there,

I was testing YugaByte through a rolling upgrade of the k8s cluster it’s deployed on and here’re some findings:

  • There’s a downgrade of read/write performance during k8s upgrade(see Journal below)
  • Records missing while app claimed 720,000 write ops done, there’re 719,998 in table .

Command I used:

export TSERVER=XXX.XXX.XXX.XXX
export WRITE_NUM=720000
export READ_NUM=14400000
java -jar yb-sample-apps.jar --workload CassandraKeyValue --num_unique_keys ${WRITE_NUM} --num_writes ${WRITE_NUM} --num_threads_write 1 --num_reads ${READ_NUM} --num_threads_read 1 --nodes ${TSERVER}:9042 > test.log;

Journal

11:35 upgrade errand
11:36 master down
11:38 master up
11:38 worker nodes rolling upgrade starts, app read/write speed decreases to 0.2 reads/sec, 0.2 writes/sec
11:46 nodes upgrade done
11:55 errand finished
11:56 read/write speed resumes to around 500 reads/sec, 280 writes/sec), left running until target all reads(14400000) and writes(7200000) finish
00:22 7200000 writes done, 719998 records in the table. Sample from log:
Read: 475.31 ops/sec (2.10 ms/op), 2641092 total ops | Write: 0.00 ops/sec (0.00 ms/op), 720000 total ops | Uptime: 6346154 ms | maxWrittenKey: 719999 | maxGeneratedKey: 720000 |
01:04 Stop the app. Still 719998 records in the table.

Can I have some explanation about the observations above in terms of performance downgrade and records lost?

Thanks

hi @qshao:

Some quick questions:

  1. To confirm - when you say k8s rolling upgrade, this is an upgrade of k8s software rather than upgrade of YB software to newer release on a k8s cluster, correct?

  2. Are you using anti-affinity where each yb-tserver is on a different VM/node and we don’t have multiple yb-tservers/yb-masters that are down during a roll of each k8s node?

regards,
Kannan

Hi @qshao,

One more question: what do you observe when one yb-tserver pod is restarted?

What we have seen is that if the nodes are restarted with enough of a time window in between the restarts, the app recovers gracefully.

Note that the ops/sec going lower is not a problem in real life, these ops will all succeed but will appear to have higher latency.

  1. Yes
  2. I used the enterprise version without specifying anti-affinity rules. But each pod does get scheduled on different k8s worker node both before and after upgrade

Yes all pods are rescheduled and restarted successfully.
Agree that the ops/sec decrease is not a problem. I’m concerned about the data loss/missing more. Not sure if it’s because YB itself or the error introduced by the app.

Updating this thread with the summary of an offline conversation.

Looks like there could be a window where all nodes could be offline in the case of a rolling restart of the underlying Kubernetes cluster.

@qshao mentioned it would be good to look at Pod Disruption Budgets, since (theoretically) it looks like it can protect the quorum better in the scenario of PKS cluster rolling upgrade. A key difference between rolling upgrades for updating a statefulset and upgrading the k8s cluster nodes is the following:

  • StatefulSets wait for the previous pod to be running before the next pod is upgraded/updated. This ensures that the quorum has at most one non-operational member at any point of time.
  • k8s nodes upgrade does not wait for the previous pod, resulting in multiple pods being down at the same time. So, if any pod has a longer startup time, there’s a chance of losing majority which could be the case if there is a workload running.