[RFC] Running YugaByte DB on Kubernetes using StatefulSets

This topic assumes you are familiar with the architecture of YugaByte DB (in case you are new).

Goal

This is the first step in a series of steps to get YugaByte DB to run natively on Kubernetes. A lot of this is based on my understanding of Kubernetes which is not extensive by any means :slight_smile:.

Specifically, here are the basic scenarios we would like to address first:

  1. Creating a YB universe on Kubernetes with rf = 1, 3, 5, etc
  2. Scale the universe up by (equivalent of adding nodes)
  3. Ability to scale the universe down (equivalent of removing nodes)
  4. Surviving failures (based on the replication factor / fault tolerance)

Design

There will be two services built using StatefulSets: YB-Master and YB-Tserver

YB-Master

  • This will be the first StatefulSet
  • There will be a YB-Master headless service (needed for StatefulSets) and an end-point Service (seems to be needed if other services need to access to this service for RPC calls) in addition to the StatefulSet
  • The pods will be started in parallel using podManagementPolicy: "Parallel" as we want to trigger a leader election right away.
  • The masters should discover each other using the cname of the YB-Master headless service, which would list out all the pods in that service. This is an enhancement needed in YB.

This is based on the following Kubernetes documentation:

The CNAME of the headless service points to SRV records (one for each Pod that is Running and Ready). The SRV records point to A record entries that contain the Pods’ IP addresses.

YB-Tserver

  • This will be the second StatefulSet
  • As before, there will be a YB-Tserver headless service (needed for StatefulSets) and an end-point Service (seems to be needed if other services need to access to this service for RPC calls). This is in addition to theStatefulSet
  • The podManagementPolicy is less important here, but planning to set it to podManagementPolicy: "Parallel"
  • The tservers will discover the masters using the cname of the YB-Master headless service.

Open Questions

Planning to figure out some of these answers. Any help appreciated!

  1. When a new StatefulSet service is created, the cname of the headless service seems to list the pods that are running. If some pods come up slower, will the cname list a partial list of pods?

  2. In the above scenario, is there some mechanism to wait for all the pods to come up before starting the YB-Master? If not, is there a state change event for each new pod coming up?

  3. Can we start master statefulset first before tserver from single yaml file? In other words, can we make some services dependent on other services being up?

  4. Does ntp work inside containers? If so, can we assume containers on the same physical hosts have synchronized clocks and the remove ones would need ntp synchronization?

  5. When a pod is scheduled to be decommissioned, YugaByte DB drains the data and iops from it before removing it from the universe (and for k8s to destroy). How can this be achieved using Kubernetes?

  6. This may be a basic one but seems very hard for me to figure out: What is the paradigm to expose the StatefulSet pods to the outside world? The YugaByte smart client supports node-level locality when performing IO for low latency operations - it “knows” the correct node to go to based on the key as opposed to asking a random node.

Future Work

There is a lot more in this list, but just adding these as concerns down the road after the basic service is working.

  • Cross-AZ, multi-region and multi-cloud deployments
  • Running in public clouds with anti-affinity rules
  • Changing the replication factor of a running universe
4 Likes