[RFC] Running YugaByte DB on Kubernetes using StatefulSets

karthik · January 17, 2018, 2:16am

This topic assumes you are familiar with the architecture of YugaByte DB (in case you are new).

Goal

This is the first step in a series of steps to get YugaByte DB to run natively on Kubernetes. A lot of this is based on my understanding of Kubernetes which is not extensive by any means .

Specifically, here are the basic scenarios we would like to address first:

Creating a YB universe on Kubernetes with rf = 1, 3, 5, etc
Scale the universe up by (equivalent of adding nodes)
Ability to scale the universe down (equivalent of removing nodes)
Surviving failures (based on the replication factor / fault tolerance)

Design

There will be two services built using StatefulSets: YB-Master and YB-Tserver

YB-Master

This will be the first StatefulSet
There will be a YB-Master headless service (needed for StatefulSets) and an end-point Service (seems to be needed if other services need to access to this service for RPC calls) in addition to the StatefulSet
The pods will be started in parallel using podManagementPolicy: "Parallel" as we want to trigger a leader election right away.
The masters should discover each other using the cname of the YB-Master headless service, which would list out all the pods in that service. This is an enhancement needed in YB.

This is based on the following Kubernetes documentation:

The CNAME of the headless service points to SRV records (one for each Pod that is Running and Ready). The SRV records point to A record entries that contain the Pods’ IP addresses.

YB-Tserver

This will be the second StatefulSet
As before, there will be a YB-Tserver headless service (needed for StatefulSets) and an end-point Service (seems to be needed if other services need to access to this service for RPC calls). This is in addition to theStatefulSet
The podManagementPolicy is less important here, but planning to set it to podManagementPolicy: "Parallel"
The tservers will discover the masters using the cname of the YB-Master headless service.

Open Questions

Planning to figure out some of these answers. Any help appreciated!

When a new StatefulSet service is created, the cname of the headless service seems to list the pods that are running. If some pods come up slower, will the cname list a partial list of pods?
In the above scenario, is there some mechanism to wait for all the pods to come up before starting the YB-Master? If not, is there a state change event for each new pod coming up?
Can we start master statefulset first before tserver from single yaml file? In other words, can we make some services dependent on other services being up?
Does ntp work inside containers? If so, can we assume containers on the same physical hosts have synchronized clocks and the remove ones would need ntp synchronization?
When a pod is scheduled to be decommissioned, YugaByte DB drains the data and iops from it before removing it from the universe (and for k8s to destroy). How can this be achieved using Kubernetes?
This may be a basic one but seems very hard for me to figure out: What is the paradigm to expose the StatefulSet pods to the outside world? The YugaByte smart client supports node-level locality when performing IO for low latency operations - it “knows” the correct node to go to based on the key as opposed to asking a random node.

Future Work

There is a lot more in this list, but just adding these as concerns down the road after the basic service is working.

Cross-AZ, multi-region and multi-cloud deployments
Running in public clouds with anti-affinity rules
Changing the replication factor of a running universe

Topic		Replies	Views
Accessing YugaByteDB on k8s over 9042 externally General	5	461	November 10, 2023
Kubernetes Example Not Working As Expected General	14	2237	October 8, 2020
Introducing YugaByte DB General	1	1337	September 15, 2018
Waiting for all pods to start before running the container commands General	1	873	February 8, 2018
YSQL scalability? General	3	1397	December 12, 2019