Can anyone help me troubleshoot this issue?
I have 3 Kubernetes clusters, and in each one I deployed YugabyteDB with 1 master and 1 tserver, then connected all three into a single universe.
Each master is advertised through a Kubernetes Service LoadBalancer IP.
However, I am seeing extremely slow master RPC calls. For example, this command takes a long time to return:
yb-admin --master_addresses 10.96.196.75:7100,10.96.226.47:7100,10.91.193.215:7100 list_all_masters
Output:
Master UUID RPC Host/Port State Role Broadcast Host/Port
91e8753df07b43bd8e0f5130e97ee56c 10.42.26.21:7100 ALIVE FOLLOWER 10.96.196.75:7100
1ed16710628543d7b351fd4a0c2e3018 10.250.1.219:7100 ALIVE FOLLOWER 10.96.226.47:7100
d08514bcae854809aeb9b5e9487e897d 10.250.1.135:7100 ALIVE LEADER 10.91.193.215:7100
Additional details
-
Broadcast Host/Port values are routable private IPs (we are running on a private cloud).
-
RPC Host/Port values are pod IPs, which cannot be reached from other clusters. They may not need to accessed from external k8s cluster?
As a result:
-
list_all_mastersand other yb-admin commands are extremely slow -
psqlconnections disconnect constantly -
General cross-cluster communication is unstable
One of my master configurations looks like this:
--fs_data_dirs=/mnt/disk0
--master_addresses=10.96.196.75:7100,10.96.226.47:7100,10.91.193.215:7100
--replication_factor=3
--enable_ysql=true
--master_enable_metrics_snapshotter=true
--metrics_snapshotter_tserver_metrics_whitelist=handler_latency_yb_tserver_TabletServerService_Read_count,handler_latency_yb_tserver_TabletServerService_Write_count,handler_latency_yb_tserver_TabletServerService_Read_sum,handler_latency_yb_tserver_TabletServerService_Write_sum,disk_usage,cpu_usage,node_up
--metric_node_name=${EXPORTED_INSTANCE}
--memory_limit_hard_bytes=1824522240
--stderrthreshold=0
--num_cpus=2
--max_log_size=256
--undefok=num_cpus,enable_ysql
--use_node_hostname_for_local_tserver=true
--rpc_bind_addresses=${HOSTNAME}.yb-masters.${NAMESPACE}.svc.cluster.local
--server_broadcast_addresses=${HOSTNAME}.yb-masters.${NAMESPACE}.svc.cluster.local:7100
--webserver_interface=0.0.0.0
--default_memory_limit_to_ram_ratio=0.85
--leader_failure_max_missed_heartbeat_periods=10
--max_clock_skew_usec=10000000
--placement_cloud=rancher
--placement_region=ca-west-1
--placement_zone=A
--rpc_bind_addresses=${POD_IP}
--server_broadcast_addresses=10.96.196.75:7100
--use_node_hostname_for_local_tserver=false
--use_private_ip=never
I suspect the combination of:
-
pod IPs being used as RPC Host/Port (not reachable across clusters),
-
LoadBalancer IPs being used for broadcast,
-
and
--use_private_ip=never
is causing the extremely slow RPC behavior.
Any suggestions on how to correctly configure cross-cluster masters/tservers or how to fix the RPC routing would be greatly appreciated.