List_all_masters are so slow, across k8s cluster, joined by loadbalancer ip

Can anyone help me troubleshoot this issue?

I have 3 Kubernetes clusters, and in each one I deployed YugabyteDB with 1 master and 1 tserver, then connected all three into a single universe.

Each master is advertised through a Kubernetes Service LoadBalancer IP.

However, I am seeing extremely slow master RPC calls. For example, this command takes a long time to return:

yb-admin --master_addresses 10.96.196.75:7100,10.96.226.47:7100,10.91.193.215:7100 list_all_masters

Output:

Master UUID                             RPC Host/Port       State     Role      Broadcast Host/Port
91e8753df07b43bd8e0f5130e97ee56c        10.42.26.21:7100    ALIVE     FOLLOWER  10.96.196.75:7100
1ed16710628543d7b351fd4a0c2e3018        10.250.1.219:7100   ALIVE     FOLLOWER  10.96.226.47:7100
d08514bcae854809aeb9b5e9487e897d        10.250.1.135:7100   ALIVE     LEADER    10.91.193.215:7100

:small_blue_diamond: Additional details

  • Broadcast Host/Port values are routable private IPs (we are running on a private cloud).

  • RPC Host/Port values are pod IPs, which cannot be reached from other clusters. They may not need to accessed from external k8s cluster?

As a result:

  • list_all_masters and other yb-admin commands are extremely slow

  • psql connections disconnect constantly

  • General cross-cluster communication is unstable

:small_blue_diamond: One of my master configurations looks like this:

--fs_data_dirs=/mnt/disk0
--master_addresses=10.96.196.75:7100,10.96.226.47:7100,10.91.193.215:7100
--replication_factor=3
--enable_ysql=true
--master_enable_metrics_snapshotter=true
--metrics_snapshotter_tserver_metrics_whitelist=handler_latency_yb_tserver_TabletServerService_Read_count,handler_latency_yb_tserver_TabletServerService_Write_count,handler_latency_yb_tserver_TabletServerService_Read_sum,handler_latency_yb_tserver_TabletServerService_Write_sum,disk_usage,cpu_usage,node_up
--metric_node_name=${EXPORTED_INSTANCE}
--memory_limit_hard_bytes=1824522240
--stderrthreshold=0
--num_cpus=2
--max_log_size=256
--undefok=num_cpus,enable_ysql
--use_node_hostname_for_local_tserver=true
--rpc_bind_addresses=${HOSTNAME}.yb-masters.${NAMESPACE}.svc.cluster.local
--server_broadcast_addresses=${HOSTNAME}.yb-masters.${NAMESPACE}.svc.cluster.local:7100
--webserver_interface=0.0.0.0
--default_memory_limit_to_ram_ratio=0.85
--leader_failure_max_missed_heartbeat_periods=10
--max_clock_skew_usec=10000000
--placement_cloud=rancher
--placement_region=ca-west-1
--placement_zone=A
--rpc_bind_addresses=${POD_IP}
--server_broadcast_addresses=10.96.196.75:7100
--use_node_hostname_for_local_tserver=false
--use_private_ip=never

I suspect the combination of:

  • pod IPs being used as RPC Host/Port (not reachable across clusters),

  • LoadBalancer IPs being used for broadcast,

  • and --use_private_ip=never

is causing the extremely slow RPC behavior.

Any suggestions on how to correctly configure cross-cluster masters/tservers or how to fix the RPC routing would be greatly appreciated.

Hi @yulintan

It’s probably these. Please configure the dbs to be able to connect directly.

What logs are you getting on the yb-masters & yb-tservers?

Thanks,
I also doubt it’s caused by rpc IPs(pod IPs) can not access form each other.

There are no logs.

Unfortunately, our private cloud does not support exposing pod IPs across clusters. Is there any way to work around this limitation? I can expose each pod using a LoadBalancer IP, but the issue is that RPC cannot bind to a LoadBalancer IP because it’s not a real network interface.