Timeout executing DDL

I deployed yugabyte 2.9.0 to five on-prem servers (96 cores, 377 GB mem, 8 SSDs):
3 masters and 6 t-servers using 6 SSDs for fs_data_dirs, 1 SSD for fs_wal_dirs, and 1 SSD for log_dir

I connect using ysqlsh and when I attempt to create a new schema or table, the client times out. I have tried upping the timeout to 3 minutes but if still times out.

Here are the commands for the yb-master servers:

./bin/yb-master \
--master_addresses node1:7100, node2:7100, node3:7100 \
--fs_data_dirs "/data01,/data02,/data03,/data04,/data05,/data06" \
--fs_wal_dirs /data07 \
--log_dir /data08 \
--rpc_bind_addresses node1:7100 \
--server_broadcast_addresses node1:7100 \
--durable_wal_write=true

./bin/yb-master \
--master_addresses node1:7100, node2:7100, node3:7100 \
--fs_data_dirs "/data01,/data02,/data03,/data04,/data05,/data06" \
--fs_wal_dirs /data07 \
--log_dir /data08 \
--rpc_bind_addresses node2:7100 \
--server_broadcast_addresses node2:7100 \
--durable_wal_write=true

./bin/yb-master \
--master_addresses node1:7100, node2:7100, node3:7100 \
--fs_data_dirs "/data01,/data02,/data03,/data04,/data05,/data06" \
--fs_wal_dirs /data07 \
--log_dir /data08 \
--rpc_bind_addresses node3:7100 \
--server_broadcast_addresses node3:7100 \
--durable_wal_write=true

All master servers start up and from the Admin UI I can see all three masters (1 leader and 2 followers).

And the t-servers (port 9100 was in use so I opted to use 9200):

./bin/yb-tserver \
--tserver_master_addrs node1:7100, node2:7100,node3:7100 \
--fs_data_dirs "/data01,/data02,/data03,/data04,/data05,/data06" \
--fs_wal_dirs /data07 \
--log_dir /data08 \
--rpc_bind_addresses node1:9200 \
--server_broadcast_addresses node1:9200 \
--durable_wal_write=true

./bin/yb-tserver \
--tserver_master_addrs node1:7100, node2:7100,node3:7100 \
--fs_data_dirs "/data01,/data02,/data03,/data04,/data05,/data06" \
--fs_wal_dirs /data07 \
--log_dir /data08 \
--rpc_bind_addresses node2:9200 \
--server_broadcast_addresses node2:9200 \
--durable_wal_write=true

./bin/yb-tserver \
--tserver_master_addrs node1:7100, node2:7100,node3:7100 \
--fs_data_dirs "/data01,/data02,/data03,/data04,/data05,/data06" \
--fs_wal_dirs /data07 \
--log_dir /data08 \
--rpc_bind_addresses node3:9200 \
--server_broadcast_addresses node3:9200 \
--durable_wal_write=true

./bin/yb-tserver \
--tserver_master_addrs node1:7100, node2:7100,node3:7100 \
--fs_data_dirs "/data01,/data02,/data03,/data04,/data05,/data06" \
--fs_wal_dirs /data07 \
--log_dir /data08 \
--rpc_bind_addresses node4:9200 \
--server_broadcast_addresses node4:9200 \
--durable_wal_write=true

./bin/yb-tserver \
--tserver_master_addrs node1:7100, node2:7100,node3:7100 \
--fs_data_dirs "/data01,/data02,/data03,/data04,/data05,/data06" \
--fs_wal_dirs /data07 \
--log_dir /data08 \
--rpc_bind_addresses node5:9200 \
--server_broadcast_addresses node5:9200 \
--durable_wal_write=true

As I add each t-server I can see in the logs of the master that it is registering each new t-server. However, once I get to the 3rd t-server, I can see the master logs complain about “Delete Tablet RPC failed for tablet e0849349****: Network error: Connect timeout, passed: 15s” and “Create Tablet RPC failed for tablet ****: Netowork error: Connect timeout, passed: 15s.”, as well as “Hinted Leader Start Election RPC failed” with similar message.

Any ideas as to what could be my issue? These are Centos servers.

Forgot to mention that if I only use one master and one t-server things work as expected… I can create a new schema and table.

Hi @Rocco

I believe you can remove the fs_wal_dirs.
This will make that wal is spread on all SSD, which should be faster than on a single SSD.

Also remove this too, since you don’t need to allocate 1 SSD just for logs.

Can you paste a screenshot of http://node1/:7000 and http://node1/:7000/tablet-servers ?

Stop everything. Just clean the data-dirs for all processes and start the 3 masters again. Post the first screenshot. And then start the yb-tservers and post the 2nd screenshot. And then try creating the table…

You can use this guide: Deploy | YugabyteDB Docs

I figured out my issue… port 9200 was not opened on my t-servers so I chose an open port to use and now it all works as expected.

Thanks for your time.