Cannot add read replica node

I’m trying to add 1 read replica on a west region to an existing yugabyte cluster in east coast region.
I generated the certicate on the master node for the new ip replica, on master host is xx.xx.xx.96.

 /opt/yugabyte//bin/yugabyted cert generate_server_certs --hostnames=xx.xx.xx.46

I copy the certs to the replica certs dir and tried to started the replica the same way I started the other nodes but it errored out.

yugabyted start --secure --certs_dir=/db/yugabyte/certs --cloud_location=xx.xx.rr-1a --base_dir=/db/yugabyte --advertise_address=xx.xx.xx.46 --join=xx.xx.xx.96 --fault_tolerance=region

On the master, I ran the yb-admin command :

/opt/yugabyte/bin/yb-admin --certs_dir_name /db/yugabyte/certs --master_addresses xx.xx.xx.96:7100,xx.xx.xx.101:7100,xx.xx.xx.93:7100  add_read_replica_placement_info xx.xx.rr-1a:1 1 rr

Fetching configs from join IP...
Starting yugabyted...
/ Starting the YugabyteDB Processes...Failed to setup master. Exception: Traceback (most recent call last):
  File "/opt/yugabyte/bin/yugabyted", line 4565, in setup_master
    master_uuids = retry_op_with_argument(self.get_master_uuids, master_addrs)

The log has reference the list_all_masters command and looks like it could not run this command.

[yugabyted start]  INFO:  | 189.5s | run_process: ['/opt/yugabyte-2024.1.1.0/bin/yb-admin', '--certs_dir_name=/db/yugabyte/certs', 
'--master_addresses', 'xx.xx.xx.96:7100,xx.xx.xx.101:7100,xx.xx.xx.93:7100', 'list_all_masters'] timeout expired for command:
[yugabyted start] 2024-11-06 14:54:07,450 ERROR:  | 189.5s | Failed to setup master. Exception: Traceback (most recent call last):

If I run on the replica the list_all_masters command:

Timed out (yb/rpc/rpc.cc:223): Unable to establish connection to leader master at [xx.xx.xx.96:7100,xx.xx.xx.101:7100,xx.xx.xx.93:7100]. Please verify the addresses and check if server is up,
 or if you're missing --certs_dir_name.
: Could not locate the leader master: GetLeaderMasterRpc(addrs: [10.29.196.96:7100, 10.29.196.101:7100, 10.161.129.93:7100], num_attempts: 249) passed its deadline 15146.954s 
(passed: 60.031s): Network error (yb/rpc/secure_stream.cc:914): Handshake failed: Network error (yb/rpc/secure_stream.cc:1129): Unverified certificate: certificate signature failure, 
address: xx.xx.xx.101, hostname: xx.xx.xx.101

Hi @tryyuga

What version are you using?

I think you used the incorrect command compared to docs: yugabyted reference | YugabyteDB Docs

Did you get a more complete error here?

Hi dorian_yugabyte,
Version is YugabyteDB Version 2024.1.1.0.

I think you used the incorrect command compared to docs: yugabyted reference | YugabyteDB Docs <<
You are right, I restarted the command including missing param, but now I get below error:
yugabyted start --secure --certs_dir=/db/yugabyte/certs --cloud_location=xx.xx.rr-1a --base_dir=/db/yugabyte --advertise_address=xx.xx.xx.46 --join=xx.xx.xx.96 --read_replica

Got error list_all_tablet_servers instead of list_all_masters this time.
error:
[yugabyted start] 2024-11-07 09:08:13,445 INFO: | 180.7s |
run_process: [‘/opt/yugabyte-2024.1.1.0/bin/yb-admin’, ‘–certs_dir_name=/db/yugabyte/certs’, ‘–master_addresses’, ‘xx.xx.xx.93:7100,xx.xx.xx.101:7100,xx.xx.xx.96:7100’, ‘list_all_tablet_servers’]
timeout expired for command:
[yugabyted start] 2024-11-07 09:08:13,455 INFO: | 180.7s | wait_tserver: exception: Traceback (most recent call last):
File “/opt/yugabyte/bin/yugabyted”, line 4658, in wait_tserver
if retry_op_with_argument(self.is_tserver_up, cluster_type, timeout):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/opt/yugabyte/bin/yugabyted”, line 8569, in retry_op_with_argument
raise RuntimeError(“Failed after retrying operation for {} secs.”.format(
RuntimeError: Failed after retrying operation for 180.06600880622864 secs.

I ran this list_all_tablet_servers command itself on new replica and got the same error, ran on other existing nodes and it ran ok.
Is this a network issue? admin told me they don’t see anything in firewall issues…
I’m not sure how to debug.
/opt/yugabyte/bin/yb-admin --certs_dir_name /db/yugabyte/certs --master_addresses xx.xx.xx.96:7100,xx.xx.xx.196.99:7100,xx.xx.xx.93:7100 list_all_tablet_servers

Please paste both the full command & full response/error that you got.

command:

  /opt/yugabyte/bin/yugabyted start --secure --certs_dir=/db/yugabyte/certs --cloud_location=xx.xx.west_rr-1a --base_dir=/db/yugabyte --advertise_address=xxx.46 --join=xxx.96 --read_replica

response:

Fetching configs from join IP...
Starting yugabyted...
❌ Database failed to start
Failed to start tserver yugabyted
For more information, check the logs in /db/yugabyte/logs

/db/yugabyte/logs/yugabyted.log:

[yugabyted start] 2024-11-08 10:03:36,454 INFO:  | 180.7s | run_process: ['/opt/yugabyte-2024.1.1.0/bin/yb-admin', '--certs_dir_name=/db/yugabyte/certs', '--master_addresses', 'xxx.93:7100,xxs.101:7100,xxx.96:7100',
 'list_all_tablet_servers'] timeout expired for command:
[yugabyted start] 2024-11-08 10:03:36,463 INFO:  | 180.7s | wait_tserver: exception: Traceback (most recent call last):
  File "/opt/yugabyte/bin/yugabyted", line 4658, in wait_tserver
    if retry_op_with_argument(self.is_tserver_up, cluster_type, timeout):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/yugabyte/bin/yugabyted", line 8569, in retry_op_with_argument
    raise RuntimeError("Failed after retrying operation for {} secs.".format(
RuntimeError: Failed after retrying operation for 180.0666904449463 secs.

[yugabyted start] 2024-11-08 10:03:36,463 INFO:  | 180.7s | Failed to wait for tserver.
[yugabyted start] 2024-11-08 10:03:36,552 ERROR:  | 180.8s | Failed to start tserver yugabyted
For more information, check the logs in /db/yugabyte/logs
[yugabyted start] 2024-11-08 10:03:36,552 INFO:  | 180.8s | Shutting down...

@tryyuga can you please do yugabyted --base_dir=/db/yugabyte collect_logs on your read replica node? And upload the zip in a cloud storage and share the link?

dorian_yugabyte, Here is the link, let me know if you can get it.

The file is completely empty 127bytes. Maybe you uploaded it wrong?

Ok sorry to waste your time, but I figure it out after redoing my test nodes completely.
I had created the certs on a different “master” than what it was suppose to be, maybe that was why it coludn’t reach it, did give me certs error sporadically. created it on xxx.xxx.96 and copied it over to replica and ran same start replica command and it started.

1 Like