Can only connect to 1 box in a 3 box cluster

YugaByte Version 1.2.6
Server: Ubuntu Server 16.04
Client: https://docs.yugabyte.com/latest/develop/build-apps/csharp/

I have a 3 box cluster which I created with the steps from the tutorial here:

https://docs.yugabyte.com/latest/deploy/manual-deployment/start-tservers/

The problem is that I can only connect (YCQL endpoint) to one specific box, is there any extra configuration I need to do to be able to connect to all 3 boxes.

Here is the behavior I’m seeing:

  • With all 3 boxes up and running (both master and tserver).
    ** 192.168.0.101
    ** 192.168.0.102
    ** 192.168.0.103

  • I can connect to all 3 locally logged into the box

  • However, I can only connect to 192.168.0.103 externally

  • If I stop the services on 192.168.0.103
    ** Then I connect just fine to 192.168.0.102 externally

There is no network issue and the box configuration is the same.

Are there any known issues or anything specific that needs to be done to be able to connect to all 3 boxes?

Note: There is no error that I see, the client tires to connect but just spins and hangs, there is no response returned.

hi @Exocomp:

That’s odd. Can you share the output of the full command line for the yb-tserver process (including any gflags or contents of a gflag file you are passing to it)?

For example, from one of our test clusters:

[yugabyte@yb-15-yugabyte-adoption-3-n1 ~]$ ps auxww | grep yb-tserver
yugabyte  3710  0.0  0.0 112680   672 pts/0    D+   03:24   0:00 grep --color=auto yb-tserver
yugabyte 22883 18.4 55.1 13780488 4129524 ?    Sl   Apr26 8250:29 /home/yugabyte/tserver/bin/yb-tserver --flagfile /home/yugabyte/tserver/conf/server.conf

where the gflags file is something like:

[yugabyte@yb-15-yugabyte-adoption-3-n1 ~]$ cat /home/yugabyte/tserver/conf/server.conf
--tserver_master_addrs=10.150.0.45:7100,10.150.0.46:7100,10.150.0.50:7100
--webserver_port=9000
--placement_cloud=gcp
--placement_region=us-west1
--max_log_size=256
--placement_zone=us-west1-b
--placement_uuid=4d9834cc-6d6e-4dc4-89ef-6a8590f59f43
--rpc_bind_addresses=10.150.0.46:9100
--cql_proxy_bind_address=10.150.0.46:9042
--fs_data_dirs=/mnt/d0,/mnt/d1
--webserver_interface=10.150.0.46
--redis_proxy_bind_address=10.150.0.46:6379

regards,
Kannan

Additionally, when being connected to any one node via cqlsh could you also share the contents of thge system.local and system.peers tables.

For example, something like this:

yugabyte@yb-15-yugabyte-adoption-3-n1 ~]$ ~/tserver/bin/cqlsh 10.150.0.45
Connected to local cluster at 10.150.0.45:9042.
[cqlsh 5.0.1 | Cassandra 3.9-SNAPSHOT | CQL spec 3.4.2 | Native protocol v4]
Use HELP for help.
cqlsh> select * from system.local;

 key   | bootstrapped | broadcast_address | cluster_name  | cql_version | data_center | gossip_generation | host_id | listen_address | native_protocol_version | partitioner                                 | rack       | release_version | rpc_address | schema_version                       | thrift_version | tokens | truncated_at
-------+--------------+-------------------+---------------+-------------+-------------+-------------------+---------+----------------+-------------------------+---------------------------------------------+------------+-----------------+-------------+--------------------------------------+----------------+--------+--------------
 local |    COMPLETED |       10.150.0.45 | local cluster |       3.4.2 |    us-west1 |                 0 |    null |    10.150.0.45 |                       4 | org.apache.cassandra.dht.Murmur3Partitioner | us-west1-a |    3.9-SNAPSHOT | 10.150.0.45 | 00000000-0000-0000-0000-000000000000 |         20.1.0 |  {'0'} |         null

(1 rows)
cqlsh> select * from system.peers;

 peer        | data_center | host_id                              | preferred_ip | rack       | release_version | rpc_address | schema_version                       | tokens
-------------+-------------+--------------------------------------+--------------+------------+-----------------+-------------+--------------------------------------+-------------------------
 10.150.0.46 |    us-west1 | b12f820e-f972-d2b7-bb46-7361fb8eaf1c |  10.150.0.46 | us-west1-b |            null | 10.150.0.46 | 00000000-0000-0000-0000-000000000000 |                   {'0'}
 10.150.0.50 |    us-west1 | dfb11d7f-7eb9-52a6-6d4a-90eb4ccd368b |  10.150.0.50 | us-west1-c |            null | 10.150.0.50 | 00000000-0000-0000-0000-000000000000 | {'6148820866244280320'}

(2 rows)

@kannan

Thanks for the help. Here is what you requested:

administrator@box-100:~$ ps auxww | grep yb-tserver
adminis+ 179593  2.0  4.7 1268864 95792 ?       Ssl  23:05   0:13 /opt/yugabyte/yugabyte-1.2.6.0/bin/yb-tserver --flagfile /etc/yugabyte/tserver.conf --log_dir /mnt/log/yugabyte

--tserver_master_addrs=192.168.0.101:7100,192.168.0.102:7100,192.168.0.103:7100
--rpc_bind_addresses=192.168.0.101
--cql_proxy_bind_address=192.168.0.101:9042
--redis_proxy_bind_address=192.168.0.101:6379
--fs_data_dirs=/mnt/data/yugabyte

administrator@box-100:~$ /opt/yugabyte/yugabyte-1.2.6.0/bin/cqlsh 192.168.0.101
Connected to local cluster at 192.168.0.101:9042.
[cqlsh 5.0.1 | Cassandra 3.9-SNAPSHOT | CQL spec 3.4.2 | Native protocol v4]
Use HELP for help.
cqlsh> select * from system.local;

 key   | bootstrapped | broadcast_address | cluster_name  | cql_version | data_center | gossip_generation | host_id | listen_address | native_protocol_version | partitioner                                 | rack  | release_version | rpc_address | schema_version                       | thrift_version | tokens | truncated_at
-------+--------------+-------------------+---------------+-------------+-------------+-------------------+---------+----------------+-------------------------+---------------------------------------------+-------+-----------------+-------------+--------------------------------------+----------------+--------+--------------
 local |    COMPLETED |         192.168.0.101 | local cluster |       3.4.2 | datacenter1 |                 0 |    null |      192.168.0.101 |                       4 | org.apache.cassandra.dht.Murmur3Partitioner | rack1 |    3.9-SNAPSHOT |   192.168.0.101 | 00000000-0000-0000-0000-000000000000 |         20.1.0 |  {'0'} |         null

(1 rows)

NOTE: the other boxes are configured as above but with their respective IP addresses.

I just noticed that using cqlsh from the boxes I can connect to any other box. However, using (https://docs.yugabyte.com/latest/develop/build-apps/csharp/) client I’m seeing the issue I described. Here is the piece of code from the C# client:

var db = Cluster.Builder()
	.AddContactPoint("192.168.0.102")
	.Build();
db.Connect("mycluster");

It hangs on .Connect.

However, as I mentioned if I stop the master and tserver process on 192.168.0.103 then I can connect just fine to 192.168.0.102. So it’s not a network issue.

hi @Exocomp

Can you also share the output of this query when connected to 192.168.0.101

select * from system.peers;

And perhaps the same two queries from another node, say:

/opt/yugabyte/yugabyte-1.2.6.0/bin/cqlsh 192.168.0.102

select * from system.local;
select * from system.peers;

Hi @kannan,

Here is the output of those commands:

administrator@box100:~$ /opt/yugabyte/yugabyte-1.2.6.0/bin/cqlsh 192.168.0.101
Connected to local cluster at 192.168.0.101:9042.
[cqlsh 5.0.1 | Cassandra 3.9-SNAPSHOT | CQL spec 3.4.2 | Native protocol v4]
Use HELP for help.
cqlsh> select * from system.peers;

 peer      | data_center | host_id                              | preferred_ip | rack  | release_version | rpc_address | schema_version                       | tokens
-----------+-------------+--------------------------------------+--------------+-------+-----------------+-------------+--------------------------------------+-------------------------
 192.168.0.103 | datacenter1 | 0e7a6228-ec9f-4ac3-a292-6825afc3bca0 |    192.168.0.103 | rack1 |            null |   192.168.0.103 | 00000000-0000-0000-0000-000000000000 |                   {'0'}
 192.168.0.102 | datacenter1 | 29ae68ef-f58e-4c14-ac5a-26084739ae13 |    192.168.0.102 | rack1 |            null |   192.168.0.102 | 00000000-0000-0000-0000-000000000000 | {'6148820866244280320'}

(2 rows)

administrator@box101:~$ /opt/yugabyte/yugabyte-1.2.6.0/bin/cqlsh 192.168.0.102
Connected to local cluster at 192.168.0.102:9042.
[cqlsh 5.0.1 | Cassandra 3.9-SNAPSHOT | CQL spec 3.4.2 | Native protocol v4]
Use HELP for help.
cqlsh> select * from system.local;

 key   | bootstrapped | broadcast_address | cluster_name  | cql_version | data_center | gossip_generation | host_id | listen_address | native_protocol_version | partitioner                                 | rack  | release_version | rpc_address | schema_version                       | thrift_version | tokens | truncated_at
-------+--------------+-------------------+---------------+-------------+-------------+-------------------+---------+----------------+-------------------------+---------------------------------------------+-------+-----------------+-------------+--------------------------------------+----------------+--------+--------------
 local |    COMPLETED |         192.168.0.102 | local cluster |       3.4.2 | datacenter1 |                 0 |    null |      192.168.0.102 |                       4 | org.apache.cassandra.dht.Murmur3Partitioner | rack1 |    3.9-SNAPSHOT |   192.168.0.102 | 00000000-0000-0000-0000-000000000000 |         20.1.0 |  {'0'} |         null

(1 rows)
cqlsh> select * from system.peers;

 peer      | data_center | host_id                              | preferred_ip | rack  | release_version | rpc_address | schema_version                       | tokens
-----------+-------------+--------------------------------------+--------------+-------+-----------------+-------------+--------------------------------------+--------------------------
 192.168.0.103 | datacenter1 | 0e7a6228-ec9f-4ac3-a292-6825afc3bca0 |    192.168.0.103 | rack1 |            null |   192.168.0.103 | 00000000-0000-0000-0000-000000000000 |                    {'0'}
 192.168.0.101 | datacenter1 | 4c256ad5-0eef-4523-a34b-ca1bcd3facbb |    192.168.0.101 | rack1 |            null |   192.168.0.101 | 00000000-0000-0000-0000-000000000000 | {'-6149102341220990976'}

(2 rows)

NOTE: I can connect using cqlsh if I log into any box to any other box. The issue is only when using the C# client (https://docs.yugabyte.com/latest/develop/build-apps/csharp/) where I can only connect to 192.168.0.103 (when all 3 boxes are running) and then when I stop master and tserver on 192.168.0.103 then can connect to 192.168.0.102 (from the client).

@kannan

Is there a way to increase the logging level of yugabyte? The INFO, WARNING, ERROR logs don’t produce anything when I connect to a boxes with the issue.

Also when I can’t connect to the boxes I mentioned, the client CPU spikes like it is stuck internally in a loop or doing heavy operations internally (from a client perspective I just see it stuck with .Connect). So seems like it does connect but doesn’t like what it is receiving from Yugabyte.

hi @Exocomp

Could this be an issue similar to https://datastax-oss.atlassian.net/browse/CSHARP-480 (our YCQL C# driver is a fork of the Apache Cassandra driver)?

Could you try to enable tracing on as recommended here, and see if we can learn anything from the logs:

https://datastax-oss.atlassian.net/browse/CSHARP-480?focusedCommentId=32400&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-32400

regards,
Kannan

hi @Exocomp

I think we have a handle on this … and it something specific to YugaByte and C# driver combo. Will keep you posted as soon as we have the fix. Hoping in the next few days.

regards,
Kannan

Tracking the issue here: https://github.com/YugaByte/yugabyte-db/issues/1467

@Exocomp

We have done both a YugaByte C# driver side fix (https://github.com/YugaByte/cassandra-csharp-driver) and a server-side fix (https://github.com/YugaByte/yugabyte-db/issues/1467).

Either of the fix (using the new driver or waiting for the release with the above server-side fix) should help avoid the problem.

Can you please give this a try with the 3.7.1 version of the YugaByte C# Driver (https://www.nuget.org/packages/YugaByteCassandraCSharpDriver)?

regards,
Kannan

@kannan

I tried 3.7.1, that resolved the issue.

Looking over the commit that fixed it, looks like it is bypassing the token map generation so sounds like before it was going somewhere before where it should not have. https://github.com/YugaByte/cassandra-csharp-driver/commit/6fc2772a140379bd2aaaed789781179196c732c8

Thanks for the quick fix and glad I could contribute.