Timeouts on 16 node cluster due to bad schema design

16 nodes in a region

Application in central region is configured to talk to master(leader) in central region

Application is using YCL driver(gocql), so the table is currently hash based partitioned


Given the requirement with huge traffic of SELECT query(below) and less traffic of INSERT & UPDATE queries:

SELECT userID, col3, col4, col5, col6, col7
                  FROM table1
              WHERE domainID = :domain_id AND userID IN :user_id_set

Database currently has schema(below) with hash based partition:

CREATE TABLE keySpaceName.table1 (         							
domainID VARCHAR, 
userID VARCHAR, 
col3 VARCHAR, 
col4 VARCHAR, 
col5 VARCHAR, 
col6 VARCHAR, 
col7 VARCHAR, 
           PRIMARY KEY((userID, domainID), col3, col4));

Note: One domain can have multiple users. Thousands of users in a domain.

For some historical reasons, our schema was badly partitioned using PRIMARY KEY((userID, domainID), col3, col4)), which lead to SELECT queries getting huge timeouts with 16 node cluster(hosting hash based partitioned table1). Need to regularly restart application.


So, refactoring PRIMARY KEY((userID, domainID), col3, col4)) to PRIMARY KEY((domainID), userID) WITH CLUSTERING ORDER BY (userID ASC); . Table gets created with hash partitioning using key domainID. Clustering key userID would be ordered ascending.

Partition key domainID would enhance performance of search domainID = :domain_id

and

clustering key userID would enhance performance of search userID IN :user_id_set


  1. Please let me know, if this refactoring make sense, to enhance WHERE clause performance in SELECT query

  2. How do INSERT query(to add a new row) & UPDATE query(to update an existing row) perform on this refactoring?

Hi @sham_yuga

This is a single regino/az deployment, correct?

How big (CPU,memory,SSD) is each of these nodes?

Can you be more specific what you mean here? Do you mean you connect to only 1 node?

How big is the :user_id_set on each query? How many values?

Can there be domains that have a huge number of users? So much that they can outgrow a single node? You can think of this in how many rows, how many GBs, and how much select/inserts/second?

The INSERT would work the same. What’s important is my questions above to see the effects of everything happening.

@dorian_yugabyte

2 Regions, 3 AZs, 24 nodes.
16 nodes in Central region, 8 nodes in east region
Application can talk to any of the 16 nodes in central region, to submit SELECT/INSERT/UPDATE queries
Users cannot outgrow a single node.

Please answer all my questions.

I was talking about “domain”. How many users for a domain and what would be the biggest domain?

@dorian_yugabyte
As mentioned here,
Table partitioning into tablets is done by Yugabyte. Assigning each tablet to a node is not in our control. If there are n tablets(say medium size each) assigned to a specific node(nodeX). A tablet out of those n tablets can outgrow a node, because nodeX has multiple tablets assigned.

In any case, there is very good chance of domain to outgrow a node, in our case

@dorian_yugabyte
Did I answer all the questions?

For your question, what is the biggest domain?

Knowing the size of biggest domain, does it help us ensure that node of that capacity (GB) will own that one tablet? Is this in our control? Because there can be more than one tablet owned by one node. Correct me.

@sham_yuga please, answer ALL my questions, just go 1 by 1. If you don’t know the answer, answer with a ballpark estimate or say you don’t know.

This answer doesn’t help. Let me explain. You can answer I really don't know how big they can get, maybe hundreds GBs.

But if you answer They can grow to 50GB and My nodes have 8GB memory, then we can keep a domain to 1 node and you can grow nodes up to 128GB+ memory and it will work.

@dorian_yugabyte

Here are the answers for your questions:

  1. This is a single region/AZ deployment, correct?

    No.
    2 Regions, 3 AZs,
    24 nodes in total
    16 nodes in Central region, 8 nodes in east region

  2. How big (CPU,memory,SSD) is each of these nodes?

    16 cores (soon to be 32 cores)
    Ram 64GB
    Disk 1TB

  3. Can you be more specific what you mean here? Do you mean you connect to only 1 node? My bad.

    I gave wrong info on this, in previous comments. Out of 16 nodes in central region, Currently GoCQL driver is configured with hosts info for 16 tablet servers(from central region). Below is the driver code that take those 16 hosts info

        func NewCluster(hosts ...string) *ClusterConfig {
               cfg := &ClusterConfig{
                     Hosts:                  hosts,
  1. How big is the :user_id_set on each query? How many values?

    On average, we see set size as 10 user id’s per select query

  2. Can there be domains that have a huge number of users? So much that they can outgrow a single node? You can think of this in how many rows, how many GBs…

    ~3000 domains, this number is not expected to grow , maybe 10 max per year
    750,000 users per domain(that has huge number of users)
    3 items at 1 store = 1KB
    3,000 items at 1 store = 1MB
    3,000,000 items at 1 store = 1GB
    No, each node will have 1TB disk and it’s not expected to outgrow 1 node

  3. How many SELECT queries per second? INSERT queries per sec? UPDATE queries per sec?

    10k SELECT ops per second
    ~2-5k UPDATEs every hour or so
    INSERTs are rare

@sham_yuga I don’t understand how “domain,user” are connected to “items in a store” ?
I “store” a “domain” ? And “user in domain” is “item in store”?

Have you tried making single-row queries? So domain=x and user=y, and making 10 queries in parallel (ex: 10 goroutines)?

Make sure to also configure with cluster.PoolConfig.HostSelectionPolicy = gocql.TokenAwareHostPolicy(gocql.DCAwareRoundRobinPolicy("dc1")) where “dc1” is the name of --placement-cloud of your central region.

And make sure you’re using prepared queries.

Can you try these changes?

@dorian_yugabyte

3 users in 1 domain = 1KB
3,000 users in 1 domain = 1MB
3,000,000 users in 1 domain = 1GB

Before we try single row query,
How to see partition key, given this scenario?

Since you expect at most 750K users in a domain, then it means less than 1GB.
This means it won’t be a problem for memory size, the whole domain would stay in memory.
The remaining question would be the peak QPS for a given domain? Since 10K QPS is assuming for the whole cluster, correct?

Both should work. I suggest trying my suggestion first before changing the schema since it’s easier.

@dorian_yugabyte
Sure, will try your suggestion and let know