Multiregion deployment with single cluster/universe

Hi There,

Here is the requirement for a multi-region deployment. This will provide me balanced data across AZs, scaling up interms of adding more nodes to individual AZs or adding more AZs.

Requirement:

create a tablespace in a 3x3 cluster (3 nodes per region across ms-1, mr-1, st-1, totaling 9 nodes) with 5 replicas with region survival goal:

  • 2 replicas in ms-1 across any 2 AZs of ms-1a, ms-1b, ms-1c.
  • 2 replicas in mr-1 across any 2 AZs of mr-1a, mr-1b, mr-1c.
  • 1 replica in st-1 across any 1 AZ of st-1a, st-1b, st-1c.
  • Balanced data distribution to avoid zones like ms-1c having minimal data.

I am getting error with zone=‘*’ or zone=‘AZ1,AZ2,AZ3’ while creating tablespace.

My requirement is to features as below and would like Yugabyte to take care of distributing the data/tablets evenly across all AZs in each region. CockroachDB has this feature.

CREATE TABLESPACE geo_partitioned_leader_aware WITH (
replica_placement=‘{
“num_replicas”: 5,
“placement_blocks”: [
{ “cloud”: “aodc”, “region”: “ms-1”,“zone”: “", “min_num_replicas”: 2 },
{ “cloud”: “aodc”, “region”: “mr-1”,“zone”: "
”, “min_num_replicas”: 2 },
{ “cloud”: “aodc”, “region”: “st-1”,“zone”: “*”, “min_num_replicas”: 1 }
]
}’
);

Even this doesn’t work looks like the min_num_replicas can’t be 0

ms_testdb=# CREATE TABLESPACE geo_partitioned_leader_aware_3 WITH (
ms_testdb(# replica_placement=‘{
ms_testdb’# “num_replicas”: 5,
ms_testdb’# “placement_blocks”: [
ms_testdb’# {“cloud”: “aws”, “region”: “ms-1”, “zone”: “ms-1a”, “min_num_replicas”: 1},
ms_testdb’# {“cloud”: “aws”, “region”: “ms-1”, “zone”: “ms-1b”, “min_num_replicas”: 0},
ms_testdb’# {“cloud”: “aws”, “region”: “ms-1”, “zone”: “ms-1c”, “min_num_replicas”: 0},
ms_testdb’# {“cloud”: “aws”, “region”: “mr-1”, “zone”: “mr-1a”, “min_num_replicas”: 1},
ms_testdb’# {“cloud”: “aws”, “region”: “mr-1”, “zone”: “mr-1b”, “min_num_replicas”: 0},
ms_testdb’# {“cloud”: “aws”, “region”: “mr-1”, “zone”: “mr-1c”, “min_num_replicas”: 0},
ms_testdb’# {“cloud”: “aws”, “region”: “st-1”, “zone”: “st-1a”, “min_num_replicas”: 1},
ms_testdb’# {“cloud”: “aws”, “region”: “st-1”, “zone”: “st-1b”, “min_num_replicas”: 0},
ms_testdb’# {“cloud”: “aws”, “region”: “st-1”, “zone”: “st-1c”, “min_num_replicas”: 0}
ms_testdb’# ],
ms_testdb’# “num_replicas_per_region”: [
ms_testdb’# {“cloud”: “aws”, “region”: “ms-1”, “num_replicas”: 2},
ms_testdb’# {“cloud”: “aws”, “region”: “mr-1”, “num_replicas”: 2},
ms_testdb’# {“cloud”: “aws”, “region”: “st-1”, “num_replicas”: 1}
ms_testdb’# ]
ms_testdb’# }’
ms_testdb(# );
ERROR: Invalid type/value for some key in placement block. Placement policy: {
“num_replicas”: 5,
“placement_blocks”: [
{“cloud”: “aws”, “region”: “ms-1”, “zone”: “ms-1a”, “min_num_replicas”: 1},
{“cloud”: “aws”, “region”: “ms-1”, “zone”: “ms-1b”, “min_num_replicas”: 0},
{“cloud”: “aws”, “region”: “ms-1”, “zone”: “ms-1c”, “min_num_replicas”: 0},
{“cloud”: “aws”, “region”: “mr-1”, “zone”: “mr-1a”, “min_num_replicas”: 1},
{“cloud”: “aws”, “region”: “mr-1”, “zone”: “mr-1b”, “min_num_replicas”: 0},
{“cloud”: “aws”, “region”: “mr-1”, “zone”: “mr-1c”, “min_num_replicas”: 0},
{“cloud”: “aws”, “region”: “st-1”, “zone”: “st-1a”, “min_num_replicas”: 1},
{“cloud”: “aws”, “region”: “st-1”, “zone”: “st-1b”, “min_num_replicas”: 0},
{“cloud”: “aws”, “region”: “st-1”, “zone”: “st-1c”, “min_num_replicas”: 0}
],
“num_replicas_per_region”: [
{“cloud”: “aws”, “region”: “ms-1”, “num_replicas”: 2},
{“cloud”: “aws”, “region”: “mr-1”, “num_replicas”: 2},
{“cloud”: “aws”, “region”: “st-1”, “num_replicas”: 1}
]
}

Hi @ranjanp

Did you follow any guide on our docs regarding this?

Can you show a full page screenshot of http://yb-master:7000/tablet-servers ?

Also, what version are you using?

Regards,
Dorian

The issue is num_replicas_per_region is not a supported param in our tablespace declaration. This should be sufficient.

CREATE TABLESPACE multi_region_tablespace WITH (
  replica_placement='{"num_replicas": 5, "placement_blocks":
  [{"cloud":"aws","region":"ms-1","zone":"ms-1a","min_num_replicas":1},
   {"cloud":"aws","region":"ms-1","zone":"ms-1b","min_num_replicas":1},
   {"cloud":"aws","region":"mr-1","zone":"mr-1a","min_num_replicas":1},
   {"cloud":"aws","region":"mr-1","zone":"mr-1b","min_num_replicas":1},
   {"cloud":"aws","region":"st-1","zone":"st-1a","min_num_replicas":1}]}'
);

version 2.25.1.0

Thanks Dorian. I can create with min_num_replicas=1.

As a requirement i described above is it something you plan to add where Yugabyte takes care of distributing the tablets across AZs in a region. 3 AZs in this case and 2 replica copy. This way I can more nodes incase I want to scale up.

Hi @ranjanp -

This improvement is currently being tracked under this GitHub issue: GH# 26671. It is currently undergoing code review and further testing. Once it has been landed, I would expect it to be available in an upcoming drop of 2.25 in the next few weeks.

Thanks a lot, this helps.

And I hope YDB would choose the AZs evenly. As in this case 2 zones out of 3 AZs available and all the 3 nodes will have even data?

Hi @ranjanp

By adding this feature to create tablespace, the idea is for the tablespace to follow the zones where there are nodes in the cluster. The cloud=aws, region=ms-1, zone=* designation says that I want to match on any zone that has nodes present in the cluster.
The choice is actually not ours to make on the placement if you are the one designating where the nodes are logically or physically. If there is no node present in a given AZ, we cannot place any data there in full or in part.

Hope this clears things up.
–Alan

I think I understand your point. But my question is little different.

if I create a table(geo-partitioned) on below tablespace. As you mentioned above for region “us-east-1” you would choose 2 out of 3 AZs available(us-east-1a, us-east-1b and us-east-1c). This part you clarified.

Since you are choosing 2 out 3 AZs randomly there is a possibility of one or more nodes being overused interms of tablets. My question is, would you plan to distribute the tablets evenly across 3 AZs or its completely random.

CREATE TABLESPACE geo_partitioned_auto_zone WITH (
replica_placement=‘{
“num_replicas”: 5,
“placement_blocks”: [
{ “cloud”: “aws”, “region”: “us-east-1”, “zone”: “", “min_num_replicas”: 2 },
{ “cloud”: “aws”, “region”: “eu-west-1”, “zone”: "
”, “min_num_replicas”: 2 },
{ “cloud”: “aws”, “region”: “ap-southeast-2”, “zone”: “*”, “min_num_replicas”: 1 }
]
}’
);

Hi @ranjanp

Each node has its own storage for data (whether that be local NVME/SSD or EBS-style storage). A node can only exist in 1 zone. Since you have called for 2 “copies” to reside in us-east-1, you will only have 2 zones. The cluster load balancer will balance the number of leaders across the nodes in the zone, so there is not much possibility of a node being overused in terms of tablets from that perspective. Suboptimal data modeling choices may result in some tablets getting too much activity (hot shards), but there are methods to identify that condition when it occurs.

Hope this helps.
–Alan

@ranjanp : There is one caveat with using wildcard regions/zone like

CREATE TABLESPACE wildcard_zone_rf5 WITH (
replica_placement=‘{
“num_replicas”: 5,
“placement_blocks”: [
{ “cloud”: “cloud1”, “region”: “r1”,“zone”: “*", “min_num_replicas”: 2 },
{ “cloud”: “cloud1”, “region”: “r2”,“zone”: "*”, “min_num_replicas”: 2 },
{ “cloud”: “cloud1”, “region”: “r3”,“zone”: “*”, “min_num_replicas”: 1 }
]
}’
);

when we have servers in

r1 - z1, z2, z3
r2 - z1, z2, z3
r3 - z1, z2, z3

If zone r1.z1 fails, the placement allows all these tablets to be automatically moved to the remaining two AZs in r1 - r1.z2 and r1.z3. These two AZs will now see ~33% more load in terms of both the disk usage and cpu workload of these tablets and should be “over-provisioned” to handle this outage event. Note that if we had pinned the tablet copies to specific AZs like

r1.z1 : min_num_replicas=1,
r1.z2 : min_num_replicas=1,
r2.z1 : min_num_replicas=1,
r2.z2 : min_num_replicas=1,
r3.z1 : min_num_replicas=1

When r1.z1 fails, there should be no data movement to other AZs, though data is now under replicated at 4 copies out of 5. In this case, the operator will manually intervene to bring up r1.z3, change placement and cause tablets to be restored to RF5. Prior to this manual restoration to 5 copies, there is no additional disk usage on the remaining AZs - though there will be additional cpu workload depending on the exact leader preference used.