Fault tolerance

Ben_Jiro · August 3, 2020, 5:36pm

Is my logic correct in assuming ( with all nodes having plenty of free space ):

YugabyteDB with 3 nodes:

Single node failure, the replication will distribute among the two remaining nodes. Read / Write remains available If another node dies, it enters read-only mode.
Dual node ( both instantly unable ) failure, the remaining node goes into read-only mode.

YugabyteDB with 4 nodes:

Single node failure, the replication will distribute among the three remaining nodes. If the replication has been successful, YugabyteDB enters a state of “3 nodes” ( See 3 node example how it will be handled upon future failures ).
Dual node failure ( both failing withing 60 seconds )? Does YugabyteDB goes into read-only mode?

YugabyteDB with 5 nodes:

Single node failure, the replication will distribute among the four remaining nodes. If the replication has been successful, YugabyteDB enters into 4 node behavior ( See 4 node example how it will be handled upon future failures ).
Dual node failure ( both failing withing 60 seconds )? Replication will start between the 3 nodes. if successful, YugabyteDB enters into 3 node behavior ( See 3 node example how it will be handled upon future failures ). Able to handle another failure.
Triple node failure ( all 3 failing withing 60 seconds )? Does YugabyteDB goes into read-only mode?

Is this understanding correct? If yes, please also also update the documentation because i see in a lot of DB’s like YugabyteDB, CRDB that talk about node failures, do not really mention “instantaneous” failures vs “slow” failure ( with a chance to rebuild and enter a different node setup ). And the effects on different type of node 3,4,5,6,7 with instant vs slow failures on the DB infrastructure are mostly glanced over.

dorian_yugabyte · August 3, 2020, 5:53pm

Hi @Ben_Jiro

Welcome to YugabyteDB Forum!

Are we assuming replication factor = 3 in all those cases ?

Regards,
Dorian
Technical Support Engineer

Ben_Jiro · August 3, 2020, 6:25pm

Indeed. A replication of 3.

dorian_yugabyte · August 5, 2020, 6:52am

Assuming RF=3 in all scenarios.
Assuming free disk space in all scenarios.

If a node fails, the tablets which leaders resided on the failed node won’t accept writes until new leaders have been chosen on the other nodes after --leader_failure_max_missed_heartbeat_periods (default 3 seconds)
If a node is down for --follower_unavailable_considered_failed_sec (default 15 minutes), the node is considered dead and data will start replicating to the other machines.
If we lose 2 peers of a tablet, then we have read only of those tablets ONLY when the peer that is available already was the leader (it can’t pick new leaders because there’s no quorum) OR when using follower reads. If those come back online, they will start replicating. If we lost them forever(15+minutes), then we must do manual recovery of those tablets.
If we lose 3 peers of a tablet, that tablet is unavailable for read/writes. No replicas exist. You must resurrect at least one of those nodes. If 1 tablet resurrection, follow if we lose 2 peers of a tablet.
Generally if RF is n , YugabyteDB can survive (n - 1) / 2 failures without compromising correctness or availability of data.
Replication happens per-tablet. Example: assuming a transaction that needs data in 2 tablets… If both have leaders, the transaction can continue.

See 1.

There is no reason to have 2 replicas of the same tablet in a server (3 replicas distributed on 2 servers) so no replication needed.

See 3 above.

See 1,2.

See 1,2,3.

See 1,2.

See 1,2,3.

See 1,2,3,4.

pcyb · February 14, 2021, 4:13pm

Reads too wont be served until the tablet leader is chosen for those tablets that had the leaders on the 1 node that went down, isnt it?

dorian_yugabyte · February 15, 2021, 2:13pm

Yes you won’t be able to read from leaders since they are down.
In YCQL (soon in YSQL) you can also read from tablet peers.

Topic		Replies	Views
What is the behaviour of connection failure between masters in multi-master setup? General	5	964	July 26, 2020
Is keepalived needed to YugabyteDB for fault tolerance? General	3	121	June 30, 2024
How does Yugabyte handle single disk failure? General	1	866	January 27, 2020
Yugabyte and Zabbix or Nagios Design Discussions	5	728	January 4, 2023
Does yugabyte db fsync data to file? General	1	844	March 2, 2020

Fault tolerance

Related topics