Faster recovery after K8 outage

Hi,
I was running yugabyte 2.14.3.1 on kubernetes using the yugabyte helm chart in a testing environment.
By accident multiple K8 Nodes were shutdown simultaneously which resulted in a forceful shutdown of all yugabyte pods.
The StatefulSet for the masters recovered quickly.
The pods of the tablet servers took a long time to recover (roughly 11 hours). Not much data was stored in yugabyte (I can’t say exactly, but should be less than 1 GB).
The tablet server pods shutdown and restarted many times, because of a CRC mismatch in the WAL.

I was wondering if there is a way to recover more quickly in such a scenario.
For example, would it be possible for the tablet-server to not shutdown on a CRC mismatch in the WAL?
Could the tablet-server immediately initiate the bootstrap of the tablet instead?

I also looked at this guideline here: Recover YB-TServer from crash loop | YugabyteDB Docs
but I am not sure if it can help in this scenario, because all tablet-servers were constantly restarting and not just one.

In the logs of the tablet servers I saw that a CRC mismatch was detected in the WAL, which resulted in the shutdown and restart of the tablet servers:

F0217 07:22:31.294332   119 consensus_queue.cc:694] T 76e3528156c3461ab36650a2b59d2f4f P 7b31bf647fbf40a88a0897948ae224d1 [LEADER]: Error reading the log while preparing peer request: Corruption (yb/consensus/log_util.cc:808): Failed to read ops 16..16: Failed to read LogEntry for index 16 from log segment 7 offset 168: Invalid checksum in log entry head header: found=3071432632, computed=356876958. Destination peer: 8ce7b8036f2c4fd698a101daaf45229d

After a restart of the tablet-server the affected tablet could be successfully bootstrapped:

I0217 07:23:38.081457    63 log_util.cc:615] Ignoring partially flushed segment in write ahead log /mnt/disk1/yb-data/tserver/wals/table-6e839e00f08242eab15ea679d0b88a29/tablet-76e3528156c3461ab36650a2b59d2f4f/wal-000000005 because there are no log entries following this one. The server probably crashed in the middle of writing an entry to the write-ahead log or downloaded an active log via remote bootstrap. Error detail: Corruption (yb/consensus/log_util.cc:808): Invalid checksum in log entry head header: found=2790768357, computed=933096183: Failed trying to read batch #3 at offset 320 for log segment /mnt/disk1/yb-data/tserver/wals/table-6e839e00f08242eab15ea679d0b88a29/tablet-76e3528156c3461ab36650a2b59d2f4f/wal-000000005: Prior batch offsets: 168 292 320; Last log entries read: [REPLICATE (8.12)] [REPLICATE (9.13)] [REPLICATE (10.14)]
I0217 07:25:47.156235    53 ts_tablet_manager.cc:1399] T 76e3528156c3461ab36650a2b59d2f4f P 7b31bf647fbf40a88a0897948ae224d1: Bootstrapping tablet

I0217 07:25:47.169693    53 version_set.cc:2852] T 76e3528156c3461ab36650a2b59d2f4f P 7b31bf647fbf40a88a0897948ae224d1 [R]: Recovered from manifest file:/mnt/disk1/yb-data/tserver/data/rocksdb/table-6e839e00f08242eab15ea679d0b88a29/tablet-76e3528156c3461ab36650a2b59d2f4f/MANIFEST-000036 succeeded,manifest_file_number is 36, next_file_number is 38, last_sequence is 1125899906842624, log_number is 0,prev_log_number is 0,max_column_family is 0, flushed_values is 0x0000000007dbe540 -> { op_id: 3.7 hybrid_time: { physical: 1675070444910877 logical: 1 } history_cutoff: <invalid> hybrid_time_filter: <invalid> max_value_level_ttl_expiration_time: <invalid> primary_schema_version: <NULL> cotable_schema_versions: [] }

I0217 07:25:47.196981    53 log_util.cc:692] Scanning /mnt/disk1/yb-data/tserver/wals/table-6e839e00f08242eab15ea679d0b88a29/tablet-76e3528156c3461ab36650a2b59d2f4f/wal-000000007 for valid entry headers following offset 256...

I0217 07:25:47.197877    53 log_util.cc:615] Ignoring partially flushed segment in write ahead log /mnt/disk1/yb-data/tserver/wals/table-6e839e00f08242eab15ea679d0b88a29/tablet-76e3528156c3461ab36650a2b59d2f4f/wal-000000006 because there are no log entries following this one. The server probably crashed in the middle of writing an entry to the write-ahead log or downloaded an active log via remote bootstrap. Error detail: Corruption (yb/consensus/log_util.cc:808): Invalid checksum in log entry head header: found=2356042956, computed=1918334030: Failed trying to read batch #3 at offset 256 for log segment /mnt/disk1/yb-data/tserver/wals/table-6e839e00f08242eab15ea679d0b88a29/tablet-76e3528156c3461ab36650a2b59d2f4f/wal-000000006: Prior batch offsets: 168 228 256; Last log entries read: [REPLICATE (11.15)]

I0217 07:25:47.197899    53 log.cc:616] T 76e3528156c3461ab36650a2b59d2f4f P 7b31bf647fbf40a88a0897948ae224d1: Opened existing logs. Last segment is /mnt/disk1/yb-data/tserver/wals/table-6e839e00f08242eab15ea679d0b88a29/tablet-76e3528156c3461ab36650a2b59d2f4f/wal-000000008
I0217 07:25:47.197909    53 tablet_bootstrap.cc:383] T 76e3528156c3461ab36650a2b59d2f4f P 7b31bf647fbf40a88a0897948ae224d1: Bootstrap optimizer (skip_flushed_entries): op_id_replay_lowest=3.7 (regular_op_id=3.7, intents_op_id=-1.-1, has_intents_db=0)
W0217 07:25:47.197918    53 tablet_bootstrap.cc:1166] T 76e3528156c3461ab36650a2b59d2f4f P 7b31bf647fbf40a88a0897948ae224d1: Bootstrap optimizer (skip_flushed_entries): Could not read the first entry's metadata of log segment /mnt/disk1/yb-data/tserver/wals/table-6e839e00f08242eab15ea679d0b88a29/tablet-76e3528156c3461ab36650a2b59d2f4f/wal-000000008. Simply continuing to earlier segments to determine the first segment to start the replay at. The error was: Not found (yb/consensus/log_util.cc:673): No entries found
I0217 07:25:47.197939    53 tablet_bootstrap.cc:1209] T 76e3528156c3461ab36650a2b59d2f4f P 7b31bf647fbf40a88a0897948ae224d1: Segment /mnt/disk1/yb-data/tserver/wals/table-6e839e00f08242eab15ea679d0b88a29/tablet-76e3528156c3461ab36650a2b59d2f4f/wal-000000007 cannot be used as the first segment to start replay with according to our OpId and retention criteria. Continuing to earlier segments.op_id=12.16, op_id_replay_lowest=3.7, first_op_time=8320133443.780s, min_seconds_to_retain_logs=660.000s, replay_from_this_or_earlier_time_was_initialized=0, *replay_from_this_or_earlier_time=8320132783.780s, is_first_op_id_low_enough=0, is_first_op_time_early_enough=0
I0217 07:25:47.198016    53 tablet_bootstrap.cc:1209] T 76e3528156c3461ab36650a2b59d2f4f P 7b31bf647fbf40a88a0897948ae224d1: Segment /mnt/disk1/yb-data/tserver/wals/table-6e839e00f08242eab15ea679d0b88a29/tablet-76e3528156c3461ab36650a2b59d2f4f/wal-000000006 cannot be used as the first segment to start replay with according to our OpId and retention criteria. Continuing to earlier segments.op_id=11.15, op_id_replay_lowest=3.7, first_op_time=8320133440.716s, min_seconds_to_retain_logs=660.000s, replay_from_this_or_earlier_time_was_initialized=1, *replay_from_this_or_earlier_time=8320132783.780s, is_first_op_id_low_enough=0, is_first_op_time_early_enough=0
I0217 07:25:47.198062    53 tablet_bootstrap.cc:1209] T 76e3528156c3461ab36650a2b59d2f4f P 7b31bf647fbf40a88a0897948ae224d1: Segment /mnt/disk1/yb-data/tserver/wals/table-6e839e00f08242eab15ea679d0b88a29/tablet-76e3528156c3461ab36650a2b59d2f4f/wal-000000005 cannot be used as the first segment to start replay with according to our OpId and retention criteria. However, this is already the earliest segment so we have to start replay here. We should probably investigate how we got into this situation. op_id=8.12, op_id_replay_lowest=3.7, first_op_time=8320133437.768s, min_seconds_to_retain_logs=660.000s, replay_from_this_or_earlier_time_was_initialized=1, *replay_from_this_or_earlier_time=8320132783.780s, is_first_op_id_low_enough=0, is_first_op_time_early_enough=0
I0217 07:25:47.198083    53 tablet_bootstrap.cc:1222] T 76e3528156c3461ab36650a2b59d2f4f P 7b31bf647fbf40a88a0897948ae224d1: Bootstrap optimizer (skip_flushed_entries): will replay all segments starting from the very first one.
I0217 07:25:47.200260    53 tablet_bootstrap_if.cc:96] T 76e3528156c3461ab36650a2b59d2f4f P 7b31bf647fbf40a88a0897948ae224d1: Bootstrap replayed 1/4 log segments. Read operations: 3, overwritten operations: 0. Pending: 1 replicates. Last read committed op id: 10.14, last entry metadata: { entry_time: 8320133437.768s offset: 168 active_segment_sequence_number: 5 }, last read entry op id: 10.14
I0217 07:25:47.200313    53 tablet_bootstrap_if.cc:96] T 76e3528156c3461ab36650a2b59d2f4f P 7b31bf647fbf40a88a0897948ae224d1: Bootstrap replayed 2/4 log segments. Read operations: 4, overwritten operations: 0. Pending: 1 replicates. Last read committed op id: 11.15, last entry metadata: { entry_time: 8320133440.716s offset: 168 active_segment_sequence_number: 6 }, last read entry op id: 11.15
I0217 07:25:47.200338    53 tablet_bootstrap_if.cc:96] T 76e3528156c3461ab36650a2b59d2f4f P 7b31bf647fbf40a88a0897948ae224d1: Bootstrap replayed 3/4 log segments. Read operations: 5, overwritten operations: 0. Pending: 1 replicates. Last read committed op id: 12.16, last entry metadata: { entry_time: 8320133443.780s offset: 168 active_segment_sequence_number: 7 }, last read entry op id: 12.16
I0217 07:25:47.200347    53 tablet_bootstrap_if.cc:96] T 76e3528156c3461ab36650a2b59d2f4f P 7b31bf647fbf40a88a0897948ae224d1: Bootstrap replayed 4/4 log segments. Read operations: 5, overwritten operations: 0. Pending: 1 replicates. Last read committed op id: 0.0, no entries in last segment
I0217 07:25:47.200354    53 tablet_bootstrap.cc:1377] T 76e3528156c3461ab36650a2b59d2f4f P 7b31bf647fbf40a88a0897948ae224d1: Dumping replay state to log at the end of PlaySegments
I0217 07:25:47.200366    53 tablet_bootstrap.cc:1074] T 76e3528156c3461ab36650a2b59d2f4f P 7b31bf647fbf40a88a0897948ae224d1: ReplayState: Previous OpId: 12.16, Committed OpId: 12.16, Pending Replicates: 0, Flushed Regular: 3.7, Flushed Intents: -1.-1
I0217 07:25:47.200382    53 tablet_bootstrap.cc:1074] T 76e3528156c3461ab36650a2b59d2f4f P 7b31bf647fbf40a88a0897948ae224d1: Log entries applied to RocksDB: 5
I0217 07:25:47.200394    53 tablet_bootstrap.cc:1388] T 76e3528156c3461ab36650a2b59d2f4f P 7b31bf647fbf40a88a0897948ae224d1: Number of orphaned replicates: 0, last id: 12.16, committed id: 12.
I0217 07:25:47.211081    53 tablet_bootstrap_if.cc:96] T 76e3528156c3461ab36650a2b59d2f4f P 7b31bf647fbf40a88a0897948ae224d1: Bootstrap complete.

Can you provide some more logs from the yb-tservers? So we can see the crash-loop in the logs?
By uploading the files zipped somewhere?

Hi @dorian_yugabyte

I reproduced the outage with a minikube cluster and uploaded the logs to a github repo: GitHub - MichiganL/yugabytedb-logs: Logs of the yugabyte database to analyze

I noticed that the durable_wal_write / require_durable_wal_write flag seem to influence the recovery process.
When these flags are enabled the CRC mismatch in the WAL is detected, which causes the restarts.

I used the following steps to reproduce the cluster outage and the yugabyte recovery process:

Create a k8 cluster (minikube).
Use the yugabyte helm chart yugabyte 2.14.3 · yugabyte/yugabyte version 2.14.3 and the image 2.14.3.1-b1.
Enable durable_wal_write and require_durable_wal_write.
Start a minimal yugabyte cluster (3 master, 3 tablets).

Simulate some client activity:

ycqlsh:

create keyspace ks;
create table ks.test1(key text primary key, value text);
create table ks.test2(key text primary key, value text);
create table ks.test3(key text primary key, value text);
create table ks.test4(key text primary key, value text);

generate some scripts:

for i in `seq 1 1000`; do echo "insert into ks.test1(key, value) values('k$i', 'v1');"; done > insert1.cql
for i in `seq 1 1000`; do echo "insert into ks.test2(key, value) values('k$i', 'v1');"; done > insert2.cql
for i in `seq 1 1000`; do echo "insert into ks.test3(key, value) values('k$i', 'v1');"; done > insert3.cql
for i in `seq 1 1000`; do echo "insert into ks.test4(key, value) values('k$i', 'v1');"; done > insert4.cql

run the scripts:

ycqlsh -f insert1.cql &
ycqlsh -f insert2.cql &
ycqlsh -f insert3.cql &
ycqlsh -f insert4.cql &

kill all tablet servers:

kubectl exec yb-tserver-0 -c yb-tserver -- kill 1
kubectl exec yb-tserver-1 -c yb-tserver -- kill 1
kubectl exec yb-tserver-2 -c yb-tserver -- kill 1

watch the recovery process

@MichiganL can you also paste a screenshot of http://yb-master-ip:7000/tablet-server?
Looking to see how many tablets you have on these tables.

Also, what is the hardware (vcpu,memory,ssd/hdd) of the servers?

The master servers run with 4 GB of memory, 2 CPUs and 2 disks each.
The tablet servers run with 8GB of memory, 4 CPUs and 2 disks each.
The disks are remotely attached azure premium SSDs for IO-intensive workloads.

Hi @dorian_yugabyte
Sorry, I forgot to mention you in my last response.
Did you have the opportunity to look further into this?

Hi @MichiganL

Checked this internally with our devs:

My expectation is that the CRC mismatch might happen if we DO NOT use durable writes, eg: if some data doesn’t get fsynced correctly
Also, the logs uploaded capture just a 20m window, which include the FATAL, but not the “15 hours” recovery.

Can you upload the full logs?

Can you also upload the WAL files?

Hi @dorian_yugabyte

sorry for the late reply. I was occupied with some other tasks.

I reproduced the issue and saved the WAL files after stopping the tablet servers.
As far as I have seen, the restarts only occur when both encryption at rest and durable_wal_write are active.
This time yugabyte could fully recover within 3 hours.

All tablet server logs including restarts are also available this time, as well as the encryption key used for encryption at rest.
For each restart of a yb-tserver a new log file with the start time is created.
All logs and tablet server data as well as the encryption key can be found here: GitHub - MichiganL/yugabytedb-logs: Logs of the yugabyte database to analyze

I hope you can have another look at this.

Hi @MichiganL

Thanks for getting back!

A developer will take a look.

Does the same thing happen for you also on latest version (stable or preview)?

Hi @dorian_yugabyte

yes, I could also reproduce this with the latest preview (2.17.3.0-b152) and stable (2.18.0.0-b65) version.

Hi @dorian_yugabyte

I think this github issue describes a similar problem: [DocDB] Incorrect WAL footer start offset causes WAL entry header corruption error · Issue #17574 · yugabyte/yugabyte-db · GitHub

The root cause might be the same.

Hi @MichiganL, I don’t think they are the same issue, because the Github issue is specifically related to a WAL feature we have, which saves WAL index in the file before closing the file. However, I just checked the 2.14.3.1 code, we don’t have this feature implemented at this point.