Kubernetes Example Not Working As Expected

Hello,

We recently decided to evaluate using Yugabyte with Kubernetes in GKE, we are using yugabytedb-2.3.0.0-b176.

As we could not directly use helm, we instead used the helm template command to generate the YAML files and added our business level annotations. We are using GKE zonal-ssd for the storage class.

The cluster starts up fine and we are able to see the UI, everything reports as up etc and no errors in the log files.

However, as soon as we try to connect using ysqlsh after port forwarding the yb-tserver, the following error is shown:
ysqlsh: FATAL: Not found: Table 000000010000300080000000000004ec not found in Raft group 00000000000000000000000000000000

From the UI I can see this listed in system tables and I have not spotted any errors. Please note we have not modified the YAML other than the storage class and adding some custom annotations for our infrastructure.

EDIT: On a teardown, deleting PVC, and recreating I sometimes see

E0923 13:16:01.532014    31 master.cc:267] Master@10.242.31.7:7100: Unable to init master catalog manager: Illegal state (yb/tablet/tablet.cc:697): Unable to initialize catalog manager: Failed to initialize sys tables async: IO error (yb/rocksdb/util/env_posix.cc:601): /mnt/disk0/yb-data/master/data/rocksdb/table-sys.catalog.uuid/tablet-00000000000000000000000000000000/MANIFEST-000011: No such file or directory
F0923 13:16:01.532127     1 master_main.cc:131] Illegal state (yb/tablet/tablet.cc:697): Unable to initialize catalog manager: Failed to initialize sys tables async: IO error (yb/rocksdb/util/env_posix.cc:601): /mnt/disk0/yb-data/master/data/rocksdb/table-sys.catalog.uuid/tablet-00000000000000000000000000000000/MANIFEST-000011: No such file or directory
Fatal failure details written to /mnt/disk0/yb-data/master/logs/yb-master.FATAL.details.2020-09-23T13_16_01.pid1.txt
F20200923 13:16:01 ../../src/yb/master/master_main.cc:131] Illegal state (yb/tablet/tablet.cc:697): Unable to initialize catalog manager: Failed to initialize sys tables async: IO error (yb/rocksdb/util/env_posix.cc:601): /mnt/disk0/yb-data/master/data/rocksdb/table-sys.catalog.uuid/tablet-00000000000000000000000000000000/MANIFEST-000011: No such file or directory
    @     0x7ffbf3238f6c  yb::LogFatalHandlerSink::send()
    @     0x7ffbf2419376  google::LogMessage::SendToLog()
    @     0x7ffbf24167da  google::LogMessage::Flush()
    @     0x7ffbf24198a9  google::LogMessageFatal::~LogMessageFatal()
    @           0x409476  yb::master::MasterMain()
    @     0x7ffbee06f825  __libc_start_main
    @           0x408829  _start
    @              (nil)  (unknown)

in the master. However, this seems to be transient i.e. it is only in 1 of the 3 master pod replicas.

In the situation were I get a clean startup I still get the Table 000000010000300080000000000004ec not found error

Any suggestions on how to diagnose the root cause of this would be appreciated, many thanks!
Jason

Hi Jason,

The ysqlsh fatal looks like our currently open issue https://github.com/yugabyte/yugabyte-db/issues/4398. You may want to take a look and add to it.

tablet-00000000000000000000000000000000 is the system catalog tablet on master, and it holds various system tables.

Does ```
/mnt/disk0/yb-data/master/data/rocksdb/table-sys.catalog.uuid/tablet-00000000000000000000000000000000/

Thanks for the report, Jason. Will look into how this could happen.

Please note that the 2.3 series is our “latest”/“unstable” series (https://docs.yugabyte.com/latest/releases/releases-overview/) and you might want to try out the stable 2.2 series to check if it also has the same issue.

If you have the full log from the master, please do share. I would be especially interested in the logs in the PVC from the very first run (it might look like *INFO*.gz file in the logs directory. You could also share the logs over DM in the Slack channel for yugabyte - https://yugabyte.slack.com/.

Ah, good to know thanks, I just went with the default Helm chart. I will try 2.2 tomorrow (I’m in the UK) and report back thanks!

Sounds good.

To use the stable 2.2 version, you need to set an override tag for the helm chart - --set image.tag=2.2.2.0-b15. You will want to start with clean persistent volumes when you go back to an older version.

Hello,

I decided to do this today instead :slight_smile:
UI now says:

version 2.2.2.0 build 15 revision 525f00c97da67be442cf36e364bd025806e15fff build_type RELEASE built at 18 Aug 2020 21:13:58 UTC
server uuid 0a2cfff69afe45b7a0525650f99ff732

But I’m now getting this issue every time on at least one master, even after deleting PVC.

I0923 20:15:29.360877 32 db_impl.cc:646] T 00000000000000000000000000000000 P 458bce3ef6cb47febb4c11d22a96d250 [R]: Shutting down RocksDB at: /mnt/disk1/yb-data/master/data/rocksdb/table-sys.catalog.uuid/tablet-00000000000000000000000000000000
I0923 20:15:29.361042 32 db_impl.cc:756] T 00000000000000000000000000000000 P 458bce3ef6cb47febb4c11d22a96d250 [R]: Shutdown done
E0923 20:15:29.361111 32 tablet.cc:680] T 00000000000000000000000000000000 P 458bce3ef6cb47febb4c11d22a96d250: Failed to open a RocksDB database in directory /mnt/disk1/yb-data/master/data/rocksdb/table-sys.catalog.uuid/tablet-00000000000000000000000000000000: IO error (yb/rocksdb/util/env_posix.cc:601): /mnt/disk1/yb-data/master/data/rocksdb/table-sys.catalog.uuid/tablet-00000000000000000000000000000000/MANIFEST-000011: No such file or directory
I0923 20:15:29.365432 32 tablet_bootstrap.cc:539] T 00000000000000000000000000000000 P 458bce3ef6cb47febb4c11d22a96d250: Time spent opening tablet: real 0.012s user 0.002s sys 0.007s
I0923 20:15:29.365478 32 tablet.cc:868] T 00000000000000000000000000000000 P 458bce3ef6cb47febb4c11d22a96d250: StartShutdown
I0923 20:15:29.365705 32 transaction_participant.cc:147] T 00000000000000000000000000000000 P 458bce3ef6cb47febb4c11d22a96d250: Shutdown
I0923 20:15:29.365743 32 tablet.cc:894] T 00000000000000000000000000000000 P 458bce3ef6cb47febb4c11d22a96d250: CompleteShutdown(0)
I0923 20:15:29.365757 32 tablet.cc:868] T 00000000000000000000000000000000 P 458bce3ef6cb47febb4c11d22a96d250: StartShutdown
W0923 20:15:29.366041 32 tablet_bootstrap.cc:1611] T N/A Tablet bootstrap failed: Illegal state (yb/tablet/tablet.cc:685): IO error (yb/rocksdb/util/env_posix.cc:601): /mnt/disk1/yb-data/master/data/rocksdb/table-sys.catalog.uuid/tablet-00000000000000000000000000000000/MANIFEST-000011: No such file or directory
E0923 20:15:29.366151 32 master.cc:266] Master@10.242.26.84:7100: Unable to init master catalog manager: Illegal state (yb/tablet/tablet.cc:685): Unable to initialize catalog manager: Failed to initialize sys tables async: IO error (yb/rocksdb/util/env_posix.cc:601): /mnt/disk1/yb-data/master/data/rocksdb/table-sys.catalog.uuid/tablet-00000000000000000000000000000000/MANIFEST-000011: No such file or directory
F0923 20:15:29.366236 1 master_main.cc:120] Illegal state (yb/tablet/tablet.cc:685): Unable to initialize catalog manager: Failed to initialize sys tables async: IO error (yb/rocksdb/util/env_posix.cc:601): /mnt/disk1/yb-data/master/data/rocksdb/table-sys.catalog.uuid/tablet-00000000000000000000000000000000/MANIFEST-000011: No such file or directory
Fatal failure details written to /mnt/disk0/yb-data/master/logs/yb-master.FATAL.details.2020-09-23T20_15_29.pid1.txt
F20200923 20:15:29 …/…/src/yb/master/master_main.cc:120] Illegal state (yb/tablet/tablet.cc:685): Unable to initialize catalog manager: Failed to initialize sys tables async: IO error (yb/rocksdb/util/env_posix.cc:601): /mnt/disk1/yb-data/master/data/rocksdb/table-sys.catalog.uuid/tablet-00000000000000000000000000000000/MANIFEST-000011: No such file or directory
@ 0x7fc4886135fc yb::LogFatalHandlerSink::send()
@ 0x7fc4877f3376 google::LogMessage::SendToLog()
@ 0x7fc4877f07da google::LogMessage::Flush()
@ 0x7fc4877f38a9 google::LogMessageFatal::~LogMessageFatal()
@ 0x409224 yb::master::MasterMain()
@ 0x7fc483449825 __libc_start_main
@ 0x4085d9 _start
@ (nil) (unknown)

*** Check failure stack trace: ***
@ 0x7fc4886119e1 yb::(anonymous namespace)::DumpStackTraceAndExit()
@ 0x7fc4877f0c8d google::LogMessage::Fail()
@ 0x7fc4877f2dfd google::LogMessage::SendToLog()
@ 0x7fc4877f07da google::LogMessage::Flush()
@ 0x7fc4877f38a9 google::LogMessageFatal::~LogMessageFatal()
@ 0x409224 yb::master::MasterMain()
@ 0x7fc483449825 __libc_start_main
@ 0x4085d9 _start
@ (nil) (unknown)
*** Aborted at 1600892129 (unix time) try “date -d @1600892129” if you are using GNU date ***
PC: @ 0x7fc48345d5a6 __GI_abort
*** SIGSEGV (@0x0) received by PID 1 (TID 0x7fc4935f4b80) from PID 0; stack trace: ***
@ 0x7fc483ddaba0 (unknown)
@ 0x7fc48345d5a6 __GI_abort
@ 0x7fc488611a34 yb::(anonymous namespace)::DumpStackTraceAndExit()
@ 0x7fc4877f0c8c google::LogMessage::Fail()
@ 0x7fc4877f2dfc google::LogMessage::SendToLog()
I0923 20:15:29.437857 27 encryption_manager.cc:52] Responding to GetUniverseKeyRegistry request with key ids:
@ 0x7fc4877f07d9 google::LogMessage::Flush()
@ 0x7fc4877f38a8 google::LogMessageFatal::~LogMessageFatal()
@ 0x409223 yb::master::MasterMain()
@ 0x7fc483449825 __libc_start_main
@ 0x4085d9 _start
@ 0x0 (unknown)

Update with more logs checking missing manifest:

bash-4.2$ cat /mnt/disk0/yb-data/master/logs/yb-master.FATAL.details.2020-09-24T08_17_05.pid6.txt
F20200924 08:17:05 …/…/src/yb/master/master_main.cc:120] Illegal state (yb/tablet/tablet.cc:685): Unable to initialize catalog manager: Failed to initialize sys tables async: IO error (yb/rocksdb/util/env_posix.cc:601): /mnt/disk1/yb-data/master/data/rocksdb/table-sys.catalog.uuid/tablet-00000000000000000000000000000000/MANIFEST-000011: No such file or directory
@ 0x7f65d74405fc yb::LogFatalHandlerSink::send()
@ 0x7f65d6620376 google::LogMessage::SendToLog()
@ 0x7f65d661d7da google::LogMessage::Flush()
@ 0x7f65d66208a9 google::LogMessageFatal::~LogMessageFatal()
@ 0x409224 yb::master::MasterMain()
@ 0x7f65d2276825 __libc_start_main
@ 0x4085d9 _start
@ (nil) (unknown)

bash-4.2$ ls /mnt/disk1/yb-data/master/data
rocksdb
bash-4.2$ ls /mnt/disk1/yb-data/master/data/rocksdb/table-sys.catalog.uuid/
tablet-00000000000000000000000000000000 tablet-00000000000000000000000000000000.intents tablet-00000000000000000000000000000000.snapshots
bash-4.2$ ls -la /mnt/disk1/yb-data/master/data/rocksdb/table-sys.catalog.uuid/tablet-00000000000000000000000000000000/MANIFEST-000011
ls: cannot access /mnt/disk1/yb-data/master/data/rocksdb/table-sys.catalog.uuid/tablet-00000000000000000000000000000000/MANIFEST-000011: No such file or directory
bash-4.2$ ls -la /mnt/disk1/yb-data/master/data/rocksdb/table-sys.catalog.uuid/tablet-00000000000000000000000000000000/
total 976
drwxrwsr-x 2 9347 9107 4096 Sep 24 07:50 .
drwxrwsr-x 5 9347 9107 4096 Sep 24 07:50 …
-rw-rw-r-- 1 9347 9107 983040 Sep 24 07:50 000010.sst.sblock.0
-rw-rw-r-- 1 9347 9107 16 Sep 24 07:50 CURRENT
-rw-rw-r-- 1 9347 9107 37 Sep 24 07:50 IDENTITY
-rw-rw-r-- 1 9347 9107 0 Sep 24 07:50 LOCK

For a pod that succeeds I can see a succesful log entry for manifest creation, but never for the broken pod:

I0924 08:17:33.134510 30 tablet.cc:518] T 00000000000000000000000000000000 P 8b22690fcbd74c9486dc9a263924116e: Creating RocksDB database in dir /mnt/disk1/yb-data/master/data/rocksdb/table-sys.catalog.uuid/tablet-00000000000000000000000000000000
I0924 08:17:33.137562 30 tablet.cc:676] Opening RocksDB at: /mnt/disk1/yb-data/master/data/rocksdb/table-sys.catalog.uuid/tablet-00000000000000000000000000000000
I0924 08:17:33.140027 30 version_set.cc:2839] T 00000000000000000000000000000000 P 8b22690fcbd74c9486dc9a263924116e [R]: Recovered from manifest file:/mnt/disk1/yb-data/master/data/rocksdb/table-sys.catalog.uuid/tablet-00000000000000000000000000000000/MANIFEST-000014 succeeded,manifest_file_number is 14, next_file_number is 16, last_sequence is 1125899906842624, log_number is 0,prev_log_number is 0,max_column_family is 0, flushed_values is
I0924 08:17:33.140074 30 version_set.cc:2847] T 00000000000000000000000000000000 P 8b22690fcbd74c9486dc9a263924116e [R]: Column family [default] (ID 0), log number is 13
I0924 08:17:33.140251 30 event_logger.cc:67] T 00000000000000000000000000000000 P 8b22690fcbd74c9486dc9a263924116e [R]: EVENT_LOG_v1 {“time_micros”: 1600935453140222, “job”: 1, “event”: “recovery_started”, “log_files”: [15]}
I0924 08:17:33.140311 30 db_impl.cc:1505] T 00000000000000000000000000000000 P 8b22690fcbd74c9486dc9a263924116e [R]: Recovering log #15 mode 0 skip-recovery 0
I0924 08:17:33.140357 30 version_set.cc:2280] T 00000000000000000000000000000000 P 8b22690fcbd74c9486dc9a263924116e [R]: Creating manifest 17
I0924 08:17:33.141409 30 version_set.cc:3346] T 00000000000000000000000000000000 P 8b22690fcbd74c9486dc9a263924116e [R]: Writing version edit: comparator: “leveldb.BytewiseComparator”

For the transient issue of the MANIFEST not being created, I suspected some form of volume readiness race condition. Therefore I added an artificial wait before starting the yb-master which seems to have resolved the issue and the manifest is always created.

However, even though the masters/tservers all now report themselves in a good state. When I try to connect with ysqlsh I am back to the original error of:

[mbp-jasont yugabyte]$ kc exec -it yb-tsrv-devel-0 bash

Defaulting container name to yb-tsrv-devel.
bash-4.2$ /home/yugabyte/bin/ysqlsh
ysqlsh: FATAL: Not found: Table 000000010000300080000000000004ec not found in Raft group 00000000000000000000000000000000

hi @Jason_Taylor:

We’ll look and maybe reach out to debug together.

But thinking aloud here:

It is possible that because of the volume readiness issue, the initdb step didn’t quite finished at cluster create time. After you added the artificial wait, did you just try restarting yb-master or did you try a fresh cluster create? I suspect, you might need to create the cluster fresh (as it it seems to me that it might not have been cleanly created in the first place).

Just for good measure I deleted the PVC’s each time and rebuilt with the same issue presenting.

On Monday I might also try rolling back to 2.1. Other suggestions welcome of course.

Or if anyone wants to try deploying to Google Kubernetes Engine and see if they can reproduce using the default helm chart. As mentioned I’m using the zonal-pd-ssd storage class (I might retry with non-ssd too).

Thanks

@Jason_Taylor:

  1. Can you share the StorageClass config you are using? I set up an SSD storage class like below and saw no issues bringing up all 3 masters.
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
   name: yb-ssd-test
provisioner: kubernetes.io/gce-pd
parameters:
     type: pd-ssd
helm3 install yb-test1 yugabytedb/yugabyte  --namespace yb-ssd-test --set storage.master.storageClass=yb-ssd-test --set storage.tserver.storageClass=yb-ssd-test  --set resource.master.requests.memory=1Gi --set resource.tserver.requests.memory=1Gi --set resource.master.requests.cpu=1 --set resource.master.limits.cpu=1 --set resource.tserver.requests.cpu=1 --set resource.tserver.limits.cpu=1
  1. Can you share all the INFO logs (.tar.gz and the regular text file) you find in the pod when you first encounter this issue on a clean PVC? If it is easier, feel free to drop by our Slack channel, it may be quicker to debug this over a Slack conversation.

Hello,

All 3 of my masters are up too. The problem is when I try to run:

/home/yugabyte/bin/ysqlsh

I get the error mentioned above. Are you able to run this command successfully without the same error from a tserver?
The UI and the masters all report as healthy.

The storage class is defined as per https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/ssd-pd

Yes, I was able to run ysqlsh. Here is a log of the commands

The storage class definition looks identical to what I used.

FWIW, we test this scenario nightly in our CI/CD jobs (with the default storage class).

Hello,

I’ll move to the yugabyte-db slack channel with an aim to share my full logs there.

Thanks

The problem was fixed on slack.

It looks like yugabyte wants to write to

/home/yugabyte/share/initial_sys_catalog_snapshot/exported_tablet_metadata_changes

However, when I check perms on that directory it’s:

bash-4.2$ ls -la /home/yugabyte/share/initial_sys_catalog_snapshot/exported_tablet_metadata_changes
-rw------- 1 root root 226844 Aug 18 22:17 /home/yugabyte/share/initial_sys_catalog_snapshot/exported_tablet_metadata_changes

I believe this might be a restriction imposed by our security team (we dont normally use 3rd party images), I’m looking through the yb code path to see if this dep on writing to /home can be avoided with flags currently.

After this commit https://github.com/yugabyte/yugabyte-db/commit/db2287c48596c6fde0dd46a5b1785de65b362e36, the docker image has been republished and the issue is fixed.