Kubernetes Example Not Working As Expected

Jason_Taylor · September 23, 2020, 1:01pm

Hello,

We recently decided to evaluate using Yugabyte with Kubernetes in GKE, we are using yugabytedb-2.3.0.0-b176.

As we could not directly use helm, we instead used the helm template command to generate the YAML files and added our business level annotations. We are using GKE zonal-ssd for the storage class.

The cluster starts up fine and we are able to see the UI, everything reports as up etc and no errors in the log files.

However, as soon as we try to connect using ysqlsh after port forwarding the yb-tserver, the following error is shown:
ysqlsh: FATAL: Not found: Table 000000010000300080000000000004ec not found in Raft group 00000000000000000000000000000000

From the UI I can see this listed in system tables and I have not spotted any errors. Please note we have not modified the YAML other than the storage class and adding some custom annotations for our infrastructure.

EDIT: On a teardown, deleting PVC, and recreating I sometimes see

E0923 13:16:01.532014    31 master.cc:267] Master@10.242.31.7:7100: Unable to init master catalog manager: Illegal state (yb/tablet/tablet.cc:697): Unable to initialize catalog manager: Failed to initialize sys tables async: IO error (yb/rocksdb/util/env_posix.cc:601): /mnt/disk0/yb-data/master/data/rocksdb/table-sys.catalog.uuid/tablet-00000000000000000000000000000000/MANIFEST-000011: No such file or directory
F0923 13:16:01.532127     1 master_main.cc:131] Illegal state (yb/tablet/tablet.cc:697): Unable to initialize catalog manager: Failed to initialize sys tables async: IO error (yb/rocksdb/util/env_posix.cc:601): /mnt/disk0/yb-data/master/data/rocksdb/table-sys.catalog.uuid/tablet-00000000000000000000000000000000/MANIFEST-000011: No such file or directory
Fatal failure details written to /mnt/disk0/yb-data/master/logs/yb-master.FATAL.details.2020-09-23T13_16_01.pid1.txt
F20200923 13:16:01 ../../src/yb/master/master_main.cc:131] Illegal state (yb/tablet/tablet.cc:697): Unable to initialize catalog manager: Failed to initialize sys tables async: IO error (yb/rocksdb/util/env_posix.cc:601): /mnt/disk0/yb-data/master/data/rocksdb/table-sys.catalog.uuid/tablet-00000000000000000000000000000000/MANIFEST-000011: No such file or directory
    @     0x7ffbf3238f6c  yb::LogFatalHandlerSink::send()
    @     0x7ffbf2419376  google::LogMessage::SendToLog()
    @     0x7ffbf24167da  google::LogMessage::Flush()
    @     0x7ffbf24198a9  google::LogMessageFatal::~LogMessageFatal()
    @           0x409476  yb::master::MasterMain()
    @     0x7ffbee06f825  __libc_start_main
    @           0x408829  _start
    @              (nil)  (unknown)

in the master. However, this seems to be transient i.e. it is only in 1 of the 3 master pod replicas.

In the situation were I get a clean startup I still get the Table 000000010000300080000000000004ec not found error

Any suggestions on how to diagnose the root cause of this would be appreciated, many thanks!
Jason

jason · September 23, 2020, 6:03pm

Hi Jason,

The ysqlsh fatal looks like our currently open issue ysqlsh: FATAL: Not found: Table 000000010000300080000000000004ec not found in Raft group 00000000000000000000000000000000 · Issue #4398 · yugabyte/yugabyte-db · GitHub. You may want to take a look and add to it.

tablet-00000000000000000000000000000000 is the system catalog tablet on master, and it holds various system tables.

Does ```
/mnt/disk0/yb-data/master/data/rocksdb/table-sys.catalog.uuid/tablet-00000000000000000000000000000000/

sanketh · September 23, 2020, 6:16pm

Thanks for the report, Jason. Will look into how this could happen.

Please note that the 2.3 series is our “latest”/“unstable” series (Releases in YugabyteDB | YugabyteDB Docs) and you might want to try out the stable 2.2 series to check if it also has the same issue.

If you have the full log from the master, please do share. I would be especially interested in the logs in the PVC from the very first run (it might look like *INFO*.gz file in the logs directory. You could also share the logs over DM in the Slack channel for yugabyte - https://yugabyte.slack.com/.

Jason_Taylor · September 23, 2020, 6:17pm

Ah, good to know thanks, I just went with the default Helm chart. I will try 2.2 tomorrow (I’m in the UK) and report back thanks!

sanketh · September 23, 2020, 6:59pm

Sounds good.

To use the stable 2.2 version, you need to set an override tag for the helm chart - --set image.tag=2.2.2.0-b15. You will want to start with clean persistent volumes when you go back to an older version.

Jason_Taylor · September 23, 2020, 8:21pm

Hello,

I decided to do this today instead
UI now says:

version 2.2.2.0 build 15 revision 525f00c97da67be442cf36e364bd025806e15fff build_type RELEASE built at 18 Aug 2020 21:13:58 UTC
server uuid 0a2cfff69afe45b7a0525650f99ff732

But I’m now getting this issue every time on at least one master, even after deleting PVC.

I0923 20:15:29.360877 32 db_impl.cc:646] T 00000000000000000000000000000000 P 458bce3ef6cb47febb4c11d22a96d250 [R]: Shutting down RocksDB at: /mnt/disk1/yb-data/master/data/rocksdb/table-sys.catalog.uuid/tablet-00000000000000000000000000000000
I0923 20:15:29.361042 32 db_impl.cc:756] T 00000000000000000000000000000000 P 458bce3ef6cb47febb4c11d22a96d250 [R]: Shutdown done
E0923 20:15:29.361111 32 tablet.cc:680] T 00000000000000000000000000000000 P 458bce3ef6cb47febb4c11d22a96d250: Failed to open a RocksDB database in directory /mnt/disk1/yb-data/master/data/rocksdb/table-sys.catalog.uuid/tablet-00000000000000000000000000000000: IO error (yb/rocksdb/util/env_posix.cc:601): /mnt/disk1/yb-data/master/data/rocksdb/table-sys.catalog.uuid/tablet-00000000000000000000000000000000/MANIFEST-000011: No such file or directory
I0923 20:15:29.365432 32 tablet_bootstrap.cc:539] T 00000000000000000000000000000000 P 458bce3ef6cb47febb4c11d22a96d250: Time spent opening tablet: real 0.012s user 0.002s sys 0.007s
I0923 20:15:29.365478 32 tablet.cc:868] T 00000000000000000000000000000000 P 458bce3ef6cb47febb4c11d22a96d250: StartShutdown
I0923 20:15:29.365705 32 transaction_participant.cc:147] T 00000000000000000000000000000000 P 458bce3ef6cb47febb4c11d22a96d250: Shutdown
I0923 20:15:29.365743 32 tablet.cc:894] T 00000000000000000000000000000000 P 458bce3ef6cb47febb4c11d22a96d250: CompleteShutdown(0)
I0923 20:15:29.365757 32 tablet.cc:868] T 00000000000000000000000000000000 P 458bce3ef6cb47febb4c11d22a96d250: StartShutdown
W0923 20:15:29.366041 32 tablet_bootstrap.cc:1611] T N/A Tablet bootstrap failed: Illegal state (yb/tablet/tablet.cc:685): IO error (yb/rocksdb/util/env_posix.cc:601): /mnt/disk1/yb-data/master/data/rocksdb/table-sys.catalog.uuid/tablet-00000000000000000000000000000000/MANIFEST-000011: No such file or directory
E0923 20:15:29.366151 32 master.cc:266] Master@10.242.26.84:7100: Unable to init master catalog manager: Illegal state (yb/tablet/tablet.cc:685): Unable to initialize catalog manager: Failed to initialize sys tables async: IO error (yb/rocksdb/util/env_posix.cc:601): /mnt/disk1/yb-data/master/data/rocksdb/table-sys.catalog.uuid/tablet-00000000000000000000000000000000/MANIFEST-000011: No such file or directory
F0923 20:15:29.366236 1 master_main.cc:120] Illegal state (yb/tablet/tablet.cc:685): Unable to initialize catalog manager: Failed to initialize sys tables async: IO error (yb/rocksdb/util/env_posix.cc:601): /mnt/disk1/yb-data/master/data/rocksdb/table-sys.catalog.uuid/tablet-00000000000000000000000000000000/MANIFEST-000011: No such file or directory
Fatal failure details written to /mnt/disk0/yb-data/master/logs/yb-master.FATAL.details.2020-09-23T20_15_29.pid1.txt
F20200923 20:15:29 …/…/src/yb/master/master_main.cc:120] Illegal state (yb/tablet/tablet.cc:685): Unable to initialize catalog manager: Failed to initialize sys tables async: IO error (yb/rocksdb/util/env_posix.cc:601): /mnt/disk1/yb-data/master/data/rocksdb/table-sys.catalog.uuid/tablet-00000000000000000000000000000000/MANIFEST-000011: No such file or directory
@ 0x7fc4886135fc yb::LogFatalHandlerSink::send()
@ 0x7fc4877f3376 google::LogMessage::SendToLog()
@ 0x7fc4877f07da google::LogMessage::Flush()
@ 0x7fc4877f38a9 google::LogMessageFatal::~LogMessageFatal()
@ 0x409224 yb::master::MasterMain()
@ 0x7fc483449825 __libc_start_main
@ 0x4085d9 _start
@ (nil) (unknown)

*** Check failure stack trace: ***
@ 0x7fc4886119e1 yb::(anonymous namespace)::DumpStackTraceAndExit()
@ 0x7fc4877f0c8d google::LogMessage::Fail()
@ 0x7fc4877f2dfd google::LogMessage::SendToLog()
@ 0x7fc4877f07da google::LogMessage::Flush()
@ 0x7fc4877f38a9 google::LogMessageFatal::~LogMessageFatal()
@ 0x409224 yb::master::MasterMain()
@ 0x7fc483449825 __libc_start_main
@ 0x4085d9 _start
@ (nil) (unknown)
*** Aborted at 1600892129 (unix time) try “date -d @1600892129” if you are using GNU date ***
PC: @ 0x7fc48345d5a6 __GI_abort
*** SIGSEGV (@0x0) received by PID 1 (TID 0x7fc4935f4b80) from PID 0; stack trace: ***
@ 0x7fc483ddaba0 (unknown)
@ 0x7fc48345d5a6 __GI_abort
@ 0x7fc488611a34 yb::(anonymous namespace)::DumpStackTraceAndExit()
@ 0x7fc4877f0c8c google::LogMessage::Fail()
@ 0x7fc4877f2dfc google::LogMessage::SendToLog()
I0923 20:15:29.437857 27 encryption_manager.cc:52] Responding to GetUniverseKeyRegistry request with key ids:
@ 0x7fc4877f07d9 google::LogMessage::Flush()
@ 0x7fc4877f38a8 google::LogMessageFatal::~LogMessageFatal()
@ 0x409223 yb::master::MasterMain()
@ 0x7fc483449825 __libc_start_main
@ 0x4085d9 _start
@ 0x0 (unknown)

Jason_Taylor · September 24, 2020, 8:21am

Update with more logs checking missing manifest:

bash-4.2$ cat /mnt/disk0/yb-data/master/logs/yb-master.FATAL.details.2020-09-24T08_17_05.pid6.txt
F20200924 08:17:05 …/…/src/yb/master/master_main.cc:120] Illegal state (yb/tablet/tablet.cc:685): Unable to initialize catalog manager: Failed to initialize sys tables async: IO error (yb/rocksdb/util/env_posix.cc:601): /mnt/disk1/yb-data/master/data/rocksdb/table-sys.catalog.uuid/tablet-00000000000000000000000000000000/MANIFEST-000011: No such file or directory
@ 0x7f65d74405fc yb::LogFatalHandlerSink::send()
@ 0x7f65d6620376 google::LogMessage::SendToLog()
@ 0x7f65d661d7da google::LogMessage::Flush()
@ 0x7f65d66208a9 google::LogMessageFatal::~LogMessageFatal()
@ 0x409224 yb::master::MasterMain()
@ 0x7f65d2276825 __libc_start_main
@ 0x4085d9 _start
@ (nil) (unknown)

bash-4.2$ ls /mnt/disk1/yb-data/master/data
rocksdb
bash-4.2$ ls /mnt/disk1/yb-data/master/data/rocksdb/table-sys.catalog.uuid/
tablet-00000000000000000000000000000000 tablet-00000000000000000000000000000000.intents tablet-00000000000000000000000000000000.snapshots
bash-4.2$ ls -la /mnt/disk1/yb-data/master/data/rocksdb/table-sys.catalog.uuid/tablet-00000000000000000000000000000000/MANIFEST-000011
ls: cannot access /mnt/disk1/yb-data/master/data/rocksdb/table-sys.catalog.uuid/tablet-00000000000000000000000000000000/MANIFEST-000011: No such file or directory
bash-4.2$ ls -la /mnt/disk1/yb-data/master/data/rocksdb/table-sys.catalog.uuid/tablet-00000000000000000000000000000000/
total 976
drwxrwsr-x 2 9347 9107 4096 Sep 24 07:50 .
drwxrwsr-x 5 9347 9107 4096 Sep 24 07:50 …
-rw-rw-r-- 1 9347 9107 983040 Sep 24 07:50 000010.sst.sblock.0
-rw-rw-r-- 1 9347 9107 16 Sep 24 07:50 CURRENT
-rw-rw-r-- 1 9347 9107 37 Sep 24 07:50 IDENTITY
-rw-rw-r-- 1 9347 9107 0 Sep 24 07:50 LOCK

For a pod that succeeds I can see a succesful log entry for manifest creation, but never for the broken pod:

I0924 08:17:33.134510 30 tablet.cc:518] T 00000000000000000000000000000000 P 8b22690fcbd74c9486dc9a263924116e: Creating RocksDB database in dir /mnt/disk1/yb-data/master/data/rocksdb/table-sys.catalog.uuid/tablet-00000000000000000000000000000000
I0924 08:17:33.137562 30 tablet.cc:676] Opening RocksDB at: /mnt/disk1/yb-data/master/data/rocksdb/table-sys.catalog.uuid/tablet-00000000000000000000000000000000
I0924 08:17:33.140027 30 version_set.cc:2839] T 00000000000000000000000000000000 P 8b22690fcbd74c9486dc9a263924116e [R]: Recovered from manifest file:/mnt/disk1/yb-data/master/data/rocksdb/table-sys.catalog.uuid/tablet-00000000000000000000000000000000/MANIFEST-000014 succeeded,manifest_file_number is 14, next_file_number is 16, last_sequence is 1125899906842624, log_number is 0,prev_log_number is 0,max_column_family is 0, flushed_values is
I0924 08:17:33.140074 30 version_set.cc:2847] T 00000000000000000000000000000000 P 8b22690fcbd74c9486dc9a263924116e [R]: Column family [default] (ID 0), log number is 13
I0924 08:17:33.140251 30 event_logger.cc:67] T 00000000000000000000000000000000 P 8b22690fcbd74c9486dc9a263924116e [R]: EVENT_LOG_v1 {“time_micros”: 1600935453140222, “job”: 1, “event”: “recovery_started”, “log_files”: [15]}
I0924 08:17:33.140311 30 db_impl.cc:1505] T 00000000000000000000000000000000 P 8b22690fcbd74c9486dc9a263924116e [R]: Recovering log #15 mode 0 skip-recovery 0
I0924 08:17:33.140357 30 version_set.cc:2280] T 00000000000000000000000000000000 P 8b22690fcbd74c9486dc9a263924116e [R]: Creating manifest 17
I0924 08:17:33.141409 30 version_set.cc:3346] T 00000000000000000000000000000000 P 8b22690fcbd74c9486dc9a263924116e [R]: Writing version edit: comparator: “leveldb.BytewiseComparator”

Jason_Taylor · September 24, 2020, 10:01am

For the transient issue of the MANIFEST not being created, I suspected some form of volume readiness race condition. Therefore I added an artificial wait before starting the yb-master which seems to have resolved the issue and the manifest is always created.

However, even though the masters/tservers all now report themselves in a good state. When I try to connect with ysqlsh I am back to the original error of:

[mbp-jasont yugabyte]$ kc exec -it yb-tsrv-devel-0 bash

Defaulting container name to yb-tsrv-devel.
bash-4.2$ /home/yugabyte/bin/ysqlsh
ysqlsh: FATAL: Not found: Table 000000010000300080000000000004ec not found in Raft group 00000000000000000000000000000000

kannan · September 25, 2020, 5:56pm

hi @Jason_Taylor:

We’ll look and maybe reach out to debug together.

But thinking aloud here:

It is possible that because of the volume readiness issue, the initdb step didn’t quite finished at cluster create time. After you added the artificial wait, did you just try restarting yb-master or did you try a fresh cluster create? I suspect, you might need to create the cluster fresh (as it it seems to me that it might not have been cleanly created in the first place).

Jason_Taylor · September 25, 2020, 6:28pm

Just for good measure I deleted the PVC’s each time and rebuilt with the same issue presenting.

On Monday I might also try rolling back to 2.1. Other suggestions welcome of course.

Or if anyone wants to try deploying to Google Kubernetes Engine and see if they can reproduce using the default helm chart. As mentioned I’m using the zonal-pd-ssd storage class (I might retry with non-ssd too).

Thanks

sanketh · September 25, 2020, 11:19pm

@Jason_Taylor:

Can you share the StorageClass config you are using? I set up an SSD storage class like below and saw no issues bringing up all 3 masters.

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
   name: yb-ssd-test
provisioner: kubernetes.io/gce-pd
parameters:
     type: pd-ssd

helm3 install yb-test1 yugabytedb/yugabyte  --namespace yb-ssd-test --set storage.master.storageClass=yb-ssd-test --set storage.tserver.storageClass=yb-ssd-test  --set resource.master.requests.memory=1Gi --set resource.tserver.requests.memory=1Gi --set resource.master.requests.cpu=1 --set resource.master.limits.cpu=1 --set resource.tserver.requests.cpu=1 --set resource.tserver.limits.cpu=1

Can you share all the INFO logs (.tar.gz and the regular text file) you find in the pod when you first encounter this issue on a clean PVC? If it is easier, feel free to drop by our Slack channel, it may be quicker to debug this over a Slack conversation.

Jason_Taylor · September 26, 2020, 5:26pm

Hello,

All 3 of my masters are up too. The problem is when I try to run:

/home/yugabyte/bin/ysqlsh

I get the error mentioned above. Are you able to run this command successfully without the same error from a tserver?
The UI and the masters all report as healthy.

The storage class is defined as per Menggunakan persistent disk SSD | Google Kubernetes Engine (GKE) | Google Cloud

sanketh · September 26, 2020, 7:48pm

Yes, I was able to run ysqlsh. Here is a log of the commands

gist.github.com

https://gist.github.com/iSignal/a9d9ca44a0a48e511b71da615ee949f5

gistfile1.txt

dev-server-sanketh-3 12:35 ~/code/charts [master] $
kubectl create namespace yb-sanketh-test1
namespace/yb-sanketh-test1 created
dev-server-sanketh-3 12:35 ~/code/charts [master] $
clear
dev-server-sanketh-3 12:35 ~/code/charts [master] $
kubectl delete namespace yb-sanketh-test1
namespace "yb-sanketh-test1" deleted
dev-server-sanketh-3 12:36 ~/code/charts [master] $
cat storageclass.yaml

This file has been truncated. show original

The storage class definition looks identical to what I used.

FWIW, we test this scenario nightly in our CI/CD jobs (with the default storage class).

Jason_Taylor · September 28, 2020, 10:58am

Hello,

I’ll move to the yugabyte-db slack channel with an aim to share my full logs there.

Thanks

dorian_yugabyte · October 8, 2020, 2:12pm

The problem was fixed on slack.

It looks like yugabyte wants to write to

/home/yugabyte/share/initial_sys_catalog_snapshot/exported_tablet_metadata_changes

However, when I check perms on that directory it’s:

bash-4.2$ ls -la /home/yugabyte/share/initial_sys_catalog_snapshot/exported_tablet_metadata_changes
-rw------- 1 root root 226844 Aug 18 22:17 /home/yugabyte/share/initial_sys_catalog_snapshot/exported_tablet_metadata_changes

I believe this might be a restriction imposed by our security team (we dont normally use 3rd party images), I’m looking through the yb code path to see if this dep on writing to /home can be avoided with flags currently.

After this commit [#5797] All files inside the release tarball should be readable by gr… · yugabyte/yugabyte-db@db2287c · GitHub, the docker image has been republished and the issue is fixed.

Topic		Replies	Views
Yugabyte onto AWS EKS Cluster General	11	1235	December 17, 2020
[RFC] Running YugaByte DB on Kubernetes using StatefulSets Design Discussions	0	1689	January 17, 2018
Initializing YSQL Failed to Initialize client: Timed out General	4	3923	October 18, 2019
Yugabytedb Pod issue General	8	1406	May 12, 2021
Cannot create schema: ERROR: Table system. transactions not found: General	2	189	August 23, 2024

Kubernetes Example Not Working As Expected

Related topics