I have version 2.7.1.1 with a 4 node cluster. I’ve been pushing data into it (batches of 500K records) and i’ve gotten in a state that is unexpected. All connections are failing with message:
psql -U dse -h nyzks702i,nyzks701i -p 31223 ybtest
Password for user dse:
psql: FATAL: Query error: GetTransactionStatus: tablet_id: "1d1d10164a824eebbffa189fc25f707b" transaction_id: "\254*\024*$\273J=\2454\211KWY\251\321" propagated_hybrid_time: 6663771018554699776, retrier: { task_id: -1 state: kRunning deadline: 119835.163s } passed its deadline 119835.163s (passed: 5.107s): Illegal state (yb/consensus/consensus.cc:152): Not the leader (tablet server error 15)
In my application at some point, I started to get:
2021-07-21 14:38:48.660325919 [13581] INFO - sp.psql.asyncconn: 0x7f3ee4027410 Failed to establish connection to dbname=ybtest host=nyzks701i,nyzks702i,nyzks703i,nyzks704i port=31223 connect_timeout=3 user=dse ; status=1, FATAL: Query error: GetTransactionStatus: tablet_id: "1d1d10164a824eebbffa189fc25f707b" transaction_id: "\254*\024*$\273J=\2454\211KWY\251\321" propagated_hybrid_time: 6663752594264670208, retrier: { task_id: -1 state: kRunning deadline: 115337.045s } passed its deadline 115337.045s (passed: 5.446s): Leader does not have a valid lease. (yb/consensus/consensus.cc:164): This leader has not yet acquired a lease. (tablet server error 15);
When I look the logs, I see:
I0721 20:09:34.485960 41 cluster_balance_util.h:405] Master leader not received heartbeat from ts 586cf5abf7fb4995b1d97249ed664954. Only performing leader balancing for tables with replicas in this TS.
And this is thousands of time per second. Then it switched to:
I0721 20:09:34.540902 41 cluster_balance.cc:420] Skipping Add replicas. Only leader balancing table 00004000000030008000000000004000
I0721 20:09:34.540908 41 cluster_balance.cc:442] Skipping remove replicas. Only leader balancing table 00004000000030008000000000004000
And then further:
I0721 20:11:47.807847 28 meta_cache.cc:809] 0x0000000009e4dd50 -> LookupByIdRpc(tablet: 1d1d10164a824eebbffa189fc25f707b, num_attempts: 2): Failed: Timed out (yb/rpc/rpc.cc:211): LookupByIdRpc(tablet: 1d1d10164a824eebbffa189fc25f707b, num_attempts: 2) passed its deadline 203761.486s (passed: 1.004s)
W0721 20:11:47.807888 28 meta_cache.cc:872] 0x0000000009e4dd50 -> LookupByIdRpc(tablet: 1d1d10164a824eebbffa189fc25f707b, num_attempts: 2): Timed out (yb/rpc/rpc.cc:211): timed out after deadline expired: LookupByIdRpc(tablet: 1d1d10164a824eebbffa189fc25f707b, num_attempts: 2) passed its deadline 203761.486s (passed: 1.004s)
W0721 20:11:47.827116 28 tablet_rpc.cc:436] Timed out (yb/rpc/rpc.cc:211): Failed GetTransactionStatus: tablet_id: "1d1d10164a824eebbffa189fc25f707b" transaction_id: "\254*\024*$\273J=\2454\211KWY\251\321" propagated_hybrid_time: 6663775447567736832, retrier: { task_id: -1 state: kRunning deadline: 203761.486s } to tablet 1d1d10164a824eebbffa189fc25f707b (no tablet server available) after 17 attempt(s): GetTransactionStatus: tablet_id: "1d1d10164a824eebbffa189fc25f707b" transaction_id: "\254*\024*$\273J=\2454\211KWY\251\321" propagated_hybrid_time: 6663775447567736832, retrier: { task_id: -1 state: kRunning deadline: 203761.486s } passed its deadline 203761.486s (passed: 5.198s): Illegal state (yb/consensus/consensus.cc:152): Not the leader (tablet server error 15)
W0721 20:11:47.827150 28 transaction_status_resolver.cc:149] T 00000000000000000000000000000000 P 3a76513f26774d26a50108a0791e9a97: Failed to request transaction statuses: Timed out (yb/rpc/rpc.cc:211): GetTransactionStatus: tablet_id: "1d1d10164a824eebbffa189fc25f707b" transaction_id: "\254*\024*$\273J=\2454\211KWY\251\321" propagated_hybrid_time: 6663775447567736832, retrier: { task_id: -1 state: kRunning deadline: 203761.486s } passed its deadline 203761.486s (passed: 5.198s): Illegal state (yb/consensus/consensus.cc:152): Not the leader (tablet server error 15)
I0721 20:11:48.846987 27 tablet_rpc.cc:140] Unable to pick leader for 1d1d10164a824eebbffa189fc25f707b, replicas: [0x0000000003bb91d0 -> { uuid: 586cf5abf7fb4995b1d97249ed664954 private: [host: "yb-tserver-1.yb-tservers.dse-default.svc.cluster.local" port: 9100] public: [host: "yb-tserver-1.yb-tservers.dse-default.svc.cluster.local" port: 9100] cloud_info: placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1", 0x0000000003bb8b40 -> { uuid: 7aa0072cb9c04048ad288d57b79e0a4f private: [host: "yb-tserver-2.yb-tservers.dse-default.svc.cluster.local" port: 9100] public: [host: "yb-tserver-2.yb-tservers.dse-default.svc.cluster.local" port: 9100] cloud_info: placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1", 0x0000000003bb8690 -> { uuid: 77c48bc0811f434bad97ab5ca372294b private: [host: "yb-tserver-0.yb-tservers.dse-default.svc.cluster.local" port: 9100] public: [host: "yb-tserver-0.yb-tservers.dse-default.svc.cluster.local" port: 9100] cloud_info: placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1"], followers: [{0x0000000003bb8b40 -> { uuid: 7aa0072cb9c04048ad288d57b79e0a4f private: [host: "yb-tserver-2.yb-tservers.dse-default.svc.cluster.local" port: 9100] public: [host: "yb-tserver-2.yb-tservers.dse-default.svc.cluster.local" port: 9100] cloud_info: placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1", { status: Illegal state (yb/consensus/consensus.cc:152): Not the leader (tablet server error 15) time: 0.006s }}, {0x0000000003bb91d0 -> { uuid: 586cf5abf7fb4995b1d97249ed664954 private: [host: "yb-tserver-1.yb-tservers.dse-default.svc.cluster.local" port: 9100] public: [host: "yb-tserver-1.yb-tservers.dse-default.svc.cluster.local" port: 9100] cloud_info: placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1", { status: Illegal state (yb/consensus/consensus.cc:152): Not the leader (tablet server error 15) time: 0.012s }}, {0x0000000003bb8690 -> { uuid: 77c48bc0811f434bad97ab5ca372294b private: [host: "yb-tserver-0.yb-tservers.dse-default.svc.cluster.local" port: 9100] public: [host: "yb-tserver-0.yb-tservers.dse-default.svc.cluster.local" port: 9100] cloud_info: placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1", { status: Illegal state (yb/consensus/consensus.cc:152): Not the leader (tablet server error 15) time: 0.017s }}]
I0721 20:11:49.879317 28 tablet_rpc.cc:140] Unable to pick leader for 1d1d10164a824eebbffa189fc25f707b, replicas: [0x0000000003bb91d0 -> { uuid: 586cf5abf7fb4995b1d97249ed664954 private: [host: "yb-tserver-1.yb-tservers.dse-default.svc.cluster.local" port: 9100] public: [host: "yb-tserver-1.yb-tservers.dse-default.svc.cluster.local" port: 9100] cloud_info: placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1", 0x0000000003bb8b40 -> { uuid: 7aa0072cb9c04048ad288d57b79e0a4f private: [host: "yb-tserver-2.yb-tservers.dse-default.svc.cluster.local" port: 9100] public: [host: "yb-tserver-2.yb-tservers.dse-default.svc.cluster.local" port: 9100] cloud_info: placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1", 0x0000000003bb8690 -> { uuid: 77c48bc0811f434bad97ab5ca372294b private: [host: "yb-tserver-0.yb-tservers.dse-default.svc.cluster.local" port: 9100] public: [host: "yb-tserver-0.yb-tservers.dse-default.svc.cluster.local" port: 9100] cloud_info: placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1"], followers: [{0x0000000003bb8b40 -> { uuid: 7aa0072cb9c04048ad288d57b79e0a4f private: [host: "yb-tserver-2.yb-tservers.dse-default.svc.cluster.local" port: 9100] public: [host: "yb-tserver-2.yb-tservers.dse-default.svc.cluster.local" port: 9100] cloud_info: placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1", { status: Illegal state (yb/consensus/consensus.cc:152): Not the leader (tablet server error 15) time: 0.008s }}, {0x0000000003bb91d0 -> { uuid: 586cf5abf7fb4995b1d97249ed664954 private: [host: "yb-tserver-1.yb-tservers.dse-default.svc.cluster.local" port: 9100] public: [host: "yb-tserver-1.yb-tservers.dse-default.svc.cluster.local" port: 9100] cloud_info: placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1", { status: Illegal state (yb/consensus/consensus.cc:152): Not the leader (tablet server error 15) time: 0.016s }}, {0x0000000003bb8690 -> { uuid: 77c48bc0811f434bad97ab5ca372294b private: [host: "yb-tserver-0.yb-tservers.dse-default.svc.cluster.local" port: 9100] public: [host: "yb-tserver-0.yb-tservers.dse-default.svc.cluster.local" port: 9100] cloud_info: placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1", { status: Illegal state (yb/consensus/consensus.cc:152): Not the leader (tablet server error 15) time: 0.027s }}]
I0721 20:11:50.936357 28 tablet_rpc.cc:140] Unable to pick leader for 1d1d10164a824eebbffa189fc25f707b, replicas: [0x0000000003bb91d0 -> { uuid: 586cf5abf7fb4995b1d97249ed664954 private: [host: "yb-tserver-1.yb-tservers.dse-default.svc.cluster.local" port: 9100] public: [host: "yb-tserver-1.yb-tservers.dse-default.svc.cluster.local" port: 9100] cloud_info: placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1", 0x0000000003bb8b40 -> { uuid: 7aa0072cb9c04048ad288d57b79e0a4f private: [host: "yb-tserver-2.yb-tservers.dse-default.svc.cluster.local" port: 9100] public: [host: "yb-tserver-2.yb-tservers.dse-default.svc.cluster.local" port: 9100] cloud_info: placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1", 0x0000000003bb8690 -> { uuid: 77c48bc0811f434bad97ab5ca372294b private: [host: "yb-tserver-0.yb-tservers.dse-default.svc.cluster.local" port: 9100] public: [host: "yb-tserver-0.yb-tservers.dse-default.svc.cluster.local" port: 9100] cloud_info: placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1"], followers: [{0x0000000003bb8b40 -> { uuid: 7aa0072cb9c04048ad288d57b79e0a4f private: [host: "yb-tserver-2.yb-tservers.dse-default.svc.cluster.local" port: 9100] public: [host: "yb-tserver-2.yb-tservers.dse-default.svc.cluster.local" port: 9100] cloud_info: placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1", { status: Illegal state (yb/consensus/consensus.cc:152): Not the leader (tablet server error 15) time: 0.016s }}, {0x0000000003bb91d0 -> { uuid: 586cf5abf7fb4995b1d97249ed664954 private: [host: "yb-tserver-1.yb-tservers.dse-default.svc.cluster.local" port: 9100] public: [host: "yb-tserver-1.yb-tservers.dse-default.svc.cluster.local" port: 9100] cloud_info: placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1", { status: Illegal state (yb/consensus/consensus.cc:152): Not the leader (tablet server error 15) time: 0.028s }}, {0x0000000003bb8690 -> { uuid: 77c48bc0811f434bad97ab5ca372294b private: [host: "yb-tserver-0.yb-tservers.dse-default.svc.cluster.local" port: 9100] public: [host: "yb-tserver-0.yb-tservers.dse-default.svc.cluster.local" port: 9100] cloud_info: placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1", { status: Illegal state (yb/consensus/consensus.cc:152): Not the leader (tablet server error 15) time: 0.040s }}]
I0721 20:11:51.350374 24 transaction_status_resolver.cc:66] T 00000000000000000000000000000000 P 3a76513f26774d26a50108a0791e9a97: Start, queues: 1
What is the recommended next step here? Will this recover? Restart the pods? Something in the admin UI?