Connection failure: Not the leader

I have version 2.7.1.1 with a 4 node cluster. I’ve been pushing data into it (batches of 500K records) and i’ve gotten in a state that is unexpected. All connections are failing with message:

psql -U dse -h nyzks702i,nyzks701i -p 31223  ybtest
Password for user dse: 
psql: FATAL:  Query error: GetTransactionStatus: tablet_id: "1d1d10164a824eebbffa189fc25f707b" transaction_id: "\254*\024*$\273J=\2454\211KWY\251\321" propagated_hybrid_time: 6663771018554699776, retrier: { task_id: -1 state: kRunning deadline: 119835.163s } passed its deadline 119835.163s (passed: 5.107s): Illegal state (yb/consensus/consensus.cc:152): Not the leader (tablet server error 15)

In my application at some point, I started to get:

2021-07-21 14:38:48.660325919 [13581] INFO  - sp.psql.asyncconn: 0x7f3ee4027410 Failed to establish connection to  dbname=ybtest host=nyzks701i,nyzks702i,nyzks703i,nyzks704i port=31223 connect_timeout=3 user=dse ; status=1, FATAL:  Query error: GetTransactionStatus: tablet_id: "1d1d10164a824eebbffa189fc25f707b" transaction_id: "\254*\024*$\273J=\2454\211KWY\251\321" propagated_hybrid_time: 6663752594264670208, retrier: { task_id: -1 state: kRunning deadline: 115337.045s } passed its deadline 115337.045s (passed: 5.446s): Leader does not have a valid lease. (yb/consensus/consensus.cc:164): This leader has not yet acquired a lease. (tablet server error 15);

When I look the logs, I see:

I0721 20:09:34.485960    41 cluster_balance_util.h:405] Master leader not received heartbeat from ts 586cf5abf7fb4995b1d97249ed664954. Only performing leader balancing for tables with replicas in this TS.

And this is thousands of time per second. Then it switched to:

I0721 20:09:34.540902    41 cluster_balance.cc:420] Skipping Add replicas. Only leader balancing table 00004000000030008000000000004000
I0721 20:09:34.540908    41 cluster_balance.cc:442] Skipping remove replicas. Only leader balancing table 00004000000030008000000000004000

And then further:

I0721 20:11:47.807847    28 meta_cache.cc:809] 0x0000000009e4dd50 -> LookupByIdRpc(tablet: 1d1d10164a824eebbffa189fc25f707b, num_attempts: 2): Failed: Timed out (yb/rpc/rpc.cc:211): LookupByIdRpc(tablet: 1d1d10164a824eebbffa189fc25f707b, num_attempts: 2) passed its deadline 203761.486s (passed: 1.004s)
W0721 20:11:47.807888    28 meta_cache.cc:872] 0x0000000009e4dd50 -> LookupByIdRpc(tablet: 1d1d10164a824eebbffa189fc25f707b, num_attempts: 2): Timed out (yb/rpc/rpc.cc:211): timed out after deadline expired: LookupByIdRpc(tablet: 1d1d10164a824eebbffa189fc25f707b, num_attempts: 2) passed its deadline 203761.486s (passed: 1.004s)
W0721 20:11:47.827116    28 tablet_rpc.cc:436] Timed out (yb/rpc/rpc.cc:211): Failed GetTransactionStatus: tablet_id: "1d1d10164a824eebbffa189fc25f707b" transaction_id: "\254*\024*$\273J=\2454\211KWY\251\321" propagated_hybrid_time: 6663775447567736832, retrier: { task_id: -1 state: kRunning deadline: 203761.486s } to tablet 1d1d10164a824eebbffa189fc25f707b (no tablet server available) after 17 attempt(s): GetTransactionStatus: tablet_id: "1d1d10164a824eebbffa189fc25f707b" transaction_id: "\254*\024*$\273J=\2454\211KWY\251\321" propagated_hybrid_time: 6663775447567736832, retrier: { task_id: -1 state: kRunning deadline: 203761.486s } passed its deadline 203761.486s (passed: 5.198s): Illegal state (yb/consensus/consensus.cc:152): Not the leader (tablet server error 15)
W0721 20:11:47.827150    28 transaction_status_resolver.cc:149] T 00000000000000000000000000000000 P 3a76513f26774d26a50108a0791e9a97: Failed to request transaction statuses: Timed out (yb/rpc/rpc.cc:211): GetTransactionStatus: tablet_id: "1d1d10164a824eebbffa189fc25f707b" transaction_id: "\254*\024*$\273J=\2454\211KWY\251\321" propagated_hybrid_time: 6663775447567736832, retrier: { task_id: -1 state: kRunning deadline: 203761.486s } passed its deadline 203761.486s (passed: 5.198s): Illegal state (yb/consensus/consensus.cc:152): Not the leader (tablet server error 15)
I0721 20:11:48.846987    27 tablet_rpc.cc:140] Unable to pick leader for 1d1d10164a824eebbffa189fc25f707b, replicas: [0x0000000003bb91d0 -> { uuid: 586cf5abf7fb4995b1d97249ed664954 private: [host: "yb-tserver-1.yb-tservers.dse-default.svc.cluster.local" port: 9100] public: [host: "yb-tserver-1.yb-tservers.dse-default.svc.cluster.local" port: 9100] cloud_info: placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1", 0x0000000003bb8b40 -> { uuid: 7aa0072cb9c04048ad288d57b79e0a4f private: [host: "yb-tserver-2.yb-tservers.dse-default.svc.cluster.local" port: 9100] public: [host: "yb-tserver-2.yb-tservers.dse-default.svc.cluster.local" port: 9100] cloud_info: placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1", 0x0000000003bb8690 -> { uuid: 77c48bc0811f434bad97ab5ca372294b private: [host: "yb-tserver-0.yb-tservers.dse-default.svc.cluster.local" port: 9100] public: [host: "yb-tserver-0.yb-tservers.dse-default.svc.cluster.local" port: 9100] cloud_info: placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1"], followers: [{0x0000000003bb8b40 -> { uuid: 7aa0072cb9c04048ad288d57b79e0a4f private: [host: "yb-tserver-2.yb-tservers.dse-default.svc.cluster.local" port: 9100] public: [host: "yb-tserver-2.yb-tservers.dse-default.svc.cluster.local" port: 9100] cloud_info: placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1", { status: Illegal state (yb/consensus/consensus.cc:152): Not the leader (tablet server error 15) time: 0.006s }}, {0x0000000003bb91d0 -> { uuid: 586cf5abf7fb4995b1d97249ed664954 private: [host: "yb-tserver-1.yb-tservers.dse-default.svc.cluster.local" port: 9100] public: [host: "yb-tserver-1.yb-tservers.dse-default.svc.cluster.local" port: 9100] cloud_info: placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1", { status: Illegal state (yb/consensus/consensus.cc:152): Not the leader (tablet server error 15) time: 0.012s }}, {0x0000000003bb8690 -> { uuid: 77c48bc0811f434bad97ab5ca372294b private: [host: "yb-tserver-0.yb-tservers.dse-default.svc.cluster.local" port: 9100] public: [host: "yb-tserver-0.yb-tservers.dse-default.svc.cluster.local" port: 9100] cloud_info: placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1", { status: Illegal state (yb/consensus/consensus.cc:152): Not the leader (tablet server error 15) time: 0.017s }}]
I0721 20:11:49.879317    28 tablet_rpc.cc:140] Unable to pick leader for 1d1d10164a824eebbffa189fc25f707b, replicas: [0x0000000003bb91d0 -> { uuid: 586cf5abf7fb4995b1d97249ed664954 private: [host: "yb-tserver-1.yb-tservers.dse-default.svc.cluster.local" port: 9100] public: [host: "yb-tserver-1.yb-tservers.dse-default.svc.cluster.local" port: 9100] cloud_info: placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1", 0x0000000003bb8b40 -> { uuid: 7aa0072cb9c04048ad288d57b79e0a4f private: [host: "yb-tserver-2.yb-tservers.dse-default.svc.cluster.local" port: 9100] public: [host: "yb-tserver-2.yb-tservers.dse-default.svc.cluster.local" port: 9100] cloud_info: placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1", 0x0000000003bb8690 -> { uuid: 77c48bc0811f434bad97ab5ca372294b private: [host: "yb-tserver-0.yb-tservers.dse-default.svc.cluster.local" port: 9100] public: [host: "yb-tserver-0.yb-tservers.dse-default.svc.cluster.local" port: 9100] cloud_info: placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1"], followers: [{0x0000000003bb8b40 -> { uuid: 7aa0072cb9c04048ad288d57b79e0a4f private: [host: "yb-tserver-2.yb-tservers.dse-default.svc.cluster.local" port: 9100] public: [host: "yb-tserver-2.yb-tservers.dse-default.svc.cluster.local" port: 9100] cloud_info: placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1", { status: Illegal state (yb/consensus/consensus.cc:152): Not the leader (tablet server error 15) time: 0.008s }}, {0x0000000003bb91d0 -> { uuid: 586cf5abf7fb4995b1d97249ed664954 private: [host: "yb-tserver-1.yb-tservers.dse-default.svc.cluster.local" port: 9100] public: [host: "yb-tserver-1.yb-tservers.dse-default.svc.cluster.local" port: 9100] cloud_info: placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1", { status: Illegal state (yb/consensus/consensus.cc:152): Not the leader (tablet server error 15) time: 0.016s }}, {0x0000000003bb8690 -> { uuid: 77c48bc0811f434bad97ab5ca372294b private: [host: "yb-tserver-0.yb-tservers.dse-default.svc.cluster.local" port: 9100] public: [host: "yb-tserver-0.yb-tservers.dse-default.svc.cluster.local" port: 9100] cloud_info: placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1", { status: Illegal state (yb/consensus/consensus.cc:152): Not the leader (tablet server error 15) time: 0.027s }}]
I0721 20:11:50.936357    28 tablet_rpc.cc:140] Unable to pick leader for 1d1d10164a824eebbffa189fc25f707b, replicas: [0x0000000003bb91d0 -> { uuid: 586cf5abf7fb4995b1d97249ed664954 private: [host: "yb-tserver-1.yb-tservers.dse-default.svc.cluster.local" port: 9100] public: [host: "yb-tserver-1.yb-tservers.dse-default.svc.cluster.local" port: 9100] cloud_info: placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1", 0x0000000003bb8b40 -> { uuid: 7aa0072cb9c04048ad288d57b79e0a4f private: [host: "yb-tserver-2.yb-tservers.dse-default.svc.cluster.local" port: 9100] public: [host: "yb-tserver-2.yb-tservers.dse-default.svc.cluster.local" port: 9100] cloud_info: placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1", 0x0000000003bb8690 -> { uuid: 77c48bc0811f434bad97ab5ca372294b private: [host: "yb-tserver-0.yb-tservers.dse-default.svc.cluster.local" port: 9100] public: [host: "yb-tserver-0.yb-tservers.dse-default.svc.cluster.local" port: 9100] cloud_info: placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1"], followers: [{0x0000000003bb8b40 -> { uuid: 7aa0072cb9c04048ad288d57b79e0a4f private: [host: "yb-tserver-2.yb-tservers.dse-default.svc.cluster.local" port: 9100] public: [host: "yb-tserver-2.yb-tservers.dse-default.svc.cluster.local" port: 9100] cloud_info: placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1", { status: Illegal state (yb/consensus/consensus.cc:152): Not the leader (tablet server error 15) time: 0.016s }}, {0x0000000003bb91d0 -> { uuid: 586cf5abf7fb4995b1d97249ed664954 private: [host: "yb-tserver-1.yb-tservers.dse-default.svc.cluster.local" port: 9100] public: [host: "yb-tserver-1.yb-tservers.dse-default.svc.cluster.local" port: 9100] cloud_info: placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1", { status: Illegal state (yb/consensus/consensus.cc:152): Not the leader (tablet server error 15) time: 0.028s }}, {0x0000000003bb8690 -> { uuid: 77c48bc0811f434bad97ab5ca372294b private: [host: "yb-tserver-0.yb-tservers.dse-default.svc.cluster.local" port: 9100] public: [host: "yb-tserver-0.yb-tservers.dse-default.svc.cluster.local" port: 9100] cloud_info: placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1", { status: Illegal state (yb/consensus/consensus.cc:152): Not the leader (tablet server error 15) time: 0.040s }}]
I0721 20:11:51.350374    24 transaction_status_resolver.cc:66] T 00000000000000000000000000000000 P 3a76513f26774d26a50108a0791e9a97: Start, queues: 1

What is the recommended next step here? Will this recover? Restart the pods? Something in the admin UI?

@Pierre_Belzile how many indexes are in that table ? can you try with much smaller batches, start with 500 ?

Two indexes. One on one field, the other on 2 fields. Row has 5 or 6 fields. Yes. I can try smaller batches. Cluster seems to have recovered: At least I can connect to it and run simple queries.