Why DELETE query with yb_hash_code() degrades performance?

For a given below schema:

CREATE TABLE  IF NOT EXISTS public.item_data
(
    item_id uuid NOT NULL,
    id2 integer NOT NULL,
    create_date timestamp without time zone NOT NULL,
    modified_date timestamp without time zone NOT NULL,
        CONSTRAINT item_data_pkey PRIMARY KEY (item_id, id2)
);

We have 48 tablets in yugabyte setup. So, first hash range is [0, 1395)

Below are the execution times of DELETE queries:

Query 1 (using yb_hash_code()):

EXPLAIN ANALYZE DELETE FROM item_data x WHERE yb_hash_code(x.item_id)>=0 and yb_hash_code(x.item_id)<1395 and x.item_id = any in the arrayOfItemIds - taking 2 seconds of execution time

Query 2:

EXPLAIN ANALYZE DELETE FROM item_data x WHERE x.item_id = any in the listOfItemIds - taking 2 milli seconds of execution time


DELETE is a write operation, so, the query plan includes:
Step a) finding shard for the given WHERE clause.
Step b) executing query on the shard leader
Step c) replicate the changes on shard followers
Step d) respond to client

yb_hash_code() in WHERE clause should avoid step a, Is that correct?

Why Query 2 performs faster than Query 1? despite Query 1 uses yb_hash_code()

Hi Sham,
Here, because item_idis the primary key, YugabyteDB determines the right tablets do you don’t need to filter with hash code.
But you are right, adding the criteria should not add more work. The execution plan shows. My guess is that you see Rows Removed by Index Recheck with approximately 1/48 of the rows there. This means that when having the predicate on hash_code, the other predicate was not pushed down.
I’ll check, I think we have a git issue about this, or I’ll open one.
Franck

1 Like

@FranckPachot
Please share the GitHub issue link

Here it is, coming from your example:

1 Like

@FranckPachot

Another question…

For the given below schema:

CREATE TABLE IF NOT EXISTS public.table1

(

customer_id uuid NOT NULL ,

item_id uuid NOT NULL ,

kind character varying(100) NOT NULL ,

details character varying(100) NOT NULL ,
created_date timestamp without time zone NOT NULL,
modified_date timestamp without time zone NOT NULL,

CONSTRAINT table1_pkey PRIMARY KEY (customer_id, kind, item_id)

);

CREATE UNIQUE INDEX IF NOT EXISTS unique_item_id ON table1(item_id);

CREATE UNIQUE INDEX IF NOT EXISTS unique_item ON table1(customer_id, kind) WHERE kind='NEW' OR kind='BACKUP';

We see that yb_hash_code() performs better with SELECT query:

EXPLAIN ANALYZE select item_id from table1 WHERE yb_hash_code(item_id) >= 0 and yb_hash_code (item_id) < 1395 and modified_date < date '2022-04-08';
Planning Time: 7.967 ms
Execution Time: 82.929 ms

EXPLAIN ANALYZE select item_id from table1 WHERE modified_date < date '2022-04-08';
Planning Time: 0.054 ms
Execution Time: 4618.350 ms

EXPLAIN ANALYZE select item_id from table1 WHERE yb_hash_code(item_id) >= 0 and yb_hash_code(item_id) <=65535 and modified_date < date '2022-04-08';
Planning Time: 0.073 ms
Execution Time: 4565.615 ms

EXPLAIN ANALYZE select item_id from table1 WHERE yb_hash_code(item_id) >= 0 and yb_hash_code(item_id) < 1490 and modified_date < date '2022-04-08';
Planning Time: 0.148 ms
Execution Time: 84.737 ms

But,
do you suggest start using yb_hash_code() with SELECT query after above github issue is fixed?

Hi,
The two queries that takes 4 seconds are doing a Seq Scan I guess. yb_hash_code() is pushed down only for IndexScan. See [YSQL] yb_hash_code() not pushed down to SeqScan · Issue #12096 · yugabyte/yugabyte-db · GitHub

It think the 3rd one you can force and Index Scan with hint, yb_hash_code() will be pushed down

But, maybe you are trying to do too much with yb_hash_code(). It was introduce for specific case, like doing some parallelization when reading all rows. The goal of a distributed database is to use it like one database logically without caring about the distribution. When specific placement considerations are needed, we have partitioning and tablespaces.