CREATE TABLE IF NOT EXISTS public.item_data
(
item_id uuid NOT NULL,
id2 integer NOT NULL,
create_date timestamp without time zone NOT NULL,
modified_date timestamp without time zone NOT NULL,
CONSTRAINT item_data_pkey PRIMARY KEY (item_id, id2)
);
We have 48 tablets in yugabyte setup. So, first hash range is [0, 1395)
Below are the execution times of DELETE queries:
Query 1 (using yb_hash_code()):
EXPLAIN ANALYZE DELETE FROM item_data x WHERE yb_hash_code(x.item_id)>=0 and yb_hash_code(x.item_id)<1395 and x.item_id = any in the arrayOfItemIds - taking 2 seconds of execution time
Query 2:
EXPLAIN ANALYZE DELETE FROM item_data x WHERE x.item_id = any in the listOfItemIds - taking 2 milli seconds of execution time
DELETE is a write operation, so, the query plan includes:
Step a) finding shard for the given WHERE clause.
Step b) executing query on the shard leader
Step c) replicate the changes on shard followers
Step d) respond to client
yb_hash_code() in WHERE clause should avoid step a, Is that correct?
Hi Sham,
Here, because item_idis the primary key, YugabyteDB determines the right tablets do you don’t need to filter with hash code.
But you are right, adding the criteria should not add more work. The execution plan shows. My guess is that you see Rows Removed by Index Recheck with approximately 1/48 of the rows there. This means that when having the predicate on hash_code, the other predicate was not pushed down.
I’ll check, I think we have a git issue about this, or I’ll open one.
Franck
CREATE TABLE IF NOT EXISTS public.table1
(
customer_id uuid NOT NULL ,
item_id uuid NOT NULL ,
kind character varying(100) NOT NULL ,
details character varying(100) NOT NULL ,
created_date timestamp without time zone NOT NULL,
modified_date timestamp without time zone NOT NULL,
CONSTRAINT table1_pkey PRIMARY KEY (customer_id, kind, item_id)
);
CREATE UNIQUE INDEX IF NOT EXISTS unique_item_id ON table1(item_id);
CREATE UNIQUE INDEX IF NOT EXISTS unique_item ON table1(customer_id, kind) WHERE kind='NEW' OR kind='BACKUP';
We see that yb_hash_code() performs better with SELECT query:
EXPLAIN ANALYZE select item_id from table1 WHERE yb_hash_code(item_id) >= 0 and yb_hash_code (item_id) < 1395 and modified_date < date '2022-04-08';
Planning Time: 7.967 ms
Execution Time: 82.929 ms
EXPLAIN ANALYZE select item_id from table1 WHERE modified_date < date '2022-04-08';
Planning Time: 0.054 ms
Execution Time: 4618.350 ms
EXPLAIN ANALYZE select item_id from table1 WHERE yb_hash_code(item_id) >= 0 and yb_hash_code(item_id) <=65535 and modified_date < date '2022-04-08';
Planning Time: 0.073 ms
Execution Time: 4565.615 ms
EXPLAIN ANALYZE select item_id from table1 WHERE yb_hash_code(item_id) >= 0 and yb_hash_code(item_id) < 1490 and modified_date < date '2022-04-08';
Planning Time: 0.148 ms
Execution Time: 84.737 ms
But,
do you suggest start using yb_hash_code() with SELECT query after above github issue is fixed?
It think the 3rd one you can force and Index Scan with hint, yb_hash_code() will be pushed down
But, maybe you are trying to do too much with yb_hash_code(). It was introduce for specific case, like doing some parallelization when reading all rows. The goal of a distributed database is to use it like one database logically without caring about the distribution. When specific placement considerations are needed, we have partitioning and tablespaces.