Why DELETE query with yb_hash_code() degrades performance?

sham_yuga · April 10, 2022, 1:15am

For a given below schema:

CREATE TABLE  IF NOT EXISTS public.item_data
(
    item_id uuid NOT NULL,
    id2 integer NOT NULL,
    create_date timestamp without time zone NOT NULL,
    modified_date timestamp without time zone NOT NULL,
        CONSTRAINT item_data_pkey PRIMARY KEY (item_id, id2)
);

We have 48 tablets in yugabyte setup. So, first hash range is [0, 1395)

Below are the execution times of DELETE queries:

Query 1 (using yb_hash_code()):

EXPLAIN ANALYZE DELETE FROM item_data x WHERE yb_hash_code(x.item_id)>=0 and yb_hash_code(x.item_id)<1395 and x.item_id = any in the arrayOfItemIds - taking 2 seconds of execution time

Query 2:

EXPLAIN ANALYZE DELETE FROM item_data x WHERE x.item_id = any in the listOfItemIds - taking 2 milli seconds of execution time

DELETE is a write operation, so, the query plan includes:
Step a) finding shard for the given WHERE clause.
Step b) executing query on the shard leader
Step c) replicate the changes on shard followers
Step d) respond to client

yb_hash_code() in WHERE clause should avoid step a, Is that correct?

Why Query 2 performs faster than Query 1? despite Query 1 uses yb_hash_code()

FranckPachot · April 10, 2022, 9:07am

Hi Sham,
Here, because item_idis the primary key, YugabyteDB determines the right tablets do you don’t need to filter with hash code.
But you are right, adding the criteria should not add more work. The execution plan shows. My guess is that you see Rows Removed by Index Recheck with approximately 1/48 of the rows there. This means that when having the predicate on hash_code, the other predicate was not pushed down.
I’ll check, I think we have a git issue about this, or I’ll open one.
Franck

sham_yuga · April 10, 2022, 2:45pm

@FranckPachot
Please share the GitHub issue link

FranckPachot · April 10, 2022, 8:16pm

Here it is, coming from your example:

github.com/yugabyte/yugabyte-db

[YSQL] adding yb_hash_code() may degrade performance when having other predicates

opened 08:13PM - 10 Apr 22 UTC

closed 11:45PM - 23 Feb 23 UTC

FranckPachot

kind/bug area/ysql priority/medium

Jira Link: [DB-882](https://yugabyte.atlassian.net/browse/DB-882) ### Descripti…on This is an example from https://forum.yugabyte.com/t/why-delete-query-with-yb-hash-code-degrades-performance/1572/2 Here is a test case on 11.2-YB-2.11.2.0-b0 ``` CREATE TABLE IF NOT EXISTS public.item_data ( item_id uuid NOT NULL, id2 integer NOT NULL, create_date timestamp without time zone NOT NULL, modified_date timestamp without time zone NOT NULL, CONSTRAINT item_data_pkey PRIMARY KEY (item_id, id2) ) split into 48 tablets; create extension if not exists pgcrypto; insert into public.item_data select gen_random_uuid(),generate_series(1,10000),now(),now(); EXPLAIN ANALYZE select * FROM item_data x WHERE yb_hash_code(x.item_id)>=0 and yb_hash_code(x.item_id)<1365 and x.item_id = any ( array['ecff8203-d8a8-4e45-9f04-316aeea018f1'::uuid,'0a670402-590d-41c6-b9d5-1cd742f8946e'::uuid]) ; ``` This shows some "Rows Removed by Index Recheck" ``` QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------- ------------------------------------------------------------- Index Scan using item_data_pkey on item_data x (cost=0.00..7.83 rows=31 width=36) (actual time=3.584..3.584 rows=0 loops=1) Index Cond: ((yb_hash_code(item_id) >= 0) AND (yb_hash_code(item_id) < 1365) AND (item_id = ANY ('{ecff8203-d8a8-4e45-9f04-31 6aeea018f1,0a670402-590d-41c6-b9d5-1cd742f8946e}'::uuid[]))) Rows Removed by Index Recheck: 417 Planning Time: 0.061 ms Execution Time: 3.625 ms (5 rows) ``` It is faster without the yb_hash_code() restriction: ``` EXPLAIN ANALYZE select * FROM item_data x WHERE x.item_id = any ( array['ecff8203-d8a8-4e45-9f04-316aeea018f1'::uuid,'0a670402-590d-41c6-b9d5-1cd742f8946e'::uuid]) ; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------- Index Scan using item_data_pkey on item_data x (cost=0.00..15.25 rows=100 width=36) (actual time=0.617..0.617 rows=0 loops=1) Index Cond: (item_id = ANY ('{ecff8203-d8a8-4e45-9f04-316aeea018f1,0a670402-590d-41c6-b9d5-1cd742f8946e}'::uuid[])) Planning Time: 0.055 ms Execution Time: 0.645 ms (4 rows) ``` In this example there are 48 tablets with 10000/48=208 rows per tablet. The predicate on yb_hash_code() filters to one tablet (1365=65536/48). The predicate on the two UUID goes to two other tablets: ``` select yb_hash_code(unnest) from unnest(array['ecff8203-d8a8-4e45-9f04-316aeea018f1'::uuid,'0a670402-590d-41c6-b9d5-1cd742f8946e'::uuid]); ``` It looks like the all rows from two tablets have been read (417=2*10000/48). When looking at rocksdb_number_db_seek the same tablet (the one for yb_hash_code<1365) has been read two times. So maybe the predicate on item_id was not pushed down, yb_hash_code() was used to select the tablet, but read for each value in the array

sham_yuga · April 10, 2022, 9:05pm

@FranckPachot

Another question…

For the given below schema:

CREATE TABLE IF NOT EXISTS public.table1

(

customer_id uuid NOT NULL ,

item_id uuid NOT NULL ,

kind character varying(100) NOT NULL ,

details character varying(100) NOT NULL ,
created_date timestamp without time zone NOT NULL,
modified_date timestamp without time zone NOT NULL,

CONSTRAINT table1_pkey PRIMARY KEY (customer_id, kind, item_id)

);

CREATE UNIQUE INDEX IF NOT EXISTS unique_item_id ON table1(item_id);

CREATE UNIQUE INDEX IF NOT EXISTS unique_item ON table1(customer_id, kind) WHERE kind='NEW' OR kind='BACKUP';

We see that yb_hash_code() performs better with SELECT query:

EXPLAIN ANALYZE select item_id from table1 WHERE yb_hash_code(item_id) >= 0 and yb_hash_code (item_id) < 1395 and modified_date < date '2022-04-08';
Planning Time: 7.967 ms
Execution Time: 82.929 ms

EXPLAIN ANALYZE select item_id from table1 WHERE modified_date < date '2022-04-08';
Planning Time: 0.054 ms
Execution Time: 4618.350 ms

EXPLAIN ANALYZE select item_id from table1 WHERE yb_hash_code(item_id) >= 0 and yb_hash_code(item_id) <=65535 and modified_date < date '2022-04-08';
Planning Time: 0.073 ms
Execution Time: 4565.615 ms

EXPLAIN ANALYZE select item_id from table1 WHERE yb_hash_code(item_id) >= 0 and yb_hash_code(item_id) < 1490 and modified_date < date '2022-04-08';
Planning Time: 0.148 ms
Execution Time: 84.737 ms

But,
do you suggest start using yb_hash_code() with SELECT query after above github issue is fixed?

FranckPachot · April 11, 2022, 9:16am

Hi,
The two queries that takes 4 seconds are doing a Seq Scan I guess. yb_hash_code() is pushed down only for IndexScan. See [YSQL] yb_hash_code() not pushed down to SeqScan · Issue #12096 · yugabyte/yugabyte-db · GitHub

It think the 3rd one you can force and Index Scan with hint, yb_hash_code() will be pushed down

But, maybe you are trying to do too much with yb_hash_code(). It was introduce for specific case, like doing some parallelization when reading all rows. The goal of a distributed database is to use it like one database logically without caring about the distribution. When specific placement considerations are needed, we have partitioning and tablespaces.

Topic		Replies	Views
SELECT with yb_hash_code() and DELETE General	3	885	April 9, 2022
Query performance on table with indices General	10	1338	March 18, 2022
Function partition_hash not working as expected General	23	1130	January 20, 2023
Optimizing Query Performance in YugabyteDB! General	2	75	January 29, 2025
YSQL: Transaction errors and slow performance during execution of DML statements General	5	1483	September 6, 2022

Why DELETE query with yb_hash_code() degrades performance?

Related topics