Function partition_hash not working as expected

I have a table with a single column “data_id” (type uuid) as the partition key and tried to delete the entire partition by:

delete from table where data_id = 799728e9-82e7-4c8f-a097-88afe5ec1abc;

I was under the impression that this would be very fast, but actually I am running into the internal Yugabyte 60 seconds timeout for a data set of about 1.5 GB. For about 500 MB the query actually completes in 15 seconds.

So I thought to split the deletion into parts via the partition_hash function. But alas:

select count(*) from table where data_id = 799728e9-82e7-4c8f-a097-88afe5ec1abc and partition_hash(data_id) >= 0 and partition_hash(data_id) < 57109;

This returns 0 rows. If I increase the hash value to one more, i. e. 57110, then I get the timeout again.

The data is quite similar, but every row has a different timestamp, so I would expect the hashes to vary. So my two questions are:

  1. Is it expected that deleting a 1.5 GB partition takes disproportionately longer than 500 GB?

  2. Am I using the partition_hash method correctly? If yes, then how can I delete 1.5 GB of data in reasonable time?

A delete in YugabyteDB, and most other databases, means every distinct row has to be found, and then is explicitly changed to the state ‘deleted’. This is in all these databases a very intensive operation.

Based on your description, I think your question is: it’s all in the same partition, that so essentially the partition should be just terminated, so why does it take so long?

@Frits_Hoogland

Note that Cassandra has a special “delete the whole partition by actually inserting 1 tombstone” case.

@ulim

I’m assuming this is what you’re expecting, correct?

Yes, I thought deleting the entire partition would be fast. So I designed my table accordingly, so that I never have to delete less than the entire partition.

FWIW, here’s the table definition (column names anonymized, but otherwise complete):

CREATE TABLE my_table (
    data_id uuid,
    time bigint,
    name text,
    v1 double,
    v2 double,
    v3 double,
    v4 double,
    v5 double,
    v6 double,
    v7 double,
    v8 tinyint,
    v9 list<float>,
    v10 double,
    v11 double,
    v12 smallint,
    v13 smallint,
    v14 smallint,
    v15 smallint,
    x float,
    y float,
    z float,
    PRIMARY KEY (data_id, time)
) WITH CLUSTERING ORDER BY (time DESC)
    AND default_time_to_live = 0
    AND transactions = {'enabled': 'false'};

Good point, my advise is quite strictly about YSQL, the PostgreSQL compatible API.

This is YCQL, the Cassandra compatible API?

Yes, I’m sorry I forgot to mention that. I thought the partition_hash function would only exist in YCQL.

So your data_id that you mention maps to exactly one partition_hash() value, which in this case is 57109 (0xDF15). So your query never returns it in the first count since you disqualified that hash value in your WHERE clause. There are many possible values that can hash to this value and so the “partition” in this case will actually cover (65535/# of tablets) values. In your case, since you want to delete all of the values with that data_id then you should use the range portion of the key (your time value) to restrict the number of records that you reference. That should allow you to DELETE it in more manageable chunks.

First of all, thanks for your explanation, it seems that I was wrong in thinking that the hash value is for the entire row and thus every row would have a different hash value.

As to your suggestion, unfortunately I do not know the type or column name of the time value, as this is just one of many different tables that are all created dynamically and have completely different columns. The only column I know and that is always present is the data_id (uuid) as partition key. So it seems that using the partition_hash function is not an option in this case.

This leaves the question why deleting the entire partition takes so long and especially why would it take disproportionately longer for a larger partition? If there is just one tombstone to insert?

So we do not currently implement that type of DELETE. Each key must get its own tombstone. A DocDB key is comprised of the partition(hash) key, the cluster key(range) and the column_id. So there are quite a few tombstones that must be written. My advice here would be to have another column that represents a “shard_id” that could be part of the clustering key to help you remove records more efficiently. This could be a simple integer field that could be computed/generated in a given range such as (0-15). This would allow you to break up the larger partitions into smaller chunks that could be efficiently DELETEd.

That seems like a clever idea, I’ll try that.

I am planning to implement the data layer of my application also in YSQL and then compare to the YCQL layer. Could I expect the same situation with tombstones and partitions (since it’s all DocDB) or would there be a difference in this regard? I am thinking that YSQL is compatible to PostgreSQL, which has no such concepts.

Some other options to consider:

  1. Using YCQL’s table-level or row-level TTL to automatically retire older records.
  2. Doing range deletes to delete the rows in smaller batches. For example, if the time range in the partition spans multiple years, you can consider doing the deletes on that partition 1 month at a time by specifying the time range in the query.

Regarding YCQL vs. YSQL in terms of deletes, they will roughly be the same in that, for every record, a delete tombstone will be written (in the log-structured storage layer). While YSQL is Postgres compatible from the language/query layer perspective, the underlying storage/transaction engine is very different (it is log-structured). [But note that in vanilla Postgres too, there’s is some parallel, because of multi-version concurrency control; delete markers are written and older versions are kept around till VACCUMing happens.)]

regards
Kannan

@Alan_Caldera
I have tried your first suggestion now to use the time column for the partition_hash function like so:

select * from myTable where partition_hash(time) < 512;

But this did not work, I got an error message:

SyntaxException: Invalid CQL Statement. Invalid partition_hash call, found reference to unexpected column

The table definition is like shown above - what’s wrong here?

What’s the full partition key on the table? Can you paste your table DDL?

Your previous table on this thread has data_id as partition key:

PRIMARY KEY (data_id, time)

Yes, that’s the table definition used, data_id is the partition key and time is the clustering key. Maybe I don’t understand the partition concept well enough, but I thought it is to localise data on the same node, if it belongs together. So adding time to the partition key would distribute the data, which I am not sure is what I want in terms of read performance…

So it seems there is no way to delete large partitions at all? Or is there a way to increase the timeout beyond 60 seconds for the Java driver?

Yes it will. But you didn’t add time to the partition key. Need to be like this:

PRIMARY KEY ((data_id, time))

Depends on your queries. The rows for a single data_id will be spread on multiple tablets (probably on all tablets).

Is this time-series and you’re deleting old data? If yes, you can do partitioning on multiple tables and then you just “drop” or “truncate” old tables. Like I explained in Database schema in YugabyteDB for storing statistics collected periodically

Getting a timeout on a query in a tablet means that there are just too many tombstones that haven’t been compacted. A compact table command should fix this yb-ts-cli - command line tool for advanced yb-tserver operations | YugabyteDB Docs

There should be a way, I’ll ask internally. This is for the DELETE statement, correct?

Yes correct. Thanks a lot for helping out with the many tips, I’m pretty new to Yugabyte or DocDB in general, so may not have some of the design basics down yet.

Oh, and btw, yes it is time series, but we are not just deleting old data, but all data for a given data_id. I would not expect this to happen frequently, but it must be doable and it’s ok if it takes a few minutes.

I could use a seperate table for each data_id and then drop the table instead of deleting data. But then I would lose the ability to query over several or even all data_ids. Those would become multi-table queries and I’d need an additional join column in each such table.

You would suffer from “too many tablets” problem. You don’t want more than ~2000 tablets per-server (depending on the server size).

You might do it on YSQL with colocated tablets though. A truncate there works like Cassandra partition delete.

You still might have too much metadata overhead for the schema depending on the number of tables. Maybe with partitioning it might be lower.

Well, then I am pretty much lost on how to design my data model on YugabyteDB :slight_smile:

I have a potentially large number of data sets, certainly more than 2000 over time. Some of them can be over a gigabyte in size, most will be less. But it’s generally not on a level I would worry about using PostgreSQL - I know it would become slower over time, but it would shoulder the load. My hope for YugabyteDB was that I could actually do more in terms of size and volume than I could do on a single node PostgreSQL server.

Based on the Q&A here, you may want to consider enrolling in our newly released YCQL Development self-paced course at Yugabyte University.

The course is free and includes a browser-based Gitpod environment running YugabyteDB in a 3 node cluster with VSCode + Jupyter (ipython) notebooks. It’s been especially designed for developers like yourself to help demystify how YugabyteDB functions as a distributed SQL database.

The course offers a deep dive into partition keys and clustering keys, DocDB and secondary indexes. The notebook exercises facilitate experimentation and tinkering. Using the notebooks, you get to view DocKeys and SubDocKeys using SST dumps of table and index tablets, which will help demystify how DDL affects data on disk. The videos are clear and concise and offer easy-to understand examples of how sharding works. The course also covers a YCQL methodology for query-driven data modeling, which may prove beneficial to you.

Although the lab exercises are not about time-series data, TTL, upserts, and other concepts such as TimeUUID are also covered in this course. I encourage you to enroll. It would be great to know if you find the course helpful in terms of both learning more about YugabyteDB and YCQL. Check out https://university.yugabyte.com/courses/yugabytedb-ycql-development.