Would it be timeout for range query on the second cluster key when data increase very large

Hello,
I have a use case like below:
table schema:
CREATE TABLE test (
group text,
shardid int,
identifier text,
timestamp timestamp,// second-precision
fields jsonb,
PRIMARY KEY ((group,shardid ), identifier, timestamp)
);
The table will be very large.
I’d like to make a query like below:

select * from test where group =‘group’ and shardid in (1,2,3,…,100)
and timestamp > ‘2020-01-15 19:00:00’ and timestamp < ‘2020-01-15 19:30:00’

I’m not sure if this query would timeout if the data size increase to be very large?
If it would timeout do you have any suggesstion?
(I guess it might be helpful to add another partition column, like hour-precision timestamp, but may affect other kind of query (group=? and shardid =? and identifier=? and timestamp> and timestamp<))

Thank you.

@Fulton_Fu yes the query will eventually timeout if data is big enough.
You have exchange in PRIMARY KEY ((exchange,shardid ) but it’s not defined as a column ??
Can you fix the schema ?

@dorian_yugabyte ,sorry, fixed

@Fulton_Fu

Is it possible to put timestamp before identifier in PRIMARY KEY ? (PRIMARY KEY ((group,shardid ), timestamp, identifier)

^This will depend if you need to make efficient queries on identifier ?

Also, how many shardid values are you filtering ? (100?)

If it would timeout do you have any suggesstion?
Suppose you reverse the position of timestamp & identifier in the primary key:

Suppose you can split into a per-partition query:

select * from test where group =‘group’ and shardid =1 and timestamp > ‘2020-01-15 19:00:00’ and timestamp < ‘2020-01-15 19:30:00’;

We don’t know how many values may be in this range, it may timeout. So, we pick a default LIMIT of rows that we know won’t timeout, say, 10 000:

select * from test where group =‘group’ and shardid =1 and timestamp > ‘2020-01-15 19:00:00’ and timestamp < ‘2020-01-15 19:30:00’ LIMIT 1000;

On the client we check::::

rows = [array]
numb_rows = length(rows)
if rows_retrieved < 10000:
    then we got all rows
else:
    # we keep track of the timestamp of the last row
    last_row_timestamp = rows[-1]["timestamp"]

Then we use the last_row_timestamp to make the following queries:

select * from test where group = group and shardid =1 and timestamp > last_row_timestamp and timestamp < ‘2020-01-15 19:30:00’ LIMIT 1000;

^^ Note, we don’t use OFFSET because it’s inefficient on large values. Always keep track of last-row timestamp and do a range query.

^^ If you can change the schema, it’s usually better to store timestamp as DESC for insertion/reading performance if it’s a timeseries.

With your current schema:::
With your current schema, you do the same logic EXCEPT you also filter on identifier, because we’ll use that to do an efficient scan on the data:

select * from test where group = group and shardid =1 AND identifier < last_row_identifier and timestamp > ‘2020-01-15 19:00:00’ and timestamp < ‘2020-01-15 19:30:00’ LIMIT 1000;

We do identifier < last_row_identifier, because the sort is ASC.

Hi @dorian_yugabyte
We defnitely need efficient query on identifier with time range
For the shard value count, not confirm the final value(maybe 16), depends on the performance,
Both shardid and group could be calculated or parsed from identifier
shardid = GetShardFrom(identifier), group = GetGroupFrom(identifier).

I suppose if we could add hour-precision value (timestamp_partition) as cluster key like below:

CREATE TABLE test (
group text,
shardid int,
timestamp_partition timestamp,
identifier text,
timestamp timestamp,// second-precision
fields jsonb,
PRIMARY KEY ((group,shardid ), timestamp_partition ,identifier, timestamp)
);

1:For group-timerange query:

select * from test where group=? and shardid=? and timestamp_partition>? and timestamp_partition<? and timestamp >? and timestamp<?

2:For identifier-timerange query:

select * from test where group=? and shardid=? and timestamp_partition>? and identifier=? and timestamp_partition<? and timestamp >? and timestamp<?

I’m not sure if this work and if this would affect the performance of the 2rd query.

Thank you.

What other values beside group and shardid does identifier include ?

How many rows do you expect for a partition (group_id, shard_id) ?

Do you have a guess where an identifier might be inside a timestamp range ?

I don’t see a good reason for timestamp_partition.

^ I’m trying to find a way without a secondary index. Worst case we’ll just create an index.

What other values beside group and shardid does identifier include ?

Itself, nothing else.

How many rows do you expect for a partition (group_id, shard_id) ?

Since it’s timeseries data, it will grow every day, I checked one sample about 200k for one patition(group,shardid) one day, the number could be larger.

Do you have a guess where an identifier might be inside a timestamp range ?

Not sure what you mean.

I don’t see a good reason for timestamp_partition .

I don’t know how yugabyte execute the query, just guess if it might speed up the query.

thanks.

After a chat with @Fulton_Fu we decided:

1st case was to query latest minutes of data. For this we create another table with PRIMARY KEY ((group,shardid ), timestamp, identifier) and TTL=180 and write to both tables. The low TTL will lower overhead of the additional table.

2nd case was to query a random 10-minutes timerange. This query would run occasionally on the backend with relaxed latency. We decided to implement by querying all the list of unique identifier and doing multiple batches of reads and merging on client side.