If it would timeout do you have any suggesstion?
Suppose you reverse the position of timestamp
& identifier
in the primary key:
Suppose you can split into a per-partition query
:
select * from test where group =‘group’ and shardid =1 and timestamp > ‘2020-01-15 19:00:00’ and timestamp < ‘2020-01-15 19:30:00’;
We don’t know how many values may be in this range, it may timeout. So, we pick a default LIMIT
of rows that we know won’t timeout, say, 10 000
:
select * from test where group =‘group’ and shardid =1 and timestamp > ‘2020-01-15 19:00:00’ and timestamp < ‘2020-01-15 19:30:00’ LIMIT 1000;
On the client we check::::
rows = [array]
numb_rows = length(rows)
if rows_retrieved < 10000:
then we got all rows
else:
# we keep track of the timestamp of the last row
last_row_timestamp = rows[-1]["timestamp"]
Then we use the last_row_timestamp
to make the following queries:
select * from test where group = group and shardid =1 and timestamp > last_row_timestamp and timestamp < ‘2020-01-15 19:30:00’ LIMIT 1000;
^^ Note, we don’t use OFFSET
because it’s inefficient on large values. Always keep track of last-row timestamp and do a range query.
^^ If you can change the schema, it’s usually better to store timestamp as DESC
for insertion/reading performance if it’s a timeseries.
With your current schema:::
With your current schema, you do the same logic EXCEPT you also filter on identifier
, because we’ll use that to do an efficient scan on the data:
select * from test where group = group and shardid =1 AND identifier < last_row_identifier and timestamp > ‘2020-01-15 19:00:00’ and timestamp < ‘2020-01-15 19:30:00’ LIMIT 1000;
We do identifier < last_row_identifier
, because the sort is ASC.