How's TTL performance

Fulton_Fu · October 23, 2019, 9:46am

Hello,

How’s TTL performance(Row level) in yugabyte?
How it affect when write/read at same time?
I’d like to know if there’s a a benchmark test especially for huge data?Or I need do it myself.
The case is that there’s always huge timeseries data came in we want the stale data auto move out (delete/move to other place).
TTL is one option. the other one is to split table by period(hour,day,week)
But with the latter, it may need run multiple query for a request.
Is there a good way to handle this? Look forward to your suggesstions.

Thanks.

dorian_yugabyte · October 23, 2019, 12:42pm

@Fulton_Fu see replies inline

Inserts are as usual.

Writes are as usual. While reads will also read from disk, and return data if the row hasn’t expired.
Expired rows will be deleted from disk on compaction. If there are too many rows expired it will also trigger a compaction so there is no wasted space.

This will depend on many factors like TTL, amount of data, type of reads, write volume, etc.

Time series (with some type of regularity) are best implemented as separate tables.
This will make it possible to just drop old tables and not expire.

Can you explain your queries ?

Explain your case (ttl, expected rows per day, date-range of queries) and we can see which is best.

dorian_yugabyte · October 23, 2019, 12:50pm

How TTL works underneath: Persistence in YugabyteDB | YugabyteDB Docs

table level: CREATE TABLE statement [YCQL] | YugabyteDB Docs

row level: INSERT statement [YCQL] | YugabyteDB Docs

Fulton_Fu · October 24, 2019, 1:49am

Thank you @dorian_yugabyte for quick reply, the rows count would be billions.
The ttl depends, it could be several days for data amount too large in one day, also could be hundreds/tens of days for the data amount not too large in one day.

If we seperate the table like table_day1,table_day2
The query I mentioned: If need the data come from the two days which come from two tables.
Is it possible to do this thing like lua in redis or procedure in relation database, or have to request two query?

and one more question, not sure if I should create new topic,
Assume we have table ((id:partitionkey), timestamp:sortkey), attributes
For query: select attributes from table where timestamp>=start and timestamp<end
If the data amount is huge among the date range[start,end),
Does it help if we split this kind of query to be:
select attributes from table
where partition_hash(id)>=0 and partition_hash(id)<=65535 (split this condition)
timestamp>=start and timestamp<end

Thank you.

dorian_yugabyte · October 24, 2019, 7:17am

So it’s per-row ttl ? How is it set, what’s the business logic ?

You can also separate per-week/month and also do filtering inside them.

It’s actually better in distributed systems to split some queries into several parallel ones when you’re filtering by partition.

In the backend, it still has to do table/tablet-scan. This will be more efficient, since you can do the queries in parallel and because it will hit the leader node lower inter-node-networking.

But to have selectivity, you have to filter by the partition column. You can also create an index ? (or double the writes ?)

Fulton_Fu · October 24, 2019, 8:03am

Yes, row level, set it when new data came in. From these info, I suppose we should choose to seperate the tables.
partition_hash condition also made a table scan, we need think about index or double write.

Thank you for suggesstion.

kisg · October 24, 2019, 8:41am

Hi,

without knowing much about your use case, this is just a shot in the dark, but the large amounts of continuous data could make it a good candidate for a streaming data architecture, e.g. using Kafka.

In this architecture you could use Kafka Streams to process and aggregate the raw data, with YugabyteDB acting as a high performance, distributed state storage for Kafka Streams. This way YugabyteDB can provide a query interface for your aggregate / processed data.

Best Regards,
Gergely

Fulton_Fu · October 25, 2019, 1:09am

Thank you,
The use case is simple
schema could be ((groupid, id),timestamp) attributes.
we want the stale data could be removed after a period.
For the query like select from table where timetamp>=start and timestamp<end, we want to get a statistic from the time range like max,min,sum…

dorian_yugabyte · October 25, 2019, 12:04pm

Are the timestamp-ranges fixed (say daily,weekly etc) or is it dynamic ?

Fulton_Fu · October 27, 2019, 1:00am

Hi
It’s fixed, every minute/hour/day
thanks

Topic		Replies	Views
Database schema in YugabyteDB for storing statistics collected periodically General	50	1919	November 21, 2022
Batch delete operation when in-built cassandra ttl cannot be used in the case of transactional tables Design Discussions	3	1452	June 28, 2019
Facing strange behaviour while checking ingestion rate in ysql table General	11	588	September 6, 2023
TTL with secondary index General	1	725	January 13, 2021
Data Rollup planned? General	12	616	January 4, 2023

How's TTL performance

Related topics