How's TTL performance

Hello,

How’s TTL performance(Row level) in yugabyte?
How it affect when write/read at same time?
I’d like to know if there’s a a benchmark test especially for huge data?Or I need do it myself.
The case is that there’s always huge timeseries data came in we want the stale data auto move out (delete/move to other place).
TTL is one option. the other one is to split table by period(hour,day,week)
But with the latter, it may need run multiple query for a request.
Is there a good way to handle this? Look forward to your suggesstions.

Thanks.

@Fulton_Fu see replies inline

Inserts are as usual.

Writes are as usual. While reads will also read from disk, and return data if the row hasn’t expired.
Expired rows will be deleted from disk on compaction. If there are too many rows expired it will also trigger a compaction so there is no wasted space.

This will depend on many factors like TTL, amount of data, type of reads, write volume, etc.

Time series (with some type of regularity) are best implemented as separate tables.
This will make it possible to just drop old tables and not expire.

Can you explain your queries ?

Explain your case (ttl, expected rows per day, date-range of queries) and we can see which is best.

How TTL works underneath: https://docs.yugabyte.com/latest/architecture/docdb/persistence/#ycql-time-to-live-ttl-example

table level: https://docs.yugabyte.com/latest/api/ycql/ddl_create_table/#table-properties-1

row level: https://docs.yugabyte.com/latest/api/ycql/dml_insert/#insert-a-row-with-expiration-time-using-the-using-ttl-clause

1 Like

Thank you @dorian_yugabyte for quick reply, the rows count would be billions.
The ttl depends, it could be several days for data amount too large in one day, also could be hundreds/tens of days for the data amount not too large in one day.

If we seperate the table like table_day1,table_day2
The query I mentioned: If need the data come from the two days which come from two tables.
Is it possible to do this thing like lua in redis or procedure in relation database, or have to request two query?

and one more question, not sure if I should create new topic,
Assume we have table ((id:partitionkey), timestamp:sortkey), attributes
For query: select attributes from table where timestamp>=start and timestamp<end
If the data amount is huge among the date range[start,end),
Does it help if we split this kind of query to be:
select attributes from table
where partition_hash(id)>=0 and partition_hash(id)<=65535 (split this condition)
timestamp>=start and timestamp<end

Thank you.

So it’s per-row ttl ? How is it set, what’s the business logic ?

You can also separate per-week/month and also do filtering inside them.

It’s actually better in distributed systems to split some queries into several parallel ones when you’re filtering by partition.

In the backend, it still has to do table/tablet-scan. This will be more efficient, since you can do the queries in parallel and because it will hit the leader node lower inter-node-networking.

But to have selectivity, you have to filter by the partition column. You can also create an index ? (or double the writes ?)

Yes, row level, set it when new data came in. From these info, I suppose we should choose to seperate the tables.
partition_hash condition also made a table scan, we need think about index or double write.

Thank you for suggesstion.

Hi,

without knowing much about your use case, this is just a shot in the dark, but the large amounts of continuous data could make it a good candidate for a streaming data architecture, e.g. using Kafka.

In this architecture you could use Kafka Streams to process and aggregate the raw data, with YugabyteDB acting as a high performance, distributed state storage for Kafka Streams. This way YugabyteDB can provide a query interface for your aggregate / processed data.

Best Regards,
Gergely

Thank you,
The use case is simple
schema could be ((groupid, id),timestamp) attributes.
we want the stale data could be removed after a period.
For the query like select from table where timetamp>=start and timestamp<end, we want to get a statistic from the time range like max,min,sum…

Are the timestamp-ranges fixed (say daily,weekly etc) or is it dynamic ?

Hi
It’s fixed, every minute/hour/day
thanks