Data Rollup planned?

Are there any plans to implement data rollup for YCQL? With that I mean automatic aggregation of older data (e. g. something like https://docs.oracle.com/cd/E19062-01/sun.mgmt.ctr30/816-2761/prm-intro-7/index.html). The idea is to specify a maximum size of the database, say 5 TB, and then keep on pumping current sensor data into it without having to fear that the storage fills up. Data rollup will simply make sure that there is never more than 5 TB of data, but it will not simply delete older data, but reduce its granularity.

The idea is that I can retain all data ever ingested into the system, but the older it is, the less precise it becomes. So for weather data I might still be be able to say what the average temperature was for a certain day in the past, but I will not have the individual data points anymore. For more recent days, I will still have one measurement every 5 seconds and over time this is reduced to one measurement every minute, every hour and so on.

Hi @ulim

Something like this isn’t in the short/long term roadmap GitHub - yugabyte/yugabyte-db: YugabyteDB - the cloud native distributed SQL database for mission-critical applications.

You can use the YSQL layer and with some simple queries be able to do manual rollups. A good way would be to use partitioning, do rollups how you’d prefer, and then drop the old partitions with the full data.

You can also create a feature request on Github Issues · yugabyte/yugabyte-db · GitHub

Thanks a lot for the information. I believe this would be more of a feature for the YCQL layer, because I am under the impression that this would be the correct choice for timeseries based IOT data. The consistency requirements aren’t as strong, but parallelism is king, because there is so much data.

But perhaps the intended use case for YCQL is not big data processing, so storage would not usually be a concern?

Consistency is the same in YCQL/YSQL.

There is more support going on for YSQL development. It’s better to make YSQL faster to catch up with YCQL than to add features to YCQL.

Ok, gotcha. I thought since YCQL seems to be modelled after Cassandra, that it would be an eventually consistent system, thereby always having a performance advantage over distributed strongly consistent systems.

The general idea is that it’s not nice from a developer’s pov to work with async systems (you’ll put the synchronization logic on your code).

The idea was to have the same API as Cassandra and nicer developer experience (because of consistent replication) and faster performance (because synchronous is generally faster (faster quorum reads, no tombstones, etc), better architecture, c++ faster than java).

Hi @ulim,
Inspired by your question, I’ve written an example of doing this kind of rollup in SQL:

A single query scheduled daily can do the job.

From the blog post:

DELETE those rows from the distributed storage and return them to the SQL layer

If you have a lot of rows, it will be better to use partitioning and just drop the partition, because it’s much more efficient than doing deletes (file delete VS tombstone insert for each row & compaction).

Sure. But then you have to manage the visibility (cannot insert + drop partition in the same transaction), the granularity (you don’t want per-day partitions for years), indexing (no global indexes on declarative partitioning). Frequent transactional delete+insert are easier, but drop partition + maitning another aggregate table is cheaper

@dorian_yugabyte
Your example looks interesting, I believe if you talk to your marketing guys they will say that you should expand it into a feature for the next version, because that is easy to sell :slight_smile:

Doing that with a dedicated feature would need to have many options to cope with specific structures or workloads. I think the great value of SQL is to be able to do this kind of thing customized to each case. But maybe a function that generates the queries could make sense.

No one takes the customized approach away from you, if you need it. But I think it’s an 80/20 thing, where in 80% of the cases you will have a timestamp column and rollup over that. And you will have a target size for the database. So these are the two configuration options needed. The rest can be automated in an algorithm that simply keeps on thinning out until the target size is reached. The interesting part is in the selection of the rows - you don’t just select by day, week or month, because that would introduce more configuration options. Instead you look at the distribution of the timestamps to generate sensible windows. That way you can support IoT data from just one day just as well as weather data from decades.

1 Like

@ulim that totally makes sense. Thanks for the feedback :+1:t2: