In my model, I want to store a variable sized circular buffer for each key. The buffer size may go upto 2k elements. Was wondering how would that be possible.
This will depend on exactly when the circular buffer will wrap around. Can you please explain that part a bit?
For example, in a fixed size buffer, this is relatively easier to implement since you know when the wrap around happens (at a certain element count that is being added).
In my use case, another table would hold the length of the buffer. We can assume both the keys to be the same or easily derivable from each other. The length of the buffer changes dynamically based on user behaviour.
Got it… so the circular buffer for one key needs to hold n
entries at any time (the value of n
can be dynamic. Lets say you’re implementing the buffer for key k1
.
NOTE: I assume the app is distributed, but if exactly one instance of the app is running, things become simpler. Both solutions are shown below.
Distributed app technique
-
The table
circular_buffer_info
will store the following for the keyk1
:buffer_length: n
and themax_entry_count: ...
. The value ofmax_entry_count
is incremented each time an entry is inserted. -
The table
circular_buffer
will store the actual value for the key. The keys for this table will be(k1, <entry_count>)
. For every insert into the buffer, we incremententry_count
for this key and update it in thecircular_buffer_info
table (max_entry_count
column). -
Each
select
query does aLIMIT n
along with it. So this will implicitly return only the entries in this buffer. -
Periodically, issue a
delete from circular_buffer where entry_count > n
to flush the older values. Depending on the read volume, you can do this with every read as well. -
If you know the values will get refilled in a certain time-period, you can simply set a TTL for auto-expiry.
Single app instance technique
-
The table
circular_buffer_lengths
will store the key-value pair(k1, n)
. -
The table
circular_buffer
will store the actual value for the key. The keys for this table will bek1-0
,k1-1
,k1-2
, …k1-(n-1)
, the values for these keys are stored in the value column. In the app, you store (or read) the valuen
, and roll the next entry tok1-0
.
My use case is distributed in nature (multiple app instances are running).
So if i understand this correctly, my entry-count
is the index of the buffer? I’ll have to use transactions to keep the two tables in sync?
can you be a bit more clear on the fields in each table? Isn’t buffer_length
the same as n
? How do we roll over or reset the entry_count
in the table circular_buffer
?
I have simplified things a bit here, we can deal with max_entry_count
if we need it later on.
The circular_buffer_info
table has the following columns:
key
which represents the queue we are dealing with. This could just be the user id.buffer_length
which is the current length of the circular buffer
The circular_buffer
table contains the following fields:
key
which represents the primary key (user id)entry_id
which is a monotonically increasing number for each newly added entry. In practice, this can simply be thetimestamp
of insertion of the entry (millisecond resolution) ortimeuuid
(if you expect multiple inserts in the same millisecond)value
which is the value of the circular buffer.
Note that the primary key in the circular_buffer_table
is ((key), entry_id)
where key
is the partitioning column and entry_id
the clustering column with DESC
sort order.
The transactional guarantees needed and how often the circular buffer length changes will change the queries you perform. Typically, in such applications, you need consistency of data but may not need transactional guarantees when the buffer length is changing, but this depends on your use case. Here is a simple version:
Inserts
Simply insert data into the circular_buffer
table. Each insert is an append into the table.
Reads
- First read the size of the buffer. This is stored in the
circular_buffer_info
tablebuffer_length
column. Let us say this value isN
. - Select the last
N
entries from thecircular_buffer
table. This should give you all the entries in the circular buffer. You can select less than that to access fewer items.
Pruning
You would have to clean up entries in the circular_buffer
table since it grows for ever. A simple way to accomplish this is to delete all entries that are an offset N
away. Depending on the read or write volume, you can perform these with reads/writes. Or you could periodically run this query. If you know that the data is regenerated periodically (say every week or so) and old data is not useful, you can set a TTL to expire the older data.
I get it now. Makes sense. Thanks.
Might be worth clarifying in your post above that the primary key in the circular_buffer_table
is ((key), entry_id)
where key
is the partitioning column and entry_id the clustering column with DESC order (so that the queries like ... ORDER BY entry_id DESC LIMIT n
) are efficient.
Great suggestion, done!