I wouldn’t say that performance issue. I’m just looking for performance improvements.
On the local machine, I’m able to store around 2,5 k records per second, where each record contains around 15 objects with temperature field in json.
The problem with COPY is that, it replaces whole record. So for example if I have partial update (update part of json), the upsert will just replace this part of json and don’t touch the rest of json. The replace, what is naturally, is gonna replace whole record.
Hi, two things can be not scalable if you need higher throughput:
“the upsert will just replace this part of json” has to write the whole JSON again, in YugabyteDB like in PostgreSQL, so it is less efficient than having one row per readout
if you receive multiple rows to update, it is better to do all in one statement (using a WITH clause for example) than row-by row. Because row-by-row alternates reads and writes, and YugabyteDB must flush to the raft consensus (involving network latency) between them.
Thanks for these points.
In terms of having one row per readout. This was my previous model. To have denormalized table, where one row is one redout.
CREATE TABLE yuga_readouts
(
country_id INT NOT NULL,
province_id INT NOT NULL,
sensor_id BIGINT NOT NULL,
readout_group INT NOT NULL,
readout_id INT NOT NULL,
readout_value NUMERIC(19, 2),
date_of_capturing DATE,
CONSTRAINT pk PRIMARY KEY (country_id ASC, province_id ASC, sensor_id ASC, readout_group ASC, readout_id ASC, date_of_capturing ASC)
);
However, this db has to store 500kk redouts daily.
Each sensor produces from 2 to 100 different readouts daily. In this case, big portion of data would be duplicated for each readout (country_id,province_id, device_id, date_of_capturing). And the user usually takes all the readouts for the given sensor.
So I decided to experiment with document style model. The pros are: smaller index, less duplication of data, less storage used. And still I shouldn’t hit 32kb limit.
What is your opinion about such an approach ?
big portion of data would be duplicated I think we have good key prefix compression
the user usually takes all the readouts for the given sensor they will still be stored together
We don’t store in heap tables like PostgreSQL but in the primary key index, that’s why they are stored together and can benefit from key prefix compression, as long as readout_id is at the end of the key. When new readouts are inserted to an existing readout group, it will be physically in another place (LSM Tree flushes to SST files) but at some point compaction will happen and put them together so reading all readouts for one sensor will be together
It would be great to test both but I think one row per readout can be ok, and better if data ingest must be improved. The document model may be better for queries, but only slightly
Agree with @FranckPachot 's assessment that if you aren’t wholesale writing the JSONB in one shot and doing a lot of read-modify-write to append into an existing JSONB, the approach of exploding the data across more rows (that are still clustered near each other on disk) is a better way to go.
Also, one more observation. I think instead of this:
So that all readouts for a particular sensor_id are stored sorted by date order first (and then within that you have all the info for various readout_ids for that date) because usually, for a given sensor, you are interested in all the readout_ids for a particular date or range of dates. On the other hand, if you are commonly interested in data only for 1 specific readout_id at a time across a date range, then the 1st way is fine.