What is the most performant way of upsert?

yugi · January 15, 2025, 12:12pm

Hi,
I have following data model:

CREATE TABLE yuga_readouts
(
    country_id               INT    NOT NULL,
    province_id             INT    NOT NULL,
    sensor_id                BIGINT NOT NULL,
    readouts                 jsonb NOT_NULL
    date_of_capturing  DATE NOT_NULL, 
    CONSTRAINT pk PRIMARY KEY (country_id ASC, province_id ASC, sensor_id ASC, date_of_capturing  ASC)
)

I have set of data which is mixed in proportion of 80% new data nad 20% updates. So I decided to use upsert to handle it.

For now, my upsert looks as follows:

INSERT INTO yuga_readouts (country_id, province_id, sensor_id, date_of_capturing, readouts          ) VALUES (4, 23,1, '2025-01-15', '
{
  "2_1": {
      "temperature": 25.5
    }
}
') ON CONFLICT (country_id, province_id, sensor_id, date_of_capturing)
    DO UPDATE SET readouts = yuga_readouts.readouts || '
{
  "2_1": {
      "temperature": 25.5
    }
}
';

But, I’m curious about your experience, what can be better way to store it ? Stored procedure ? Maybe some other idea?

dorian_yugabyte · January 15, 2025, 12:42pm

Hi @yugi

Do you have a performance issue?

In your exact case, it’s probably COPY with REPLACE COPY statement [YSQL] | YugabyteDB Docs

yugi · January 15, 2025, 1:27pm

I wouldn’t say that performance issue. I’m just looking for performance improvements.
On the local machine, I’m able to store around 2,5 k records per second, where each record contains around 15 objects with temperature field in json.

The problem with COPY is that, it replaces whole record. So for example if I have partial update (update part of json), the upsert will just replace this part of json and don’t touch the rest of json. The replace, what is naturally, is gonna replace whole record.

FranckPachot · January 15, 2025, 1:56pm

Hi, two things can be not scalable if you need higher throughput:

“the upsert will just replace this part of json” has to write the whole JSON again, in YugabyteDB like in PostgreSQL, so it is less efficient than having one row per readout
if you receive multiple rows to update, it is better to do all in one statement (using a WITH clause for example) than row-by row. Because row-by-row alternates reads and writes, and YugabyteDB must flush to the raft consensus (involving network latency) between them.

yugi · January 15, 2025, 6:30pm

Thanks for these points.
In terms of having one row per readout. This was my previous model. To have denormalized table, where one row is one redout.

CREATE TABLE yuga_readouts
(
    country_id               INT    NOT NULL,
    province_id             INT    NOT NULL,
    sensor_id                BIGINT NOT NULL,
    readout_group           INT    NOT NULL,
    readout_id              INT    NOT NULL,
    readout_value           NUMERIC(19, 2),
    date_of_capturing           DATE,
    CONSTRAINT pk PRIMARY KEY (country_id ASC, province_id ASC, sensor_id ASC, readout_group ASC, readout_id ASC, date_of_capturing ASC)
);

However, this db has to store 500kk redouts daily.
Each sensor produces from 2 to 100 different readouts daily. In this case, big portion of data would be duplicated for each readout (country_id,province_id, device_id, date_of_capturing). And the user usually takes all the readouts for the given sensor.
So I decided to experiment with document style model. The pros are: smaller index, less duplication of data, less storage used. And still I shouldn’t hit 32kb limit.
What is your opinion about such an approach ?

FranckPachot · January 15, 2025, 7:01pm

Yes. Makes sense. However:

big portion of data would be duplicated I think we have good key prefix compression
the user usually takes all the readouts for the given sensor they will still be stored together

We don’t store in heap tables like PostgreSQL but in the primary key index, that’s why they are stored together and can benefit from key prefix compression, as long as readout_id is at the end of the key. When new readouts are inserted to an existing readout group, it will be physically in another place (LSM Tree flushes to SST files) but at some point compaction will happen and put them together so reading all readouts for one sensor will be together

It would be great to test both but I think one row per readout can be ok, and better if data ingest must be improved. The document model may be better for queries, but only slightly

kannan · January 15, 2025, 7:31pm

Agree with @FranckPachot 's assessment that if you aren’t wholesale writing the JSONB in one shot and doing a lot of read-modify-write to append into an existing JSONB, the approach of exploding the data across more rows (that are still clustered near each other on disk) is a better way to go.

Also, one more observation. I think instead of this:

PRIMARY KEY (country_id ASC, province_id ASC, sensor_id ASC, readout_group ASC, readout_id ASC, date_of_capturing ASC)

you want:

PRIMARY KEY (country_id ASC, province_id ASC, sensor_id ASC, date_of_capturing ASC, readout_group ASC, readout_id ASC)

So that all readouts for a particular sensor_id are stored sorted by date order first (and then within that you have all the info for various readout_ids for that date) because usually, for a given sensor, you are interested in all the readout_ids for a particular date or range of dates. On the other hand, if you are commonly interested in data only for 1 specific readout_id at a time across a date range, then the 1st way is fine.

Topic		Replies	Views
Query performance on table with indices General	10	1336	March 18, 2022
How to Optimize YugabyteDB for Better Performance General	2	443	July 25, 2024
Are upsert-transactions possible with yugabyteDB? Reading a value, and based on the value doing writes. Addressing double spending problem General	1	868	April 3, 2022
On-disk size differences between jsonb structures General	8	81	January 14, 2025
Which one has better write performance between YugabyteDB cluster and single node General	5	970	March 13, 2023

What is the most performant way of upsert?

Related topics