What is the most performant way of upsert?

Hi,
I have following data model:

CREATE TABLE yuga_readouts
(
    country_id               INT    NOT NULL,
    province_id             INT    NOT NULL,
    sensor_id                BIGINT NOT NULL,
    readouts                 jsonb NOT_NULL
    date_of_capturing  DATE NOT_NULL, 
    CONSTRAINT pk PRIMARY KEY (country_id ASC, province_id ASC, sensor_id ASC, date_of_capturing  ASC)
)

I have set of data which is mixed in proportion of 80% new data nad 20% updates. So I decided to use upsert to handle it.

For now, my upsert looks as follows:

INSERT INTO yuga_readouts (country_id, province_id, sensor_id, date_of_capturing, readouts          ) VALUES (4, 23,1, '2025-01-15', '
{
  "2_1": {
      "temperature": 25.5
    }
}
') ON CONFLICT (country_id, province_id, sensor_id, date_of_capturing)
    DO UPDATE SET readouts = yuga_readouts.readouts || '
{
  "2_1": {
      "temperature": 25.5
    }
}
';

But, I’m curious about your experience, what can be better way to store it ? Stored procedure ? Maybe some other idea?

Hi @yugi

Do you have a performance issue?

In your exact case, it’s probably COPY with REPLACE COPY statement [YSQL] | YugabyteDB Docs

I wouldn’t say that performance issue. I’m just looking for performance improvements.
On the local machine, I’m able to store around 2,5 k records per second, where each record contains around 15 objects with temperature field in json.

The problem with COPY is that, it replaces whole record. So for example if I have partial update (update part of json), the upsert will just replace this part of json and don’t touch the rest of json. The replace, what is naturally, is gonna replace whole record.

Hi, two things can be not scalable if you need higher throughput:

  • “the upsert will just replace this part of json” has to write the whole JSON again, in YugabyteDB like in PostgreSQL, so it is less efficient than having one row per readout
  • if you receive multiple rows to update, it is better to do all in one statement (using a WITH clause for example) than row-by row. Because row-by-row alternates reads and writes, and YugabyteDB must flush to the raft consensus (involving network latency) between them.

Thanks for these points.
In terms of having one row per readout. This was my previous model. To have denormalized table, where one row is one redout.

CREATE TABLE yuga_readouts
(
    country_id               INT    NOT NULL,
    province_id             INT    NOT NULL,
    sensor_id                BIGINT NOT NULL,
    readout_group           INT    NOT NULL,
    readout_id              INT    NOT NULL,
    readout_value           NUMERIC(19, 2),
    date_of_capturing           DATE,
    CONSTRAINT pk PRIMARY KEY (country_id ASC, province_id ASC, sensor_id ASC, readout_group ASC, readout_id ASC, date_of_capturing ASC)
);

However, this db has to store 500kk redouts daily.
Each sensor produces from 2 to 100 different readouts daily. In this case, big portion of data would be duplicated for each readout (country_id,province_id, device_id, date_of_capturing). And the user usually takes all the readouts for the given sensor.
So I decided to experiment with document style model. The pros are: smaller index, less duplication of data, less storage used. And still I shouldn’t hit 32kb limit.
What is your opinion about such an approach ?

Yes. Makes sense. However:

  • big portion of data would be duplicated I think we have good key prefix compression
  • the user usually takes all the readouts for the given sensor they will still be stored together

We don’t store in heap tables like PostgreSQL but in the primary key index, that’s why they are stored together and can benefit from key prefix compression, as long as readout_id is at the end of the key. When new readouts are inserted to an existing readout group, it will be physically in another place (LSM Tree flushes to SST files) but at some point compaction will happen and put them together so reading all readouts for one sensor will be together

It would be great to test both but I think one row per readout can be ok, and better if data ingest must be improved. The document model may be better for queries, but only slightly

1 Like

Agree with @FranckPachot 's assessment that if you aren’t wholesale writing the JSONB in one shot and doing a lot of read-modify-write to append into an existing JSONB, the approach of exploding the data across more rows (that are still clustered near each other on disk) is a better way to go.

Also, one more observation. I think instead of this:

PRIMARY KEY (country_id ASC, province_id ASC, sensor_id ASC, readout_group ASC, readout_id ASC, date_of_capturing ASC)

you want:

PRIMARY KEY (country_id ASC, province_id ASC, sensor_id ASC, date_of_capturing ASC, readout_group ASC, readout_id ASC)

So that all readouts for a particular sensor_id are stored sorted by date order first (and then within that you have all the info for various readout_ids for that date) because usually, for a given sensor, you are interested in all the readout_ids for a particular date or range of dates. On the other hand, if you are commonly interested in data only for 1 specific readout_id at a time across a date range, then the 1st way is fine.

2 Likes