Facing strange behaviour while checking ingestion rate in ysql table

Hi,
We are exploring the option of yugabyte replacing the hive. We need 50K insertion per second. To begin with I have 3 master server (c6a.xlarge:4 vcpu, 8 GB ) and initially 3 T-server (c6a4.xlarge: 16 vcpu, 32 GB RAM)

Record size is ~5kb
datapump= 2 from 2 different server.
driver: yugabyte smart jdbc driver.
loadbalance=true
RF=3
tablet creation: default
primary key index: composite index
execution mode: batch
batch size: 500 ( or less)
results:
We are getting variable ingestion rates (1500 avg, min: 500, max 9000 ). but observed two strange things:

  1. though loadbalance option is set to true, only one of the three t-server is taking most of the work and cpu is going 70% to 80%, and for others it is 20% in avarage.
  2. while executing with two pump, the ingestion report shows while yugabyte is serving one pump it is not doing the same for the other. For this reason, for one minute of time period if 0 to 5 second is returned in datapump 1, the other has recorded from 6th or 7th second to 13th again we get some logs in datpmp 1 again, say from 17th second.

Our target is, with this minimum servers, we want to get some number with this 3 instances and then record the improvement by scaling up and scaling out the t-servers.

Now my question is:
Is it really achievable with Yugabyte? ( in production we will have say 20 nos of bare metal servers)
whether I am doing some mistake while writing the pump ?
Which are the parameters that we can use for tuning ingestion workload?
Why only one t-server is working more?
How can I check how the tablet leaders are distributed in three servers?
Please show us some light.

Thanks in advance,
Dipanjan

Hi @dipanjang thanks for the details.

How can I check how the tablet leaders are distributed in three servers?
You can see this from the webconsole on the master server port 7000.

http://yb-master:7000/tablet-servers shows all the table tervers and the interesting information are:

  • User Tablet-Peers / Leaders to see the distribution of replicas and leaders
  • Read ops/sec Write ops/sec to see if reads and writes are well distributed
    Can you paste a screenshot here?

http://yb-master:7000/tables then click on the table and you have the tablet information

  • the columns an which ones are the partition key
  • the tablets, how they are split, and on which servers are the leaders and followers
    Again a screenshot may help

I think the most important information is the table schema (\d tablename on psql). You mention composite primary key. By default, it is hashed on the first column of it and clustered on the other columns. Then if you insert batches that have the same value for the first column, it may explain the bad load balancing.

Note that if you don’t require ACID properties during this ingestion (batches not being atomic) you can set yb_disable_transactional_writes. But that’s about speedup. Let’s first understand the scaleup problem first.

If you are available for a quick zoom at GMT+2 office hours where we can look at it, send me two or 3 available timeslots in the next days fpachot@yugabyte.com

Hi Franck,
Thank you so much for your quick reply. I later found that I was doing one big mistake. the table structure was the same as it was being used to hive. So it was not the ID column that should be the first column or partitioned field. It was the date column which was also part of primary key. After revisiting the table structure we understood the issue and kept the id column as the first column. As you said, my entire payload was having a single value for one day, so we were facing the skewed distribution issue. We got rid of this.

We tested the ingestion with a single table for different days, range partitioned tables. While ingestion rate in the single table and the partitioned table matches only if I am writing directly to the child table. If it is happening on the partitioned table the rate is bit slow.

I think this is happening as current yugabyte (2.18 sts) is based on postgresql 11. But in Yugabyte is there any way to speed up insertion on partitioned table. This pg version doesn’t support re-planning. What is the loadmap for 2.21 (with pg 15)?

For us eventual consistency will be fine if there is no chance for data loss in normal operation (not considering catastrophic failure). We will test using yb_disable_transactional_writes parameter and let you know.

We got a moderate figure with 3 t-servers and scaling up to 5 t-servers. We need more help from you while we will be checking the query workload as there are massive use of window functions in current application.
I am not sharing any time right now. But it is obvious that we will need this in near future.

Thanks again.
Dipanjan

1 Like

Hi @dipanjang I would not expect different performance when inserting into the table vs. the partitioned table. But parsing may take more time and then better use prepared statements. What so do you mean by re-planning? The roadmap for PG15 do not have a date yet. I would like to be sure about which new feature you are talking about.

Hi Franck,
We have seen that some times we get better execution time if we use β€œforce_custom_plan” in plan_cache_mode parameter as in version 11, pruning is being done later in execution phase. We have seen also if number of child partition grows more than 20 or so queries get slower.

I am facing some issue with queries that I am trying to raise in a different topic.

Thanks & Regards,
Dipanjan

  1. With hive do you mean Apache Hive https://hive.apache.org/ ?

  2. Can you please show the exact schema \d+ table_name?

  3. Why & how are you using partitioning(curious)?

  4. How are writes coming? Also, how many tablets does each partition child table have?

  5. How are you batching inserts?

Can you create an issue about this on our github Issues Β· yugabyte/yugabyte-db Β· GitHub?

  1. With hive do you mean Apache Hive https://hive.apache.org/ ?
    Apache Hive on HDFS 3.0.1

  2. Can you please show the exact schema \d+ table_name?
    Apologize for my inability to provide that.
    To provide an idea I can say that tables are flat, denormalized and in average have 110 fields.
    Average size of a row is close to 5 kb
    In Yugabyte we are using composite primary key of three fields, (id varchar(35), type varchar(20) and tdate date)
    There are other secondary indexes also.

  3. Why & how are you using partitioning(curious)?
    The data lake should store half year to one year data where single data raw data size is huge as 6 to 8 TB
    Initial thought was each partition will accommodate more tablets. Purging will be easy.

  4. How are writes coming? Also, how many tablets does each partition child table have?
    Though we are currently using a piece of code for pumping data, but, in real scenario data will be consumed from one of more middleware source and will be inserted into database in batch. Batch will be set according to the result of benchmark considering all the related tunable items in terms of the middleware, DB etc.
    We didn’t control tablet creation till now.

  5. How are you batching inserts?
    Maximum pool size is 100
    Maximum batch size used is 10000 records per batch.

Thanks,
Dipanjan

Please provide the full primary key and all indexes & unique constraints for that table (by full I mean how they are partitioned and what’s the sorting order if they’re sorted.

So do \d+ table_name and only paste the indexes + primary key (along with the data-types of the indexed columns).

Hi Dorian,
Please find the structure:

                 Column                      |            Type             | Collation | Nullable | Default | Storage  | Stats target | Description

--------------------------------------------------Β±----------------------------Β±----------Β±---------Β±--------Β±---------Β±-------------Β±------------
txnid | character varying(50) | | not null | | extended | |
txntype | character varying(30) | | not null | | extended | |
txndate | date | | not null | | plain | |
mid | character varying(30) | | | | extended | |
rmtrorgid | character varying(12) | | | | extended | |
rmtrmobmidno | character varying(100) | | | | extended | |
rrn | character varying(20) | | | | extended | |
remitterinstrumentid | character varying(150) | | | | extended | |
…
… (107 columnss)
Partition key: RANGE (txndate)
Indexes:
β€œdw_for_custpmt_pkey” PRIMARY KEY, lsm (txnid HASH, txntype ASC, txndate ASC)
β€œdw_for_custpmt_exp_mid_idx” lsm (upper(mid) HASH)
β€œdw_for_custpmt_rmtorgid_idx” lsm (rmtrorgid HASH)
β€œdw_for_custpmt_rmtrmidmob_idx” lsm (rmtrmobmidno HASH)
β€œdw_for_custpmt_rrn_idx” lsm (rrn HASH)
Partitions: dw_for_custpmt_2023_06_01 FOR VALUES FROM (β€˜2023-06-01’) TO (β€˜2023-06-02’),
dw_for_custpmt_2023_06_02 FOR VALUES FROM (β€˜2023-06-02’) TO (β€˜2023-06-03’),
dw_for_custpmt_2023_06_03 FOR VALUES FROM (β€˜2023-06-03’) TO (β€˜2023-06-04’),
dw_for_custpmt_2023_06_04 FOR VALUES FROM (β€˜2023-06-04’) TO (β€˜2023-06-05’),
dw_for_custpmt_2023_06_05 FOR VALUES FROM (β€˜2023-06-05’) TO (β€˜2023-06-06’),
dw_for_custpmt_2023_06_06 FOR VALUES FROM (β€˜2023-06-06’) TO (β€˜2023-06-07’),
dw_for_custpmt_2023_06_07 FOR VALUES FROM (β€˜2023-06-07’) TO (β€˜2023-06-08’),
dw_for_custpmt_2023_06_08 FOR VALUES FROM (β€˜2023-06-08’) TO (β€˜2023-06-09’),
dw_for_custpmt_2023_06_09 FOR VALUES FROM (β€˜2023-06-09’) TO (β€˜2023-06-10’),
dw_for_custpmt_2023_06_10 FOR VALUES FROM (β€˜2023-06-10’) TO (β€˜2023-06-11’),
dw_for_custpmt_2023_06_11 FOR VALUES FROM (β€˜2023-06-11’) TO (β€˜2023-06-12’),
dw_for_custpmt_2023_06_12 FOR VALUES FROM (β€˜2023-06-12’) TO (β€˜2023-06-13’),
dw_for_custpmt_2023_06_13 FOR VALUES FROM (β€˜2023-06-13’) TO (β€˜2023-06-14’),
dw_for_custpmt_2023_06_14 FOR VALUES FROM (β€˜2023-06-14’) TO (β€˜2023-06-15’),
dw_for_custpmt_2023_06_15 FOR VALUES FROM (β€˜2023-06-15’) TO (β€˜2023-06-16’),
dw_for_custpmt_2023_06_16 FOR VALUES FROM (β€˜2023-06-16’) TO (β€˜2023-06-17’),
dw_for_custpmt_2023_06_17 FOR VALUES FROM (β€˜2023-06-17’) TO (β€˜2023-06-18’),
dw_for_custpmt_2023_06_18 FOR VALUES FROM (β€˜2023-06-18’) TO (β€˜2023-06-19’),
dw_for_custpmt_2023_06_19 FOR VALUES FROM (β€˜2023-06-19’) TO (β€˜2023-06-20’),
dw_for_custpmt_2023_06_20 FOR VALUES FROM (β€˜2023-06-20’) TO (β€˜2023-06-21’),
dw_for_custpmt_2023_06_21 FOR VALUES FROM (β€˜2023-06-21’) TO (β€˜2023-06-22’),
dw_for_custpmt_2023_06_22 FOR VALUES FROM (β€˜2023-06-22’) TO (β€˜2023-06-23’),
dw_for_custpmt_2023_06_23 FOR VALUES FROM (β€˜2023-06-23’) TO (β€˜2023-06-24’),
dw_for_custpmt_2023_06_24 FOR VALUES FROM (β€˜2023-06-24’) TO (β€˜2023-06-25’),
dw_for_custpmt_2023_06_25 FOR VALUES FROM (β€˜2023-06-25’) TO (β€˜2023-06-26’),
dw_for_custpmt_2023_06_26 FOR VALUES FROM (β€˜2023-06-26’) TO (β€˜2023-06-27’),
dw_for_custpmt_2023_06_27 FOR VALUES FROM (β€˜2023-06-27’) TO (β€˜2023-06-28’),
dw_for_custpmt_2023_06_28 FOR VALUES FROM (β€˜2023-06-28’) TO (β€˜2023-06-29’),
dw_for_custpmt_2023_06_29 FOR VALUES FROM (β€˜2023-06-29’) TO (β€˜2023-06-30’),
dw_for_custpmt_2023_06_30 FOR VALUES FROM (β€˜2023-06-30’) TO (β€˜2023-07-01’)

Daily partitions will introduce a lot of tablet overhead (probably also system tables overhead).

You’ll want at minimum 1 tablet-leader on each server for each table/index.

For 1 year with 3RF that would be 5 * 365 * 3 = ~5475 tablets on each server.

I would use monthly partitioning here unless you have more info that would suggest otherwise.

After you load data for at minimum 1 day, and use at least 3 tablets for each table & indexes, can you show how much data each tablet has at the end of the day? (for the table and indexes)

Need answer from the previous question to know the data distribution of these indexes.

You can try with lower batch size, and increasing parallelism.
In another place you said 500, so I’d use that.

Do you need all 3 columns for the primary key? Is the txnid unique?

Are you using packed rows Persistence in YugabyteDB | YugabyteDB Docs ?

Can you disable partitioning for now? Just to remove as many things as possible from the test.

With the changes above,is the CPU usage still like this? If yes, how many YSQL opened connections are on each yb-tserver?

Yes, that is why I am a bit confused. Use case wise no update will be there. But we would have 7 of such tables among which one is 70% of the data, 3 will consist of 20% and the rest will have 10% of the data. Concern is for the biggest table.
But yes for month wise partition the number of tablets can be controlled. the only concern is we have to consider more space for extra 30 days.
As I got it from your statements. You are suggesting to specify number of tablets manually. Then, when the number of shards are low will it take higher execution time for only date filtered queries? We will obviously check and share.

Setting ysql_num_shards_per_tserver as 3 ? Our this environment is based on a single region and single zone and 5 tservers. Just to understand, if we use 3 zones and 6 Tservers with 2 tservers per zone then will it be 3 tablet leader per node or total of 6 tablet leader distributed in 2 tservers per zone. In my understanding, for RF=3 the total one day data will reside in each zone. Please let me know whether this understanding is correct or not.
We will check this and share the size.

Final data loading and ingestion rate test was performed with 500 records per batch only.

Yes, we do require 3 columns. txnid is unique per transaction as a few (5 to 10) % of records there will be multiple rows per id depending on other tow columns. Through allowing update this can be taken care of, but this is not an option as application needs to be modified. But there always be some common columns(a lion share) that are actually get duplicated for those rows, we can split them into another table and let few columns get duplicated in one of the two (if we consider so) tables. But, now, for such huge data cost of joining two tables will also be high. It is true, in Hive, there were so many things that were not possible what we can do it here in Yugabyte. But in that case also, the main concerned table will have composite key and alike postgresql, i think, here also partition key for partition table will also be part of primary key.

Just now we have configured packed-row-format. We tested the ingestion with this and got awsome result nearly doubled the initial findings. We will test the read workload and record the results.

Yes we will test this. While doing ingestion we checked that the results on table without partition only differs (get better result) when insertion is done on partitioned table. but inserting on named child partitions provides equal results as table without partitions. Will test fetch and record the result.

Actually these were the findings of very initial stage. We are actually doing grossly wrong in some areas. Later on we reviewed and corrected this.

Thanks
Dipanjan

Then use this table for load testing.

Why more space? And how much more? Do you intend to drop daily?

To get maximum write performance, you need probably more than 1 tablet-leader on each server. yugabyte adds more tablets than needed to support for growth in the future. But you already support growth by partitioning. So probably would be better to manually specify something like 2 or 4 in yb-tserver configuration reference | YugabyteDB Docs.

I meant 3 tablets total, so ysql_num_shards_per_tserver=1.

It’s ok.

It should be (6 tservers * 3 ysql_num_shards_per_tserver = 18 tablet-leaders). Then they get distributed.

Of the table & indexes. Looking to see the distribution of data. Make sure data has been flushed to sstables after loading. (use [docs] Add `yb-admin flush_table` command by ddorian Β· Pull Request #17211 Β· yugabyte/yugabyte-db Β· GitHub)