We are exploring the option of yugabyte replacing the hive. We need 50K insertion per second. To begin with I have 3 master server (c6a.xlarge:4 vcpu, 8 GB ) and initially 3 T-server (c6a4.xlarge: 16 vcpu, 32 GB RAM)
Record size is ~5kb
datapump= 2 from 2 different server.
driver: yugabyte smart jdbc driver.
tablet creation: default
primary key index: composite index
execution mode: batch
batch size: 500 ( or less)
We are getting variable ingestion rates (1500 avg, min: 500, max 9000 ). but observed two strange things:
though loadbalance option is set to true, only one of the three t-server is taking most of the work and cpu is going 70% to 80%, and for others it is 20% in avarage.
while executing with two pump, the ingestion report shows while yugabyte is serving one pump it is not doing the same for the other. For this reason, for one minute of time period if 0 to 5 second is returned in datapump 1, the other has recorded from 6th or 7th second to 13th again we get some logs in datpmp 1 again, say from 17th second.
Our target is, with this minimum servers, we want to get some number with this 3 instances and then record the improvement by scaling up and scaling out the t-servers.
Now my question is:
Is it really achievable with Yugabyte? ( in production we will have say 20 nos of bare metal servers)
whether I am doing some mistake while writing the pump ?
Which are the parameters that we can use for tuning ingestion workload?
Why only one t-server is working more?
How can I check how the tablet leaders are distributed in three servers?
Please show us some light.
the tablets, how they are split, and on which servers are the leaders and followers
Again a screenshot may help
I think the most important information is the table schema (\d tablename on psql). You mention composite primary key. By default, it is hashed on the first column of it and clustered on the other columns. Then if you insert batches that have the same value for the first column, it may explain the bad load balancing.
Note that if you don’t require ACID properties during this ingestion (batches not being atomic) you can set yb_disable_transactional_writes. But that’s about speedup. Let’s first understand the scaleup problem first.
If you are available for a quick zoom at GMT+2 office hours where we can look at it, send me two or 3 available timeslots in the next days firstname.lastname@example.org
Thank you so much for your quick reply. I later found that I was doing one big mistake. the table structure was the same as it was being used to hive. So it was not the ID column that should be the first column or partitioned field. It was the date column which was also part of primary key. After revisiting the table structure we understood the issue and kept the id column as the first column. As you said, my entire payload was having a single value for one day, so we were facing the skewed distribution issue. We got rid of this.
We tested the ingestion with a single table for different days, range partitioned tables. While ingestion rate in the single table and the partitioned table matches only if I am writing directly to the child table. If it is happening on the partitioned table the rate is bit slow.
I think this is happening as current yugabyte (2.18 sts) is based on postgresql 11. But in Yugabyte is there any way to speed up insertion on partitioned table. This pg version doesn’t support re-planning. What is the loadmap for 2.21 (with pg 15)?
For us eventual consistency will be fine if there is no chance for data loss in normal operation (not considering catastrophic failure). We will test using yb_disable_transactional_writes parameter and let you know.
We got a moderate figure with 3 t-servers and scaling up to 5 t-servers. We need more help from you while we will be checking the query workload as there are massive use of window functions in current application.
I am not sharing any time right now. But it is obvious that we will need this in near future.
Hi @dipanjang I would not expect different performance when inserting into the table vs. the partitioned table. But parsing may take more time and then better use prepared statements. What so do you mean by re-planning? The roadmap for PG15 do not have a date yet. I would like to be sure about which new feature you are talking about.
We have seen that some times we get better execution time if we use “force_custom_plan” in plan_cache_mode parameter as in version 11, pruning is being done later in execution phase. We have seen also if number of child partition grows more than 20 or so queries get slower.
I am facing some issue with queries that I am trying to raise in a different topic.
Can you please show the exact schema \d+ table_name?
Apologize for my inability to provide that.
To provide an idea I can say that tables are flat, denormalized and in average have 110 fields.
Average size of a row is close to 5 kb
In Yugabyte we are using composite primary key of three fields, (id varchar(35), type varchar(20) and tdate date)
There are other secondary indexes also.
Why & how are you using partitioning(curious)?
The data lake should store half year to one year data where single data raw data size is huge as 6 to 8 TB
Initial thought was each partition will accommodate more tablets. Purging will be easy.
How are writes coming? Also, how many tablets does each partition child table have?
Though we are currently using a piece of code for pumping data, but, in real scenario data will be consumed from one of more middleware source and will be inserted into database in batch. Batch will be set according to the result of benchmark considering all the related tunable items in terms of the middleware, DB etc.
We didn’t control tablet creation till now.
How are you batching inserts?
Maximum pool size is 100
Maximum batch size used is 10000 records per batch.
txnid | character varying(50) | | not null | | extended | |
txntype | character varying(30) | | not null | | extended | |
txndate | date | | not null | | plain | |
mid | character varying(30) | | | | extended | |
rmtrorgid | character varying(12) | | | | extended | |
rmtrmobmidno | character varying(100) | | | | extended | |
rrn | character varying(20) | | | | extended | |
remitterinstrumentid | character varying(150) | | | | extended | |
… (107 columnss)
Partition key: RANGE (txndate)
“dw_for_custpmt_pkey” PRIMARY KEY, lsm (txnid HASH, txntype ASC, txndate ASC)
“dw_for_custpmt_exp_mid_idx” lsm (upper(mid) HASH)
“dw_for_custpmt_rmtorgid_idx” lsm (rmtrorgid HASH)
“dw_for_custpmt_rmtrmidmob_idx” lsm (rmtrmobmidno HASH)
“dw_for_custpmt_rrn_idx” lsm (rrn HASH)
Partitions: dw_for_custpmt_2023_06_01 FOR VALUES FROM (‘2023-06-01’) TO (‘2023-06-02’),
dw_for_custpmt_2023_06_02 FOR VALUES FROM (‘2023-06-02’) TO (‘2023-06-03’),
dw_for_custpmt_2023_06_03 FOR VALUES FROM (‘2023-06-03’) TO (‘2023-06-04’),
dw_for_custpmt_2023_06_04 FOR VALUES FROM (‘2023-06-04’) TO (‘2023-06-05’),
dw_for_custpmt_2023_06_05 FOR VALUES FROM (‘2023-06-05’) TO (‘2023-06-06’),
dw_for_custpmt_2023_06_06 FOR VALUES FROM (‘2023-06-06’) TO (‘2023-06-07’),
dw_for_custpmt_2023_06_07 FOR VALUES FROM (‘2023-06-07’) TO (‘2023-06-08’),
dw_for_custpmt_2023_06_08 FOR VALUES FROM (‘2023-06-08’) TO (‘2023-06-09’),
dw_for_custpmt_2023_06_09 FOR VALUES FROM (‘2023-06-09’) TO (‘2023-06-10’),
dw_for_custpmt_2023_06_10 FOR VALUES FROM (‘2023-06-10’) TO (‘2023-06-11’),
dw_for_custpmt_2023_06_11 FOR VALUES FROM (‘2023-06-11’) TO (‘2023-06-12’),
dw_for_custpmt_2023_06_12 FOR VALUES FROM (‘2023-06-12’) TO (‘2023-06-13’),
dw_for_custpmt_2023_06_13 FOR VALUES FROM (‘2023-06-13’) TO (‘2023-06-14’),
dw_for_custpmt_2023_06_14 FOR VALUES FROM (‘2023-06-14’) TO (‘2023-06-15’),
dw_for_custpmt_2023_06_15 FOR VALUES FROM (‘2023-06-15’) TO (‘2023-06-16’),
dw_for_custpmt_2023_06_16 FOR VALUES FROM (‘2023-06-16’) TO (‘2023-06-17’),
dw_for_custpmt_2023_06_17 FOR VALUES FROM (‘2023-06-17’) TO (‘2023-06-18’),
dw_for_custpmt_2023_06_18 FOR VALUES FROM (‘2023-06-18’) TO (‘2023-06-19’),
dw_for_custpmt_2023_06_19 FOR VALUES FROM (‘2023-06-19’) TO (‘2023-06-20’),
dw_for_custpmt_2023_06_20 FOR VALUES FROM (‘2023-06-20’) TO (‘2023-06-21’),
dw_for_custpmt_2023_06_21 FOR VALUES FROM (‘2023-06-21’) TO (‘2023-06-22’),
dw_for_custpmt_2023_06_22 FOR VALUES FROM (‘2023-06-22’) TO (‘2023-06-23’),
dw_for_custpmt_2023_06_23 FOR VALUES FROM (‘2023-06-23’) TO (‘2023-06-24’),
dw_for_custpmt_2023_06_24 FOR VALUES FROM (‘2023-06-24’) TO (‘2023-06-25’),
dw_for_custpmt_2023_06_25 FOR VALUES FROM (‘2023-06-25’) TO (‘2023-06-26’),
dw_for_custpmt_2023_06_26 FOR VALUES FROM (‘2023-06-26’) TO (‘2023-06-27’),
dw_for_custpmt_2023_06_27 FOR VALUES FROM (‘2023-06-27’) TO (‘2023-06-28’),
dw_for_custpmt_2023_06_28 FOR VALUES FROM (‘2023-06-28’) TO (‘2023-06-29’),
dw_for_custpmt_2023_06_29 FOR VALUES FROM (‘2023-06-29’) TO (‘2023-06-30’),
dw_for_custpmt_2023_06_30 FOR VALUES FROM (‘2023-06-30’) TO (‘2023-07-01’)
Yes, that is why I am a bit confused. Use case wise no update will be there. But we would have 7 of such tables among which one is 70% of the data, 3 will consist of 20% and the rest will have 10% of the data. Concern is for the biggest table.
But yes for month wise partition the number of tablets can be controlled. the only concern is we have to consider more space for extra 30 days.
As I got it from your statements. You are suggesting to specify number of tablets manually. Then, when the number of shards are low will it take higher execution time for only date filtered queries? We will obviously check and share.
Setting ysql_num_shards_per_tserver as 3 ? Our this environment is based on a single region and single zone and 5 tservers. Just to understand, if we use 3 zones and 6 Tservers with 2 tservers per zone then will it be 3 tablet leader per node or total of 6 tablet leader distributed in 2 tservers per zone. In my understanding, for RF=3 the total one day data will reside in each zone. Please let me know whether this understanding is correct or not.
We will check this and share the size.
Final data loading and ingestion rate test was performed with 500 records per batch only.
Yes, we do require 3 columns. txnid is unique per transaction as a few (5 to 10) % of records there will be multiple rows per id depending on other tow columns. Through allowing update this can be taken care of, but this is not an option as application needs to be modified. But there always be some common columns(a lion share) that are actually get duplicated for those rows, we can split them into another table and let few columns get duplicated in one of the two (if we consider so) tables. But, now, for such huge data cost of joining two tables will also be high. It is true, in Hive, there were so many things that were not possible what we can do it here in Yugabyte. But in that case also, the main concerned table will have composite key and alike postgresql, i think, here also partition key for partition table will also be part of primary key.
Just now we have configured packed-row-format. We tested the ingestion with this and got awsome result nearly doubled the initial findings. We will test the read workload and record the results.
Yes we will test this. While doing ingestion we checked that the results on table without partition only differs (get better result) when insertion is done on partitioned table. but inserting on named child partitions provides equal results as table without partitions. Will test fetch and record the result.
Actually these were the findings of very initial stage. We are actually doing grossly wrong in some areas. Later on we reviewed and corrected this.
Why more space? And how much more? Do you intend to drop daily?
To get maximum write performance, you need probably more than 1 tablet-leader on each server. yugabyte adds more tablets than needed to support for growth in the future. But you already support growth by partitioning. So probably would be better to manually specify something like 2 or 4 in yb-tserver configuration reference | YugabyteDB Docs.
I meant 3 tablets total, so ysql_num_shards_per_tserver=1.
It should be (6 tservers * 3 ysql_num_shards_per_tserver = 18 tablet-leaders). Then they get distributed.