Hi.
I am running a series of YCSB tests to explore the capabilities of yugabate. I have a few questions based on the results of the tests.
The first test was performed on a cluster of four baremetall servers. Three of the servers had master and tserver running at the same time. On the fourth server only tserver was running.
baremetall: silver 4314 16 CPU @ 2.40GHz x2; HT on; 768 Gb RAM; Mellanox CX-4; disk NVME x6
I uploaded the data to yugabyte.
recordcount= 2 billion
I used one to five clients located in virtual machines inside a podman container. Workload A. I expected the bottle neck to be networking. But in reality, with a total threadcount of ~4000 from all clients, I saw high CPU utilization on the baremetall servers in the cluster. Is this expected behavior? How can I reduce CPU utilization?
Also, I found an instruction that promised a nice performance boost. To do this, I ran tservers with --ycql_enable_packed_row=true. I only got about 5-6% performance gain. Is this expected behavior? Why is this setting not enabled by default? What is the cost of enabling ycql_enable_packed_row?
Note that to troubleshoot correct benchmarking is very hard. Many things can go wrong in all layers, including your client code/machines, and unless you explicitly monitor & graph all layers, it will be hard to guess what’s going wrong, or if it is at all.
Depends on the scenario, number of columns & column-size, full inserts vs updates. Read how it works to understand more Packed rows in DocDB | YugabyteDB Docs
Just like all features aren’t enabled by default. Takes time to test etc.
what are the benchmarks numbers? maybe they’re ok?
needs to know the data-set. this is probably all in-memory.
Maybe you’re overwriting the same rows and doing reads causes many row-versions have to be checked resulting in high overhead (this happens with “queue” like patterns or heavy deletes)
I was able to improve my test scores. To do this, I ran the client on a baremetall server. And found the optimal threadcouunt value.
This is the best ratio of operations per second to latency I could get.
Thanks for the link missed that page in the documentation. Was the bottleneck in your tests also the CPU?
I’m confused by the 15-20% load of system calls. I’m running perf top.
The main load of system calls is given by native_queued_spin_lock_slowpath. I built flamegrpah but it didn’t help me. I found native_queued_spin_lock_slowpath in many branches. But the number of native_queued_spin_lock_slowpath relative to the total number is not high.
I’ll try building flamegrpah again and post the results here.