CQL performance

Hi.
I am running a series of YCSB tests to explore the capabilities of yugabate. I have a few questions based on the results of the tests.

The first test was performed on a cluster of four baremetall servers. Three of the servers had master and tserver running at the same time. On the fourth server only tserver was running.
baremetall: silver 4314 16 CPU @ 2.40GHz x2; HT on; 768 Gb RAM; Mellanox CX-4; disk NVME x6

I uploaded the data to yugabyte.
recordcount= 2 billion

I used one to five clients located in virtual machines inside a podman container. Workload A. I expected the bottle neck to be networking. But in reality, with a total threadcount of ~4000 from all clients, I saw high CPU utilization on the baremetall servers in the cluster. Is this expected behavior? How can I reduce CPU utilization?

Also, I found an instruction that promised a nice performance boost. To do this, I ran tservers with --ycql_enable_packed_row=true. I only got about 5-6% performance gain. Is this expected behavior? Why is this setting not enabled by default? What is the cost of enabling ycql_enable_packed_row?

Hi @godofevil

Note that to troubleshoot correct benchmarking is very hard. Many things can go wrong in all layers, including your client code/machines, and unless you explicitly monitor & graph all layers, it will be hard to guess what’s going wrong, or if it is at all.

Depends on the scenario, number of columns & column-size, full inserts vs updates. Read how it works to understand more Packed rows in DocDB | YugabyteDB Docs

Just like all features aren’t enabled by default. Takes time to test etc.

I don’t think there’s any cost.

monitor graph

Summary



Some notes:

  • the monitors need to be per-node. maybe one or few nodes are the bottleneck?
  • maybe also a screenshot of the yb-master:7000/tablet-servers to see how data & operations are spread
  • needs to have monitors for client nodes too. Maybe the bottleneck is there?
  • are you using our fork of ycsb Benchmark YCQL performance in YugabyteDB with YCSB | YugabyteDB Docs
  • what are the benchmarks numbers? maybe they’re ok?
  • needs to know the data-set. this is probably all in-memory.

Maybe you’re overwriting the same rows and doing reads causes many row-versions have to be checked resulting in high overhead (this happens with “queue” like patterns or heavy deletes)

Hi @godofevil , The ycql_enable_packed_row is still in preview as noted in the documentation - yb-tserver configuration reference | YugabyteDB Docs. There are a few around TTL columns and its management that need to be hardened.

What exactly is the nature of your workload A? We have some performance numbers running CassandraKeyValue sample application on our docs page - Benchmark YCQL performance with large datasets | YugabyteDB Docs.

It’s documented here YCSB/workloads/workloada at master · yugabyte/YCSB · GitHub. They should still specify the full command/config.

All three master/tserver nodes have the same situation, i.e. the monitoring schedule is similar.

graph

screenshot 2024 09 13 11 36 33 - Tinypic

There is no problem on the clients. I run multiple clients to spread the load.

Yes

workload A 50% Read and 50% Update

My result

[OVERALL], RunTime(ms), 940396
[OVERALL], Throughput(ops/sec), 106338.18093654163
[TOTAL_GCS_G1_Young_Generation], Count, 1441
[TOTAL_GC_TIME_G1_Young_Generation], Time(ms), 17023
[TOTAL_GC_TIME_%G1_Young_Generation], Time(%), 1.810194854082748
[TOTAL_GCS_G1_Old_Generation], Count, 0
[TOTAL_GC_TIME_G1_Old_Generation], Time(ms), 0
[TOTAL_GC_TIME
%G1_Old_Generation], Time(%), 0.0
[TOTAL_GCs], Count, 1441
[TOTAL_GC_TIME], Time(ms), 17023
[TOTAL_GC_TIME
%], Time(%), 1.810194854082748
[READ], Operations, 49996266
[READ], AverageLatency(us), 12432.989645466723
[READ], MinLatency(us), 1169
[READ], MaxLatency(us), 2490367
[READ], 95thPercentileLatency(us), 29023
[READ], 99thPercentileLatency(us), 143231
[READ], Return=OK, 49996266
[CLEANUP], Operations, 4096
[CLEANUP], AverageLatency(us), 540.586181640625
[CLEANUP], MinLatency(us), 0
[CLEANUP], MaxLatency(us), 2213887
[CLEANUP], 95thPercentileLatency(us), 3
[CLEANUP], 99thPercentileLatency(us), 5
[UPDATE], Operations, 50003734
[UPDATE], AverageLatency(us), 63211.334732642164
[UPDATE], MinLatency(us), 1325
[UPDATE], MaxLatency(us), 3432447
[UPDATE], 95thPercentileLatency(us), 262911
[UPDATE], 99thPercentileLatency(us), 1070079
[UPDATE], Return=OK, 50003734

============================================

I was able to improve my test scores. To do this, I ran the client on a baremetall server. And found the optimal threadcouunt value.
This is the best ratio of operations per second to latency I could get.

my result

[OVERALL], RunTime(ms), 562719
[OVERALL], Throughput(ops/sec), 177708 .58990011
[TOTAL_GCS_G1_Young_Generation], Count, 604
[TOTAL_GC_TIME_G1_Young_Generation], Time(ms), 5179
[TOTAL_GC_TIME_%G1_Young_Generation], Time(%), 0.9203527870926697
[TOTAL_GCS_G1_Old_Generation], Count, 0
[TOTAL_GC_TIME_G1_Old_Generation], Time(ms), 0
[TOTAL_GC_TIME
%G1_Old_Generation], Time(%), 0.0
[TOTAL_GCs], Count, 604
[TOTAL_GC_TIME], Time(ms), 5179
[TOTAL_GC_TIME
%], Time(%), 0.9203527870926697
[READ], Operations, 50011498
[READ], AverageLatency(us), 1517 .6875384336618
[READ], MinLatency(us), 322
[READ], MaxLatency(us), 272639
[READ], 95thPercentileLatency(us), 2115
[READ], 99thPercentileLatency(us), 2583
[READ], Return=OK, 50011498
[CLEANUP], Operations, 370
[CLEANUP], AverageLatency(us), 6020.983783783784
[CLEANUP], MinLatency(us), 0
[CLEANUP], MaxLatency(us), 2228223
[CLEANUP], 95thPercentileLatency(us), 4
[CLEANUP], 99thPercentileLatency(us), 12
[UPDATE], Operations, 49988502
[UPDATE], AverageLatency(us), 2607 .616915105798
[UPDATE], MinLatency(us), 480
[UPDATE], MaxLatency(us), 948223
[UPDATE], 95thPercentileLatency(us), 4147
[UPDATE], 99thPercentileLatency(us), 9615
[UPDATE], Return=OK, 49988502

============================================

Thanks for the link missed that page in the documentation. Was the bottleneck in your tests also the CPU?

I’m confused by the 15-20% load of system calls. I’m running perf top.
The main load of system calls is given by native_queued_spin_lock_slowpath. I built flamegrpah but it didn’t help me. I found native_queued_spin_lock_slowpath in many branches. But the number of native_queued_spin_lock_slowpath relative to the total number is not high.
I’ll try building flamegrpah again and post the results here.

Generally it’s the CPU, see Deployment checklist for YugabyteDB clusters | YugabyteDB Docs.

Your cpu seems a bit small for that amount of memory as example but depends on the workload if you need more of it.