Uncached/Large Data Set Random Read Workload

kannan · January 10, 2018, 12:17am

One of our users was interested in looking at YugaByte DB’s behavior for a random read workload where the data set does not fit in RAM (i.e. an uncached random read workload). The intent was to verify if YugaByte was designed well to handle this case with the optimal number of IOs to the disk subsystem.

Summary

We loaded about 360GB of raw data per-node, and configured the block cache on each node to be only about 7.5GB. On four, 8-cpu machines, we were able to get about 77K ops/second with read latencies under 0.88 millisecs.

YugaByte’s ability to cache the index & bloom filter blocks effectively, and its highly optimized storage engine (DocDB) allows us to keep the number of disk accesses at the optimal level. Details below.

Setup

4-node cluster in Google Compute Platform (GCP)

Each node is a:

n1-standard-8
8 vcpu; Intel(R) Xeon(R) CPU @ 2.60GHz
RAM: 30GB
SSD: 2 x 375GB

Replication Factor = 3
Default for db_block_size = 32KB

Load Phase & Data Set Size:

Number of KVs: 1.6 Billion
KV Size: ~300 bytes
** Value size: 256 bytes (chosen to be not very compressible)
** Key size: 50 bytes
Logical Data Size: 1.6B * 300 == 480GB
Raw Data Including Replication: 480GB * 3 == 1.4TB
Raw Data per Node: 1.4TB / 4 nodes = 360GB
Block Cache Size: 7.5GB

Load Data using provided sample applications jar

% nohup java -jar /opt/yugabyte/java/yb-sample-apps.jar --workload CassandraKeyValue --nodes 10.128.0.2:9042,10.128.0.5:9042,10.128.0.4:9042,10.128.0.3:9042 --num_threads_write 200 --num_threads_read 0 --value_size 256 --num_unique_keys 1629145600 --num_writes 1629145601 --uuid ed8c67d8-7fd6-453f-b109-554131f380c1 >& /tmp/load.txt &

Workload: Random Read

We use 150 concurrent readers; the reads use a random distribution across the 1.6B keys loaded into the system.

% nohup java -jar /opt/yugabyte/java/yb-sample-apps.jar --workload CassandraKeyValue --nodes 10.128.0.2:9042,10.128.0.5:9042,10.128.0.4:9042,10.128.0.3:9042 --num_threads_write 0 --num_threads_read 150 --value_size 256 --read_only --max_written_key 1629145600 --uuid ed8c67d8-7fd6-453f-b109-554131f380c1 >& /tmp/read.txt &

Results

Disk Utilization

Sample disk IO on one of the nodes during the “random read” workload is shown below. The disk stats below show that the workload is evenly distributed across the 2 available data SSDs on the system. Each of the four nodes is handling about 16.4K disk read ops/sec (for the 77K user read ops/sec cluster wide).

The index blocks and bloom filters are cached effectively, and therefore all the misses, about 8.2K per disk (as shown below) are for data blocks. The amount of IO is about 230MB/s on each disk for the 8.2K disk reads ops/sec.

The average IO size is 230MB/s / 8.2K reads/sec is 29KB. This corresponds with our db_block_size of 32KB (it’s slightly smaller because while keys in this setup are somewhat compressible, but the bulk of the data volume is in the value portion which is deliberately been picked to be not very compressible.).

% iostat -mx 5
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
      60.07    0.00   18.05   11.42    0.00   10.46

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdc               0.00     0.00 8207.00    0.00   232.65     0.00    58.06     2.97    0.36    0.36    0.00   0.10  82.80
sda               0.00     0.00    0.00    0.20     0.00     0.00     8.00     0.00    1.00    0.00    1.00   1.00   0.02
sdb               0.00     0.00 8193.60    0.00   231.70     0.00    57.91     2.85    0.35    0.35    0.00   0.10  82.18

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
      57.62    0.00   18.72   10.13    0.00   13.54

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdc               0.00     0.00 8212.60    0.00   232.45     0.00    57.97     2.56    0.31    0.31    0.00   0.10  79.98
sda               0.00     0.00    0.20    0.00     0.00     0.00     8.00     0.00   21.00   21.00    0.00  21.00   0.42
sdb               0.00     0.00 8196.40    0.00   231.83     0.00    57.93     2.50    0.31    0.31    0.00   0.10  79.24

hengestone · January 10, 2018, 5:38pm

Hi, thanks for posting this! As a comparison, can you repeat the test with Cassandra, and maybe Scylla?

Conrad

kannan · January 10, 2018, 9:51pm

Hi @hengestone:

Our primary intent was to share insights about YugaByte and its design point/current state.

Time permitting will consider doing some comparisons with other products. Those in general are a lot more time consuming exercises (especially to get the various defaults to the optimal recommended level for each product - e.g, in the case of Apache Cassandra, what JVM version, GC option, heap sizes, etc. to use).

regards,
Kannan

Topic		Replies	Views
Large cluster perf #1 - 25 nodes General	5	2889	April 21, 2020
YCSB Benchmark Results for YugaByte and Apache Cassandra General	1	13911	April 17, 2021
YugaByte DB P99 latencies with Netflix Data Store Benchmark General	1	3797	March 8, 2023
YCSB Benchmark Results for YugaByte and Apache Cassandra (again) with P99 Latencies General	1	6831	March 8, 2023
Write latency is bad for Yugabyte YCSB Benchmark General	8	1058	November 24, 2021

Uncached/Large Data Set Random Read Workload

Related topics