Uncached/Large Data Set Random Read Workload

One of our users was interested in looking at YugaByte DB’s behavior for a random read workload where the data set does not fit in RAM (i.e. an uncached random read workload). The intent was to verify if YugaByte was designed well to handle this case with the optimal number of IOs to the disk subsystem.

Summary

We loaded about 360GB of raw data per-node, and configured the block cache on each node to be only about 7.5GB. On four, 8-cpu machines, we were able to get about 77K ops/second with read latencies under 0.88 millisecs.

YugaByte’s ability to cache the index & bloom filter blocks effectively, and its highly optimized storage engine (DocDB) allows us to keep the number of disk accesses at the optimal level. Details below.

Setup

4-node cluster in Google Compute Platform (GCP)

Each node is a:

  • n1-standard-8
  • 8 vcpu; Intel(R) Xeon(R) CPU @ 2.60GHz
  • RAM: 30GB
  • SSD: 2 x 375GB

Replication Factor = 3
Default for db_block_size = 32KB

Load Phase & Data Set Size:

  • Number of KVs: 1.6 Billion
  • KV Size: ~300 bytes
    ** Value size: 256 bytes (chosen to be not very compressible)
    ** Key size: 50 bytes
  • Logical Data Size: 1.6B * 300 == 480GB
  • Raw Data Including Replication: 480GB * 3 == 1.4TB
  • Raw Data per Node: 1.4TB / 4 nodes = 360GB
  • Block Cache Size: 7.5GB

Load Data using provided sample applications jar

% nohup java -jar /opt/yugabyte/java/yb-sample-apps.jar --workload CassandraKeyValue --nodes 10.128.0.2:9042,10.128.0.5:9042,10.128.0.4:9042,10.128.0.3:9042 --num_threads_write 200 --num_threads_read 0 --value_size 256 --num_unique_keys 1629145600 --num_writes 1629145601 --uuid ed8c67d8-7fd6-453f-b109-554131f380c1 >& /tmp/load.txt &

Workload: Random Read

We use 150 concurrent readers; the reads use a random distribution across the 1.6B keys loaded into the system.

% nohup java -jar /opt/yugabyte/java/yb-sample-apps.jar --workload CassandraKeyValue --nodes 10.128.0.2:9042,10.128.0.5:9042,10.128.0.4:9042,10.128.0.3:9042 --num_threads_write 0 --num_threads_read 150 --value_size 256 --read_only --max_written_key 1629145600 --uuid ed8c67d8-7fd6-453f-b109-554131f380c1 >& /tmp/read.txt &

Results

Disk Utilization

Sample disk IO on one of the nodes during the “random read” workload is shown below. The disk stats below show that the workload is evenly distributed across the 2 available data SSDs on the system. Each of the four nodes is handling about 16.4K disk read ops/sec (for the 77K user read ops/sec cluster wide).

The index blocks and bloom filters are cached effectively, and therefore all the misses, about 8.2K per disk (as shown below) are for data blocks. The amount of IO is about 230MB/s on each disk for the 8.2K disk reads ops/sec.

The average IO size is 230MB/s / 8.2K reads/sec is 29KB. This corresponds with our db_block_size of 32KB (it’s slightly smaller because while keys in this setup are somewhat compressible, but the bulk of the data volume is in the value portion which is deliberately been picked to be not very compressible.).

% iostat -mx 5
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
      60.07    0.00   18.05   11.42    0.00   10.46

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdc               0.00     0.00 8207.00    0.00   232.65     0.00    58.06     2.97    0.36    0.36    0.00   0.10  82.80
sda               0.00     0.00    0.00    0.20     0.00     0.00     8.00     0.00    1.00    0.00    1.00   1.00   0.02
sdb               0.00     0.00 8193.60    0.00   231.70     0.00    57.91     2.85    0.35    0.35    0.00   0.10  82.18

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
      57.62    0.00   18.72   10.13    0.00   13.54

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdc               0.00     0.00 8212.60    0.00   232.45     0.00    57.97     2.56    0.31    0.31    0.00   0.10  79.98
sda               0.00     0.00    0.20    0.00     0.00     0.00     8.00     0.00   21.00   21.00    0.00  21.00   0.42
sdb               0.00     0.00 8196.40    0.00   231.83     0.00    57.93     2.50    0.31    0.31    0.00   0.10  79.24
1 Like

Hi, thanks for posting this! As a comparison, can you repeat the test with Cassandra, and maybe Scylla?

Conrad

Hi @hengestone:

Our primary intent was to share insights about YugaByte and its design point/current state.

Time permitting will consider doing some comparisons with other products. Those in general are a lot more time consuming exercises (especially to get the various defaults to the optimal recommended level for each product - e.g, in the case of Apache Cassandra, what JVM version, GC option, heap sizes, etc. to use).

regards,
Kannan