[Question posted by a user on YugabyteDB Community Slack ]
What’s the best way to count number of rows in a YCQL table with some degree of parallelism?
[Question posted by a user on YugabyteDB Community Slack ]
What’s the best way to count number of rows in a YCQL table with some degree of parallelism?
One option to do this kind of parallel processing (reporting or analytic queries) on a YCQL table would be to use Spark.
For the simpler SELECT Count(*), if you don’t want to fire up a spark job to get the table counts, and need something lighter weight, you can try this Python script.
You’ll need to update these params (at least the first two) in the script:
cluster = Cluster(['127.0.0.1']) --> change to a comma separated list of a few IPs in the cluster
keyspace_name="ybdemo_keyspace" --> change to keyspace you want
num_tasks_per_table=4096 --> can leave this setting as is
num_parallel_tasks=8 --> can leave this setting as is
The script will find the all tables in a YCQL keyspace, and then for each table, find out the partition columns, and then do SELECT COUNT(*) queries of the form below for each “sub-task/partition” with a max parallelism you specify above and aggregate the sums.
The queries issued will be of the form:
SELECT count(*) as rows FROM k2.test
WHERE partition_hash(partition_key_col) >= ? AND partition_hash(partition_key_col) <= ?
for various partition slices like (0, 15) (16, 31) and so on.