[Question posted by a user on YugabyteDB Community Slack ]
One option to do this kind of parallel processing (reporting or analytic queries) on a YCQL table would be to use Spark.
For the simpler SELECT Count(*), if you don’t want to fire up a spark job to get the table counts, and need something lighter weight, you can try this Python script.
You’ll need to update these params (at least the first two) in the script:
cluster = Cluster(['127.0.0.1']) --> change to a comma separated list of a few IPs in the cluster keyspace_name="ybdemo_keyspace" --> change to keyspace you want num_tasks_per_table=4096 --> can leave this setting as is num_parallel_tasks=8 --> can leave this setting as is
The script will find the all tables in a YCQL keyspace, and then for each table, find out the partition columns, and then do SELECT COUNT(*) queries of the form below for each “sub-task/partition” with a max parallelism you specify above and aggregate the sums.
The queries issued will be of the form:
SELECT count(*) as rows FROM k2.test WHERE partition_hash(partition_key_col) >= ? AND partition_hash(partition_key_col) <= ?
for various partition slices like (0, 15) (16, 31) and so on.