Invalid Arguments. All hash columns must be set if order by clause is present

Hi there! I have a question about an error I get when I try to use an order clause in a select on a table. Simple select runs fine, select with order by gives an error. The table only has 6 rows, so this is not an issue with a large cluster/data. I have not been able to find anything similar online - can you please tell me what I am doing wrong?
Please find the table definition and select output below.

cqlsh:test> desc test ;

CREATE TABLE test.test (
    id text,
    retailer_moniker text,
    interval_start timestamp,
    interval_end timestamp,
    key_data text,
    created_dtm timestamp,
    PRIMARY KEY (id, retailer_moniker, interval_start)
) WITH CLUSTERING ORDER BY (retailer_moniker ASC, interval_start DESC)
    AND default_time_to_live = 63072000
    AND transactions = {'enabled': 'false'};
cqlsh:test> select * from test where retailer_moniker = 'homedepot';

 id                                   | retailer_moniker | interval_start                  | interval_end                    | key_data             | created_dtm
--------------------------------------+------------------+---------------------------------+---------------------------------+----------------------+---------------------------------
 fa92eb7e-cbaa-11ea-87d0-0242ac130003 |        homedepot | 2020-07-01 00:00:00.000000+0000 | 2020-08-01 00:00:00.000000+0000 | secret key text July | 2020-07-22 18:56:19.833000+0000
 02536780-cbab-11ea-87d0-0242ac130003 |        homedepot | 2020-06-01 00:00:00.000000+0000 | 2020-07-01 00:00:00.000000+0000 | secret key text June | 2020-07-22 18:56:19.380000+0000
 079a6298-cbab-11ea-87d0-0242ac130003 |        homedepot | 2020-05-01 00:00:00.000000+0000 | 2020-06-01 00:00:00.000000+0000 |  secret key text May | 2020-07-22 18:56:19.358000+0000

(3 rows)
cqlsh:test> select * from test order by retailer_moniker, interval_start;
InvalidRequest: Error from server: code=2200 [Invalid query] message="Invalid Arguments. All hash columns must be set if order by clause is present.
select * from test order by retailer_moniker, interval_start;
                                          ^^^^^^^^^^^^^^^^
 (error -304)"

Hi @Vlad_Koretsky

Not specifying the partitioning-columns (id in this case) will result in cluster-wide full table scan and is an anti pattern. See docs https://docs.yugabyte.com/latest/api/ycql/dml_select/#order-by-clause.

If this is needed, can you open an issue on github ?

hi @Vlad_Koretsky

In the link @dorian_yugabyte shared about the ORDER BY clause, see that

<< Currently, only two overall orderings are allowed, the clustering order from the CREATE TABLE statement (forward scan) or its opposite (reverse scan). >>

In your CLUSTERING clause, the columns are ASC+DESC, but ORDER BY (for retailer_moniker+interval_start) is ASC/ASC.

YCQL, unlike YSQL, doesn’t support ORDER BY that requires a dynamic/run-time sort.

Dorian, Kannan, thank you for your replies!

A follow up question on both your answers - In my scenario, I have 2 use-cases:

  1. Read: I need an efficient lookup on all rows that belong to a specific retailer (known value). ID value is randomly generated during the save operation (not in YB), so it is not known at the time of read.
  2. Delete: I need to be able to delete a record for a retailer with a given ID for a given retailer (without doing a full table scan.

If I partition my records by retailer and cluster on interval_start, I believe that all retailer records would be stored on the same node and sorted by interval_start field (efficient lookup). This result could then be filtered by non-indexed field ID. Is my understanding correct?

CREATE TABLE test.test (
    id text,
    retailer_moniker text,
    interval_start timestamp,
    interval_end timestamp,
    key_data text,
    created_dtm timestamp,
    PRIMARY KEY (retailer_moniker, interval_start)
) WITH CLUSTERING ORDER BY (interval_start ASC)
    AND default_time_to_live = 63072000
    AND transactions = {'enabled': 'false'};```

@Vlad_Koretsky

That is correct.

All rows of a given retailer_moniker value fall on same partition/tablet/node and will be sorted by interval_start field. This will handle your case (1). For case (2), you’ll have to page through these rows, to filter the non-indexed ID field, and then delete it.

An unrelated comment…

If the number of rows per “retailer_moniker” are highly varying, and if you have say a specific 'retailer_moniker’s that tends to be very high volume, or has a burst of activity during some times, then the nodes hosting the replicas of that shard can become hot spots. If this can be an issue, you might want to spread each 'retailer_moniker’s rows across a bunch of “application” level shards (say 8 or 16) in addition… that is, make the “retailer_moniker+app_shard_id” the partition key. But in this case, the lookups have to be done to all those 8 or 16 shards. That’s a design tradeoff you would need to make.