Why is CQL allowing inequality operators with partition key? - cassandra

The documentation is clear that the only operators allowed in a SELECT for use with a partition column are equals (=) and in[value1, values2[,...]), however, with ALLOW FILTERING, it seems inequality operators are allowed. Here's a simple example:
CREATE TABLE dept_emp (
emp_no INT,
dept_no VARCHAR,
from_date DATE,
to_date DATE,
PRIMARY KEY (emp_no, dept_no)
);
insert into dept_emp (emp_no, dept_no, from_date, to_date) values
(1, '9', '1901-01-01', '1920-02-01');
insert into dept_emp (emp_no, dept_no, from_date, to_date) values
(2, '9', '1920-01-01', '1930-01-01');
insert into dept_emp (emp_no, dept_no, from_date, to_date) values
(3, '9', '1920-01-01', '1930-01-01');
SELECT * FROM dept_emp WHERE emp_no > 1 ALLOW FILTERING;
emp_no | dept_no | from_date | to_date
--------+---------+------------+------------
2 | 9 | 1920-01-01 | 1930-01-01
3 | 9 | 1920-01-01 | 1930-01-01
(2 rows)
I took the document as describing what the CQL parser would recognize and so was expecting a error like I get if I try a != operator. If this is just an ALLOW FILTERING thing, is it documented elsewhere what operators are allowed in that case?

The partition key is in token order so things like > require reading the entire data set from all replica sets, filtering out things dont match. This is horribly inefficient and expensive (which is why ALLOW FILTERING is required). The same would be true of !=, generally C* will out right refuse to do any operation that requires reading everything as its simply something that database is not designed for. ALLOW FILTERING allows some cases of this for things like Spark jobs but they should be avoided in everything but random single run operational debugging tasks or well thought out olap jobs.
Equality on the partition key is required to have any semblance of an efficient query for the coordinator to know where to send the request. I would highly recommend only using equality and changing your data model such that you can satisfy queries that way.

Related

Reduce results to first match for each pattern with spark sql

I have a spark sql query, where I have to search for multiple identifiers:
SELECT * FROM my_table WHERE identifier IN ('abc', 'cde', 'efg', 'ghi')
Now I get hundreds of results for each of these matches, where I am only interested in the first match for each identifier, i.e. one row with identifier == 'abc', one where identifier == 'cde' and so on.
What is the best way to reduce my result to only the first row for each match?
The best approach certainly depends a bit on your data and also on what you mean by first. Is that any random row that happens to be returned first? Or first by some particular sort order?
A general flexible approach is using window functions. row_number() allows you to easily filter for the first row by window.
SELECT * FROM (
SELECT *, row_number() OVER (PARTITION BY identifier ORDER BY ???) as row_num
FROM my_table
WHERE identifier IN ('abc', 'cde', 'efg', 'ghi')) tmp
WHERE
row_num = 1
Though, aggregations like first or max_by are often more efficient. But these get quickly inconvenient when dealing with lots of columns.
You can use the first() aggregation function (after grouping by identifier) to only get the first row in each group.
But I don't think you'll be able to select * with this approach. Instead, you can list every individual column you want to get:
SELECT identifier, first(col1), first(col2), first(col3), ...
FROM my_table
WHERE identifier IN ('abc', 'cde', 'efg', 'ghi')
GROUP BY identifier
Another approach would be to fire a query for each identifier value with a limit of 1 and then union all the results.
With the DataFrame API, you can use your original query and then use .dropDuplicates(["identifier"]) on the result to only keep a single row for each identifier value.

COUNT(*) vs. COUNT(1) vs COUNT(column) with WHERE condition performance in Cassandra

I have a query in Cassandra
select count(pk1) from tableA where pk1=xyz
Table is :
create table tableA
(
pk1 uuid,
pk2 int,
pk3 text,
pk4 int,
fc1 int,
fc2 bigint,
....
fcn blob,
primary key (pk1, pk2 , pk3 , pk4)
The query is executed often and takes up to 2s to execute.
I am wondering if there will be any performance gain if refactoring to:
select count(1) from tableA where pk = xyz
Based on the documentation here, there is no difference between count(1) and count(*).
Generally speaking COUNT(1) and COUNT(*) will both return the number of rows that match the condition specified in your query
This is in line with how traditional SQL databases are implemented.
COUNT ( { [ [ ALL | DISTINCT ] expression ] | * } )
Count(1) is a condition that always evaluates to true.
Also, Count(Column_name) only returns the Non-Null values.
Since in your case because of where condition the "Non-null" is a non-factor, I don't think there will be any difference in performance in using one over the other. This answer tried confirming the same using some performance tests.
In general COUNT is not at all recommended in Cassandra . As it’s going to scan through multiple nodes and get your answer back . And I’m not sure the count you get is really consistent.

Spanner SQL Filtering by Less Than on Tuples

Is there any way to achive <, >, etc comparisons in the WHERE clause of a Spanner SQL query where the values compared are not scalar but tuples/structs?
For example, say we have a table users (intentionally unrealistic schema)
CREATE TABLE users (
is_special BOOL NOT NULL,
registered_on TIMESTAMP NOT NULL,
) PRIMARY KEY (is_special DESC, registered_on DESC)
The natural sort order of the PK index is then is_special DESC, registered_on DESC.
I want select a range of rows starting with a specific row in that PK index (i.e. from a cursor):
SELECT * FROM users
WHERE (is_special, registered_on) < (#cursor.is_special, #cursor.registered_on)
LIMIT 100
That's not allowed by Spanner SQL because the tuple is treated as a STRUCT type and STRUCT types do not allow the < comparison. Is there any other way to achieve this?
With the Read API, I can query a range by using a KeyRange and providing the PK of the row I want to start the query from, but I'd like to achieve the same in SQL.
Here is how to write the query using individual fields. This relies on the fact that column is_special is not nullable.
SELECT * FROM users
WHERE (is_special < #cursor.is_special) OR (is_special = #cursor.is_special AND registered_on < #cursor.registered_on)
LIMIT 100
Just for completeness; if column is_special is nullable then it gets a uglier.
SELECT * FROM users
WHERE (is_special < #cursor.is_special) OR ((is_special = #cursor.is_special OR (is_special IS NULL AND #cursor.is_special IS NULL)) AND registered_on < #cursor.registered_on)
LIMIT 100
Additional comment. The query has a LIMIT clause but no ORDER BY clause. This is legal but unusual and it looks like a bug given that the query is paging results.
I think the query should have the following clause:
ORDER BY is_special, registered_on
The reason is as follows:
If a SQL query does not have an ORDER BY clause then it does not provide any row ordering guarantee. In practice you will observe ordering in Spanner results even without an ORDER BY clause but no order is guaranteed and you should not rely on it. However, if a query has an ORDER BY and Spanner uses an index that provides the required order then Spanner will not explicitly sort the data. Therefore you need not worry about the performance or memory impact of including ORDER BY.

Cassandra queries performance, ranges

I'm quite new with Cassandra, and I was wondering if there would be any impact in performance if a query is asked with "date = '2015-01-01'" or "date >= '2015-01-01' AND date <= '2015-01-01'".
The only reason I want to use the ranges like that is because I need to make multiple queries and I want to have them prepared (as in prepared statements). This way the prepared statements number is cut by half.
The keys used are ((key1, key2), date) and (key1, date, key2) in the two tables I want to use this. The query for the first table is similar to:
SELECT * FROM table1
WHERE key1 = val1
AND key2 = val2
AND date >= date1 AND date <= date2
For a PRIMARY KEY (key1, date, key2) that type of query just isn't possible. If you do, you'll see an error like this:
InvalidRequest: code=2200 [Invalid query] message="PRIMARY KEY column
"key2" cannot be restricted (preceding column "date" is either not
restricted or by a non-EQ relation)"
Cassandra won't allow you to filter by a PRIMARY KEY component if the preceding column(s) are filtered by anything other than the equals operator.
On the other hand, your queries for PRIMARY KEY ((key1, key2), date) will work and perform well. The reason, is that Cassandra uses the clustering key(s) (date in this case) to specify the on-disk sort order of data within a partition. As you are specifying partition keys (key1 and key2) your result set will be sorted by date, allowing Cassandra to satisfy your query by performing a continuous read from the disk.
Just to test that out, I'll even run two queries on a table with a similar key, and turn tracing on:
SELECT * FROM log_date2 WHERe userid=1001
AND time > 32671010-f588-11e4-ade7-21b264d4c94d
AND time < a3e1f750-f588-11e4-ade7-21b264d4c94d;
Returns 1 row and completes in 4068 microseconds.
SELECT * FROM log_date2 WHERe userid=1001
AND time=74ad4f70-f588-11e4-ade7-21b264d4c94d;
Returns 1 row and completes in 4001 microseconds.

Ranges (intervals) request in Cassandra DB - CQL

Excuse, if it is a duplicate, I've found a few questions about times ranges here, but my case seems a little bit different and not yet discussed.
I would like to store quite big chunks (bins) of data (blobs - 2-4Mb, this is the “black-box data”, I can't change its layout) to access with interval keys:
...
primary key ( bin_id int, from_item_id int, to_item_id int )
...
with ability to select with items ranges, like in this pseudo-code to select all chunks that contains interval of items [110, 200]:
select chunk from tb1 where chunk_id = 100500 and from_item_id >= 110 and to_item_id <= 200;
Attempt to run such a query directly ended with error:
code=2200 [Invalid query] message="PRIMARY KEY column "to_item_id" cannot be restricted (preceding column "from_item_id" is restricted by a non-EQ relation)"
Currently only solution I've found is to implement additional table (tb_map) with reverse mapping from item_id to bin_id and use select to make a query looks something like this:
...
– in tb_map
primary key (dummy_id, item_id)
...
select bin_id from tb_map where dummy_id = SOME_MAGIK and item_id >= 110 and item_id <= 200;
And then use bin_id to retrieve chunks from tb1 with EQ or IN operator like here:
select * from tb1 where bin_id in (...);
But I can't use this model due insert performance issues (application should avoid many inserts to additional table and should avoid maintaining additional data structures, but should be "as simple as nail").
Is it any simple solution to stay within one table (or several simple tables)? I'm stuck with no ideas how to model such behaviour in C* (may be slices should be used?), could local C* gurus provide any hints?
I'm using CQL 3.1
From CQL3 reference:
Moreover, for a given partition key, the clustering columns induce an ordering of rows and relations on them is restricted to the relations that allow to select a contiguous (for the ordering) set of rows.
In your case the query doesn't select a contiguous set of rows, so Cassandra refuses to process it.

Resources