Given this sample table schema:
CREATE TABLE foo (
pk1 text,
pk2 text,
pk3 text,
pk4 text,
data map<text, text>,
PRIMARY KEY ((pk1, pk2, pk3), pk4)
);
I wonder if it is possible to have a query that selects different combinations of pk2, pk3, pk4 with a fixed pk1. Something like:
SELECT * FROM foo WHERE pk1 = 'abc' AND
((pk2 = 'x' AND pk3 = 'y' AND pk4 = 'z') OR ((pk2 = 'd' AND pk3 = 'e' AND pk4 = 'f'));
I don't get this working. I have a set of pk2, pk3, pk4 tuples and a fixed pk1 and want to select all matching rows using a single query if possible (Cassandra 2.2.x).
I wonder if it is possible to have a query that selects different combinations of pk2, pk3, pk4 with a fixed pk1.
No, it is not.
As Doan pointed out, OR is not supported (yet). This leaves you with AND which makes nesting WHERE conditions superfluous.
Also, adding parens around your AND parts is not supported either. Without parens this works, but not with them:
aploetz#cqlsh:stackoverflow> SELECT * FROM row_historical_game_outcome_data
WHERE customer_id=123123 AND (game_id=673 AND game_time = '2015-07-01 05:01:42+0000');
SyntaxException: <ErrorMessage code=2000 [Syntax error in CQL query] message="line 1:89 missing ')' at 'AND' (...=123123 AND (game_id=673 [AND] game_time...)">
That being said, the IN predicate will work on either the last partition or last clustering key. But using it on a partition key is considered to be an anti-pattern. In any case, in your case this would work (syntactically):
SELECT * FROM foo WHERE pk1 = 'abc' AND pk2 = 'x' AND pk3 = 'y' AND pk4 IN ('z','f');
CQL supports AND predicate on columns
that belong to the PRIMARY KEY
that are indexed by the secondary index system
OR predicate is not supported. It will be supported as future enhancement of the new SASI secondary index
Related
I have a query in Cassandra
select count(pk1) from tableA where pk1=xyz
Table is :
create table tableA
(
pk1 uuid,
pk2 int,
pk3 text,
pk4 int,
fc1 int,
fc2 bigint,
....
fcn blob,
primary key (pk1, pk2 , pk3 , pk4)
The query is executed often and takes up to 2s to execute.
I am wondering if there will be any performance gain if refactoring to:
select count(1) from tableA where pk = xyz
Based on the documentation here, there is no difference between count(1) and count(*).
Generally speaking COUNT(1) and COUNT(*) will both return the number of rows that match the condition specified in your query
This is in line with how traditional SQL databases are implemented.
COUNT ( { [ [ ALL | DISTINCT ] expression ] | * } )
Count(1) is a condition that always evaluates to true.
Also, Count(Column_name) only returns the Non-Null values.
Since in your case because of where condition the "Non-null" is a non-factor, I don't think there will be any difference in performance in using one over the other. This answer tried confirming the same using some performance tests.
In general COUNT is not at all recommended in Cassandra . As it’s going to scan through multiple nodes and get your answer back . And I’m not sure the count you get is really consistent.
I want to create a table with these columns: id1, id2, type, time, data, version.
The frequent query is:
select * from table_name where id1 = ... and id2 =... and type = ...
select * from table_name where id1= ... and type = ... and time > ... and time < ...
I don't know how to set the primary key for the fast query?
As you have two different queries, you will likely need to have two different tables for them to perform well. This is not unusual for Cassandra data models. Keep in mind that for both of these, the PRIMARY KEY definition in Cassandra is largely dependent on the cardinalities and anticipated query patterns. As you have only provided the latter, you may need to make adjustments based on the cardinalities of id1, id2, and type.
select * from table_name where id1 = X and id2 = Y and type = Z;
So here I'm going to make an educated guess that id1 and id2 are nigh unique (high cardinality), as IDs usually are. I don't know how many types are available in your application, but as long as there aren't more than 10,000 this should work:
CREATE TABLE table_name_by_ids (
id1 TEXT,
id2 TEXT,
type TEXT,
time TIMESTAMP,
data TEXT,
version TEXT,
PRIMARY KEY ((id1,id2),type));
This will key your partitions on a joint hash of id1 and id2, sorting the rows inside by type (default ascending).
select * from table_name where id1= X and type = Z and time > A and time < B;
Likewise, the table to support this query will look like this:
CREATE TABLE table_name_by_id1_time (
id1 TEXT,
id2 TEXT,
type TEXT,
time TIMESTAMP,
data TEXT,
version TEXT,
PRIMARY KEY ((id1),type,time))
WITH CLUSTERING ORDER BY (type ASC, time DESC);
Again, this should work as long as you don't have more than several thousand type/time combinations.
One final adjustment that I would make though, would be around judging just how many type/time combinations you expect to have over the life of the application. If this data will grow over time, then the above will cause the partitions to grow to an unmaintainable point. To keep that from happening, I'd also recommend adding a time "bucket."
version TEXT,
month_bucket TEXT,
PRIMARY KEY ((id1,month_bucket),type,time))
WITH CLUSTERING ORDER BY (type ASC, time DESC);
Likewise for this, the query will need to be adjusted as well:
select * from table_name_by_id1_time
where id1= 'X' and type = 'Z'
and month_bucket='201910'
and time > '2019-10-07 00:00:00' and time < '2019-10-07 16:22:12';
Hope this helps.
how do I guarantee the atomicity of these two insertions?
Simply put, you can run the two INSERTs together in an atomic batch.
BEGIN BATCH
INSERT INTO table_name_by_ids (
id1, id2, type, time, data, version
) VALUES (
'X', 'Y', 'Z', '2019-10-07 12:00:01','stuff','1.0'
) ;
INSERT INTO table_name_by_id1_time (
id1, id2, type, time, data, version, month_bucket
) VALUES (
'X', 'Y', 'Z', '2019-10-07 12:00:01','stuff','1.0','201910'
);
APPLY BATCH;
For more info, check out the DataStax docs on atomic batches: https://docs.datastax.com/en/dse/6.7/cql/cql/cql_using/useBatchGoodExample.html
I have table in cassandra:
CREATE TABLE pica_pictures (
p int,
g text,
id text,
a int,
PRIMARY KEY ((p), g, id)
)
Then I try select data with query:
cqlsh> select * from picapica_realty.pica_pictures where p = 1 and g in ('1', '2');
Bad Request: Clustering column "g" cannot be restricted by an IN relation
I can't find cause of this behavior.
This may be a restriction due to your version of Cassandra. As Cedric noted, it works for him in 2.2 (or rather, didn't error-out).
However, as I read your question I recalled a slide from a presentation that I gave at Cassandra Day Chicago 2015. From CQL: This is not the SQL you are looking for, silde #15:
IN
Can only operate on the last partition key and/or the last clustering key.
At the time (April 2015) the most-current version of Cassandra was either 2.1.4 or 2.1.5.
As it stands (with Cassandra 2.1) you'll either need to adjust your primary key definition to PRIMARY KEY ((p), g), or adjust your WHERE clause to something like where p = 1 and g = 1 and id in ('id1', 'id2');
This does word with Cassandra 2.2.
cqlsh:ks> CREATE TABLE pica_pictures (
... p int,
... g text,
... id text,
... a int,
... PRIMARY KEY ((p), g, id)
... );
cqlsh:ks> select * from pica_pictures where p = 1 and g in ('1', '2');
p | g | id | a
---+---+----+---
(0 rows)
As your link describes this works because the the preceding columns are defined for equality and none of the queried columns are of a collection type.
I have a column family and syntax like this:
CREATE TABLE sr_number_callrecord (
id int,
callerph text,
sr_number text,
callid text,
start_time text,
plan_id int,
PRIMARY KEY((sr_number), start_time, callerph)
);
I want to do the query like :
a) select * from dummy where sr_number='+919xxxx8383'
and start_time >='2014-12-02 08:23:18' limit 10;
b) select * from dummy where sr_number='+919xxxxxx83'
and start_time >='2014-12-02 08:23:18'
and callerph='+9120xxxxxxxx0' limit 10;
First query works fine but second query is giving error like
Bad Request: PRIMARY KEY column "callerph" cannot be restricted
(preceding column "start_time" is either not restricted or by a non-EQ
relation)
If I get the result in first query, In second query I am just adding one
more cluster key to get filter result and the row will be less
Just like you cannot skip PRIMARY KEY components, you may only use a non-equals operator on the last component that you query (which is why your 1st query works).
If you do need to serve both of the queries you have listed above, then you will need to have separate query tables for each. To serve the second query, a query table (with the same columns) will work if you define it with a PRIMARY KEY like this:
PRIMARY KEY((sr_number), callerph, start_time)
That way you are still specifying the parts of your PRIMARY KEY in order, and your non-equals condition is on the last PRIMARY KEY component.
There are certain restrictions in the way the primary key columns are to be used in the where clause http://docs.datastax.com/en/cql/3.1/cql/cql_reference/select_r.html
One solution that will work in your situation is to change the order of clustering columns in the primary key
CREATE TABLE sr_number_callrecord (
id int,
callerph text,
sr_number text,
callid text,
start_time text,
plan_id int,
PRIMARY KEY((sr_number), callerph, start_time,)
);
Now you can use range query on the last column as
select * from sr_number_callrecord where sr_number = '1234' and callerph = '+91123' and start_time >= '1234';
I have a cassandra table defined like this:
CREATE TABLE test.test(
id text,
time bigint,
tag text,
mstatus boolean,
lonumb int,
PRIMARY KEY (id, time, tag)
)
And I want to select one column using select.
I tried:
select * from test where lonumb = 4231;
It gives:
code=2200 [Invalid query] message="No indexed columns present in by-columns clause with Equal operator"
Also I cannot do
select * from test where mstatus = true;
Doesn't cassandra support where as a part of CQL? How to correct this?
You can only use WHERE on the indexed or primary key columns. To correct your issue you will need to create an index.
CREATE INDEX iname
ON keyspacename.tablename(columname)
You can see more info here.
But you have to keep in mind that this query will have to run against all nodes in the cluster.
Alternatively you might rethink your table structure if the lonumb is something you'll do the most queries on.
Jny is correct in that WHERE is only valid on columns in the PRIMARY KEY, or those where a secondary index has been created for. One way to solve this issue is to create a specific query table for lonumb queries.
CREATE TABLE test.testbylonumb(
id text,
time bigint,
tag text,
mstatus boolean,
lonumb int,
PRIMARY KEY (lonumb, time, id)
)
Now, this query will work:
select * from testbylonumb where lonumb = 4231;
It will return all CQL rows where lonumb = 4231, sorted by time. I put id on the PRIMARY KEY to ensure uniqueness.
select * from test where mstatus = true;
This one is trickier. Indexes and keys on low-cardinality columns (like booleans) are generally considered a bad idea. See if there's another way you could model that. Otherwise, you could experiment with a secondary index on mstatus, but only use it when you specify a partition key (lonumb in this case), like this:
select * from testbylonumb where lonumb = 4231 AND mstatus = true;
Maybe that wouldn't perform too badly, as you are restricting it to a specific partition. But I definitely wouldn't ever do a SELECT * on mstatus.