Clustering key restriction for "IN" condition in cassandra - cassandra

I have table in cassandra:
CREATE TABLE pica_pictures (
p int,
g text,
id text,
a int,
PRIMARY KEY ((p), g, id)
)
Then I try select data with query:
cqlsh> select * from picapica_realty.pica_pictures where p = 1 and g in ('1', '2');
Bad Request: Clustering column "g" cannot be restricted by an IN relation
I can't find cause of this behavior.

This may be a restriction due to your version of Cassandra. As Cedric noted, it works for him in 2.2 (or rather, didn't error-out).
However, as I read your question I recalled a slide from a presentation that I gave at Cassandra Day Chicago 2015. From CQL: This is not the SQL you are looking for, silde #15:
IN
Can only operate on the last partition key and/or the last clustering key.
At the time (April 2015) the most-current version of Cassandra was either 2.1.4 or 2.1.5.
As it stands (with Cassandra 2.1) you'll either need to adjust your primary key definition to PRIMARY KEY ((p), g), or adjust your WHERE clause to something like where p = 1 and g = 1 and id in ('id1', 'id2');

This does word with Cassandra 2.2.
cqlsh:ks> CREATE TABLE pica_pictures (
... p int,
... g text,
... id text,
... a int,
... PRIMARY KEY ((p), g, id)
... );
cqlsh:ks> select * from pica_pictures where p = 1 and g in ('1', '2');
p | g | id | a
---+---+----+---
(0 rows)
As your link describes this works because the the preceding columns are defined for equality and none of the queried columns are of a collection type.

Related

Set primary key for range query in Cassandra

I want to create a table with these columns: id1, id2, type, time, data, version.
The frequent query is:
select * from table_name where id1 = ... and id2 =... and type = ...
select * from table_name where id1= ... and type = ... and time > ... and time < ...
I don't know how to set the primary key for the fast query?
As you have two different queries, you will likely need to have two different tables for them to perform well. This is not unusual for Cassandra data models. Keep in mind that for both of these, the PRIMARY KEY definition in Cassandra is largely dependent on the cardinalities and anticipated query patterns. As you have only provided the latter, you may need to make adjustments based on the cardinalities of id1, id2, and type.
select * from table_name where id1 = X and id2 = Y and type = Z;
So here I'm going to make an educated guess that id1 and id2 are nigh unique (high cardinality), as IDs usually are. I don't know how many types are available in your application, but as long as there aren't more than 10,000 this should work:
CREATE TABLE table_name_by_ids (
id1 TEXT,
id2 TEXT,
type TEXT,
time TIMESTAMP,
data TEXT,
version TEXT,
PRIMARY KEY ((id1,id2),type));
This will key your partitions on a joint hash of id1 and id2, sorting the rows inside by type (default ascending).
select * from table_name where id1= X and type = Z and time > A and time < B;
Likewise, the table to support this query will look like this:
CREATE TABLE table_name_by_id1_time (
id1 TEXT,
id2 TEXT,
type TEXT,
time TIMESTAMP,
data TEXT,
version TEXT,
PRIMARY KEY ((id1),type,time))
WITH CLUSTERING ORDER BY (type ASC, time DESC);
Again, this should work as long as you don't have more than several thousand type/time combinations.
One final adjustment that I would make though, would be around judging just how many type/time combinations you expect to have over the life of the application. If this data will grow over time, then the above will cause the partitions to grow to an unmaintainable point. To keep that from happening, I'd also recommend adding a time "bucket."
version TEXT,
month_bucket TEXT,
PRIMARY KEY ((id1,month_bucket),type,time))
WITH CLUSTERING ORDER BY (type ASC, time DESC);
Likewise for this, the query will need to be adjusted as well:
select * from table_name_by_id1_time
where id1= 'X' and type = 'Z'
and month_bucket='201910'
and time > '2019-10-07 00:00:00' and time < '2019-10-07 16:22:12';
Hope this helps.
how do I guarantee the atomicity of these two insertions?
Simply put, you can run the two INSERTs together in an atomic batch.
BEGIN BATCH
INSERT INTO table_name_by_ids (
id1, id2, type, time, data, version
) VALUES (
'X', 'Y', 'Z', '2019-10-07 12:00:01','stuff','1.0'
) ;
INSERT INTO table_name_by_id1_time (
id1, id2, type, time, data, version, month_bucket
) VALUES (
'X', 'Y', 'Z', '2019-10-07 12:00:01','stuff','1.0','201910'
);
APPLY BATCH;
For more info, check out the DataStax docs on atomic batches: https://docs.datastax.com/en/dse/6.7/cql/cql/cql_using/useBatchGoodExample.html

Cassandra schema - select by frequently updated column

Given table:
CREATE TABLE T (
a int,
last_modification_time timestamp,
b int,
PRIMARY KEY (a)
);
I'm frequently updating records. With each update last_modification_time is set to now() and also other fields are set.
What is the right cassandra approach to be able to query by last_modification_time range? I need to query like this:
select * from .. where a=Z and last_modification_time < X and last_modification_time > Y;
One way would be to create materialized view with PRIMARY KEY (a, last_modification_time) but I want to avoid this since materialized views are buggy in 3.X cassandra versions.
What would be alternative way of querying by last_modification_time range given last_modification_time is frequently updated?
How about having two tables? One could hold the current snapshot where you're updating the last_modification_time field and another one which holds the changes over time (something like a history table)? You could write to both of them using BATCH statements.
CREATE TABLE t_modifications (
a int,
last_modification_time timestamp,
b int,
PRIMARY KEY (a, last_modification_time)
) WITH CLUSTERING ORDER BY (last_modificaton_time DESC);
BEGIN BATCH
UPDATE T SET last_modification_time = 123, b = 4 WHERE a = 2;
INSERT INTO t_modifications (a, last_modification_time, b) values (2, 123, 4);
APPLY BATCH;
If you're interested on the latest snapshot by a given modification range, you can select and limit the t_modifications table:
SELECT * FROM t_modifications WHERE a = 2 AND last_modification_time < 136 LIMIT 1;
In general, to do range queries like this, the field you want to range on has to be part of the composite key, has to be the right-most element of the composite key, and all other elements in the composite key have to be specified. In your case, you would modify your PRIMARY KEY to (a, last_modification_time). You can then
SELECT * from t_modifications
WHERE a = aval
AND last_modification_time > beg
AND last_modification_time < end;
This will get you all records for aval between beg and end.

Cassandra Timestamp Default Now

Is it possible to have a timestamp column in a table with the default value of now, such that only the other fields are introduced?
No, There is no way to set default value in cassandra.
You must provide value or use toUnixTimestamp(now()) when inserting.
Example :
CREATE TABLE sample_times (
a int,
b timestamp,
c timeuuid,
d bigint,
PRIMARY KEY (a,b,c,d)
);
Example Insert :
INSERT INTO sample_times (a,b,c,d) VALUES (1, toUnixTimestamp(now()), 50554d6e-29bb-11e5-b345-feff819cdc9f, 10);

Performance consideration of Cassandra query using composite partition key vs clustering column

I have the following DDL:
CREATE TABLE mykeyspace.mytable (
a text,
b text,
c text,
d text,
e text,
starttime timestamp,
endtime timestamp,
PRIMARY KEY ((a, b, c), d, e, starttime, endtime)
) WITH CLUSTERING ORDER BY (d ASC, e ASC, starttime ASC, endtime ASC)
and I only have the following SELECT/DELETE query:
SELECT */DELETE FROM mytable WHERE a = ? AND b = ? AND C = ? AND d = ?;
I just wonder if the column d can be included as part of the composite partition key so a row lookup is enough instead of a row lookup + clustering column lookup? In this case it will improve performance as well?
The column d include in the composite partition key will absolutely improve performance
Your data will distribute well among the cluster.
Your SELECT query will be faster, no clustering level filtering is required
Your DELETE query will mark that partition as markedForDeleteAt, instead of inserting range tombstone
I feel that the more columns I have in the PARTITION KEY the better.
So my suggestion is to incluse as much columns as possible in the PARTITION KEY. It will improve SELECT query performances in general, and will avoid some tombstones problems as well (because you will delete at partition level, unless you recreate the partitions of course).

Choosing the right schema for cassandra "table" in CQL3

We are trying to store lots of attributes for a particular profile_id inside a table (using CQL3) and cannot wrap our heads around which approach is the best:
a. create table mytable (profile_id, a1 int, a2 int, a3 int, a4 int ... a3000 int) primary key (profile_id);
OR
b. create MANY tables, eg.
create table mytable_a1(profile_id, value int) primary key (profile_id);
create table mytable_a2(profile_id, value int) primary key (profile_id);
...
create table mytable_a3000(profile_id, value int) primary key (profile_id);
OR
c. create table mytable (profile_id, a_all text) primary key (profile_id);
and just store 3000 "columns" inside a_all, like:
insert into mytable (profile_id, a_all) values (1, "a1:1,a2:5,a3:55, .... a3000:5");
OR
d. none of the above
The type of query we would be running on this table:
select * from mytable where profile_id in (1,2,3,4,5423,44)
We tried the first approach and the queries keep timing out and sometimes even kill cassandra nodes.
The answer would be to use a clustering column. A clustering column allows you to create dynamic columns that you could use to hold the attribute name (col name) and it's value (col value).
The table would be
create table mytable (
profile_id text,
attr_name text,
attr_value int,
PRIMARY KEY(profile_id, attr_name)
)
This allows you to add inserts like
insert into mytable (profile_id, attr_name, attr_value) values ('131', 'a1', 3);
insert into mytable (profile_id, attr_name, attr_value) values ('131', 'a2', 1031);
.....
insert into mytable (profile_id, attr_name, attr_value) values ('131', 'an', 2);
This would be the optimal solution.
Because you then want to do the following
'The type of query we would be running on this table: select * from mytable where profile_id in (1,2,3,4,5423,44)'
This would require 6 queries under the hood but cassandra should be able to do this in no time especially if you have a multi node cluster.
Also if you use the DataStax Java Driver you can run this requests asynchronously and concurrently on your cluster.
For more on data modelling and the DataStax Java Driver check out DataStax's free online training. Its worth a look
http://www.datastax.com/what-we-offer/products-services/training/virtual-training
Hope it helps.

Resources