Cassandra schema - select by frequently updated column - cassandra

Given table:
CREATE TABLE T (
a int,
last_modification_time timestamp,
b int,
PRIMARY KEY (a)
);
I'm frequently updating records. With each update last_modification_time is set to now() and also other fields are set.
What is the right cassandra approach to be able to query by last_modification_time range? I need to query like this:
select * from .. where a=Z and last_modification_time < X and last_modification_time > Y;
One way would be to create materialized view with PRIMARY KEY (a, last_modification_time) but I want to avoid this since materialized views are buggy in 3.X cassandra versions.
What would be alternative way of querying by last_modification_time range given last_modification_time is frequently updated?

How about having two tables? One could hold the current snapshot where you're updating the last_modification_time field and another one which holds the changes over time (something like a history table)? You could write to both of them using BATCH statements.
CREATE TABLE t_modifications (
a int,
last_modification_time timestamp,
b int,
PRIMARY KEY (a, last_modification_time)
) WITH CLUSTERING ORDER BY (last_modificaton_time DESC);
BEGIN BATCH
UPDATE T SET last_modification_time = 123, b = 4 WHERE a = 2;
INSERT INTO t_modifications (a, last_modification_time, b) values (2, 123, 4);
APPLY BATCH;
If you're interested on the latest snapshot by a given modification range, you can select and limit the t_modifications table:
SELECT * FROM t_modifications WHERE a = 2 AND last_modification_time < 136 LIMIT 1;

In general, to do range queries like this, the field you want to range on has to be part of the composite key, has to be the right-most element of the composite key, and all other elements in the composite key have to be specified. In your case, you would modify your PRIMARY KEY to (a, last_modification_time). You can then
SELECT * from t_modifications
WHERE a = aval
AND last_modification_time > beg
AND last_modification_time < end;
This will get you all records for aval between beg and end.

Related

Storing time specific data in cassandra

I am looking for a good way to store time specific data in cassandra.
Each entry can look like (start_time, value). Later, I would like to retrieve the current value.
Logic of retrieving current value is like following.
Find all rows with start_time<=current_time.
Then find the value with maximum start_time from the rows obtained in the first step.
PS:- Edited the question to make it more clear
The exact requirements are not possible. But we can get close to it with one more column.
First, to be able to use <= operator, your start_time column need to be the clustering key of your table.
Then, you need a different partition key. You could choose a fixed value but it could bring problems when the partition will have too many rows. Then you should better use something like the year or the month of the start_time.
CREATE TABLE time_specific_table (
year bigint,
start_time timestamp,
value text,
PRIMARY KEY((year), start_time)
) WITH CLUSTERING ORDER BY (start_time DESC);
The problem is that when you will query the table, you will need to know the value of the partition key :
Find all rows with start_time<=current_time
SELECT * FROM time_specific_table
WHERE year = :year AND start_time <= :time;
select the value with maximum start_time
SELECT * FROM time_specific_table
WHERE year = :year LIMIT 1;
Create two separate table like below :
CREATE TABLE data (
start_time timestamp,
value int,
PRIMARY KEY(start_time, value)
);
CREATE TABLE current_value (
partition int PRIMARY KEY,
value int
);
Now you have to insert data into both table, to insert data into second table use a static value like 1
INSERT INTO current_value(partition, value) VALUES(1, 10);
Now In current value table your data will be upsert and You will get latest value whenever you select.

Cassandra slow SELECT MAX(x) query

I have a dev machine with Cassandra 3.9 and 2 tables, one has ~~ 400,000 records, another about 40,000,000 records. Their structures are different.
Each of them has a secondary index on a field x, and I'm trying to run a query of the form SELECT MAX(x) FROM table. On the first table, the query takes a couple of seconds, and on the second table, it times out.
My experience is with relational databases where these queries are trivial and fast. So in Cassandra, it looks like the index isn't used to execute these queries? Is there an alternative?
In cassandra aggregation functions such as MIN, MAX, COUNT, SUM or AVG on a table without specifing a partition key is a bad practice. instead, you can have an other table that store the max value of x field for both tables.
However, you have to add some client side logic to maintain this max value in the other table when you run INSERT or UPDATE statements.
Tables structures :
CREATE TABLE t1 (
pk text PRIMARY KEY,
x int
);
CREATE TABLE t2 (
pk text PRIMARY KEY,
x int
);
CREATE TABLE agg_table (
table_name text PRIMARY KEY,
max_value int
);
So with this structure you can have the max value for a table :
SELECT max_value
FROM agg_table
WHERE table_name = 't1';
Hope this can help you.

Clustering key restriction for "IN" condition in cassandra

I have table in cassandra:
CREATE TABLE pica_pictures (
p int,
g text,
id text,
a int,
PRIMARY KEY ((p), g, id)
)
Then I try select data with query:
cqlsh> select * from picapica_realty.pica_pictures where p = 1 and g in ('1', '2');
Bad Request: Clustering column "g" cannot be restricted by an IN relation
I can't find cause of this behavior.
This may be a restriction due to your version of Cassandra. As Cedric noted, it works for him in 2.2 (or rather, didn't error-out).
However, as I read your question I recalled a slide from a presentation that I gave at Cassandra Day Chicago 2015. From CQL: This is not the SQL you are looking for, silde #15:
IN
Can only operate on the last partition key and/or the last clustering key.
At the time (April 2015) the most-current version of Cassandra was either 2.1.4 or 2.1.5.
As it stands (with Cassandra 2.1) you'll either need to adjust your primary key definition to PRIMARY KEY ((p), g), or adjust your WHERE clause to something like where p = 1 and g = 1 and id in ('id1', 'id2');
This does word with Cassandra 2.2.
cqlsh:ks> CREATE TABLE pica_pictures (
... p int,
... g text,
... id text,
... a int,
... PRIMARY KEY ((p), g, id)
... );
cqlsh:ks> select * from pica_pictures where p = 1 and g in ('1', '2');
p | g | id | a
---+---+----+---
(0 rows)
As your link describes this works because the the preceding columns are defined for equality and none of the queried columns are of a collection type.

how to do the query in cassandra If i have two cluster key in column family

I have a column family and syntax like this:
CREATE TABLE sr_number_callrecord (
id int,
callerph text,
sr_number text,
callid text,
start_time text,
plan_id int,
PRIMARY KEY((sr_number), start_time, callerph)
);
I want to do the query like :
a) select * from dummy where sr_number='+919xxxx8383'
and start_time >='2014-12-02 08:23:18' limit 10;
b) select * from dummy where sr_number='+919xxxxxx83'
and start_time >='2014-12-02 08:23:18'
and callerph='+9120xxxxxxxx0' limit 10;
First query works fine but second query is giving error like
Bad Request: PRIMARY KEY column "callerph" cannot be restricted
(preceding column "start_time" is either not restricted or by a non-EQ
relation)
If I get the result in first query, In second query I am just adding one
more cluster key to get filter result and the row will be less
Just like you cannot skip PRIMARY KEY components, you may only use a non-equals operator on the last component that you query (which is why your 1st query works).
If you do need to serve both of the queries you have listed above, then you will need to have separate query tables for each. To serve the second query, a query table (with the same columns) will work if you define it with a PRIMARY KEY like this:
PRIMARY KEY((sr_number), callerph, start_time)
That way you are still specifying the parts of your PRIMARY KEY in order, and your non-equals condition is on the last PRIMARY KEY component.
There are certain restrictions in the way the primary key columns are to be used in the where clause http://docs.datastax.com/en/cql/3.1/cql/cql_reference/select_r.html
One solution that will work in your situation is to change the order of clustering columns in the primary key
CREATE TABLE sr_number_callrecord (
id int,
callerph text,
sr_number text,
callid text,
start_time text,
plan_id int,
PRIMARY KEY((sr_number), callerph, start_time,)
);
Now you can use range query on the last column as
select * from sr_number_callrecord where sr_number = '1234' and callerph = '+91123' and start_time >= '1234';

cassandra error when using select and where in cql

I have a cassandra table defined like this:
CREATE TABLE test.test(
id text,
time bigint,
tag text,
mstatus boolean,
lonumb int,
PRIMARY KEY (id, time, tag)
)
And I want to select one column using select.
I tried:
select * from test where lonumb = 4231;
It gives:
code=2200 [Invalid query] message="No indexed columns present in by-columns clause with Equal operator"
Also I cannot do
select * from test where mstatus = true;
Doesn't cassandra support where as a part of CQL? How to correct this?
You can only use WHERE on the indexed or primary key columns. To correct your issue you will need to create an index.
CREATE INDEX iname
ON keyspacename.tablename(columname)
You can see more info here.
But you have to keep in mind that this query will have to run against all nodes in the cluster.
Alternatively you might rethink your table structure if the lonumb is something you'll do the most queries on.
Jny is correct in that WHERE is only valid on columns in the PRIMARY KEY, or those where a secondary index has been created for. One way to solve this issue is to create a specific query table for lonumb queries.
CREATE TABLE test.testbylonumb(
id text,
time bigint,
tag text,
mstatus boolean,
lonumb int,
PRIMARY KEY (lonumb, time, id)
)
Now, this query will work:
select * from testbylonumb where lonumb = 4231;
It will return all CQL rows where lonumb = 4231, sorted by time. I put id on the PRIMARY KEY to ensure uniqueness.
select * from test where mstatus = true;
This one is trickier. Indexes and keys on low-cardinality columns (like booleans) are generally considered a bad idea. See if there's another way you could model that. Otherwise, you could experiment with a secondary index on mstatus, but only use it when you specify a partition key (lonumb in this case), like this:
select * from testbylonumb where lonumb = 4231 AND mstatus = true;
Maybe that wouldn't perform too badly, as you are restricting it to a specific partition. But I definitely wouldn't ever do a SELECT * on mstatus.

Resources