Range query - Data modeling for time series in CQL Cassandra - cassandra

I have a table like this:
CREATE TABLE test ( partitionkey text, rowkey text, date
timestamp, policyid text, policyname text, primary key
(partitionkey, rowkey));
with some data:
partitionkey | rowkey | policyid | policyname | date
p1 | r1 | pl1 | plicy1 | 2007-01-02 00:00:00+0000
p1 | r2 | pl2 | plicy2 | 2007-01-03 00:00:00+0000
p2 | r3 | pl3 | plicy3 | 2008-01-03 00:00:00+0000
I want to be able to find:
1/ data from a particular partition key
2/ data from a particular partition key & rowkey
3/ Range query on date given a partitionkey
1/ and 2/ are trivial:
select * from test where partitionkey='p1';
partitionkey | rowkey | policyid | policyname | range
p1 | r1 | pl1 | plicy1 | 2007-01-02 00:00:00+0000
p1 | r2 | pl2 | plicy2 | 2007-01-03 00:00:00+0000
but what about 3/?
Even with an index it doesnt work:
create index i1 on test (date);
select * from test where partitionkey='p1' and date =
'2007-01-02';
partitionkey | rowkey | policyid | policyname | date
p1 | r1 | pl1 plicy1 | 2007-01-02 00:00:00+0000
but
select * from test where partitionkey='p1' and
date > '2007-01-02';
Bad Request: No indexed columns present in
by-columns clause with Equal operator
Any idea?
thanks,
Matt

CREATE TABLE test ( partitionkey text, rowkey text, date timestamp,
policyid text, policyname text, primary key (partitionkey, rowkey));
First of all, you really should use more descriptive column names instead of partitionkey and rowkey (and even date, for that matter). By looking at those column names, I really can't tell what kind of data this table is supposed to be indexed by.
select * from test where partitionkey='p1' and date > '2007-01-02';
Bad Request: No indexed columns present in by-columns clause with Equal operator
As for this issue, try making your "date" column a part of your primary key.
primary key (partitionkey, rowkey, date)
Once you do that, I think your date range queries will function appropriately.
For more information on this, check out DataStax Academy's (free) course called Java Development With Apache Cassandra. Session 5, Module 104 discusses how to model time series data and that should help you out.

Related

How do I model a CQL table such that it can be queried by zip_code, or by zip_code and hash?

Hi all I have a cassandra Table containing Hash as Primary key and another column containing List. I want to add another column named Zipcode such that I can query cassandra based on either zipcode or zipcode and hash
Hash | List | zipcode
select * from table where zip_code = '12345';
select * from table where zip_code = '12345' && hash='abcd';
Is there any way that I could do this?
Recommendation in Cassandra is that you design your data tables based on your access patterns. For example in your case you would like to get results by zipcode and by zipcode and hash, so ideally you can have two tables like this
CREATE TABLE keyspace.table1 (
zipcode text,
field1 text,
field2 text,
PRIMARY KEY (zipcode));
and
CREATE TABLE keyspace.table2 (
hashcode text
zipcode text,
field1 text,
field2 text,
PRIMARY KEY ((hashcode,zipcode)));
Then you may be required to redesign your tables based on your data. I recommend you understand data model design in cassandra before proceeding further.
ALLOW FILTERING construct can be used but its usage depends on how big/small is your data. If you have a very large data then avoid using this construct as it will require complete scan of the database which is quite expensive in terms of resources and time.
It is possible to design a single table that will satisfy both app queries.
In this example schema, the table is partitioned by zip code with hash as the clustering key:
CREATE TABLE table_by_zipcode (
zipcode int,
hash text,
...
PRIMARY KEY(zipcode, hash)
)
With this design, each zip code can have one or more rows of hash. Here's the table with some test data in it:
zipcode | hash | intcol | textcol
---------+------+--------+---------
123 | abc | 1 | alice
123 | def | 2 | bob
123 | ghi | 3 | charli
456 | tuv | 5 | banana
456 | xyz | 4 | apple
The table contains two partitions zipcode = 123 and zipcode = 456. The first zip code has three rows (abc, def, ghi) and the second has two rows (tuv, xyz).
You can query the table using just the partition key (zipcode), for example:
cqlsh> SELECT * FROM table_by_zipcode WHERE zipcode = 123;
zipcode | hash | intcol | textcol
---------+------+--------+---------
123 | abc | 1 | alice
123 | def | 2 | bob
123 | ghi | 3 | charli
It is also possible to query the table with the partition key zipcode and clustering key hash, for example:
cqlsh> SELECT * FROM table_by_zipcode WHERE zipcode = 123 AND hash = 'abc';
zipcode | hash | intcol | textcol
---------+------+--------+---------
123 | abc | 1 | alice
Cheers!

Last record each group in cassandra

I has a table with schema:
create table last_message_by_group
(
date date,
created_at timestamp,
message text,
group_id bigint,
primary key (date, created_at, message_id)
)
with clustering order by (created_at desc)
and data should be:
| date | created_at | message | group_id |
| 2021-05-11 | 7:23:54 | ddd | 1 |
| 2021-05-11 | 6:21:43 | ccc | 1 |
| 2021-05-11 | 5:35:16 | bbb | 2 |
| 2021-05-11 | 4:38:23 | aaa | 2 |
It will show messages order by created_at desc partition by date.
But the problem is it can not get last message each group likes:
| date | created_at | message | group_id |
| 2021-05-11 | 7:23:54 | ddd | 1 |
| 2021-05-11 | 5:35:16 | bbb | 2 |
created_at is cluster key, so it cant be updated, so I delete and insert new row every new message by group_id, this way make low performance
Is there any way to do that?
I was able to get this to work by making one change to your primary key definition. I added group_id as the first clustering key:
PRIMARY KEY (date, group_id, created_at, message_id)
After inserting the same data, this works:
> SELECT date, group_id, max(created_at), message
FROM last_message_by_group
WHERE date='2021-05-11'
GROUP BY date,group_id;
date | group_id | system.max(created_at) | message
------------+----------+---------------------------------+---------
2021-05-11 | 1 | 2021-05-11 12:23:54.000000+0000 | ddd
2021-05-11 | 2 | 2021-05-11 10:35:16.000000+0000 | bbb
(2 rows)
There's more detail on using CQL's GROUP BY clause in the official docs.
there is one problem, because you changed clustering key, so message will be ordered by group_id first. Any idea for still order by created_at and 1 message each group?
From the document linked above:
the GROUP BY option only accept as arguments primary key column names in the primary key order.
Unfortunately, if we were to adjust the primary key definition to put created_at before group_id, we would also have to group by created_at. That would create a "group" for each unique created_at, which negates the idea behind group_id.
In this case, you may have to decide between having the grouped results in a particular order vs. having them grouped at all. It might also be possible to group the results, but then re-order them appropriately on the application side.

Understanding Cassandra static field

I learn Cassandra through its documentation. Now I'm learning about batch and static fields.
In their example at the end of the page, they somehow managed to make balance have two different values (-200, -208) even though it's a static field.
Could someone explain to me how this is possible? I've read the whole page but I did not catch on.
In Cassandra static field is static under a partition key.
Example : Let's define a table
CREATE TABLE static_test (
pk int,
ck int,
d int,
s int static,
PRIMARY KEY (pk, ck)
);
Here pk is the partition key and ck is the clustering key.
Let's insert some data :
INSERT INTO static_test (pk , ck , d , s ) VALUES ( 1, 10, 100, 1000);
INSERT INTO static_test (pk , ck , d , s ) VALUES ( 2, 20, 200, 2000);
If we select the data
pk | ck | s | d
----+----+------+-----
1 | 10 | 1000 | 100
2 | 20 | 2000 | 200
here for partition key pk = 1 static field s value is 1000 and for partition key pk = 2 static field s value is 2000
If we insert/update static field s value of partition key pk = 1
INSERT INTO static_test (pk , ck , d , s ) VALUES ( 1, 11, 101, 1001);
Then static field s value will change for all the rows of the partition key pk = 1
pk | ck | s | d
----+----+------+-----
1 | 10 | 1001 | 100
1 | 11 | 1001 | 101
2 | 20 | 2000 | 200
In a table that uses clustering columns, non-clustering columns can be declared static in the table definition. Static columns are only static within a given partition.
Example:
CREATE TABLE test (
partition_column text,
static_column text STATIC,
clustering_column int,
PRIMARY KEY (partition_column , clustering_column)
);
INSERT INTO test (partition_column, static_column, clustering_column) VALUES ('key1', 'A', 0);
INSERT INTO test (partition_column, clustering_column) VALUES ('key1', 1);
SELECT * FROM test;
Results:
primary_column | clustering_column | static_column
----------------+-------------------+--------------
key1 | 0 | A
key1 | 1 | A
Observation:
Once declared static, the column inherits the value from given partition key
Now, lets insert another record
INSERT INTO test (partition_column, static_column, clustering_column) VALUES ('key1', 'C', 2);
SELECT * FROM test;
Results:
primary_column | clustering_column | static_column
----------------+-------------------+--------------
key1 | 0 | C
key1 | 1 | C
key1 | 2 | C
Observation:
If you update the static key, or insert another record with updated static column value, the value is reflected across all the columns ==> static column values are static (constant) across given partition column
Restriction (from the DataStax reference documentation below):
A table that does not define any clustering columns cannot have a static column. The table having no clustering columns has a one-row partition in which every column is inherently static.
A table defined with the COMPACT STORAGE directive cannot have a static column.
A column designated to be the partition key cannot be static.
Reference : DataStax Reference
In the example on the page you've linked they don't have different values at the same point in time.
They first have the static balance field set to -208 for the whole user1 partition:
user | expense_id | balance | amount | description | paid
-------+------------+---------+--------+-------------+-------
user1 | 1 | -208 | 8 | burrito | False
user1 | 2 | -208 | 200 | hotel room | False
Then they apply a batch update statement that sets the balance value to -200:
BEGIN BATCH
UPDATE purchases SET balance=-200 WHERE user='user1' IF balance=-208;
UPDATE purchases SET paid=true WHERE user='user1' AND expense_id=1 IF paid=false;
APPLY BATCH;
This updates the balance field for the whole user1 partition to -200:
user | expense_id | balance | amount | description | paid
-------+------------+---------+--------+-------------+-------
user1 | 1 | -200 | 8 | burrito | True
user1 | 2 | -200 | 200 | hotel room | False
The point of a static fields is that you can update/change its value for the whole partition at once. So if I would execute the following statement:
UPDATE purchases SET balance=42 WHERE user='user1'
I would get the following result:
user | expense_id | balance | amount | description | paid
-------+------------+---------+--------+-------------+-------
user1 | 1 | 42 | 8 | burrito | True
user1 | 2 | 42 | 200 | hotel room | False

Cassandra query table based on row range

I am new to cassandra. I am using cassandra-3.0 and datastax java driver for development. I would like to know whether cassandra provide any option to fecth the data based on rowkey range?
something like
select * from <table-name> where rowkey > ? and rowkey < ?;
If not, is there any other option in cassandra ( java/cql) to fetchdata based on row ranges?
Unfortunately, there really isn't a mechanism in Cassandra that works in the way that you are asking. The only way to run a range query on your partition keys (rowkey) is with the token function. This is because Cassandra orders its rows in the cluster by the hashed token value of the partition key. That value would not really have any meaning for you, but it would allow you to "page" through the a large table without encountering timeouts.
SELECT * FROM <table-name>
WHERE token(rowkey) > -9223372036854775807
AND token(rowkey) < -5534023222112865485;
The way to go about range querying on meaningful values, is to find a value to partition your rows by, and then cluster by a numeric or time value. For example, I can query a table of events by date range, if I partition my data by month (PRIMARY KEY(month,eventdate)):
aploetz#cqlsh:stackoverflow> SELECT * FROM events
WHERE monthbucket='201509'
AND eventdate > '2015-09-19' AND eventdate < '2015-09-26';
monthbucket | eventdate | beginend | eventid | eventname
-------------+--------------------------+----------+--------------------------------------+------------------------
201509 | 2015-09-25 06:00:00+0000 | B | a223ad16-2afd-4213-bee3-08a2c4dd63e6 | Hobbit Day
201509 | 2015-09-25 05:59:59+0000 | E | 9cd6a265-6c60-4537-9ea9-b57e7c152db9 | Cassandra Summit
201509 | 2015-09-22 06:00:00+0000 | B | 9cd6a265-6c60-4537-9ea9-b57e7c152db9 | Cassandra Summit
201509 | 2015-09-20 05:59:59+0000 | E | b9fe9668-cef2-464e-beb4-d4f985ef9c47 | Talk Like a Pirate Day
201509 | 2015-09-19 06:00:00+0000 | B | b9fe9668-cef2-464e-beb4-d4f985ef9c47 | Talk Like a Pirate Day
(5 rows)

Cassandra: Searching for NULL values

I have a table MACRecord in Cassandra as follows :
CREATE TABLE has.macrecord (
macadd text PRIMARY KEY,
position int,
record int,
rssi1 float,
rssi2 float,
rssi3 float,
rssi4 float,
rssi5 float,
timestamp timestamp
)
I have 5 different nodes each updating a row based on its title i-e node 1 just updates rssi1, node 2 just updates rssi2 etc. This evidently creates null values for other columns.
I cannot seem to be able to a find a query which will give me only those rows which are not null. Specifically i have referred to this post.
I want to be able to query for example like SELECT *FROM MACRecord where RSSI1 != NULL as in MYSQL. However it seems both null values and comparison operators such as != are not supported in CQL.
Is there an alternative to putting NULL values or a special flag?. I am inserting float so unlike strings i cannot insert something like ''. What is a possible workaround for this problem?
Edit :
My data model in MYSQL was like this :
+-----------+--------------+------+-----+-------------------+-----------------------------+
| Field | Type | Null | Key | Default | Extra |
+-----------+--------------+------+-----+-------------------+-----------------------------+
| MACAdd | varchar(17) | YES | UNI | NULL | |
| Timestamp | timestamp | NO | | CURRENT_TIMESTAMP | on update CURRENT_TIMESTAMP |
| Record | smallint(6) | YES | | NULL | |
| RSSI1 | decimal(5,2) | YES | | NULL | |
| RSSI2 | decimal(5,2) | YES | | NULL | |
| RSSI3 | decimal(5,2) | YES | | NULL | |
| RSSI4 | decimal(5,2) | YES | | NULL | |
| RSSI5 | decimal(5,2) | YES | | NULL | |
| Position | smallint(6) | YES | | NULL | |
+-----------+--------------+------+-----+-------------------+-----------------------------+
Each node (1-5) was querying from MYSQL based on its number for example node 1 "SELECT *FROM MACRecord WHERE RSSI1 is not NULL"
I updated my data model in cassandra as follows so that rssi1-rssi5 are now VARCHAR types.
CREATE TABLE has.macrecord (
macadd text PRIMARY KEY,
position int,
record int,
rssi1 text,
rssi2 text,
rssi3 text,
rssi4 text,
rssi5 text,
timestamp timestamp
)
I was thinking that each node would initially insert string 'NULL' for a record and when an actual rssi data comes it will just replace the 'NULL' string so it would avoid having tombstones and would more or less appear to the user that the values are actually not valid pieces of data since they are flagged 'NULL'.
However i am still puzzled as to how i will retrieve results like i have done in MYSQL. There is no != operator in cassandra. How can i write a query which will give me a result set for example like "SELECT *FROM HAS.MACRecord where RSSI1 != 'NULL'" .
You can only select rows in CQL based on the PRIMARY KEY fields, which by definition cannot be null. This also applies to secondary indexes. So I don't think Cassandra will be able to do the filtering you want on the data fields. You could select on some other criteria and then write your client to ignore rows that had null values.
Or you could create a different table for each rssiX value, so that none of them would be null.
If you are only interested in some kind of aggregation, then the null values are treated as zero. So you could do something like this:
SELECT sum(rssi1) WHERE macadd='someadd';
The sum() function is available in Cassandra 2.2.
You might also be able to do some kind of trick with a user defined function/aggregate, but I think it would be simpler to have multiple tables.

Resources