Cassandra data modeling for range queries using timestamp - cassandra

I need to create a table with 4 columns:
timestamp BIGINT
name VARCHAR
value VARCHAR
value2 VARCHAR
I have 3 required queries:
SELECT *
FROM table
WHERE timestamp > xxx
AND timestamp < xxx;
SELECT *
FROM table
WHERE name = 'xxx';
SELECT *
FROM table
WHERE name = 'xxx'
AND timestamp > xxx
AND timestamp < xxx;
The result needs to be sorted by timestamp.
When I use:
CREATE TABLE table (
timestamp BIGINT,
name VARCHAR,
value VARCHAR,
value2 VARCHAR,
PRIMARY KEY (timestamp)
);
the result is never sorted.
When I use:
CREATE TABLE table (
timestamp BIGINT,
name VARCHAR,
value VARCHAR,
value2 VARCHAR,
PRIMARY KEY (name, timestamp)
);
the result is sorted by name > timestamp which is wrong.
name | timestamp
------------------------
a | 20170804142825729
a | 20170804142655569
a | 20170804142650546
a | 20170804142645516
a | 20170804142640515
a | 20170804142620454
b | 20170804143446311
b | 20170804143431287
b | 20170804143421277
b | 20170804142920802
b | 20170804142910787
How do I do this using Cassandra?

Cassandra order data by clustering key group by partition key
In your case first table have only partition key timestamp, no clustering key. So data will not be sorted.
And For the second table partition key is name and clustering key is timestamp. So your data will sorted by timestamp group by name. Means data will be first group by it's name then each group will be sorted separately by timestamp.
Edited
So you need to add a partition key like below :
CREATE TABLE table (
year BIGINT,
month BIGINT,
timestamp BIGINT,
name VARCHAR,
value VARCHAR,
value2 VARCHAR,
PRIMARY KEY ((year, month), timestamp)
);
here (year, month) is the composite partition key. You have to insert the year and month from the timestamp. So your data will be sorted by timestamp within a year and month

Related

Cassandra clustering key uniqueness

In the book Cassandra the definitive guide it is said that the combination of partition key and clustering key guarantees a unique record in the data base... i understand that the partition key is the one that guarantees unique of record - the node where the record is stored. And the clustering key is for the sorting of the records. Can someone help me understand this?
thank and sorry for the question...
Single partition key (without clustering key) is primary key which has to be unique.
A partition key + clustering key has to be unique but it doesn't mean that either partition key or a clustering key has to be unique alone.
You can insert
(a,b) (first record)
(a,c) (same partition key with the first record)
(d,b) (same clustering key with the first record)
When you insert (a,b) again then it will update the non primary key values for existing primary key.
In the following example userid is partition key and date is clustering key.
cqlsh:play> CREATE TABLE example (userid int, date int, name text, PRIMARY KEY (userid, date));
cqlsh:play> INSERT INTO example (userid, date, name) VALUES (1, 20200530, 'a');
cqlsh:play> INSERT INTO example (userid, date, name) VALUES (1, 20200531, 'a');
cqlsh:play> INSERT INTO example (userid, date, name) VALUES (2, 20200531, 'a');
cqlsh:play> SELECT * FROM example;
userid | date | name
--------+----------+------
1 | 20200530 | a
1 | 20200531 | a
2 | 20200531 | a
(3 rows)
cqlsh:play> INSERT INTO example (userid, date, name) VALUES (2, 20200531, 'b');
cqlsh:play> SELECT * FROM example;
userid | date | name
--------+----------+------
1 | 20200530 | a
1 | 20200531 | a
2 | 20200531 | b
(3 rows)
cqlsh:play>

Date selection query is not working as expected in Cassandra

I have a domain class and below is the declaration:
class Emp{
static mapWith = "cassandra"
String name
Date doj
}
data:
id name doj
1 X 01-01-2010
2 Y 01-20-2012
Cassandra query:
select * from emp_schema.emp where doj='01-01-2010';
Error:
code=2200 [Invalid query] message="Unable to coerce '01-01-2010' to a formatted date (long)"
the format to query dates in cassandra is yyyy-mm-dd
select * from emp_schema.emp where doj='01-01-2010';
Carlos is correct in that Cassandra requires dates to be formatted like yyy-mm-dd.
But this query will only work if doj is your partition key. If your PRIMARY KEY is not setup to indicate doj as the partition key, your query is not possible.
I would specifically design your table to suit your query. This definition partitions on doj and clusters on id for uniqueness, as multiple emp[loyee]s can probably have the same doj:
create table emp_by_doj (
doj date,
id int,
name text,
primary key (doj,id));
Then you can query by a specific date, and have multiple rows returned for it:
> SELECT * FROM emp_by_doj WHERE doj='2017-06-01';
doj | id | name
------------+------+-------
2017-06-01 | 7721 | Sarah
2017-06-01 | 8122 | Sam
(2 rows)

Cassandra event storage

Is there a best way to store data in a Cassandra database if I will want to search the data in these 2 ways:
1) The last 20 "error" event_types for user_id "123"
2) All "login" event_types in the past day
Would this work:
CREATE TABLE events (
user_id text,
event_type text,
data text,
timestamp timestamp,
PRIMARY KEY (event_type, timestamp, userid) );
You will need to create two tables for this (at least in version 2.x).
From version 3.5 onward you can use SASI.
1) The last 20 "error" event_types for user_id "123"
CREATE TABLE events (
user_id text,
event_type text,
data text,
timestamp timestamp DESC,
PRIMARY KEY ((userid,event_type), timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC);
Now you can get the data by the following query.
select * from events where user_id = '123' and event_type = 'error' limit 20
2) All "login" event_types in the past day
CREATE TABLE events_by_type (
user_id text,
event_type text,
data text,
timestamp timestamp DESC,
PRIMARY KEY (event_type, timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC);
Now you can get the data by the following query.
select * from events where event_type = 'login' and timestamp > ddmmyyyy

How to model cassandra columnfamily

Is the below select queries are possible for the columnfamily I have defined, because I am getting a bad request error. How should I Model my columnfamily to get the correct results.
CREATE TABLE recordhistory (
userid bigint,
objectid bigint,
operation text,
record_link_id bigint,
time timestamp,
username text,
value map<bigint, text>,
PRIMARY KEY ((userid, objectid), operation, record_link_id, time)
) WITH CLUSTERING ORDER BY (operation ASC, record_link_id ASC, time DESC)
Select Query:
SELECT * FROM recordhistory WHERE userid=439035 AND objectid=20011009 AND operation='update' AND time>=1389205800000 AND time<=1402338600000 ALLOW FILTERING;
Bad Request: PRIMARY KEY column "time" cannot be restricted (preceding column "record_link_id" is either not restricted or by a non-EQ relation)
SELECT * FROM recordhistory WHERE userid=439035 AND objectid=20011009 AND record_link_id=20011063 ALLOW FILTERING;
Bad Request: PRIMARY KEY column "record_link_id" cannot be restricted (preceding column "operation" is either not restricted or by a non-EQ relation)
create table recordhistory (
userid bigint,
objectid bigint,
operation text,
record_link_id bigint,
time timestamp,
username text,
value map<bigint, text>,
PRIMARY KEY ((userid, objectid), time, operation, record_link_id)) WITH CLUSTERING ORDER BY (time DESC, operation ASC, record_link_id ASC);
select * from recordhistory where userid=12346 AND objectid=45646 and time >=1389205800000 and time <1402338700000 ALLOW FILTERING;
userid | objectid | time | operation | record_link_id | username | value
--------+----------+--------------------------+-----------+----------------+----------+-------
12346 | 45646 | 2014-06-09 11:30:00-0700 | myop4 | 78946 | name3 | null
12346 | 45646 | 2014-01-08 10:30:00-0800 | myop99999 | 999999 | name3 | null

Extra column created by CQL inserts (comparing to cli)

I see extra column being created in my column family when I use cql comparing to cli.
Create table using CQL and insert row:
cqlsh:cassandraSample> CREATE TABLE bedbugs(
... id varchar,
... name varchar,
... description varchar,
... primary key(id, name)
... ) ;
cqlsh:cassandraSample> insert into bedbugs (id, name, description)
values ('Cimex','Cimex lectularius','http://en.wikipedia.org/wiki/Bed_bug');
Now insert column using cli:
[default#cassandraSample] set bedbugs['BatBedBug']['C. pipistrelli:description']='google.com';
Value inserted.
Elapsed time: 1.82 msec(s).
[default#cassandraSample] list bedbugs
... ;
Using default limit of 100
Using default column limit of 100
-------------------
RowKey: Cimex
=> (column=Cimex lectularius:, value=, timestamp=1369682957658000)
=> (column=Cimex lectularius:description, value=http://en.wikipedia.org/wiki/Bed_bug, timestamp=1369682957658000)
-------------------
RowKey: BatBedBug
=> (column=C. pipistrelli:description, value=google.com, timestamp=1369688651442000)
2 Rows Returned.
cqlsh:cassandraSample> select * from bedbugs;
id | name | description
-----------+-------------------+--------------------------------------
Cimex | Cimex lectularius | http://en.wikipedia.org/wiki/Bed_bug
BatBedBug | C. pipistrelli | google.com
So, cql creates one extra column for each row, with empty non-primary key columns. Isn't it waste of space?
When you created a column family using CQLSh and specified primary key(Id, name) you make cassandra create two indices of the data stored one for data sorted by ID and the other for data sorted by name. but when you do this by cassandra-cli your column family doesn't have the index column. cassandra-cli doesn't support having secondary indexes. I hope I made sense to you I lack words to explain my understanding.
For compatibility with cassandra-cli and to prevent this extra column from being created, change your create table statement to include "WITH COMPACT STORAGE".
described here
So
CREATE TABLE bedbugs(
id varchar,
name varchar,
description varchar,
primary key(id, name)
);
becomes
CREATE TABLE bedbugs(
id varchar,
name varchar,
description varchar,
primary key(id, name)
) WITH COMPACT STORAGE;
WITH COMPACT STORAGE is also how you would go about supporting wide rows in cql.

Resources