Cassandra DB: Why less than query failed? - cassandra

I have created a KEYSPACE and a TABLE with a uuid column as primary key and a timestamp column using an index. All this succeeded like the following picture showed:
cassandra#cqlsh:my_keyspace> insert into my_test ( id, insert_time, value ) values ( uuid(), '2015-03-12 09:10:30', '111' );
cassandra#cqlsh:my_keyspace> insert into my_test ( id, insert_time, value ) values ( uuid(), '2015-03-12 09:20:30', '222' );
cassandra#cqlsh:my_keyspace> select * from my_test;
id | insert_time | value
--------------------------------------+--------------------------+-------
9d7f88bc-5cb9-463f-b679-fd66e6469eb5 | 2015-03-12 09:20:30+0000 | 222
69579f6f-bf88-493b-a1d6-2f89fac25650 | 2015-03-12 09:10:30+0000 | 111
(2 rows)
and now query
cassandra#cqlsh:my_keyspace> select * from my_test where insert_time = '2015-03-12 09:20:30';
id | insert_time | value
--------------------------------------+--------------------------+-------
9d7f88bc-5cb9-463f-b679-fd66e6469eb5 | 2015-03-12 09:20:30+0000 | 222
(1 rows)
and now query with less than:
cassandra#cqlsh:my_keyspace> select * from my_test where insert_time < '2015-03-12 09:20:30';
InvalidRequest: code=2200 [Invalid query] message="No secondary indexes on the restricted columns support the provided operators: 'insert_time < <value>'"
while the first query is successful, why this happened? How should I make the second query successful since that's just what I want?
You can test all this on your own machine. Thanks
CREATE TABLE my_test (
id uuid PRIMARY KEY,
insert_time timestamp,
value text
) ;
CREATE INDEX my_test_insert_time_idx ON my_keyspace.my_test (insert_time);

Cassandra range queries are quite limited. It goes down to performance, and data storage mechanics. A range query must have the following:
Hit a (or few with IN) partition key, and include exact matches on all consecutive clustering keys except the last one in the query, which you can do a range query on.
Say your PK is (a, b, c, d), then the following are allowed:
where a=a1 and b < b1
where a=a1 and b=b1 and c < c1
The following is not:
where a=a1 and c < 1
[I won't go into Allow Filtering here...avoid it.]
Secondary indexes must be exact matches. You can't have range queries on them.

Related

Cassandra CLUSTERING ORDER BY is not working and showing in correct results

Hi I have created a table for storing data of like this
CREATE TABLE keyspace.test (
name text,
date text,
time double,
entry text,
details text,
PRIMARY KEY ((name, date), time)
) WITH CLUSTERING ORDER BY (time DESC);
And inserted data into the table.But a query like this gives an unordered result.
SELECT * FROM keyspace.test where device_id name ='anand' and date in ('2017-04-01','2017-04-02','2017-04-03','2017-04-05') ;
Is there any problem with my table design.
I think you are misunderstanding cassandra clustering key order. Cassandra Sort data with cluster key within a single partition.
That is for your case cassandra sort data with clustering key time within a single name and date.
Example : Let's insert some data
INSERT INTO test (name , date , time , entry ) VALUES ('anand', '2017-04-01', 1, 'a');
INSERT INTO test (name , date , time , entry ) VALUES ('anand', '2017-04-01', 2, 'b');
INSERT INTO test (name , date , time , entry ) VALUES ('anand', '2017-04-01', 3, 'c');
INSERT INTO test (name , date , time , entry ) VALUES ('anand', '2017-04-02', 0, 'nil');
INSERT INTO test (name , date , time , entry ) VALUES ('anand', '2017-04-02', 4, 'd');
If we select data with your query :
SELECT * FROM test where name ='anand' and date in ('2017-04-01','2017-04-02','2017-04-03','2017-04-05') ;
Output :
name | date | time | details | entry
-------+------------+------+---------+-------
anand | 2017-04-01 | 3 | null | c
anand | 2017-04-01 | 2 | null | b
anand | 2017-04-01 | 1 | null | a
anand | 2017-04-02 | 4 | null | d
anand | 2017-04-02 | 0 | null | nil
You can see that time 3,2,1 are within a single partition anand:2017-04-01 are sorted in desc And time 4,0 are within single partition anand:2017-04-02 are sorted in desc. Cassandra will not take care of sorting between different partition.
Here is the doc :
In the table definition, a clustering column is a column that is part of the compound primary key definition, but not the first column, which is the position reserved for the partition key. Columns are clustered in multiple rows within a single partition. The clustering order is determined by the position of columns in the compound primary key definition.
Source : http://docs.datastax.com/en/cql/3.1/cql/ddl/ddl_compound_keys_c.html
By the way why is your data field is text type and time field is double type ?
You can use date field as date type and time as timestamp type.
The query that you are using is o.k. but it probably doesn't behave as you are expecting it to because coordinator will not sort the results based on partitions. I also run into this problem couple of times.
The solution to it is very simple, basically It's far better to execute the 4 separate queries that you need on the client and then merge the results there. In short IN operator puts a lot of pressure to the coordinator node in the cluster, there's a nice read on this subject:
https://lostechies.com/ryansvihla/2014/09/22/cassandra-query-patterns-not-using-the-in-query-for-multiple-partitions/

CAS with CQL in Cassandra

I'm trying to model some time series data in Cassandra which I had been able to do with the older thrift client but CQL seems to be throwing me off.
I want to add a NEW column to my row IF a specific column value matches.
My table definition is:
CREATE TABLE TestTable (
key int,
base uuid,
ts int, // Timestamp (column name)
val text, // Timestamp value (column value)
PRIMARY KEY (key, ts)
) WITH CLUSTERING ORDER BY (ts DESC);
What I'm guessing it'd look like is:
Row | UUID | TS | TS | TS
--- | ---- | --- | ---| ---
1 | id1 | 1 | 2 | 3
--- | --- | --- | ---| ---
2 | id2 | 1 | 5 | 6
So essentially, I can have a bunch of Timestamps for a given row and a SINGLE UUID for a row.
The UUID needs to be updated for each new insert of a TS column.
So inserts in a row work just fine:
insert into TestTable(key, base, ts, val) values (1, dfb63886-91a4-11e6-ae22-56b6b6499611, 50, 'one')
But I'm failing to figure out a way, using CQL, to INSERT a new column in a row using Cassandra transactions (CAS).
This one fails:
insert into TestTable(key, base, ts, val) values (1, dfb63886-91a4-11e6-ae22-56b6b6499611, 70, 'four') if base = dfb63886-91a4-11e6-ae22-56b6b6499611;
with the error:
SyntaxException: <ErrorMessage code=2000 [Syntax error in CQL query] message="line 1:106 mismatched input 'base' expecting K_NOT (..., 70, 'four') if [base] =...)">
And the query:
update TestTable set val = 'four', ts=70 where key = 1 if base = dfb63886-91a4-11e6-ae22-56b6b6499611;
fails with the error:
InvalidRequest: code=2200 [Invalid query] message="PRIMARY KEY part ts found in SET part"
I'm trying to figure out how to model the data properly so that I only have one UUID per row and can have multiple columns without having to explicitly define them during table creation, since it can vary quite a bit.
IIRC, it was easy doing this with the thrift client but using that isn't an option =/
There is a nice tutorial regarding data series here
In a nutshell, your composite key will be your unique identifier (like the UUID that you were proposing) and a timestamp, so you will be able to add as many events/values associated to a UUID
CREATE TABLE IF NOT EXISTS TestTable (
base uuid,
ts timestamp, // Timestamp (column name)
value text, // Timestamp value (column value)
PRIMARY KEY (base, ts)
) WITH CLUSTERING ORDER BY (ts DESC);
Adding values will have the same UUID with different times:
INSERT INTO TestTable (base, ts, value)
VALUES (467286c5-7d13-40c2-92d0-73434ee8970c, dateof(now()), 'abc');
INSERT INTO TestTable (base, ts, value)
VALUES (467286c5-7d13-40c2-92d0-73434ee8970c, dateof(now()), 'def');
cqlsh:test> SELECT * FROM TestTable WHERE base = 467286c5-7d13-40c2-92d0-73434ee8970c;
base | ts | value
--------------------------------------+---------------------------------+-------
467286c5-7d13-40c2-92d0-73434ee8970c | 2016-10-14 04:13:42.779000+0000 | def
467286c5-7d13-40c2-92d0-73434ee8970c | 2016-10-14 04:12:50.551000+0000 | abc
(2 rows)
Updating can be done in any of the columns, except the ones used as keys, the errors displayed in the update statement was caused by the "IF" statement and because it was tried to update ts which is part of the composite key.
INSERT INTO TestTable (base, ts, value)
VALUES (ffb0bb8e-3d67-4203-8c53-046a21992e52, dateof(now()), 'bananas');
SELECT * FROM TestTable WHERE base = ffb0bb8e-3d67-4203-8c53-046a21992e52 AND ts < dateof(now());
base | ts | value
--------------------------------------+---------------------------------+---------
ffb0bb8e-3d67-4203-8c53-046a21992e52 | 2016-10-14 04:17:26.421000+0000 | apples
(1 rows)
UPDATE TestTable SET value = 'apples' WHERE base = ffb0bb8e-3d67-4203-8c53-046a21992e52;
SELECT * FROM TestTable WHERE base = ffb0bb8e-3d67-4203-8c53-046a21992e52 AND ts < dateof(now());
base | ts | value
--------------------------------------+---------------------------------+---------
ffb0bb8e-3d67-4203-8c53-046a21992e52 | 2016-10-14 04:17:26.421000+0000 | bananas
(1 rows)

Clustering column non-EQ relation

Have a table
CREATE TABLE room (
uuidhotel text,
startreservetime double,
endreservetime double,
uuid text,
uuidguest text,
uuidroom text,
PRIMARY KEY (uuidhotel, startreservetime, endreservetime)
query like this works:
select * from room WHERE uuidhotel = 'testUUIDHotel' and startreservetime > 1;
but when I try to use:
cqlsh:hotelier> select * from room WHERE uuidhotel = 'testUUIDHotel' and startreservetime > 1 and endreservetime < 3;
got an error.
InvalidRequest: code=2200 [Invalid query] message="Clustering column "endreservetime" cannot be restricted (preceding column "startreservetime" is restricted by a non-EQ relation)
How can I execute query with 3 parameters ?
Any alternatives ?
Unfortunately, if you want to use a greater/less-than operator on a PRIMARY KEY component in Cassandra, all of the preceding PRIMARY KEY components must be restricted with an equals operator.
So how can you query for a date range? Well, you can specify the same PRIMARY KEY component twice. Currently that doesn't help you. But with a small modeling change (coupled with storing each row twice...once for the start and again for the end) it does:
aploetz#cqlsh:stackoverflow> SELECT * FROM room WHERE uuidhotel = 'testUUIDHotel'
AND reservetime > 1 AND reservetime < 3;
uuidhotel | reservetime | startend | uuid | uuidguest | uuidroom
---------------+-------------+----------+--------------------------------------+--------------------------------------+--------------------------------------
testUUIDHotel | 2.1 | S | 49c441cd-a6cd-4638-85b3-fdc3405779f4 | cd3ad747-42a3-4d31-b02a-8190dd8559d8 | daae89d5-abd3-4cac-b4cc-aec9d6b7fb1f
testUUIDHotel | 2.2 | E | 49c441cd-a6cd-4638-85b3-fdc3405779f4 | cd3ad747-42a3-4d31-b02a-8190dd8559d8 | daae89d5-abd3-4cac-b4cc-aec9d6b7fb1f
(2 rows)
Basically, if you store an entry for each the start and the end, (and use startend in your key for uniqueness) you'll be able to effectively query with greater/less-than on the time. Just make sure you're querying with a wide-enough gap that you don't query in-between the range of reservetimes that you care about.
EDIT: Apparently tuple comparisons don't work the way I thought they did... The below answer doesn't work, it compares the entire tuple and NOT the individual elements, i.e. (1,2) < (3,1) returns true. Leaving it in case it inspires a better method...
Another approach, you can query with non-EQ conditions on multiple clustering columns using tuples.
However, it would require you to have a single non-EQ operator, i.e. your query would have to look like:
SELECT * FROM table WHERE (c1, c2) > (1, 3)
So you have to transform your values such that you can use a single operator. You can do this by negating both sides!
Remember from algebra:
5 < 10
If you negate both sides, you have to switch the operator:
-5 > -10
So create a new column negative_c2 with the negated values from c2 and do the query:
SELECT * FROM table WHERE (c1, negative_c2) > (1, -3)

Is this type of counter table definition valid?

I want to create a table with wide partitions (or, put another way, a table which has no value columns (non primary key columns)) that enables the number of rows in any of its partitions to be efficiently procured. Here is a simple definition of such a table
CREATE TABLE IF NOT EXISTS test_table
(
partitionKeyCol timestamp
clusteringCol timeuuid
partitionRowCountCol counter static
PRIMARY KEY (partitionKeyCol, clusteringCol)
)
The problem with this definition, and others structured like it, is that their validity cannot be clearly deduced from the information contained in the docs.
What the docs do state (with regards to counters):
A counter column can neither be specified as part of a table's PRIMARY KEY, nor used to create an INDEX
A counter column can only be defined in a dedicated counter table (which I take to be a table which solely has counter columns defined as its value columns)
What the docs do not state (with regards to counters):
The ability of a table to have a static counter column defined for it (given the unique write path of counters, I feel that this is worth mentioning)
The ability of a table, which has zero value columns defined for it (making it a dedicated counter table, given my understanding of the term), to also have a static counter column defined for it
Given the information on this subject that is present in (and absent from) the docs, such a definition appears to be valid. However, I'm not sure how that is possible, given that the updates to partitionRowCountCol would require use of a write path different from that used to insert (partitionKeyCol, clusteringCol) tuples.
Is this type of counter table definition valid? If so, how are writes to the table carried out?
It looks like a table with this structure can be defined, but I'm struggling to find a good use case for it. It seems there is no way to actually write to that clustering column.
CREATE TABLE test.test_table (
a timestamp,
b timeuuid,
c counter static,
PRIMARY KEY (a, b)
);
cassandra#cqlsh:test> insert into test_table (a,b,c) VALUES (unixtimestampof(now()), now(), 3);
InvalidRequest: code=2200 [Invalid query] message="INSERT statements are not allowed on counter tables, use UPDATE instead"
cassandra#cqlsh:test> update test_table set c = c + 1 where a=unixtimestampof(now());
cassandra#cqlsh:test> update test_table set c = c + 1 where a=unixtimestampof(now());
cassandra#cqlsh:test> select * from test_table;
a | b | c
--------------------------+------+---
2016-03-24 15:04:31+0000 | null | 1
2016-03-24 15:04:37+0000 | null | 1
(2 rows)
cassandra#cqlsh:test> update test_table set c = c + 1 where a=unixtimestampof(now()) and b=now();
InvalidRequest: code=2200 [Invalid query] message="Invalid restrictions on clustering columns since the UPDATE statement modifies only static columns"
cassandra#cqlsh:test> insert into test_table (a,b) VALUES (unixtimestampof(now()), now());
InvalidRequest: code=2200 [Invalid query] message="INSERT statements are not allowed on counter tables, use UPDATE instead"
cassandra#cqlsh:test> update test_table set b = now(), c = c + 1 where a=unixtimestampof(now());
InvalidRequest: code=2200 [Invalid query] message="PRIMARY KEY part b found in SET part"
What is it you're trying to model?

Query results not ordered despite WITH CLUSTERING ORDER BY

I am storing posts from all users in table. I want to retrieve post from all users the user is following.
CREATE TABLE posts (
userid int,
time timestamp,
id uuid,
content text,
PRIMARY KEY (userid, time)
)WITH CLUSTERING ORDER BY (time DESC)
I have the data about who all user follows in another table
CREATE TABLE follow (
userid int,
who_follow_me set<int>,
who_i_follow set<int>,
PRIMARY KEY ((userid))
)
I am making query like
select * from posts where userid in(1,2,3,4....n);
2 questions:
why I still get data in random order, though CLUSTERING ORDER BY is specified in posts. ?
Is model correct to satisfy the query optimally (user can have n number of followers)?
I am using Cassandra 2.0.10.
"why I still get data in random order, though CLUSTERING ORDER BY is specified in posts?"
This is because ORDER BY only works for rows within a particular partitioning key. So in your case, if you wanted to see all of the posts for a specific user like this:
SELECT * FROM posts WHERE userid=1;
That return your results ordered by time, as all of the rows within the userid=1 partitioning key would be clustered by it.
"Is model correct to satisfy the query optimally (user can have n number of followers)?"
It will work, as long as you don't care about getting the results ordered by timestamp. To be able to query posts for all users ordered by time, you would need to come up with a different partitioning key. Without knowing too much about your application, you could use a column like GROUP (for instance) and partition on that.
So let's say that you evenly assign all of your users to eight groups: A, B, C, D, E, F, G and H. Let's say your table design changed like this:
CREATE TABLE posts (
group text,
userid int,
time timestamp,
id uuid,
content text,
PRIMARY KEY (group, time, userid)
)WITH CLUSTERING ORDER BY (time DESC)
You could then query all posts for all users for group B like this:
SELECT * FROM posts WHERE group='B';
That would give you all of the posts for all of the users in group B, ordered by time. So basically, for your query to order the posts appropriately by time, you need to partition your post data on something other than userid.
EDIT:
PRIMARY KEY (userid, follows)) WITH CLUSTERING ORDER BY (created DESC);
That's not going to work. In fact, that should produce the following error:
code=2200 [Invalid query] message="Missing CLUSTERING ORDER for column follows"
And even if you did add follows to your CLUSTERING ORDER clause, you would see this:
code=2200 [Invalid query] message="Only clustering key columns can be defined in CLUSTERING ORDER directive"
The CLUSTERING ORDER clause can only be used on the clustering column(s), which in this case, is only the follows column. Alter your PRIMARY KEY definition to cluster on follows (ASC) and created (DESC). I have tested this, and inserted some sample data, and can see that this query works:
aploetz#cqlsh:stackoverflow> SELECT * FROM posts WHERE userid=2 AND follows=1;
userid | follows | created | id
--------+---------+--------------------------+--------------------------------------
2 | 1 | 2015-01-25 13:27:00-0600 | 559cda12-8fe7-45d3-9a61-7ddd2119fcda
2 | 1 | 2015-01-25 13:26:00-0600 | 64b390ba-a323-4c71-baa8-e247a8bc9cdf
2 | 1 | 2015-01-25 13:24:00-0600 | 1b325b66-8ae5-4a2e-a33d-ee9b5ad464b4
(3 rows)
Although, if you want to query by just userid you can see posts from all of your followers. But in that case, the posts will only be ordered within each followerid, like this:
aploetz#cqlsh:stackoverflow> SELECT * FROM posts WHERE userid=2;
userid | follows | created | id
--------+---------+--------------------------+--------------------------------------
2 | 0 | 2015-01-25 13:28:00-0600 | 94da27d0-e91f-4c1f-88f2-5a4bbc4a0096
2 | 0 | 2015-01-25 13:23:00-0600 | 798053d3-f1c4-4c1d-a79d-d0faff10a5fb
2 | 1 | 2015-01-25 13:27:00-0600 | 559cda12-8fe7-45d3-9a61-7ddd2119fcda
2 | 1 | 2015-01-25 13:26:00-0600 | 64b390ba-a323-4c71-baa8-e247a8bc9cdf
2 | 1 | 2015-01-25 13:24:00-0600 | 1b325b66-8ae5-4a2e-a33d-ee9b5ad464b4
(5 rows)
This is my new schema,
CREATE TABLE posts(id uuid,
userid int,
follows int,
created timestamp,
PRIMARY KEY (userid, follows)) WITH CLUSTERING ORDER BY (created DESC);
Here userid represents who posted it and follows represents userid for his one of the follower. Say user x follows 10 other people , i am making 10+1 inserts. Definitely there is too much data duplication. However now its easier to get timeline for one of the user with following query
select * from posts where follows=?

Resources