Date selection query is not working as expected in Cassandra - groovy

I have a domain class and below is the declaration:
class Emp{
static mapWith = "cassandra"
String name
Date doj
}
data:
id name doj
1 X 01-01-2010
2 Y 01-20-2012
Cassandra query:
select * from emp_schema.emp where doj='01-01-2010';
Error:
code=2200 [Invalid query] message="Unable to coerce '01-01-2010' to a formatted date (long)"

the format to query dates in cassandra is yyyy-mm-dd

select * from emp_schema.emp where doj='01-01-2010';
Carlos is correct in that Cassandra requires dates to be formatted like yyy-mm-dd.
But this query will only work if doj is your partition key. If your PRIMARY KEY is not setup to indicate doj as the partition key, your query is not possible.
I would specifically design your table to suit your query. This definition partitions on doj and clusters on id for uniqueness, as multiple emp[loyee]s can probably have the same doj:
create table emp_by_doj (
doj date,
id int,
name text,
primary key (doj,id));
Then you can query by a specific date, and have multiple rows returned for it:
> SELECT * FROM emp_by_doj WHERE doj='2017-06-01';
doj | id | name
------------+------+-------
2017-06-01 | 7721 | Sarah
2017-06-01 | 8122 | Sam
(2 rows)

Related

ScyllaDB - [Invalid query] message="marshaling error: Milliseconds length exceeds expected (6)"

I have a table with a column of type timestamp and when I insert a value, Scylla saves it with 3 zeros more.
According to Scylla Documentation (https://docs.scylladb.com/getting-started/types/#timestamps):
Timestamps can be input in CQL either using their value as an integer, or using a string that represents an ISO 8601 date. For instance, all of the values below are valid timestamp values for Mar 2, 2011, at 04:05:00 AM, GMT:
1299038700000
'2011-02-03 04:05+0000'
'2011-02-03 04:05:00+0000'
'2011-02-03 04:05:00.000+0000'
'2011-02-03T04:05+0000'
'2011-02-03T04:05:00+0000'
'2011-02-03T04:05:00.000+0000'
So when I create a table for example:
CREATE TABLE callers (phone text, timestamp timestamp, callID text, PRIMARY KEY(phone, timestamp));
And insert values into it:
INSERT INTO callers (phone, timestamp, callID) VALUES ('6978311845', 1299038700000, '123');
INSERT INTO callers (phone, timestamp, callID) VALUES ('6978311846', '2011-02-03 04:05+0000', '456');
INSERT INTO callers (phone, timestamp, callID) VALUES ('6978311847', '2011-02-03 04:05:00.000+0000', '789');
The SELECT * FROM callers; will show all timestamps with 3 zeros more after the dot:
phone | timestamp | callid
------------+---------------------------------+--------
6978311847 | 2011-02-03 04:05:00.000000+0000 | 789
6978311845 | 2011-03-02 04:05:00.000000+0000 | 123
6978311846 | 2011-02-03 04:05:00.000000+0000 | 456
As a result when I try for example to delete a row:
DELETE FROM callers WHERE phone = '6978311845' AND timestamp = '2011-03-02 04:05:00.000000+0000';
An error occurs:
InvalidRequest: Error from server: code=2200 [Invalid query] message="marshaling error: unable to parse date '2011-03-02 04:05:00.000000+0000': marshaling error: Milliseconds length exceeds expected (6)"
How can I store timestamp without getting this error?
You have hr:min:sec.millisec -> Millisec can be up to 999/1000 so essentially that's what the error is saying.
The 3 INSERT statement you did are correct in terms of syntax.
The DELETE statement should be in the same format as the INSERT:
AND timestamp = '2011-03-02 04:05:00.00+0000'
AND timestamp = '2011-03-02 04:05:00.000+0000'
AND timestamp = '2011-03-02 04:05:00+0000'
AND timestamp = '2011-03-02 04:05:00'
The additional 000 that appear in the table are just a display issue.

Cassandra data modeling for range queries using timestamp

I need to create a table with 4 columns:
timestamp BIGINT
name VARCHAR
value VARCHAR
value2 VARCHAR
I have 3 required queries:
SELECT *
FROM table
WHERE timestamp > xxx
AND timestamp < xxx;
SELECT *
FROM table
WHERE name = 'xxx';
SELECT *
FROM table
WHERE name = 'xxx'
AND timestamp > xxx
AND timestamp < xxx;
The result needs to be sorted by timestamp.
When I use:
CREATE TABLE table (
timestamp BIGINT,
name VARCHAR,
value VARCHAR,
value2 VARCHAR,
PRIMARY KEY (timestamp)
);
the result is never sorted.
When I use:
CREATE TABLE table (
timestamp BIGINT,
name VARCHAR,
value VARCHAR,
value2 VARCHAR,
PRIMARY KEY (name, timestamp)
);
the result is sorted by name > timestamp which is wrong.
name | timestamp
------------------------
a | 20170804142825729
a | 20170804142655569
a | 20170804142650546
a | 20170804142645516
a | 20170804142640515
a | 20170804142620454
b | 20170804143446311
b | 20170804143431287
b | 20170804143421277
b | 20170804142920802
b | 20170804142910787
How do I do this using Cassandra?
Cassandra order data by clustering key group by partition key
In your case first table have only partition key timestamp, no clustering key. So data will not be sorted.
And For the second table partition key is name and clustering key is timestamp. So your data will sorted by timestamp group by name. Means data will be first group by it's name then each group will be sorted separately by timestamp.
Edited
So you need to add a partition key like below :
CREATE TABLE table (
year BIGINT,
month BIGINT,
timestamp BIGINT,
name VARCHAR,
value VARCHAR,
value2 VARCHAR,
PRIMARY KEY ((year, month), timestamp)
);
here (year, month) is the composite partition key. You have to insert the year and month from the timestamp. So your data will be sorted by timestamp within a year and month

Cassandra CLUSTERING ORDER BY is not working and showing in correct results

Hi I have created a table for storing data of like this
CREATE TABLE keyspace.test (
name text,
date text,
time double,
entry text,
details text,
PRIMARY KEY ((name, date), time)
) WITH CLUSTERING ORDER BY (time DESC);
And inserted data into the table.But a query like this gives an unordered result.
SELECT * FROM keyspace.test where device_id name ='anand' and date in ('2017-04-01','2017-04-02','2017-04-03','2017-04-05') ;
Is there any problem with my table design.
I think you are misunderstanding cassandra clustering key order. Cassandra Sort data with cluster key within a single partition.
That is for your case cassandra sort data with clustering key time within a single name and date.
Example : Let's insert some data
INSERT INTO test (name , date , time , entry ) VALUES ('anand', '2017-04-01', 1, 'a');
INSERT INTO test (name , date , time , entry ) VALUES ('anand', '2017-04-01', 2, 'b');
INSERT INTO test (name , date , time , entry ) VALUES ('anand', '2017-04-01', 3, 'c');
INSERT INTO test (name , date , time , entry ) VALUES ('anand', '2017-04-02', 0, 'nil');
INSERT INTO test (name , date , time , entry ) VALUES ('anand', '2017-04-02', 4, 'd');
If we select data with your query :
SELECT * FROM test where name ='anand' and date in ('2017-04-01','2017-04-02','2017-04-03','2017-04-05') ;
Output :
name | date | time | details | entry
-------+------------+------+---------+-------
anand | 2017-04-01 | 3 | null | c
anand | 2017-04-01 | 2 | null | b
anand | 2017-04-01 | 1 | null | a
anand | 2017-04-02 | 4 | null | d
anand | 2017-04-02 | 0 | null | nil
You can see that time 3,2,1 are within a single partition anand:2017-04-01 are sorted in desc And time 4,0 are within single partition anand:2017-04-02 are sorted in desc. Cassandra will not take care of sorting between different partition.
Here is the doc :
In the table definition, a clustering column is a column that is part of the compound primary key definition, but not the first column, which is the position reserved for the partition key. Columns are clustered in multiple rows within a single partition. The clustering order is determined by the position of columns in the compound primary key definition.
Source : http://docs.datastax.com/en/cql/3.1/cql/ddl/ddl_compound_keys_c.html
By the way why is your data field is text type and time field is double type ?
You can use date field as date type and time as timestamp type.
The query that you are using is o.k. but it probably doesn't behave as you are expecting it to because coordinator will not sort the results based on partitions. I also run into this problem couple of times.
The solution to it is very simple, basically It's far better to execute the 4 separate queries that you need on the client and then merge the results there. In short IN operator puts a lot of pressure to the coordinator node in the cluster, there's a nice read on this subject:
https://lostechies.com/ryansvihla/2014/09/22/cassandra-query-patterns-not-using-the-in-query-for-multiple-partitions/

Can an index be created on a UUID Column?

Is it possible to create an index on a UUID/TIMEUUID column in Cassandra? I'm testing out a model design which would have an index on a UUID column, but queries on that column always return 0 rows found.
I have a table like this:
create table some_data (site_id int, user_id int, run_id uuid, value int, primary key((site_id, user_id), run_id));
I create an index with this command:
create index idx on some_data (run_id) ;
No errors are thrown by CQL when I create this index.
I have a small bit of test data in the table:
site_id | user_id | run_id | value
---------+---------+--------------------------------------+-----------------
1 | 1 | 9e118af0-ac92-11e4-81ae-8d1bc921f26d | 3
However, when I run the query:
select * from some_data where run_id = 9e118af0-ac92-11e4-81ae-8d1bc921f26d
CQLSH just returns: (0 rows)
If I use an int for the run_id then the index behaves as expected.
Yes, you can create a secondary index on a UUID. The real question is "should you?"
In any case, I followed your steps, and got it to work.
Connected to Test Cluster at 192.168.23.129:9042.
[cqlsh 5.0.1 | Cassandra 2.1.2 | CQL spec 3.2.0 | Native protocol v3]
Use HELP for help.
aploetz#cqlsh> use stackoverflow ;
aploetz#cqlsh:stackoverflow> create table some_data (site_id int, user_id int, run_id uuid, value int, primary key((site_id, user_id), run_id));
aploetz#cqlsh:stackoverflow> create index idx on some_data (run_id) ;
aploetz#cqlsh:stackoverflow> INSERT INTO some_data (site_id, user_id, run_id, value) VALUES (1,1,9e118af0-ac92-11e4-81ae-8d1bc921f26d,3);
aploetz#cqlsh:stackoverflow> select * from usr_rec3 where run_id = 9e118af0-ac92-11e4-81ae-8d1bc921f26d;
code=2200 [Invalid query] message="unconfigured columnfamily usr_rec3"
aploetz#cqlsh:stackoverflow> select * from some_data where run_id = 9e118af0-ac92-11e4-81ae-8d1bc921f26d;
site_id | user_id | run_id | value
---------+---------+--------------------------------------+-------
1 | 1 | 9e118af0-ac92-11e4-81ae-8d1bc921f26d | 3
(1 rows)
Notice though, that when I ran this command, it failed:
select * from usr_rec3 where run_id = 9e118af0-ac92-11e4-81ae-8d1bc921f26d
Are you sure that you didn't mean to select from some_data instead?
Also, creating secondary indexes on high-cardinality columns (like a UUID) is generally not a good idea. If you need to query by run_id, then you should revisit your data model and come up with an appropriate query table to serve that.
Clarification:
Using secondary indexes in general is not considered good practice. In the new book Cassandra High Availability, Robbie Strickland identifies their use as an anti-pattern, due to poor performance.
Just because a column is of the UUID data type doesn't necessarily make it high-cardinality. That's more of a data model question for you. But knowing the nature of UUIDs and their underlying purpose toward being unique, is setting off red flags.
Put these two points together, and there isn't anything about creating an index on a UUID that sounds appealing to me. If it were my cluster, and (more importantly) I had to support it later, I wouldn't do it.

Query results not ordered despite WITH CLUSTERING ORDER BY

I am storing posts from all users in table. I want to retrieve post from all users the user is following.
CREATE TABLE posts (
userid int,
time timestamp,
id uuid,
content text,
PRIMARY KEY (userid, time)
)WITH CLUSTERING ORDER BY (time DESC)
I have the data about who all user follows in another table
CREATE TABLE follow (
userid int,
who_follow_me set<int>,
who_i_follow set<int>,
PRIMARY KEY ((userid))
)
I am making query like
select * from posts where userid in(1,2,3,4....n);
2 questions:
why I still get data in random order, though CLUSTERING ORDER BY is specified in posts. ?
Is model correct to satisfy the query optimally (user can have n number of followers)?
I am using Cassandra 2.0.10.
"why I still get data in random order, though CLUSTERING ORDER BY is specified in posts?"
This is because ORDER BY only works for rows within a particular partitioning key. So in your case, if you wanted to see all of the posts for a specific user like this:
SELECT * FROM posts WHERE userid=1;
That return your results ordered by time, as all of the rows within the userid=1 partitioning key would be clustered by it.
"Is model correct to satisfy the query optimally (user can have n number of followers)?"
It will work, as long as you don't care about getting the results ordered by timestamp. To be able to query posts for all users ordered by time, you would need to come up with a different partitioning key. Without knowing too much about your application, you could use a column like GROUP (for instance) and partition on that.
So let's say that you evenly assign all of your users to eight groups: A, B, C, D, E, F, G and H. Let's say your table design changed like this:
CREATE TABLE posts (
group text,
userid int,
time timestamp,
id uuid,
content text,
PRIMARY KEY (group, time, userid)
)WITH CLUSTERING ORDER BY (time DESC)
You could then query all posts for all users for group B like this:
SELECT * FROM posts WHERE group='B';
That would give you all of the posts for all of the users in group B, ordered by time. So basically, for your query to order the posts appropriately by time, you need to partition your post data on something other than userid.
EDIT:
PRIMARY KEY (userid, follows)) WITH CLUSTERING ORDER BY (created DESC);
That's not going to work. In fact, that should produce the following error:
code=2200 [Invalid query] message="Missing CLUSTERING ORDER for column follows"
And even if you did add follows to your CLUSTERING ORDER clause, you would see this:
code=2200 [Invalid query] message="Only clustering key columns can be defined in CLUSTERING ORDER directive"
The CLUSTERING ORDER clause can only be used on the clustering column(s), which in this case, is only the follows column. Alter your PRIMARY KEY definition to cluster on follows (ASC) and created (DESC). I have tested this, and inserted some sample data, and can see that this query works:
aploetz#cqlsh:stackoverflow> SELECT * FROM posts WHERE userid=2 AND follows=1;
userid | follows | created | id
--------+---------+--------------------------+--------------------------------------
2 | 1 | 2015-01-25 13:27:00-0600 | 559cda12-8fe7-45d3-9a61-7ddd2119fcda
2 | 1 | 2015-01-25 13:26:00-0600 | 64b390ba-a323-4c71-baa8-e247a8bc9cdf
2 | 1 | 2015-01-25 13:24:00-0600 | 1b325b66-8ae5-4a2e-a33d-ee9b5ad464b4
(3 rows)
Although, if you want to query by just userid you can see posts from all of your followers. But in that case, the posts will only be ordered within each followerid, like this:
aploetz#cqlsh:stackoverflow> SELECT * FROM posts WHERE userid=2;
userid | follows | created | id
--------+---------+--------------------------+--------------------------------------
2 | 0 | 2015-01-25 13:28:00-0600 | 94da27d0-e91f-4c1f-88f2-5a4bbc4a0096
2 | 0 | 2015-01-25 13:23:00-0600 | 798053d3-f1c4-4c1d-a79d-d0faff10a5fb
2 | 1 | 2015-01-25 13:27:00-0600 | 559cda12-8fe7-45d3-9a61-7ddd2119fcda
2 | 1 | 2015-01-25 13:26:00-0600 | 64b390ba-a323-4c71-baa8-e247a8bc9cdf
2 | 1 | 2015-01-25 13:24:00-0600 | 1b325b66-8ae5-4a2e-a33d-ee9b5ad464b4
(5 rows)
This is my new schema,
CREATE TABLE posts(id uuid,
userid int,
follows int,
created timestamp,
PRIMARY KEY (userid, follows)) WITH CLUSTERING ORDER BY (created DESC);
Here userid represents who posted it and follows represents userid for his one of the follower. Say user x follows 10 other people , i am making 10+1 inserts. Definitely there is too much data duplication. However now its easier to get timeline for one of the user with following query
select * from posts where follows=?

Resources