Group by in CQL for Cassandra DB not working - cassandra

I have this table in a Cassandra keyspace:
create table hashtags(
id uuid,
text text,
frequence int,
primary key ((text), frequence, id))
with clustering order by (frequence desc, id asc);
So I have text as partition key and frequence, id as clustering key.
According to Cassandra documentation regarding the support for GROUP BY operations, I should be able to run this sort of query:
select text, sum(frequence) from hashtags
group by text;
But I keep getting this error:
com.datastax.driver.core.exceptions.SyntaxError: line 2:0 no viable alternative at input 'group' (...text, sum(frequence) from [hashtags] group...)
Is there something I misunderstood from the guide? How can I run correctly this query?
Thanks for helping.

It worked for me on Apache Cassandra 3.10. I tried from cqlsh.
cqlsh:test> select * from hashtags ;
text | frequence | id
-------+-----------+--------------------------------------
hello | 5 | 07ef8ee4-6492-4112-babb-fc3ac2893701
hello | 4 | 3f6f3b1d-4a33-4a07-ad60-2274a9dc5577
hello | 1 | 4adf7e2a-f3b9-41eb-85cf-f4c4bdc5d322
hi | 7 | 71718f46-455e-4012-a306-f31f1cb2454a
(4 rows)
cqlsh:test> select text, sum(frequence) from hashtags group by text;
text | system.sum(frequence)
-------+-----------------------
hello | 10
hi | 7
(2 rows)
Warnings :
Aggregation query used without partition key

Related

Cassandra query max of a particular column for a particular ID

I am trying to write a Cassandra query and my use case is as follows
Let's say the table is
ID | Version
1 | 1
1 | 2
2 | 1
2 | 2
2 | 3
Now what I want is to get the latest version for all the IDs.
So the query should give me 2 rows. The first with Id:1 Version 2 and second with ID:2 Version:3
I tried a query like Select * from table where ID=1 and Version= MAX(Version) but it's not a valid syntax.
Can anybody help in this?
SELECT * FROM table WHERE ID = 1 LIMIT 1 would give you the highest version if your clustering key is Version ordered by descending.
CREATE TABLE table (
id int,
version int,
PRIMARY KEY (id, version)
) WITH CLUSTERING ORDER BY (version DESC);

Cassandra CLUSTERING ORDER BY is not working and showing in correct results

Hi I have created a table for storing data of like this
CREATE TABLE keyspace.test (
name text,
date text,
time double,
entry text,
details text,
PRIMARY KEY ((name, date), time)
) WITH CLUSTERING ORDER BY (time DESC);
And inserted data into the table.But a query like this gives an unordered result.
SELECT * FROM keyspace.test where device_id name ='anand' and date in ('2017-04-01','2017-04-02','2017-04-03','2017-04-05') ;
Is there any problem with my table design.
I think you are misunderstanding cassandra clustering key order. Cassandra Sort data with cluster key within a single partition.
That is for your case cassandra sort data with clustering key time within a single name and date.
Example : Let's insert some data
INSERT INTO test (name , date , time , entry ) VALUES ('anand', '2017-04-01', 1, 'a');
INSERT INTO test (name , date , time , entry ) VALUES ('anand', '2017-04-01', 2, 'b');
INSERT INTO test (name , date , time , entry ) VALUES ('anand', '2017-04-01', 3, 'c');
INSERT INTO test (name , date , time , entry ) VALUES ('anand', '2017-04-02', 0, 'nil');
INSERT INTO test (name , date , time , entry ) VALUES ('anand', '2017-04-02', 4, 'd');
If we select data with your query :
SELECT * FROM test where name ='anand' and date in ('2017-04-01','2017-04-02','2017-04-03','2017-04-05') ;
Output :
name | date | time | details | entry
-------+------------+------+---------+-------
anand | 2017-04-01 | 3 | null | c
anand | 2017-04-01 | 2 | null | b
anand | 2017-04-01 | 1 | null | a
anand | 2017-04-02 | 4 | null | d
anand | 2017-04-02 | 0 | null | nil
You can see that time 3,2,1 are within a single partition anand:2017-04-01 are sorted in desc And time 4,0 are within single partition anand:2017-04-02 are sorted in desc. Cassandra will not take care of sorting between different partition.
Here is the doc :
In the table definition, a clustering column is a column that is part of the compound primary key definition, but not the first column, which is the position reserved for the partition key. Columns are clustered in multiple rows within a single partition. The clustering order is determined by the position of columns in the compound primary key definition.
Source : http://docs.datastax.com/en/cql/3.1/cql/ddl/ddl_compound_keys_c.html
By the way why is your data field is text type and time field is double type ?
You can use date field as date type and time as timestamp type.
The query that you are using is o.k. but it probably doesn't behave as you are expecting it to because coordinator will not sort the results based on partitions. I also run into this problem couple of times.
The solution to it is very simple, basically It's far better to execute the 4 separate queries that you need on the client and then merge the results there. In short IN operator puts a lot of pressure to the coordinator node in the cluster, there's a nice read on this subject:
https://lostechies.com/ryansvihla/2014/09/22/cassandra-query-patterns-not-using-the-in-query-for-multiple-partitions/

Cassandra table with one primary key, one clustering column, and one regular column does not append data. It overwrites it

I imagine I must be missing something fundamental.
The table definition I have is:
CREATE TABLE IF NOT EXISTS bundle_components (
bundle_id uuid,
component_type text,
component_id uuid,
PRIMARY KEY (bundle_id, component_type)
);
CREATE INDEX ON bundle_components(component_type);
CREATE INDEX ON bundle_components(component_id);
However, I seem to only end up with a single component_id per unique bundle_id and component_type combination. I was under the impression that the table would have wide rows, so that I would have multiple component_ids if they have the same bundle_id and component_type combination.
Here's the problem in action. Two INSERT statements with different values for component_id, results in a single entry (the previous entry is overwritten):
cqlsh:voltron> INSERT INTO bundle_components(bundle_id, component_type, component_id) VALUES(8d8e8b6e-19dc-4af6-9bb7-500cd8e2dbaf, 'script', 6558981e-1d89-43c3-a1fc-b2cf45119bcc);
cqlsh:voltron> select * from bundle_components ;
bundle_id | component_type | component_id
--------------------------------------+----------------+--------------------------------------
8d8e8b6e-19dc-4af6-9bb7-500cd8e2dbaf | channel | a02069df-be81-4960-b64e-9ed8ee09550f
8d8e8b6e-19dc-4af6-9bb7-500cd8e2dbaf | script | 6558981e-1d89-43c3-a1fc-b2cf45119bcc
(2 rows)
cqlsh:voltron> INSERT INTO bundle_components(bundle_id, component_type, component_id) VALUES(8d8e8b6e-19dc-4af6-9bb7-500cd8e2dbaf, 'script', 7fcf4402-c8b3-41ed-a524-b1b546511635);
cqlsh:voltron> select * from bundle_components ;
bundle_id | component_type | component_id
--------------------------------------+----------------+--------------------------------------
8d8e8b6e-19dc-4af6-9bb7-500cd8e2dbaf | channel | a02069df-be81-4960-b64e-9ed8ee09550f
8d8e8b6e-19dc-4af6-9bb7-500cd8e2dbaf | script | 7fcf4402-c8b3-41ed-a524-b1b546511635
Can someone tell me what I'm doing wrong here?
Your primary key is (bundle_id, component_type), so you can only have 1 unique combination of (bundle_id, component_type).
What you probably want to do is add component_id to your primary key ((bundle_id, component_type, component_id)), which will let you have multiple components with the same bundle_id and component_type. With that change I get the following output:
cqlsh:test> select * from bundle_components ;
bundle_id | component_id | component_type
--------------------------------------+--------------------------------------+----------------
8d8e8b6e-19dc-4af6-9bb7-500cd8e2dbaf | 6558981e-1d89-43c3-a1fc-b2cf45119bcc | script
8d8e8b6e-19dc-4af6-9bb7-500cd8e2dbaf | 7fcf4402-c8b3-41ed-a524-b1b546511635 | script
Depending on how you want to be able to query your data, you may what to change the ordering of component_id and component_type, but I'm guessing you would like to query your data by bundle_id, component_type instead of bundle_id, component_id

Can an index be created on a UUID Column?

Is it possible to create an index on a UUID/TIMEUUID column in Cassandra? I'm testing out a model design which would have an index on a UUID column, but queries on that column always return 0 rows found.
I have a table like this:
create table some_data (site_id int, user_id int, run_id uuid, value int, primary key((site_id, user_id), run_id));
I create an index with this command:
create index idx on some_data (run_id) ;
No errors are thrown by CQL when I create this index.
I have a small bit of test data in the table:
site_id | user_id | run_id | value
---------+---------+--------------------------------------+-----------------
1 | 1 | 9e118af0-ac92-11e4-81ae-8d1bc921f26d | 3
However, when I run the query:
select * from some_data where run_id = 9e118af0-ac92-11e4-81ae-8d1bc921f26d
CQLSH just returns: (0 rows)
If I use an int for the run_id then the index behaves as expected.
Yes, you can create a secondary index on a UUID. The real question is "should you?"
In any case, I followed your steps, and got it to work.
Connected to Test Cluster at 192.168.23.129:9042.
[cqlsh 5.0.1 | Cassandra 2.1.2 | CQL spec 3.2.0 | Native protocol v3]
Use HELP for help.
aploetz#cqlsh> use stackoverflow ;
aploetz#cqlsh:stackoverflow> create table some_data (site_id int, user_id int, run_id uuid, value int, primary key((site_id, user_id), run_id));
aploetz#cqlsh:stackoverflow> create index idx on some_data (run_id) ;
aploetz#cqlsh:stackoverflow> INSERT INTO some_data (site_id, user_id, run_id, value) VALUES (1,1,9e118af0-ac92-11e4-81ae-8d1bc921f26d,3);
aploetz#cqlsh:stackoverflow> select * from usr_rec3 where run_id = 9e118af0-ac92-11e4-81ae-8d1bc921f26d;
code=2200 [Invalid query] message="unconfigured columnfamily usr_rec3"
aploetz#cqlsh:stackoverflow> select * from some_data where run_id = 9e118af0-ac92-11e4-81ae-8d1bc921f26d;
site_id | user_id | run_id | value
---------+---------+--------------------------------------+-------
1 | 1 | 9e118af0-ac92-11e4-81ae-8d1bc921f26d | 3
(1 rows)
Notice though, that when I ran this command, it failed:
select * from usr_rec3 where run_id = 9e118af0-ac92-11e4-81ae-8d1bc921f26d
Are you sure that you didn't mean to select from some_data instead?
Also, creating secondary indexes on high-cardinality columns (like a UUID) is generally not a good idea. If you need to query by run_id, then you should revisit your data model and come up with an appropriate query table to serve that.
Clarification:
Using secondary indexes in general is not considered good practice. In the new book Cassandra High Availability, Robbie Strickland identifies their use as an anti-pattern, due to poor performance.
Just because a column is of the UUID data type doesn't necessarily make it high-cardinality. That's more of a data model question for you. But knowing the nature of UUIDs and their underlying purpose toward being unique, is setting off red flags.
Put these two points together, and there isn't anything about creating an index on a UUID that sounds appealing to me. If it were my cluster, and (more importantly) I had to support it later, I wouldn't do it.

Column oriented database related

Folks,
I am currently have started reading about NOSQL related DB as currently working on Database warehousing related application.
I have following questions. I have already read basics.
Question 1) How entire raw is retrived in column oriented database as data with same column is stored together ?
lets say we store data in following format so internally it will be stored like this in column oriented DB.
test|test1 together and 5|10 together.
key 1 : { name : test, value : 5 }
key 2 : { name : test1 , value : 10 }
So if we have to retrive data for key1 how does it happen ? (A and B is my guess)
A) If it has to pick data from each column storage seperately then it will be very costly
B) is there any indexing mechanism to fetch this data for all columns for given raw key ?
Question 2 )
I was reading through some of the docs and found column oriented Database is more suited to run aggregation function on single column as I/O will be less.
I didnot find proper support for aggregation function like SUM,AVG etc in NOSQL column oriented store like cassandra and HBASE. ( There could be some tweaking/hacking/more code writing like below)
How does Apache Cassandra do aggregate operations?
realtime querying/aggregating millions of records - hadoop? hbase? cassandra?
How to use hbase coprocessor to implement groupby?
Question 3 ) How the joins happens internally in column oriented database is it advisable to do ?
Nice question,
1) In Cassandra if you are using cqlsh then it will look like as you store data in mysql or some other rdbms stores.
Connected to Test Cluster at localhost:9160.
[cqlsh 3.1.7 | Cassandra 1.2.9 | CQL spec 3.0.0 | Thrift protocol 19.36.0]
Use HELP for help.
cqlsh> create keyspace test with replication={'class':'SimpleStrategy', 'replication_factor': 1
<value>
cqlsh> create keyspace test with replication={'class':'SimpleStrategy', replication_factor': 1};
cqlsh> USE test ;
cqlsh:test> create table entry(key text PRIMARY KEY, name text, value int );
cqlsh:test> INSERT INTO entry (key, name , value ) VALUES ( 'key1', 'test',5);
cqlsh:test> INSERT INTO entry (key, name , value ) VALUES ( 'key2', 'test1',10);
cqlsh:test> select * from entry;
key | name | value
------+-------+-------
key1 | test | 5
key2 | test1 | 10
cqlsh:test>
Note:- you can select rows using key or using some criteria on other column by using secondary indexes.
But in hbase the structure will look like following
rowkey | column family | column | value
key1 | entry | name | test
key1 | entry | value | 5
key2 | entry | name | test1
key2 | entry | value | 10
Note:- you can select each row using key or any column value its very easy.
2) Yes nosqls also supports batch operation only for DMLs.
3) Joins are not supported in none of nosqls datastores. They are not meant for joins.
Hope it will help you.

Resources