How to count columns from multiple rows - cassandra

How can I count number of columns in different rows of a column family?
I am a Cassandra newbie. I do not know a starting point. The only option I have is to make the application fetch data for each row at a time. It does not sound right to me. I am using Hector to connect to Cassandra.

this is how you will get total column count in particular rowkey
sliceQuery.setColumnFamily("**your column family**");
sliceQuery.setKey("**your row key**");
sliceQuery.setRange(null, null, false, Integer.MAX_VALUE);
QueryResult<ColumnSlice<String, String>> result = sliceQuery.execute();
ColumnSlice<String, String> cs = result.get();
long noOfColumnInRowKey=result.get().getColumns().size();

Assume that you have wide row (Lets create it using CLI)
create column family cf3
with column_type = 'Standard' and
comparator = 'TimeUUIDType' and
key_validation_class = 'UTF8Type' and
default_validation_class = 'UTF8Type';
This is what I see in CQL3:
cqlsh:ks> desc table cf3;
CREATE TABLE cf3 (
key text,
column1 timeuuid,
value text,
PRIMARY KEY (key, column1)
) WITH COMPACT STORAGE AND
bloom_filter_fp_chance=0.010000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND
compression={'sstable_compression': 'SnappyCompressor'};
I inserted some values from CQL3, which makes you feel like good ol' MySQL
cqlsh:ks> insert into cf3 (key, column1, value) values ('user1', now(), 'time5');
cqlsh:ks> select * from cf3;
key | column1 | value
-------+--------------------------------------+-------
user1 | f0c687b0-d114-11e2-8002-2f4261da0d90 | time1
user1 | fb9fa130-d114-11e2-8002-2f4261da0d90 | time2
user1 | 09512f10-d115-11e2-8002-2f4261da0d90 | time3
user1 | 0f5c93e0-d115-11e2-8002-2f4261da0d90 | time4
user1 | 21155220-d115-11e2-8002-2f4261da0d90 | time5
But it's your wide-row (as seen from CLI)
[default#ks] list cf3;
Using default limit of 100
Using default column limit of 100
-------------------
RowKey: user1
=> (column=f0c687b0-d114-11e2-8002-2f4261da0d90, value=time1, timestamp=1370789864363000)
=> (column=fb9fa130-d114-11e2-8002-2f4261da0d90, value=time2, timestamp=1370789882563000)
=> (column=09512f10-d115-11e2-8002-2f4261da0d90, value=time3, timestamp=1370789905537000)
=> (column=0f5c93e0-d115-11e2-8002-2f4261da0d90, value=time4, timestamp=1370789915678000)
=> (column=21155220-d115-11e2-8002-2f4261da0d90, value=time5, timestamp=1370789945410000)
1 Row Returned.
Elapsed time: 105 msec(s).
Now, you wanted to count number of columns starting from a given time onwards. Right? Here is CQL3 for that.
cqlsh:ks> select count(*) from cf3 where key = 'user1' and column1 >= 09512f10-d115-11e2-8002-2f4261da0d90 ;
count
-------
3
Now, I am somewhat doubtful what goes beneath. But, my intuition says that actually all the columns gets fetched at coordinator node and counted in memory. Which is probably somewhat similar to what you were planning to manually on client machine.
Also, I am unaware if cassandra-cli provides such functionality, but you mentioned you are using Hector. So, you can leverage get_count or CountQuery like mentioned here except have null as range finish and large count value. Like this:
CountQuery<String, String> cq = HFactory.createCountQuery(keyspace, StringSerializer.get(), TimeUUIDSerializer.get());
cq.setColumnFamily(cf).setKey("user1");
cq.setRange(timestamp, null, Integer.MAX_VALUE);
QueryResult<Integer> r = cq.execute();
(uncompiled code above)
HTH
Old answer:
See Hector documentation:
CQL:
CqlQuery<String,String,Long> cqlQuery = new CqlQuery<String,String,Long>(keyspace, se, se, le);
cqlQuery.setQuery("SELECT COUNT(*) FROM StandardLong1 WHERE KEY = 'cqlQueryTest_key1'");
QueryResult<CqlRows<String,String,Long>> result = cqlQuery.execute();
assertEquals(2, result.get().getAsCount());
You may just miss the WHERE condition and use LIMIT to get your purpose solved.

Related

Cassandra CLUSTERING ORDER BY is not working and showing in correct results

Hi I have created a table for storing data of like this
CREATE TABLE keyspace.test (
name text,
date text,
time double,
entry text,
details text,
PRIMARY KEY ((name, date), time)
) WITH CLUSTERING ORDER BY (time DESC);
And inserted data into the table.But a query like this gives an unordered result.
SELECT * FROM keyspace.test where device_id name ='anand' and date in ('2017-04-01','2017-04-02','2017-04-03','2017-04-05') ;
Is there any problem with my table design.
I think you are misunderstanding cassandra clustering key order. Cassandra Sort data with cluster key within a single partition.
That is for your case cassandra sort data with clustering key time within a single name and date.
Example : Let's insert some data
INSERT INTO test (name , date , time , entry ) VALUES ('anand', '2017-04-01', 1, 'a');
INSERT INTO test (name , date , time , entry ) VALUES ('anand', '2017-04-01', 2, 'b');
INSERT INTO test (name , date , time , entry ) VALUES ('anand', '2017-04-01', 3, 'c');
INSERT INTO test (name , date , time , entry ) VALUES ('anand', '2017-04-02', 0, 'nil');
INSERT INTO test (name , date , time , entry ) VALUES ('anand', '2017-04-02', 4, 'd');
If we select data with your query :
SELECT * FROM test where name ='anand' and date in ('2017-04-01','2017-04-02','2017-04-03','2017-04-05') ;
Output :
name | date | time | details | entry
-------+------------+------+---------+-------
anand | 2017-04-01 | 3 | null | c
anand | 2017-04-01 | 2 | null | b
anand | 2017-04-01 | 1 | null | a
anand | 2017-04-02 | 4 | null | d
anand | 2017-04-02 | 0 | null | nil
You can see that time 3,2,1 are within a single partition anand:2017-04-01 are sorted in desc And time 4,0 are within single partition anand:2017-04-02 are sorted in desc. Cassandra will not take care of sorting between different partition.
Here is the doc :
In the table definition, a clustering column is a column that is part of the compound primary key definition, but not the first column, which is the position reserved for the partition key. Columns are clustered in multiple rows within a single partition. The clustering order is determined by the position of columns in the compound primary key definition.
Source : http://docs.datastax.com/en/cql/3.1/cql/ddl/ddl_compound_keys_c.html
By the way why is your data field is text type and time field is double type ?
You can use date field as date type and time as timestamp type.
The query that you are using is o.k. but it probably doesn't behave as you are expecting it to because coordinator will not sort the results based on partitions. I also run into this problem couple of times.
The solution to it is very simple, basically It's far better to execute the 4 separate queries that you need on the client and then merge the results there. In short IN operator puts a lot of pressure to the coordinator node in the cluster, there's a nice read on this subject:
https://lostechies.com/ryansvihla/2014/09/22/cassandra-query-patterns-not-using-the-in-query-for-multiple-partitions/

Cassandra CQL where clause with multiple collection values?

My data model:-
tid | codes | raw | type
-------------------------------------+--------------+--------------+------
a64fdd60-1bc4-11e5-9b30-3dca08b6a366 | {12, 34, 53} | {sdafb=safd} | cmd
CREATE TABLE MyTable (
tid TIMEUUID,
type TEXT,
codes SET<INT>,
raw TEXT,
PRIMARY KEY (tid)
);
CREATE INDEX ON myTable (codes);
How to query the table to return rows based on multiple set values.
This works:-
select * from logData where codes contains 34;
But i want to get row based on multiple set values and none of this works:-
select * from logData where codes contains 34, 12; or
select * from logData where codes contains 34 and 12; or
select * from logData where codes contains {34, 12};
Kindly assit.
If I create your table structure and insert a similar row to yours above, I can check for multiple values in the codes collection like this:
aploetz#cqlsh:stackoverflow2> SELECT * FROM mytable
WHERE codes CONTAINS 34
AND codes CONTAINS 12
ALLOW FILTERING;
tid | codes | raw | type
--------------------------------------+--------------+--------------+------
2569f270-1c06-11e5-92f0-21b264d4c94d | {12, 34, 53} | {sdafb=safd} | cmd
(1 rows)
Now as others have mentioned, let me also tell you why this is a terrible idea...
With a secondary index on the collection (and with the cardinality appearing to be fairly high) every node will have to be checked for each query. The idea with Cassandra, is to query by partition key as often as possible, that way you only have to hit one node per query. Apple's Richard Low wrote a great article called The sweet spot for Cassandra secondary indexes. It should make you re-think the way you use secondary indexes.
Secondly, the only way I could get Cassandra to accept this query, was to use ALLOW FILTERING. What this means, is that the only way Cassandra can apply all of your fitlering criteria (WHERE clause) is to pull back every row and individually filter-out the rows that do not meet your criteria. Horribly inefficient. To be clear, the ALLOW FILTERING directive is something that you should never use.
In any case, if codes are something that you will need to query by, then you should design an additional query table with codes as a part of the PRIMARY KEY.
The data model you are using is highly inefficient. Sets are meant to be used to get a set of data for a given primary key and not the other way round. If that is what is needed, you will have to rethink the model itself.
I would suggest creating different columns for each value you are using in a set and then using those columns as a composite primary key.
Are you really looking to get ALL log entries based on just codes? That could be quite a large dataset. Realistically, wouldn't you be looking at specific dates / date ranges? I'd key on that, and then use codes for filtering, or even filter on codes entirely on the client side.
If you have many codes, and you index on the sets, it might result in very high cardinality of the index, which would cause you issues. Whether you have your own lookup table, or use an index, remember that you essentially have a "table" where the pk is the value, and there are rows for that value for every "row" that matches the value. If that looks unacceptably large, then that's exactly what it is.
I'd recommend revisiting the requirement - again...do you really need all log entries EVER that match a certain code combination?
If you really do need to analyse the whole lot, then I'd recommend using Spark to run the job. You could then run a Spark job, and each node would deal with data on the same node; this will significantly reduce the impact compared to doing full table processing entirely in the application.
I know it's late. IMO the model with few minor changes would be sufficient to achieve what is expected. What one can do is to have as many rows as members of the power set of the set being queried.
CREATE TABLE data_points_ks.mytable (
codes frozen<set<int>>,
tid timeuuid,
raw text,
type text,
PRIMARY KEY (codes, tid)
) WITH CLUSTERING ORDER BY (tid ASC)
INSERT INTO mytable (tid, codes, raw, type) VALUES (now(), {}, '{sdafb=safd}', 'cmd');
INSERT INTO mytable (tid, codes, raw, type) VALUES (now(), {12}, '{sdafb=safd}', 'cmd');
INSERT INTO mytable (tid, codes, raw, type) VALUES (now(), {34}, '{sdafb=safd}', 'cmd');
INSERT INTO mytable (tid, codes, raw, type) VALUES (now(), {12, 34}, '{sdafb=safd}', 'cmd');
INSERT INTO mytable (tid, codes, raw, type) VALUES (now(), {53}, '{sdafb=safd}', 'cmd');
INSERT INTO mytable (tid, codes, raw, type) VALUES (now(), {12, 53}, '{sdafb=safd}', 'cmd');
INSERT INTO mytable (tid, codes, raw, type) VALUES (now(), {34, 53}, '{sdafb=safd}', 'cmd');
INSERT INTO mytable (tid, codes, raw, type) VALUES (now(), {12, 34, 53}, '{sdafb=safd}', 'cmd');
tid | codes | raw | type
--------------------------------------+--------------+--------------+------
8ae81763-1142-11e8-846c-cd9226c29754 | {34, 53} | {sdafb=safd} | cmd
8746adb3-1142-11e8-846c-cd9226c29754 | {12, 53} | {sdafb=safd} | cmd
fea77062-1142-11e8-846c-cd9226c29754 | {34} | {sdafb=safd} | cmd
70ebb790-1142-11e8-846c-cd9226c29754 | {12, 34} | {sdafb=safd} | cmd
6c39c843-1142-11e8-846c-cd9226c29754 | {12} | {sdafb=safd} | cmd
65a954f3-1142-11e8-846c-cd9226c29754 | null | {sdafb=safd} | cmd
03c60433-1143-11e8-846c-cd9226c29754 | {53} | {sdafb=safd} | cmd
82f68d70-1142-11e8-846c-cd9226c29754 | {12, 34, 53} | {sdafb=safd} | cmd
Then the following queries are sufficient and do not need any filtering.
SELECT * FROM mytable
WHERE codes = {12, 34};
OR
SELECT * FROM mytable
WHERE codes = {34};

How to delete a record in Cassandra?

I have a table like this:
CREATE TABLE mytable (
user_id int,
device_id ascii,
record_time timestamp,
timestamp timeuuid,
info_1 text,
info_2 int,
PRIMARY KEY (user_id, device_id, record_time, timestamp)
);
When I ask Cassandra to delete a record (an entry in the columnfamily) like this:
DELETE from my_table where user_id = X and device_id = Y and record_time = Z and timestamp = XX;
it returns without an error, but when I query again the record is still there. Now if I try to delete a whole row like this:
DELETE from my_table where user_id = X
It works and removes the whole row, and querying again immediately doesn't return any more data from that row.
What I am doing wrong? How you can remove a record in Cassandra?
Thanks
Ok, here is my theory as to what is going on. You have to be careful with timestamps, because they will store data down to the millisecond. But, they will only display data to the second. Take this sample table for example:
aploetz#cqlsh:stackoverflow> SELECT id, datetime FROM data;
id | datetime
--------+--------------------------
B25881 | 2015-02-16 12:00:03-0600
B26354 | 2015-02-16 12:00:03-0600
(2 rows)
The datetimes (of type timestamp) are equal, right? Nope:
aploetz#cqlsh:stackoverflow> SELECT id, blobAsBigint(timestampAsBlob(datetime)),
datetime FROM data;
id | blobAsBigint(timestampAsBlob(datetime)) | datetime
--------+-----------------------------------------+--------------------------
B25881 | 1424109603000 | 2015-02-16 12:00:03-0600
B26354 | 1424109603234 | 2015-02-16 12:00:03-0600
(2 rows)
As you are finding out, this becomes problematic when you use timestamps as part of your PRIMARY KEY. It is possible that your timestamp is storing more precision than it is showing you. And thus, you will need to provide that hidden precision if you will be successful in deleting that single row.
Anyway, you have a couple of options here. One, find a way to ensure that you are not entering more precision than necessary into your record_time. Or, you could define record_time as a timeuuid.
Again, it's a theory. I could be totally wrong, but I have seen people do this a few times. Usually it happens when they insert timestamp data using dateof(now()) like this:
INSERT INTO table (key, time, data) VALUES (1,dateof(now()),'blah blah');
CREATE TABLE worker_login_table (
worker_id text,
logged_in_time timestamp,
PRIMARY KEY (worker_id, logged_in_time)
);
INSERT INTO worker_login_table (worker_id, logged_in_time)
VALUES ("worker_1",toTimestamp(now()));
after 1 hour executed the above insert statement once again
select * from worker_login_table;
worker_id| logged_in_time
----------+--------------------------
worker_1 | 2019-10-23 12:00:03+0000
worker_1 | 2015-10-23 13:00:03+0000
(2 rows)
Query the table to get absolute timestamp
select worker_id, blobAsBigint(timestampAsBlob(logged_in_time )), logged_in_time from worker_login_table;
worker_id | blobAsBigint(timestampAsBlob(logged_in_time)) | logged_in_time
--------+-----------------------------------------+--------------------------
worker_1 | 1524109603000 | 2019-10-23 12:00:03+0000
worker_1 | 1524209403234 | 2019-10-23 13:00:03+0000
(2 rows)
The below command will not delete the entry from Cassandra as the precise value of timestamp is required to delete the entry
DELETE from worker_login_table where worker_id='worker_1' and logged_in_time ='2019-10-23 12:00:03+0000';
By using the timestamp from blob we can delete the entry from Cassandra
DELETE from worker_login_table where worker_id='worker_1' and logged_in_time ='1524209403234';

Cassandra DB: Why less than query failed?

I have created a KEYSPACE and a TABLE with a uuid column as primary key and a timestamp column using an index. All this succeeded like the following picture showed:
cassandra#cqlsh:my_keyspace> insert into my_test ( id, insert_time, value ) values ( uuid(), '2015-03-12 09:10:30', '111' );
cassandra#cqlsh:my_keyspace> insert into my_test ( id, insert_time, value ) values ( uuid(), '2015-03-12 09:20:30', '222' );
cassandra#cqlsh:my_keyspace> select * from my_test;
id | insert_time | value
--------------------------------------+--------------------------+-------
9d7f88bc-5cb9-463f-b679-fd66e6469eb5 | 2015-03-12 09:20:30+0000 | 222
69579f6f-bf88-493b-a1d6-2f89fac25650 | 2015-03-12 09:10:30+0000 | 111
(2 rows)
and now query
cassandra#cqlsh:my_keyspace> select * from my_test where insert_time = '2015-03-12 09:20:30';
id | insert_time | value
--------------------------------------+--------------------------+-------
9d7f88bc-5cb9-463f-b679-fd66e6469eb5 | 2015-03-12 09:20:30+0000 | 222
(1 rows)
and now query with less than:
cassandra#cqlsh:my_keyspace> select * from my_test where insert_time < '2015-03-12 09:20:30';
InvalidRequest: code=2200 [Invalid query] message="No secondary indexes on the restricted columns support the provided operators: 'insert_time < <value>'"
while the first query is successful, why this happened? How should I make the second query successful since that's just what I want?
You can test all this on your own machine. Thanks
CREATE TABLE my_test (
id uuid PRIMARY KEY,
insert_time timestamp,
value text
) ;
CREATE INDEX my_test_insert_time_idx ON my_keyspace.my_test (insert_time);
Cassandra range queries are quite limited. It goes down to performance, and data storage mechanics. A range query must have the following:
Hit a (or few with IN) partition key, and include exact matches on all consecutive clustering keys except the last one in the query, which you can do a range query on.
Say your PK is (a, b, c, d), then the following are allowed:
where a=a1 and b < b1
where a=a1 and b=b1 and c < c1
The following is not:
where a=a1 and c < 1
[I won't go into Allow Filtering here...avoid it.]
Secondary indexes must be exact matches. You can't have range queries on them.

Select a specific record in Cassandra using cql

This is the schema I use:
CREATE TABLE playerInfo (
key text,
column1 bigint,
column2 bigint,
column3 bigint,
column4 bigint,
column5 text,
value bigint,
PRIMARY KEY (key, column1, column2, column3, column4, column5)
)
WITH COMPACT STORAGE AND
bloom_filter_fp_chance=0.010000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND
compression={'sstable_compression': 'SnappyCompressor'};
Note I use a composite key. And there is a record like this:
key | column1 | column2 | column3 | column4 | column5 | value
----------+------------+---------+----------+---------+--------------------------------------------------+-------
Kitty | 1411 | 3 | 713 | 4 | American | 1
In cqlsh, how to select it? I try to use:
cqlsh:game> SELECT * FROM playerInfo WHERE KEY = 'Kitty' AND column5 = 'American';
but the output is:
Bad Request: PRIMARY KEY part column5 cannot be restricted (preceding part column4 is either not restricted or by a non-EQ relation)
Then how could I select such cell?
You have choosen the primary key as PRIMARY KEY (key, column1, column2, column3, column4, column5) so if you are going to give where clause on column5 then you should also need to specify the where clause of key, column1, column2, column3, column4. for eg,
SELECT * FROM playerInfo WHERE KEY = 'Kitty' AND column1 = 1411 AND column2 = 3 AND column3 = 713 AND column4 = 4 AND column5 = 'American';
If you are going to give where clause on column2 then you should also need to specify the where clause of key, column1. for eg,
SELECT * FROM playerInfo WHERE KEY = 'Kitty' AND column1 = 1411 AND column2 = 3;
If you want to specify where clause on a particular column of primary key, then where clause of previous column also need to be given. So you need to choose the cassandra data modelling in a tricky way to have a good read and write performance and to satisfy your business needs too. But however if business logic satisfies you, then cassandra performance will not satisfies you. If cassandra performance satisfies you, then your business logic will not satisfies you. That is the beauty of cassandra. Sure cassandra needs more to improve.
There is a way to select rows based on columns that are not a part of the primary key by creating secondary index.
Let me explain this with an example.
In this schema:
CREATE TABLE playerInfo (
player_id int,
name varchar,
country varchar,
age int,
performance int,
PRIMARY KEY ((player_id, name), country)
);
the first part of the primary key i.e player_id and name is the partition key. The hash value of this will determine which node in the cassandra cluster this row will be written to.
Hence we need to specify both these values in the where clause to fetch a record. For example
SELECT * FROM playerinfo WHERE player_id = 1000 and name = 'Mark B';
player_id | name | country | age | performance
-----------+--------+---------+-----+-------------
1000 | Mark B | USA | 26 | 8
If the second part of your primary key contains more than 2 columns you would have to specify values for all the columns on the left hand side of they key including that column.
In this example
PRIMARY KEY ((key, column1), column2, column3, column4, column5)
For filtering based on column3 you would have to specify values for "key, column1, column2 and column3".
For filtering based on column5 you need to sepcify values for "key, column1, column2, column3, column4, and column5".
But if your application demands using filtering on a particular columns which are not a part of the partition key you could create secondary indices on those columns.
To create an index on a column use the following command
CREATE INDEX player_age on playerinfo (age) ;
Now you can filter columns based on age.
SELECT * FROM playerinfo where age = 26;
player_id | name | country | age | performance
-----------+---------+---------+-----+-------------
2000 | Sarah L | UK | 26 | 24
1000 | Mark B | USA | 26 | 8
Be very careful about using index in Cassandra. Use this only if a table has few records or more precisely few distinct values in those columns.
Also you can drop an index using
DROP INDEX player_age ;
Refer http://wiki.apache.org/cassandra/SecondaryIndexes and http://www.datastax.com/docs/1.1/ddl/indexes for more details

Resources