Updating a Column in Cassandra based on Where Clause - cassandra

I have a very simple table
cqlsh:hell> describe columnfamily info ;
CREATE TABLE info (
nos int,
value map<text, text>,
PRIMARY KEY (nos)
)
The following is the query where I am trying to update the value .
update info set value = {'count' : '0' , 'onget' : 'function onget(value,count) { count++ ; return {"value": value, "count":count} ; }' } where nos <= 1000 ;
Bad Request: Invalid operator LTE for PRIMARY KEY part nos
I use any operator for specifying the constraint . It complains saying invalid operator. I am not sure what I am doing wrong in here , according to cassandra 3.0 cql doc, there are similar update queries.
The following is my version
[cqlsh 4.1.0 | Cassandra 2.0.3 | CQL spec 3.1.1 | Thrift protocol 19.38.0]
I have no idea , whats going wrong.

The answer is really in my comment, but it needs a bit of elaboration. To restate from the comment...
The first predicate of the where clause has to uniquely identify the partition key. In your case, since the primary key is only one column the partition key == the primary key.
Cassandra can't do range scans over partitions. In the language of CQL, a partition is a potentially wide storage row that is uniquely identified by a key. In this case, the values in your nos column. The values of the partition keys are hashed into tokens which explicitly identify where that data lives in the cluster. Since that hash has no order to it, Cassandra cannot use any operator other than equality to route a statement to the correct destination. This isn't a primary key index that could potentially be updated, it is the fundamental partitioning mechanism in Cassandra. So, you can't use inequality operators as the first clause of a predicate. You can use them in subsequent clauses because the partition has been identified and now you're dealing with an ordered set of columns.

You can't use non-equal condition on the partition key (nos is your partition key).
http://cassandra.apache.org/doc/cql3/CQL.html#selectWhere

Cassandra currently does not support user defined functions inside a query such as the following.
update info set value = {'count' : '0' , 'onget' : 'function onget(value,count) { count++ ; return {"value": value, "count":count} ; }' } where nos <= 1000 ;
First, can you push this onget function into the application layer? You can first query all the rows which nos < 1000. Then increment rows those via some batch query.
Otherwise, you can use a counter column for nos, not a int data type. Notice though, you cannot mix map data type with counter column families unless the non-counter columns are part of a composite key.
Also, you probably doe not want to have nos, a column that changes value as the primary key.
CREATE TABLE info (
id UUID,
value map<text, text>,
PRIMARY KEY (id)
)
CREATE TABLE nos_counter (
info_id UUID,
nos COUNTER,
PRIMARY KEY (info_id)
)
Now you can update the nos counter like this.
update info set nos = nos + 1 where info_id = 'SOME_UUID';

Related

Delete records in Cassandra table based on time range

I have a Cassandra table with schema:
CREATE TABLE IF NOT EXISTS TestTable(
documentId text,
sequenceNo bigint,
messageData blob,
clientId text
PRIMARY KEY(documentId, sequenceNo))
WITH CLUSTERING ORDER BY(sequenceNo DESC);
Is there a way to delete the records which were inserted between a given time range? I know internally Cassandra must be using some timestamp to track the insertion time of each record, which would be used by features like TTL.
Since there is no explicit column for insertion timestamp in the given schema, is there a way to use the implicit timestamp or is there any better approach?
There is never any update to the records after insertion.
It's an interesting question...
All columns that aren't part of the primary key have so-called WriteTime that could be retrieved using the writetime(column_name) function of CQL (warning: it doesn't work with collection columns, and return null for UDTs!). But because we don't have nested queries in the CQL, you will need to write a program to fetch data, filter out entries by WriteTime, and delete entries where WriteTime is older than your threshold. (note that value of writetime is in microseconds, not milliseconds as in CQL's timestamp type).
The easiest way is to use Spark Cassandra Connector's RDD API, something like this:
val timestamp = someDate.toInstant.getEpochSecond * 1000L
val oldData = sc.cassandraTable(srcKeyspace, srcTable)
.select("prk1", "prk2", "reg_col".writeTime as "writetime")
.filter(row => row.getLong("writetime") < timestamp)
oldData.deleteFromCassandra(srcKeyspace, srcTable,
keyColumns = SomeColumns("prk1", "prk2"))
where: prk1, prk2, ... are all components of the primary key (documentId and sequenceNo in your case), and reg_col - any of the "regular" columns of the table that isn't collection or UDT (for example, clientId). It's important that list of the primary key columns in select and deleteFromCassandra was the same.

Cassandra does not support DELETE on indexed columns

Say I have a cassandra table xyz with the following schema :
create table xyz(
xyzid uuid,
name text,
fileid int,
sid int,
PRIMARY KEY(xyzid));
I create index on columns fileid , sid:
CREATE INDEX file_index ON xyz (fileid);
CREATE INDEX sid_index ON xyz (sid);
I insert data :
INSERT INTO xyz (xyzid, name , fileid , sid ) VALUES ( now(), 'p120' , 1, 100);
INSERT INTO xyz (xyzid, name , fileid , ssid ) VALUES ( now(), 'p120' , 1, 101);
INSERT INTO xyz (xyzid, name , fileid , sid ) VALUES ( now(), 'p122' , 2, 101);
I want to delete data using the indexed columns :
DELETE from xyz WHERE fileid=1 and sid=101;
Why do I get this error ?
InvalidRequest: code=2200 [Invalid query] message="Non PRIMARY KEY fileid found in where clause"
Is it mandatory to specify the primary key in the where clause for delete queries ?
Does Cassandra supports deletes using secondary index s ?
What has to be done to delete data using secondary index s ?
Any suggestions that could help .
I am using Data Stax Community Cassandra 2.1.8 but I also want to know whether delete using indexed columns is supported by Data Stax Community Cassandra 3.2.1
Thanks
Let me try and answer your questions in order:
1) Yes, if you are going to use a where clause in a CQL statement then the PARTITION KEY must be an equality operator in the where clause. Other than that you are only allowed to filter on clustering columns specified in your primary key. (Unless you have a secondary index)
2) No it does not. See this post for some more information as it is essentially the same problem.
Why can cassandra "select" on secondary key, but not update using secondary key? (1.2.8+)
3) Why not add sid as a clustering column in your primary key. This would allow you to do the delete or query using both as you have shown.
create table xyz(
xyzid uuid,
name text,
fileid int,
sid int,
PRIMARY KEY(xyzid, sid));
4) In general using secondary indexes is considered an anti-pattern (a bit less so with SASI indexes in C* 3.4) so my question is can you add these fields as clustering columns to your primary key? How are you querying these secondary indexes?
I suppose you can perform delete in two steps:
Select data by secondary index and get primary index column values
(xyzid) from query result
Perform delete by primary index values.

Create a super column using CQL3

I am upgrading my thrift api to cql3. My data contains SuperColumns as follows:
- User //column family
- Division/name //my row key
-DivHead //SuperColumn
- name //Columns
- address //Columns
I understand all the column families to be changed to tables. And the primary key becomes the rowkey. So rest are the columns.
But my data has supercolumns. how do I create supercolumns using CQL3?
CREATE TABLE user (
rowkey varchar,
division text,
head_name text,
address text,
PRIMARY KEY (rowkey, division)
)
OR
CREATE TABLE user (
rowkey varchar,
division text,
head_name text,
head_address text,
PRIMARY KEY ((rowkey, division))
)
Under the covers the first example will have each rowkey assigned to the same partition. Each rowkey will have a set of logical rows, one for each division. Those rows will contain two columns: head_name and head_address. You can query based on the rowkey and get all divisions (sorted!). Or you can query a rowkey with a range of divisions or a single division and get a subset of the divisions with their division head and address.
The second example will have one partition for each rowkey and division combination. Each such partition will be one logical row as well. The single row for each composite key will have two columns: head_name and head_address. To make a query, you must provide BOTH the rowkey and the division.
EDIT: Cleared up some bad grammar.

Cassandra CQL - clustering order with multiple clustering columns

I have a column family with primary key definition like this:
...
PRIMARY KEY ((website_id, item_id), user_id, date)
which will be queried using queries such as:
SELECT * FROM myCF
WHERE website_id = 30 AND item_id = 10
AND user_id = 0 AND date > 'some_date' ;
However, I'd like to keep my column family ordered by date only, such as SELECT date FROM myCF ; would return the most recent inserted date.
Due to the order of clustering columns, what I get is an order per user_id then per date.
If I change the primary key definition to:
PRIMARY KEY ((website_id, item_id), date, user_id)
I can no longer run the same query, as date must be restricted is user_id is.
I thought there might be some way to say:
...
PRIMARY KEY ((website_id, shop_id), store_id, date)
) WITH CLUSTERING ORDER BY (store_id RANDOMPLEASE, date DESC) ;
But it doesn't seem to exist. Worst, maybe this is completely stupid and I don't get why.
Is there any ways of achieving this? Am I missing something?
Many thanks!
Your query example restricts user_id so that should work with the second table format. But if you are actually trying to run queries like
SELECT * FROM myCF
WHERE website_id = 30 AND item_id = 10
AND date > 'some_date'
Then you need an additional table which is created to handle those queries, it would only order on Date and not on user id
Create Table LookupByDate ... PRIMARY KEY ((website_id, item_id), date)
In addition to your primary query, if all you try to get is "return the most recent inserted date", you may not need an additional table. You can use "static column" to store the last update time per partition. CASSANDRA-6561
It probably won't help your particular case (since I imagine your list of all users is unmanagably large), but if the condition on the first clustering column is matching one of a relatively small set of values then you can use IN.
SELECT * FROM myCF
WHERE website_id = 30 AND item_id = 10
AND user_id IN ? AND date > 'some_date'
Don't use IN on the partition key because this will create an inefficient query that hits multiple nodes putting stress on the coordinator node. Instead, execute multiple asynchronous queries in parallel. But IN on a clustering column is absolutely fine.

How does a CQL3 composite index with 3 fields map in the thrift column family world?

After reading this blog at planetcassandra, I'm wondering how does a CQL3 composite index with 3 fields map in the thrift column family word, For e.g.:
CREATE TABLE comments (
article_id uuid,
posted_at timestamp,
author text,
karma int,
content text,
PRIMARY KEY (article_id, posted_at)
)
Here the column article_id will be mapped to the internal row key and posted_at will be mapped to (the first part of) the cell name.
What if the table design will be
CREATE TABLE comments (
author_id varchar,
posted_at timestamp,
article_id uuid,
author text,
karma int,
content text,
PRIMARY KEY (author_id, posted_at, article_id)
)
And will the internal row key mapped to 1st 2 fields of the composite index with article_id mapped to cell name, essentially slicing for as many articles upto 2 billion entries and any query on author_id and posted_at combination is one seek on the disk?
Is the behavior same for any number of fields in a composite key?
Your answers much appreciated.
The above observation is incorrect and the correct one is here
I've personally verified:
In the first case:
article_id = partition key, posted_at = cluster key
In the second case:
author_id = partition key, posted_at:article_id = cluster key
First part of composite key (author_id) is called "Partition Key",
rest (posted_at,article_id) are remaining keys.
Cassandra stores columns differently when composite keys are used. Partition key
becomes row key. Remaining keys are concatenated with each column
name (":" as separator) to form column names. Column values remain
unchanged.
Remaining keys (other than partition keys) are ordered,
and it's not allowed to search on any random column, you have to
start with the first one and then you can move to the second one and
so on. This is evident from "Bad Request" error.
There's an excellent explanation by Aaron Morton # his site thelastpickle.
In the first case:
article_id = partition key, posted_at = cluster key
In the second case:
author_id + posted_at = partition key, article_id = cluster key
hence be mindful of the disk seeks as you go by second method and see the row is not getting too wide and gives real benefit compared to the first case.
If you aren't crossing the 2 billion and well within the limits, don't overdo by adopting the 2nd method, as the dispersion of records happens on the combo key.

Resources