Insert row with auto increment - cassandra

I create this table
CREATE TABLE table_test(
id uuid PRIMARY KEY,
varchar,
description varchar
);
And I was expecting that since id was uuid, it would be auto incremented automatically
INSERT INTO table_test (title,description) VALUES ('a','B');
It throw error
com.datastax.driver.core.exceptions.InvalidQueryException: Some partition key parts are missing: id
com.datastax.driver.core.exceptions.InvalidQueryException: Some partition key parts are missing: id
Any idea what I´m doing wrong.

Use now() function
In the coordinator node, generates a new unique timeuuid in milliseconds when the statement is executed. The timestamp portion of the timeuuid conforms to the UTC (Universal Time) standard. This method is useful for inserting values. The value returned by now() is guaranteed to be unique.
There is no auto increment option in Cassandra.
now() function return a timeuuid, it is not auto increment. It has two bigint part MSB and LSB. MSB is the current timestamp and LSB is the combination of clock sequence and node ip of the host. And It is universal unique.
Use the below query to insert :
INSERT INTO table_test (id,title,description) VALUES (now(),'a','B');
Source : http://docs.datastax.com/en/cql/3.3/cql/cql_reference/timeuuid_functions_r.html

Related

Cassandra: Execution order of Insert statements in a Batch request

I have a table with the following schema:
CREATE TABLE IF NOT EXISTS data (
key TEXT,
created_at TIMEUUID,
value TEXT,
PRIMARY KEY (key, created_at)
) WITH CLUSTERING ORDER BY (created_at DESC);
Data is appended only and to get the value of a key, we retrieve the record with latest created_at using the query:
SELECT * FROM data WHERE key = ? ORDER BY created_at DESC LIMIT 1
In our application, we can insert multiple records of the same key using Batch statement as following:
BEGIN BATCH
INSERT INTO data (key, created_at, value) VALUES ('MyKey', now(), 'value1');
INSERT INTO data (key, created_at, value) VALUES ('MyKey', now(), 'value2');
....
INSERT INTO data (key, created_at, value) VALUES ('MyKey', now(), 'value10');
APPLY BATCH;
What will be the latest record of this MyKey? Is the result deterministic?
I tested using values with random data and found that the value from SELECT query is always the value of the last statement of the batch.
My assumption is when the batch query sent to Coordinator node, the statements will be parsed by QueryProcessor in the order they appear in the batch query. And for each statement, the native function now() will be evaluated and a unique timeuuid generated in an increasing order.
now()
In the coordinator node, generates a new unique timeuuid in milliseconds when the statement is executed. The timestamp portion of the timeuuid conforms to the UTC (Universal Time) standard. This method is useful for inserting values. The value returned by now() is guaranteed to be unique.
Summary of findings regarding order of statements in a batch can be found in the comments.

How to store only most recent entry in Cassandra?

I have a Cassandra table like :-
create table test(imei text,dt_time timestamp, primary key(imei, dt_time)) WITH CLUSTERING ORDER BY (dt_time DESC);
Partition Key is: imei
Clustering Key is: dt_time
Now I want to store only most recent entry in this table(on the time basis) for each partition key.
Let's say if I am inserting entry in a table where there will be single entry for each imei
Now let's say for an imei 98838377272 dt_time is 2017-12-23 16.20.12 Now for same imei if dt_time comes like 2017-12-23 15.20.00
Then this entry should not be inserted in that Cassandra table.
But if time comes like 2017-12-23 17.20.00 then it should get insert and previous row should get replaced with this dt_time.
You can use TIMESTAMP clause in your insert statement to mark data as most recent:
Marks inserted data (write time) with TIMESTAMP. Enter the time since epoch (January 1, 1970) in microseconds. By default, Cassandra uses the actual time of write.
Remove dt_time from primary key to store only one entry for a imei and
Insert data and specify timestamp as 2017-12-23 16.20.12
Insert data and specify timestamp as 2017-12-23 15.20.00
In this case, select by imei will return record with the most recent timestamp (from point 1).
Please note, this approach will work if your dt_time (which will be specified as timestamp) is less than the current time. In other words, select query will return records with most recent timestamp but before the current time. If you insert data with timestamp greater then the current time you will not see this data until this timestamp comes.
First, to store only the last entry in the table, you need to remove dt_time from primary key - otherwise you'll get entries inserted into DB for every timestamp.
Cassandra supports so-called lightweight transactions that allows to check the data before inserting them.
So if you want to update entry only if dt_time is less than new time, then you can use something like:
first insert data:
> insert into test(imei, dt_time) values('98838377272', '2017-12-23 15:20:12');
try to update data with same time, or it could be smaller
> update test SET dt_time = '2017-12-23 15:20:12' WHERE imei = '98838377272'
IF dt_time < '2017-12-23 15:20:12';
[applied] | dt_time
-----------+---------------------------------
False | 2017-12-23 15:20:12.000000+0000
This will fail as it's seen from applied equal to False. I can update it with greater timestamp, and it will be updated:
> update test SET dt_time = '2017-12-23 15:20:12' WHERE imei = '98838377272'
IF dt_time < '2017-12-23 16:21:12';
[applied]
-----------
True
There are several problems with this:
It will not work if entry doesn't exist yet - in this case you may try to use INSERT ... IF NOT EXISTS before trying to update, or to pre-populate the database with emei numbers
The lightweight transactions impose overhead on cluster, as the data should be read before writing, and this could be significant load on servers, and decreasing of throughput.
Actually you cannot "update" a clustering key since its part of the primary key, so you should remove the clustering key on dt_time.
Then you can update the row using a lightweight transaction which checks if the new value its after the existing value.
cqlsh:test> CREATE TABLE test1(imei text, dt_time timestamp) PRIMARY KEY (imei);
cqlsh:test> INSERT INTO test1 (imei, dt_time) VALUES ('98838377272', '2017-12-23 16:20:12');
cqlsh:test> SELECT * FROM test1;
imei | dt_time
-------------+---------------------------------
98838377272 | 2017-12-23 08:20:12.000000+0000
(1 rows)
cqlsh:test> UPDATE test1 SET dt_time='2017-12-23 15:20:00' WHERE imei='98838377272' IF dt_time < '2017-12-23 15:20:00';
[applied] | dt_time
-----------+---------------------------------
False | 2017-12-23 08:20:12.000000+0000
cqlsh:test> UPDATE test1 SET dt_time='2017-12-23 17:20:00' WHERE imei='98838377272' IF dt_time < '2017-12-23 17:20:00';
[applied]
-----------
True
The update for '15:20:00' will return 'false' and tell you the current value.
The update for '17:20:00' will return 'true'
Reference: https://docs.datastax.com/en/cql/3.3/cql/cql_using/useInsertLWT.html

Do I have a wide row?

I created a table with this staement
CREATE TABLE history (
salt int,
tagName varchar,
day timestamp,
room int static,
component varchar static,
instance varchar static,
property varchar static,
offset int,
value float,
PRIMARY KEY ((salt,tagName,day), offset)
);
The goal is to have for each rowkey (salt, tagName, day)
One column for component, instance and property.
One column for each offset with value as column value.
Day is just the current day (e.g. '2016-06-08'), not the current timestamp.
Salt will be very small. It is there to avoid exceeding row size if data is sampled very fast
I wanted to check my schema with the thrift client but it is no longer installed with the 3.5 version I have.
Is my schema correct for my goal? Is there a way to see the actual 'physical' rows with cqlsh?
Thanks!
cassandra-cli equivalent of your cql will be
RowKey (salt:tagName:day)
column(offsetvalue:,value= ,timestamp=sometimestamp)
column(offsetvalue:room,value=roomValue,timestamp=sometimestamp)
column(offsetvalue:component ,value=componentValue,timestamp=sometimestamp)
column(offsetvalue:instance,value=instanceValue,timestamp=sometimestamp)
column(offsetvalue:property,value=propertyValue,timestamp=sometimestamp)
column(offsetvalue:value,value=valueValue,timestamp=sometimestamp)

Updating a Column in Cassandra based on Where Clause

I have a very simple table
cqlsh:hell> describe columnfamily info ;
CREATE TABLE info (
nos int,
value map<text, text>,
PRIMARY KEY (nos)
)
The following is the query where I am trying to update the value .
update info set value = {'count' : '0' , 'onget' : 'function onget(value,count) { count++ ; return {"value": value, "count":count} ; }' } where nos <= 1000 ;
Bad Request: Invalid operator LTE for PRIMARY KEY part nos
I use any operator for specifying the constraint . It complains saying invalid operator. I am not sure what I am doing wrong in here , according to cassandra 3.0 cql doc, there are similar update queries.
The following is my version
[cqlsh 4.1.0 | Cassandra 2.0.3 | CQL spec 3.1.1 | Thrift protocol 19.38.0]
I have no idea , whats going wrong.
The answer is really in my comment, but it needs a bit of elaboration. To restate from the comment...
The first predicate of the where clause has to uniquely identify the partition key. In your case, since the primary key is only one column the partition key == the primary key.
Cassandra can't do range scans over partitions. In the language of CQL, a partition is a potentially wide storage row that is uniquely identified by a key. In this case, the values in your nos column. The values of the partition keys are hashed into tokens which explicitly identify where that data lives in the cluster. Since that hash has no order to it, Cassandra cannot use any operator other than equality to route a statement to the correct destination. This isn't a primary key index that could potentially be updated, it is the fundamental partitioning mechanism in Cassandra. So, you can't use inequality operators as the first clause of a predicate. You can use them in subsequent clauses because the partition has been identified and now you're dealing with an ordered set of columns.
You can't use non-equal condition on the partition key (nos is your partition key).
http://cassandra.apache.org/doc/cql3/CQL.html#selectWhere
Cassandra currently does not support user defined functions inside a query such as the following.
update info set value = {'count' : '0' , 'onget' : 'function onget(value,count) { count++ ; return {"value": value, "count":count} ; }' } where nos <= 1000 ;
First, can you push this onget function into the application layer? You can first query all the rows which nos < 1000. Then increment rows those via some batch query.
Otherwise, you can use a counter column for nos, not a int data type. Notice though, you cannot mix map data type with counter column families unless the non-counter columns are part of a composite key.
Also, you probably doe not want to have nos, a column that changes value as the primary key.
CREATE TABLE info (
id UUID,
value map<text, text>,
PRIMARY KEY (id)
)
CREATE TABLE nos_counter (
info_id UUID,
nos COUNTER,
PRIMARY KEY (info_id)
)
Now you can update the nos counter like this.
update info set nos = nos + 1 where info_id = 'SOME_UUID';

Cassandra range slicing on composite key

I have columnfamily with composite key like this
CREATE TABLE sometable(
keya varchar,
keyb varchar,
keyc varchar,
keyd varchar,
value int,
date timestamp,
PRIMARY KEY (keya,keyb,keyc,keyd,date)
);
What I need to do is to
SELECT * FROM sometable
WHERE
keya = 'abc' AND
keyb = 'def' AND
date < '2014-01-01'
And that is giving me this error
Bad Request: PRIMARY KEY part date cannot be restricted (preceding part keyd is either not restricted or by a non-EQ relation)
What's the best way to solve this? Do I need to alter my columnfamily?
I also need to query those table with all keya, keyb, keyc, and date.
You cannot do it in cassandra. Moreover, such a range slicing is costlier too. You are trying to slice through a set of equalities that have the lower priority according to your schema.
I also need to query those table with all keya, keyb, keyc, and date.
If you are considering to solve this problem, considering having this schema. What i would suggest is to have the keys in a separate schema
create table (
timeuuid id,
keyType text,
primary key (timeuuid,keyType))
Use the timeuuid to store the values and do a range scan based on that.
create table(
timeuuid prevTableId,
value int,
date timestamp,
primary key(prevTableId,date))
Guess , in this way, your table is normalized for better scalability in your use case and may save a lot of disk space if keys are repetitive too.

Resources