Cassandra: Execution order of Insert statements in a Batch request - cassandra

I have a table with the following schema:
CREATE TABLE IF NOT EXISTS data (
key TEXT,
created_at TIMEUUID,
value TEXT,
PRIMARY KEY (key, created_at)
) WITH CLUSTERING ORDER BY (created_at DESC);
Data is appended only and to get the value of a key, we retrieve the record with latest created_at using the query:
SELECT * FROM data WHERE key = ? ORDER BY created_at DESC LIMIT 1
In our application, we can insert multiple records of the same key using Batch statement as following:
BEGIN BATCH
INSERT INTO data (key, created_at, value) VALUES ('MyKey', now(), 'value1');
INSERT INTO data (key, created_at, value) VALUES ('MyKey', now(), 'value2');
....
INSERT INTO data (key, created_at, value) VALUES ('MyKey', now(), 'value10');
APPLY BATCH;
What will be the latest record of this MyKey? Is the result deterministic?
I tested using values with random data and found that the value from SELECT query is always the value of the last statement of the batch.
My assumption is when the batch query sent to Coordinator node, the statements will be parsed by QueryProcessor in the order they appear in the batch query. And for each statement, the native function now() will be evaluated and a unique timeuuid generated in an increasing order.
now()
In the coordinator node, generates a new unique timeuuid in milliseconds when the statement is executed. The timestamp portion of the timeuuid conforms to the UTC (Universal Time) standard. This method is useful for inserting values. The value returned by now() is guaranteed to be unique.
Summary of findings regarding order of statements in a batch can be found in the comments.

Related

Delete records in Cassandra table based on time range

I have a Cassandra table with schema:
CREATE TABLE IF NOT EXISTS TestTable(
documentId text,
sequenceNo bigint,
messageData blob,
clientId text
PRIMARY KEY(documentId, sequenceNo))
WITH CLUSTERING ORDER BY(sequenceNo DESC);
Is there a way to delete the records which were inserted between a given time range? I know internally Cassandra must be using some timestamp to track the insertion time of each record, which would be used by features like TTL.
Since there is no explicit column for insertion timestamp in the given schema, is there a way to use the implicit timestamp or is there any better approach?
There is never any update to the records after insertion.
It's an interesting question...
All columns that aren't part of the primary key have so-called WriteTime that could be retrieved using the writetime(column_name) function of CQL (warning: it doesn't work with collection columns, and return null for UDTs!). But because we don't have nested queries in the CQL, you will need to write a program to fetch data, filter out entries by WriteTime, and delete entries where WriteTime is older than your threshold. (note that value of writetime is in microseconds, not milliseconds as in CQL's timestamp type).
The easiest way is to use Spark Cassandra Connector's RDD API, something like this:
val timestamp = someDate.toInstant.getEpochSecond * 1000L
val oldData = sc.cassandraTable(srcKeyspace, srcTable)
.select("prk1", "prk2", "reg_col".writeTime as "writetime")
.filter(row => row.getLong("writetime") < timestamp)
oldData.deleteFromCassandra(srcKeyspace, srcTable,
keyColumns = SomeColumns("prk1", "prk2"))
where: prk1, prk2, ... are all components of the primary key (documentId and sequenceNo in your case), and reg_col - any of the "regular" columns of the table that isn't collection or UDT (for example, clientId). It's important that list of the primary key columns in select and deleteFromCassandra was the same.

How it is ordered if I query by secondary index?

I created this table on cassandra.
CREATE TABLE user_event(
userId bigint,
type varchar,
createdAt timestamp,
PRIMARY KEY ((userId), createdAt)
) WITH CLUSTERING ORDER BY (createdAt DESC);
CREATE INDEX user_event_type ON user_event(type);
If I query by userId query result will be ordered by createdAt column.
SELECT * FROM user_event WHERE userId = 1;
But how it is ordered if I query by type? Can I get last SIGN_IN event?
SELECT * FROM user_event WHERE userId = 1 AND type = 'SIGN_IN' LIMIT 1;
Is there any guarantee that result is ordered by createdAt?
The key to understanding this scenario, is to remember that result set order can only be enforced within a partition. As you are still querying by partition key (userId) all data within each partition will still be ordered by createdAt (DESCending).
"Guarantee" is a strong word, and one that I am hesitant to use. The results queried in this way should maintain their on-disk sort order. I would definitely test it out. But as long as you provide userId as a part of the query, the results should be returned sorted by createdAt.

Insert row with auto increment

I create this table
CREATE TABLE table_test(
id uuid PRIMARY KEY,
varchar,
description varchar
);
And I was expecting that since id was uuid, it would be auto incremented automatically
INSERT INTO table_test (title,description) VALUES ('a','B');
It throw error
com.datastax.driver.core.exceptions.InvalidQueryException: Some partition key parts are missing: id
com.datastax.driver.core.exceptions.InvalidQueryException: Some partition key parts are missing: id
Any idea what I´m doing wrong.
Use now() function
In the coordinator node, generates a new unique timeuuid in milliseconds when the statement is executed. The timestamp portion of the timeuuid conforms to the UTC (Universal Time) standard. This method is useful for inserting values. The value returned by now() is guaranteed to be unique.
There is no auto increment option in Cassandra.
now() function return a timeuuid, it is not auto increment. It has two bigint part MSB and LSB. MSB is the current timestamp and LSB is the combination of clock sequence and node ip of the host. And It is universal unique.
Use the below query to insert :
INSERT INTO table_test (id,title,description) VALUES (now(),'a','B');
Source : http://docs.datastax.com/en/cql/3.3/cql/cql_reference/timeuuid_functions_r.html

Order in Limited query with composite keys on cassandra

In the following scenario:
CREATE TABLE temperature_by_day (
weatherstation_id text,
date text,
event_time timestamp,
temperature text,
PRIMARY KEY ((weatherstation_id,date),event_time)
)WITH CLUSTERING ORDER BY (event_time DESC);
INSERT INTO temperature_by_day(weatherstation_id,date,event_time,temperature)
VALUES ('1234ABCD','2013-04-03','2013-04-03 07:01:00','72F');
INSERT INTO temperature_by_day(weatherstation_id,date,event_time,temperature)
VALUES ('1234ABCD','2013-04-03','2013-04-03 08:01:00','74F');
INSERT INTO temperature_by_day(weatherstation_id,date,event_time,temperature)
VALUES ('1234ABCD','2013-04-04','2013-04-04 07:01:00','73F');
INSERT INTO temperature_by_day(weatherstation_id,date,event_time,temperature)
VALUES ('1234ABCD','2013-04-04','2013-04-04 08:01:00','76F');
If I do the following query:
SELECT *
FROM temperature_by_day
WHERE weatherstation_id='1234ABCD'
AND date in ('2013-04-04', '2013-04-03') limit 2;
I realized that the result of cassandra is ordered by the same sequence of patkeys in clausa IN. In this case, I'd like to know if the expected result is ALWAYS the two records of the day '2013-04-04'? Ie Cassadra respects the order of the IN clause in the ordering of the result even in a scenario with multiple nodes?

How to get current timestamp with CQL while using Command Line?

I am trying to insert into my CQL table from the command line. I am able to insert everything. But I am wondering if I have a timestamp column, then how can I insert into timestamp column from the command line? Basically, I want to insert current timestamp whenever I am inserting into my CQL table -
Currently, I am hardcoding the timestamp whenever I am inserting into my below CQL table -
CREATE TABLE TEST (ID TEXT, NAME TEXT, VALUE TEXT, LAST_MODIFIED_DATE TIMESTAMP, PRIMARY KEY (ID));
INSERT INTO TEST (ID, NAME, VALUE, LAST_MODIFIED_DATE) VALUES ('1', 'elephant', 'SOME_VALUE', 1382655211694);
Is there any way to get the current timestamp using some predefined functions in CQL so that while inserting into above table, I can use that method to get the current timestamp and then insert into above table?
You can use the timeuuid functions now() and dateof() (or in later versions of Cassandra, toTimestamp()), e.g.,
INSERT INTO TEST (ID, NAME, VALUE, LAST_MODIFIED_DATE)
VALUES ('2', 'elephant', 'SOME_VALUE', dateof(now()));
The now function takes no arguments and generates a new unique timeuuid (at the time where the statement using it is executed). The dateOf function takes a timeuuid argument and extracts the embedded timestamp. (Taken from the CQL documentation on timeuuid functions).
Cassandra >= 2.2.0-rc2
dateof() was deprecated in Cassandra 2.2.0-rc2. For later versions you should replace its use with toTimestamp(), as follows:
INSERT INTO TEST (ID, NAME, VALUE, LAST_MODIFIED_DATE)
VALUES ('2', 'elephant', 'SOME_VALUE', toTimestamp(now()));
In new version of cassandra could use toTimestamp(now()), and note that function dateof is deprecated.
e.g
insert into dummy(id, name, size, create_date) values (1, 'Eric', 12, toTimestamp(now()));
There are actually 2 different ways for different purposes to insert the current timestamp. From the docs:
Inserting the current timestamp
Use functions to insert the current
date into date or timestamp fields as follows:
Current date and time
into timestamp field: toTimestamp(now()) sets the timestamp to the
current time of the coordinator.
Current date (midnight) into
timestamp field: toTimestamp(toDate(now())) sets the timestamp to the
current date beginning of day (midnight).

Resources