How to manipulate timestamp columns in Apache Cassandra - cassandra

I have a table with a timestamp column, and I'd like to manipulate the values of that column. For instance, I need to do something along the line:
UPDATE mytable SET datetimecolumn = datetimecolumn + 10mins
How is it done in Apache Cassandra?
UPDATE: The answer seems to be "you can't". But the selected answer is the closest we can get apparently.

You can query similar this one, only if the data type is counter.
Using Counter :
A counter is a special column used to store a number that is changed in increments. For example, you might use a counter column to count the number of times a page is viewed.
Define a counter in a dedicated table only and use the counter data type. You cannot index, delete, or re-add a counter column. All non-counter columns in the table must be defined as part of the primary key.
Example :
CREATE TABLE mytable (
pk1 int PRIMARY KEY,
datetimecolumn counter
);
Here you have to use datetimecolumn value in millisecond.
For the first time, you have to use update query with the time in millisecond value let's say 1487686182403
UPDATE mytable SET datetimecolumn = datetimecolumn + 1487686182403 where pk1 = 1
Now mytable with pk = 1 contains datetimecolumn = 1487686182403 value.
If you want to increment datetimecolumn by 10mins (600000 millisecond)
UPDATE mytable SET datetimecolumn = datetimecolumn + 600000 where pk1 = 1
Source : https://docs.datastax.com/en/cql/3.1/cql/cql_using/use_counter_t.html

read-before-write is a anti-pattern in cassandra. You have manipulate the value at client side and update as usual;
In other words: you have to search (select) the value, do changes (increment by 10mins) and update at cassandra the new value.

Related

Storing time specific data in cassandra

I am looking for a good way to store time specific data in cassandra.
Each entry can look like (start_time, value). Later, I would like to retrieve the current value.
Logic of retrieving current value is like following.
Find all rows with start_time<=current_time.
Then find the value with maximum start_time from the rows obtained in the first step.
PS:- Edited the question to make it more clear
The exact requirements are not possible. But we can get close to it with one more column.
First, to be able to use <= operator, your start_time column need to be the clustering key of your table.
Then, you need a different partition key. You could choose a fixed value but it could bring problems when the partition will have too many rows. Then you should better use something like the year or the month of the start_time.
CREATE TABLE time_specific_table (
year bigint,
start_time timestamp,
value text,
PRIMARY KEY((year), start_time)
) WITH CLUSTERING ORDER BY (start_time DESC);
The problem is that when you will query the table, you will need to know the value of the partition key :
Find all rows with start_time<=current_time
SELECT * FROM time_specific_table
WHERE year = :year AND start_time <= :time;
select the value with maximum start_time
SELECT * FROM time_specific_table
WHERE year = :year LIMIT 1;
Create two separate table like below :
CREATE TABLE data (
start_time timestamp,
value int,
PRIMARY KEY(start_time, value)
);
CREATE TABLE current_value (
partition int PRIMARY KEY,
value int
);
Now you have to insert data into both table, to insert data into second table use a static value like 1
INSERT INTO current_value(partition, value) VALUES(1, 10);
Now In current value table your data will be upsert and You will get latest value whenever you select.

Performance difference between SELECT sum(coloumn_name) FROM and SELECT coloumn_name in CQL

I like to know the performance difference in executing the following two queries for a table cycling.cyclist_points containing 1000s of rows. :
SELECT sum(race_points)
FROM cycling.cyclist_points
WHERE id = e3b19ec4-774a-4d1c-9e5a-decec1e30aac;
select *
from cycling.cyclist_points
WHERE id = e3b19ec4-774a-4d1c-9e5a-decec1e30aac;
If sum(race_points) causes the query to be expensive, I will have to look for other solutions.
Performance Difference between your query :
Both of your query need to scan same number of row.(Number of row in that partition)
First query only selecting a single column, so it is little bit faster.
Instead of calculating the sum run time, try to preprocess the sum.
If race_points is int or bigint then use a counter table like below :
CREATE TABLE race_points_counter (
id uuid PRIMARY KEY,
sum counter
);
Whenever a new data inserted into cyclist_points also increment the sum with your current point.
UPDATE race_points_counter SET sum = sum + ? WHERE id = ?
Now you can just select the sum of that id
SELECT sum FROM race_points_counter WHERE id = ?

How do I select everything where two columns contain equal values in CQL?

I'm trying to select everything where two columns contain equal values. Here is my CQL query:
select count(someColumn) from somekeySpace.sometable where columnX = columnY
This doesn't work. How can I do this?
You can't query like that, cassandra don't support it
You can do this in different way.
First you have to create a separate counter table.
CREATE TABLE match_counter(
partition int PRIMARY KEY,
count counter
);
At the time of insertion into your main table if columnX = columnY then increment the value here. Though you have only a single count, you can use a static value of partition
UPDATE match_counter SET count = count + 1 WHERE partition = 1;
Now you can get the count of match column
SELECT * FROM match_counter WHERE partition = 1;

Cassandra OperationTimedOut

select count (*) from my_table gives me OperationTimedOut: errors={}, last_host=127.0.0.1
I have already tried to change the values in request_timeout_in_ms in cassandra.yaml and request_timeout in cqlshrc.sample. (Both are in C:\Programs\DataStax-DDC\apache-cassandra\conf) But without success.
How can I increse timeout?
select count (*) is not doing what you think. It is actually expensive as it counts the rows one by one. You can track number of records using a separate column family with a counter, which you will need to increment for every insert you do into your table. For example
CREATE TABLE IF NOT EXISTS my_table_counter (
mykey text,
count counter,
PRIMARY KEY (mykey)
);
Then for every insert into your table, do counter update:
INSERT into my_table (mykey, mydata) VALUES (?, ?);
UPDATE my_table_counter SET count = count + 1 WHERE mykey = ?;
To get the count:
SELECT count FROM my_table_counter WHERE mykey = ?
Note that counters are not idempotent, so in a rare event of a failure your data might be under or over-counted. Also the code above assumes that you only insert with a new key.
If you need a precise counting, Cassandra may be not a good fit for that. Also if you are not inserting with unique keys you may need to consider using light weight transaction with insert (IF NOT EXISTS) and update a counter only if transaction was applied.

How to re-insert record with counter column after delete it in Cassandra?

I create a table with counter column in Cassandra, just like :
create table test_count(
pk int,
count counter,
primary key (pk)
);
And after that I update some record like:
update test_count set count = count + 1 where pk = 1;
Then I want to reset the count to 0, but there's no reset command in cql, so I delete the record:
delete from test_count where pk = 1;
And then I re-execute the update CQL, but when I select * from test_count, there's no record with pk = 1, so is it a bug of Cassandra? When you delete the record with counter column, it disappears forever? How can I reset the count column to 0? How can I re-insert a record with counter column?
You can first query the counter for its current value and then issue an update to subtract that amount.
You can delete the counter, and then start a new counter using a different row key. You would no longer try to use the original counter. With this approach you might want to partition your counters by day or by week, so that when a new day started, you'd start with a fresh set of zeroed out counters.

Resources