Storing time specific data in cassandra - cassandra

I am looking for a good way to store time specific data in cassandra.
Each entry can look like (start_time, value). Later, I would like to retrieve the current value.
Logic of retrieving current value is like following.
Find all rows with start_time<=current_time.
Then find the value with maximum start_time from the rows obtained in the first step.
PS:- Edited the question to make it more clear

The exact requirements are not possible. But we can get close to it with one more column.
First, to be able to use <= operator, your start_time column need to be the clustering key of your table.
Then, you need a different partition key. You could choose a fixed value but it could bring problems when the partition will have too many rows. Then you should better use something like the year or the month of the start_time.
CREATE TABLE time_specific_table (
year bigint,
start_time timestamp,
value text,
PRIMARY KEY((year), start_time)
) WITH CLUSTERING ORDER BY (start_time DESC);
The problem is that when you will query the table, you will need to know the value of the partition key :
Find all rows with start_time<=current_time
SELECT * FROM time_specific_table
WHERE year = :year AND start_time <= :time;
select the value with maximum start_time
SELECT * FROM time_specific_table
WHERE year = :year LIMIT 1;

Create two separate table like below :
CREATE TABLE data (
start_time timestamp,
value int,
PRIMARY KEY(start_time, value)
);
CREATE TABLE current_value (
partition int PRIMARY KEY,
value int
);
Now you have to insert data into both table, to insert data into second table use a static value like 1
INSERT INTO current_value(partition, value) VALUES(1, 10);
Now In current value table your data will be upsert and You will get latest value whenever you select.

Related

How to sum up cassandra counter grouping by only one column in the primary key set?

I am trying to keep track of the amount of events of each type that occured in one-hour buckets of time, and then sum the counts per category in arbitrary time ranges. So, I create a table like this:
CREATE TABLE IF NOT EXISTS sensor_activity_stats(
sensor_id text,
datetime_hour_bucket timestamp,
activity_type text,
activity_count counter,
PRIMARY KEY ((sensor_id), datetime_hour_bucket, activity_type)
)
WITH CLUSTERING ORDER BY(datetime_hour_bucket DESC, activity_type ASC);
I would like to be able to achieve this kind of query:
SELECT datetime_hour_bucket, activity_type, SUM(activity_count) as count
FROM sensor_activity_stats
WHERE sensor_id=:sensorId
AND datetime_hour_bucket >= :fromDate AND datetime_hour_bucket < :untilDate
GROUP BY activity_type
Cassandra complains about because grouping must be done in the order of the primary key columns. And, if I change the order I won't be able to query by a range over any activity_type.
Some notes:
I am grouping by hours because some users could ask me to show the data in different timezones and I want to be able to perform a decent conversion.
The activity_type has low cardinality, however I can not be sure I'll always be able to predict it's possible values.
Right now my solution was to query the whole data in the range and perform the aggregation myself in code. Have you have faced similar situation and what was your solution? Would you suggest a different way of querying or arranging the data?
I hope you've found the solution of your problem, however I have a way to you try.
First, you can chage the create table to change the order of fields:
CREATE TABLE IF NOT EXISTS sensor_activity_stats(
sensor_id text,
datetime_hour_bucket timestamp,
activity_type text,
activity_count counter,
PRIMARY KEY (activity_type, sensor_id, datetime_hour_bucket, activity_count)
)
WITH CLUSTERING ORDER BY(activity_type ASC, datetime_hour_bucket DESC);
Then, the query you can add the field "datetime_hour_bucket" in the Group By clause:
SELECT datetime_hour_bucket, activity_type, SUM(activity_count) as count
FROM sensor_activity_stats
WHERE sensor_id=:sensorId
AND datetime_hour_bucket >= :fromDate AND datetime_hour_bucket < :untilDate
GROUP BY activity_type, datetime_hour_bucket;

Cassandra export/forward data only once

I have the requirement to forward data at certain intervals from my system to an external system. To do this, I already stored all rows in a table. Already forwarded data should not be exported again.
The idea is to memorize the last export time on client side and export the following records the next time. Old rows are deleted after a successful export.
CREATE TABLE export(
id int,
import_date_time timestamp,
data text,
PRIMARY KEY (id, import_date_time)
) WITH CLUSTERING ORDER BY (import_date_time DESC)
insert into export(id, import_date_time, data) values (1, toUnixTimestamp(now()), 'content')
select * from export where id = 1 and import_date_time > '2017-03-30 16:22:37'
delete from export where id = 1 and import_date_time <= '2017-03-30 16:22:37'
Has anyone already implemented similar or do you have a different
solution?
If possible, I do not need an id for the request because I want to
export all data
If you used fixed partition key value (id = 1), then all the insert, select and delete will happen on a same node (If RF=1) over and over. And also for every delete cassandra create a tombstone entry, when you execute select query cassandra needs to merge each entry. So your select query performance will degrade.
So instead of having fixed value, use dynamic value like the below one :
CREATE TABLE export(
hour int,
day int,
month int,
year int,
import_date_time timestamp,
data text,
PRIMARY KEY ((hour, day, month, year), import_date_time)
) WITH CLUSTERING ORDER BY (import_date_time DESC);
Here you can insert the value of hour, day, month, year extracted from import_date_time
You need to take care of two case When selecting data :
Previous export time and current export time both at same hour.
Both time are not inside same hour.
For case one you need only one query and for case two you have to execute two query.
Example Query :
SELECT * FROM export WHERE hour = 16 AND day = 30 AND month = 3 AND year = 2017 AND import_date_time > '2017-03-30 16:22:37';

Get last row in table of time series?

I am already able to get the last row of time-series table as:
SELECT * from myapp.locations WHERE organization_id=1 and user_id=15 and date='2017-2-22' ORDER BY unix_time DESC LIMIT 1;
That works fine, however, I am wondering about performance and overhead of executing ORDER BY as rows are already sorted, I just use it to get the last row, is it an overhead in my case?
If I don't use ORDER BY, I will always get the first row in the table, so, I though I might be able to use INSERT in another way, ex: insert always in the beginning instead of end of table?
Any advice? shall I use ORDER BY without worries about performance?
Just define your clustering key order to DESC
Like the below schema :
CREATE TABLE locations (
organization_id int,
user_id int,
date text,
unix_time bigint,
lat double,
long double,
PRIMARY KEY ((organization_id, user_id, date), unix_time)
) WITH CLUSTERING ORDER BY (unix_time DESC);
So by default your data will sorted by unix_time desc, you don't need to specify in query
Now you can just use the below query to get the last row :
SELECT * from myapp.locations WHERE organization_id = 1 and user_id = 15 and date = '2017-2-22' LIMIT 1;
If your query pattern for that table is always ORDER BY unix_time DESC then you are in a reverse order time-series scenario, and I can say that your model is inaccurate (not wrong).
There's no reason not to sort the records in reverse order by adding a WITH CLUSTERING ORDER BY unix_time DESC in the table definition, and in my opinion the ORDER BY unix_time DESC will perform at most on par with something explicitly meant for these use cases (well, I think it will perform worse).

how to do the query in cassandra If i have two cluster key in column family

I have a column family and syntax like this:
CREATE TABLE sr_number_callrecord (
id int,
callerph text,
sr_number text,
callid text,
start_time text,
plan_id int,
PRIMARY KEY((sr_number), start_time, callerph)
);
I want to do the query like :
a) select * from dummy where sr_number='+919xxxx8383'
and start_time >='2014-12-02 08:23:18' limit 10;
b) select * from dummy where sr_number='+919xxxxxx83'
and start_time >='2014-12-02 08:23:18'
and callerph='+9120xxxxxxxx0' limit 10;
First query works fine but second query is giving error like
Bad Request: PRIMARY KEY column "callerph" cannot be restricted
(preceding column "start_time" is either not restricted or by a non-EQ
relation)
If I get the result in first query, In second query I am just adding one
more cluster key to get filter result and the row will be less
Just like you cannot skip PRIMARY KEY components, you may only use a non-equals operator on the last component that you query (which is why your 1st query works).
If you do need to serve both of the queries you have listed above, then you will need to have separate query tables for each. To serve the second query, a query table (with the same columns) will work if you define it with a PRIMARY KEY like this:
PRIMARY KEY((sr_number), callerph, start_time)
That way you are still specifying the parts of your PRIMARY KEY in order, and your non-equals condition is on the last PRIMARY KEY component.
There are certain restrictions in the way the primary key columns are to be used in the where clause http://docs.datastax.com/en/cql/3.1/cql/cql_reference/select_r.html
One solution that will work in your situation is to change the order of clustering columns in the primary key
CREATE TABLE sr_number_callrecord (
id int,
callerph text,
sr_number text,
callid text,
start_time text,
plan_id int,
PRIMARY KEY((sr_number), callerph, start_time,)
);
Now you can use range query on the last column as
select * from sr_number_callrecord where sr_number = '1234' and callerph = '+91123' and start_time >= '1234';

Cassandra CQL - clustering order with multiple clustering columns

I have a column family with primary key definition like this:
...
PRIMARY KEY ((website_id, item_id), user_id, date)
which will be queried using queries such as:
SELECT * FROM myCF
WHERE website_id = 30 AND item_id = 10
AND user_id = 0 AND date > 'some_date' ;
However, I'd like to keep my column family ordered by date only, such as SELECT date FROM myCF ; would return the most recent inserted date.
Due to the order of clustering columns, what I get is an order per user_id then per date.
If I change the primary key definition to:
PRIMARY KEY ((website_id, item_id), date, user_id)
I can no longer run the same query, as date must be restricted is user_id is.
I thought there might be some way to say:
...
PRIMARY KEY ((website_id, shop_id), store_id, date)
) WITH CLUSTERING ORDER BY (store_id RANDOMPLEASE, date DESC) ;
But it doesn't seem to exist. Worst, maybe this is completely stupid and I don't get why.
Is there any ways of achieving this? Am I missing something?
Many thanks!
Your query example restricts user_id so that should work with the second table format. But if you are actually trying to run queries like
SELECT * FROM myCF
WHERE website_id = 30 AND item_id = 10
AND date > 'some_date'
Then you need an additional table which is created to handle those queries, it would only order on Date and not on user id
Create Table LookupByDate ... PRIMARY KEY ((website_id, item_id), date)
In addition to your primary query, if all you try to get is "return the most recent inserted date", you may not need an additional table. You can use "static column" to store the last update time per partition. CASSANDRA-6561
It probably won't help your particular case (since I imagine your list of all users is unmanagably large), but if the condition on the first clustering column is matching one of a relatively small set of values then you can use IN.
SELECT * FROM myCF
WHERE website_id = 30 AND item_id = 10
AND user_id IN ? AND date > 'some_date'
Don't use IN on the partition key because this will create an inefficient query that hits multiple nodes putting stress on the coordinator node. Instead, execute multiple asynchronous queries in parallel. But IN on a clustering column is absolutely fine.

Resources