Order in Limited query with composite keys on cassandra - cassandra

In the following scenario:
CREATE TABLE temperature_by_day (
weatherstation_id text,
date text,
event_time timestamp,
temperature text,
PRIMARY KEY ((weatherstation_id,date),event_time)
)WITH CLUSTERING ORDER BY (event_time DESC);
INSERT INTO temperature_by_day(weatherstation_id,date,event_time,temperature)
VALUES ('1234ABCD','2013-04-03','2013-04-03 07:01:00','72F');
INSERT INTO temperature_by_day(weatherstation_id,date,event_time,temperature)
VALUES ('1234ABCD','2013-04-03','2013-04-03 08:01:00','74F');
INSERT INTO temperature_by_day(weatherstation_id,date,event_time,temperature)
VALUES ('1234ABCD','2013-04-04','2013-04-04 07:01:00','73F');
INSERT INTO temperature_by_day(weatherstation_id,date,event_time,temperature)
VALUES ('1234ABCD','2013-04-04','2013-04-04 08:01:00','76F');
If I do the following query:
SELECT *
FROM temperature_by_day
WHERE weatherstation_id='1234ABCD'
AND date in ('2013-04-04', '2013-04-03') limit 2;
I realized that the result of cassandra is ordered by the same sequence of patkeys in clausa IN. In this case, I'd like to know if the expected result is ALWAYS the two records of the day '2013-04-04'? Ie Cassadra respects the order of the IN clause in the ordering of the result even in a scenario with multiple nodes?

Related

How to sum up cassandra counter grouping by only one column in the primary key set?

I am trying to keep track of the amount of events of each type that occured in one-hour buckets of time, and then sum the counts per category in arbitrary time ranges. So, I create a table like this:
CREATE TABLE IF NOT EXISTS sensor_activity_stats(
sensor_id text,
datetime_hour_bucket timestamp,
activity_type text,
activity_count counter,
PRIMARY KEY ((sensor_id), datetime_hour_bucket, activity_type)
)
WITH CLUSTERING ORDER BY(datetime_hour_bucket DESC, activity_type ASC);
I would like to be able to achieve this kind of query:
SELECT datetime_hour_bucket, activity_type, SUM(activity_count) as count
FROM sensor_activity_stats
WHERE sensor_id=:sensorId
AND datetime_hour_bucket >= :fromDate AND datetime_hour_bucket < :untilDate
GROUP BY activity_type
Cassandra complains about because grouping must be done in the order of the primary key columns. And, if I change the order I won't be able to query by a range over any activity_type.
Some notes:
I am grouping by hours because some users could ask me to show the data in different timezones and I want to be able to perform a decent conversion.
The activity_type has low cardinality, however I can not be sure I'll always be able to predict it's possible values.
Right now my solution was to query the whole data in the range and perform the aggregation myself in code. Have you have faced similar situation and what was your solution? Would you suggest a different way of querying or arranging the data?
I hope you've found the solution of your problem, however I have a way to you try.
First, you can chage the create table to change the order of fields:
CREATE TABLE IF NOT EXISTS sensor_activity_stats(
sensor_id text,
datetime_hour_bucket timestamp,
activity_type text,
activity_count counter,
PRIMARY KEY (activity_type, sensor_id, datetime_hour_bucket, activity_count)
)
WITH CLUSTERING ORDER BY(activity_type ASC, datetime_hour_bucket DESC);
Then, the query you can add the field "datetime_hour_bucket" in the Group By clause:
SELECT datetime_hour_bucket, activity_type, SUM(activity_count) as count
FROM sensor_activity_stats
WHERE sensor_id=:sensorId
AND datetime_hour_bucket >= :fromDate AND datetime_hour_bucket < :untilDate
GROUP BY activity_type, datetime_hour_bucket;

Order Column Family with different id by date

I use the following CQL queries to create a table and write data, the problem is that the data in my table are not organized by date order.
I would like to have them organized by date without having to put the same id.
To create table :
CREATE TABLE IF NOT EXISTS sk1_000.data(id varchar, date_serveur timestamp ,nom_objet varchar, temperature double, etat boolean , PRIMARY KEY (id, date_serveur)) with clustering order by (date_serveur DESC);
To insert :
INSERT INTO sk1_000.data(id, date_serveur,nom_objet, temperature, etat) VALUES ('"+ uuid.v4() +"', '1501488930499','Raspberry_pi', 22.5, true) if not exists ;
Here is the output :
In Cassandra, the clustering key guarantees sort order for a given partition key and not across different partitioning key(s).
To achieve what you are looking for "sort by date across all keys", you will have to redesign the table to have date_serveur as partitioning key and id as clustering column. But guess what you can't directly query based on an id with this table design.

Cassandra grouping with filter

I have a table of events that are done every minute. I want to be able to filter these events by time period and also aggregate data for hour/day/etc.
My data model:
create table min_dev_data (
device TEXT,
event_time BIGINT,
hour BIGINT,
day BIGINT,
value DOUBLE,
PRIMARY KEY ((device), event_time)
)
CREATE MATERIALIZED VIEW hour_dev_data AS
SELECT device, event_time, hour, value
FROM min_dev_data
WHERE hour IS NOT NULL AND value IS NOT NULL
and event_time IS NOT NULL AND device IS NOT NULL
PRIMARY KEY ((device), hour, event_time)
my query is
select hour, sum(value)
from hour_dev_data
where device = 'tst' and event_time < 149000000 group by device, hour;
fails with error
code=2200 [Invalid query] message="PRIMARY KEY column "event_time" cannot be restricted as preceding column "hour" is not restricted"
The only way to make it work is to add ALLOW FILTERING, which is unpredictable.
How can I change my data model to address my query and avoid ALLOW FILTERING mode?
You have to proactively produce these results:
create table min_dev_data (
device TEXT,
event_time BIGINT,
hour BIGINT,
day BIGINT,
value DOUBLE,
PRIMARY KEY ((device), event_time)
) WITH CLUSTERING ORDER BY (event_time DESC);
create table hour_dev_data (
device TEXT,
hour BIGINT,
day BIGINT,
event_time BIGINT,
value DOUBLE,
PRIMARY KEY ((device), event_time)
) WITH CLUSTERING ORDER BY (event_time DESC);
create table day_dev_data (
device TEXT,
day BIGINT,
event_time BIGINT,
value DOUBLE,
PRIMARY KEY ((device), event_time)
) WITH CLUSTERING ORDER BY (event_time DESC);
Each table will satisfy ONE granularity only.
Every hour you query the minute data for the latest hour data for each device with something like:
SELECT * FROM min_dev_data WHERE device = X AND event_time < YYYY
Sum that at application level and store this value into the hour table:
INSERT INTO hour_dev_data (device, hour, day, event_time, value) VALUES (....);
And every day you query the hour table to produce the further aggregate data:
SELECT * FROM hour_dev_data WHERE device = X AND event_time < YYYY
sum at application level and store this value into the day table.
Please consider adding some form of bucketing because, at one minute interval, in two months your minute table will have wide partitions. This should not be a problem if you keep the table in reverse order (like I did) and query only for the last couple of hours. But if you want to query back in time as well then you must definitely use bucketing in your tables.
I think you had things pretty much right already, but you need to change your filter on event_time to be a filter on hour.
select hour, sum(value)
from hour_dev_data
where device = 'tst' and hour < 1500000000
group by device, hour;
When you were filtering on event_time, you were implicitly requiring a full scan of the row, as the event_time is clustered after the hour. To filter by the event_time, every cell would need to be examined to check the event_time. When you filter by hour, it is first in the clustering key, so it can be efficiently scanned and filtered. See this post post on ALLOW FILTERING for more on this.
I agree with xmas79 that you probably want to be bucketing at some level, perhaps by month or year depending on your frequency of events. If you're always going to be looking for the most recent values, then setting the clustering key ordering to desc is probably going to be helpful too:
CREATE MATERIALIZED VIEW hour_dev_data3 AS
SELECT device, event_time, hour, value
FROM min_dev_data
WHERE hour IS NOT NULL AND value IS NOT NULL
and event_time IS NOT NULL AND device IS NOT NULL
PRIMARY KEY ((device), hour, event_time)
WITH CLUSTERING ORDER BY (hour DESC);
Scheduling aggregations like xmas79 suggests is going to be more efficient as the sum is done once, rather than summing every time reads are done, however it does add more maintenance burden, where the materialised view handles it for you.

how to do the query in cassandra If i have two cluster key in column family

I have a column family and syntax like this:
CREATE TABLE sr_number_callrecord (
id int,
callerph text,
sr_number text,
callid text,
start_time text,
plan_id int,
PRIMARY KEY((sr_number), start_time, callerph)
);
I want to do the query like :
a) select * from dummy where sr_number='+919xxxx8383'
and start_time >='2014-12-02 08:23:18' limit 10;
b) select * from dummy where sr_number='+919xxxxxx83'
and start_time >='2014-12-02 08:23:18'
and callerph='+9120xxxxxxxx0' limit 10;
First query works fine but second query is giving error like
Bad Request: PRIMARY KEY column "callerph" cannot be restricted
(preceding column "start_time" is either not restricted or by a non-EQ
relation)
If I get the result in first query, In second query I am just adding one
more cluster key to get filter result and the row will be less
Just like you cannot skip PRIMARY KEY components, you may only use a non-equals operator on the last component that you query (which is why your 1st query works).
If you do need to serve both of the queries you have listed above, then you will need to have separate query tables for each. To serve the second query, a query table (with the same columns) will work if you define it with a PRIMARY KEY like this:
PRIMARY KEY((sr_number), callerph, start_time)
That way you are still specifying the parts of your PRIMARY KEY in order, and your non-equals condition is on the last PRIMARY KEY component.
There are certain restrictions in the way the primary key columns are to be used in the where clause http://docs.datastax.com/en/cql/3.1/cql/cql_reference/select_r.html
One solution that will work in your situation is to change the order of clustering columns in the primary key
CREATE TABLE sr_number_callrecord (
id int,
callerph text,
sr_number text,
callid text,
start_time text,
plan_id int,
PRIMARY KEY((sr_number), callerph, start_time,)
);
Now you can use range query on the last column as
select * from sr_number_callrecord where sr_number = '1234' and callerph = '+91123' and start_time >= '1234';

how to query to get an ordered result from cassandra table

i need to query like:
select * from items where group="a" order by update_time desc;
however,the column "update_time" of each row will not be fixed,it will change as we need.
so,how can i design the cassandra tables to achieve the goal : querying to get an ordered result?
i need to query like:
select * from items where group="a" order by update_time desc;
however,the column "update_time" of each row will not be fixed,it will change as we need.
so,how can i design the cassandra tables to achieve the goal : querying to get an ordered result?
Sample Table
CREATE TABLE items (
ItemA text,
update_time timeuuid,
ItemB int,
PRIMARY KEY ( ItemA, update_time)
) WITH CLUSTERING ORDER BY (update_time DESC);
Ordering Field should be part of clustering key.
Please refer the above table, where we ordering the rows update_time as Desending order.

Resources