CQL: Search a table in cassandra using '<' on a indexed column - cassandra

My cassandra data model:
CREATE TABLE last_activity_tracker ( id uuid, recent_activity_time timestamp, PRIMARY KEY(id));
CREATE INDEX activity_idx ON last_activity_tracker (recent_activity_time) ;
The idea is to keep track of 'id's and their most recent activity of an event.
I need to find the 'id's whose last activity was an year ago.
So, I tried:
SELECT * from last_activity_tracker WHERE recent_activity_time < '2013-12-31' allow filtering;
I understand that I cannot use other than '=' for secondary indexed columns.
However, I cannot add 'recent_activity_time' to the key as I need to update this column with the most recent activity time of an event if any.
Any ideas in solving my problem are highly appreciated.

I can see an issue with your query. You're not hitting a partition. As such, the performance of your query will be quite bad. It'll need to query across your whole cluster (assuming you took measures to make this work).
If you're looking to query the last activity time for an id, think about storing it in a more query friendly format. You might try this:
create table tracker (dummy int, day timestamp, id uuid, primary key(dummy, day, id));
You can then insert with the day to be the epoch for the date (ignoring the time), and dummy = 0.
That should enable you to do:
select * from tracker where dummy=0 and day > '2013-12-31';
You can set a ttl on insert so that old entries expire (maybe after a year in this case). The idea is that you're storing information in a way that suits your query.

Related

Cassandra Data modelling : Timestamp as partition keys

I need to be able to return all users that performed an action during a specified interval. The table definition in Cassandra is just below:
create table t ( timestamp from, timestamp to, user text, PRIMARY KEY((from,to), user))
I'm trying to implement the following query in Cassandra:
select * from t WHERE from > :startInterval and to < :toInterval
However, this query will obviously not work because it represents a range query on the partition key, forcing Cassandra to search all nodes in the cluster, defeating its purpose as an efficient database.
Is there an efficient to model this query in Cassandra?
My solution would be to split both timestamps into their corresponding years and months and use those as the partition key. The table would look like this:
create table t_updated ( yearFrom int, monthFrom int,yearTo int,monthTo int, timestamp from, timestamp to, user text, PRIMARY KEY((yearFrom,monthFrom,yearTo,monthTo), user) )
If i wanted the users that performed the action between Jan 2017 and July 2017 the query would look like the following:
select user from t_updated where yearFrom IN (2017) and monthFrom IN (1,2,3,4,5,6,7) and yearTo IN (2017) and monthTo IN (1,2,3,4,5,6,7)
Would there be a better way to model this query in Cassandra? How would you approach this issue?
First, the partition key has to operate on equals operator. It is better to use PRIMARY KEY (BUCKET, TIME_STAMP) here where bucket can be combination of year, month (or include days, hrs etc depending on how big your data set is).
It is better to execute multiple queries and combine the result in client side.
The answer depends on the expected number of entries. Thumb rule, is that a partition should not exceed 100mb. So if you expect a moderate number of entries, it would be enough to go with year as partition key.
We use Week-First-Date as a partition key in a iot scenario, where values get written at most once a minute.

How to select data in Cassandra either by ID or date?

I have a very simple data table. But after reading a lot of examples in the internet, I am still more and more confused how to solve the following scenario:
1) The Table
My data table looks like this (without defining the primayr key, as this is my understanding problem):
CREATE TABLE documents (
uid text,
created text,
data text
}
Now my goal is to have to different ways to select data.
2) Select by the UID:
SELECT * FROM documents
WHERE uid = ‘xxxx-yyyyy-zzzz’
3) Select by a date limit
SELECT * FROM documents
WHERE created >= ‘2015-06-05’
So my question is:
What should my table definition in Cassandra look like, so that I can perform these selections?
To achieve both queries, you would need two tables.
First one would look like:
CREATE TABLE documents (
uid text,
created text,
data text,
PRIMARY KEY (uid));
and you retrieve your data with: SELECT * FROM documents WHERE uid='xxxx-yyyy-zzzzz' Of course, uid must be unique. You might want to consider the uuid data type (instead of text)
Second one is more delicate. If you set your partition to the full date, you won't be able to do a range query, as range query is only available on the clustering column. So you need to find the sweet spot for your partition key in order to:
make sure a single partition won't be too large (max 100MB,
otherwise you will run into trouble)
satisfy your query requirements.
As an example:
CREATE TABLE documents_by_date (
year int,
month int,
day int,
uid text,
data text,
PRIMARY KEY ((year, month), day, uid);
This works fine if within a day, you don't have too many documents (so your partition don't grow too much). And this allows you to create queries such as: SELECT * FROM documents_by_date WHERE year=2018 and month=12 and day>=6 and day<=24; If you need to issue a range query across multiple months, you will need to issue multiple queries.
If your partition is too large due to the data field, you will need to remove it from documents_by_date. And use documents table to retrieve the data, given the uid you retreived from documents_by_date.
If your partition is still too large, you will need to add hour in the partition key of documents_by_date.
So overall, it's not a straightforward request, and you will need to find the right balance for yourself when defining your partition key.
If latency is not a huge concern, an alternative would be to use the stratio lucene cassandra plugin, and index your date.
Question does not specify how your data is going to be with respect user and create time. But since its a document, I am assuming that one user will be creating one document at one "created" time.
Below is the table definition you can use.
CREATE TABLE documents (
uid text,
created text,
data text
PRIMARY KEY (uid, created)
) WITH CLUSTERING ORDER BY (created DESC);
WITH CLUSTERING ORDER BY (created DESC) can help you get the data order by created for a given user.
For your first requirement you can query like given below.
SELECT * FROM documents WHERE uid = 'SEARCH_UID';
For your second requirement you can query like given below
SELECT * FROM documents WHERE created > '2018-04-10 11:32:00' ALLOW FILTERING;
Use of Allow Filtering should be used diligently as it scans all partitions. If we have to create a separate table with date as primary key, it becomes tricky if there are many documents being inserted at very same second. Clustering order works best for the requirements where documents for a given user need to be sorted by time.

Is creating a new table from scratch to support new query a common pratice in cassandra

Currently, we have the following table, which enables us to perform query based on day.
CREATE TABLE events_by_day(
...
traffic_type text,
device_class text,
country text,
...
yyyymmdd text,
event_type text,
the_datetime timeuuid,
PRIMARY KEY((yyyymmdd, event_type), the_datetime));
create index index_country on events (country);
create index index_traffic_type on events (traffic_type);
create index index_device_class on events (device_class);
The following queries are being supported.
select * from events where yymmdd = '20160303' and event_type in ('view');
select * from events where yymmdd = '20160303' and event_type in ('lead', 'view', 'sales');
select * from events where yymmdd = '20160303' and event_type = 'lead' and country = 'my' and device_class = 'smart' and traffic_type = 'WEB' ALLOW FILTERING;
When we need a data more than a day, we will perform the query multiple times. Say, I need "view" data from 1st of March 2016 till 3rd of March 2016, I will query 3 times.
select * from events where yymmdd = '20160301' and event_type in ('view');
select * from events where yymmdd = '20160302' and event_type in ('view');
select * from events where yymmdd = '20160303' and event_type in ('view');
Currently, all these fit well into our requirement.
However, in the future, let's say we have a new requirement, we need "view" data from 2013 till 2016.
Instead of querying it 1460 times (365 days * 4 years) , is it a common practice for us to create a whole new empty table like
CREATE TABLE events_by_year(
...
traffic_type text,
device_class text,
country text,
...
yyyy text,
event_type text,
the_datetime timeuuid,
PRIMARY KEY((yyyy, event_type), the_datetime));
and then fill up the data with large data from events_by_day (which might takes several days to finish the insertion as events_by_day table already has many rows)?
The short answer is yes. It is common to roll up weekly, monthly, yearly data in to new tables so that it can be queried more efficiently.
It also would be better to, for example, keep a rolling aggregation that runs daily (could be another suitable time period depending on your data and requirements) and calculates these values, rather than waiting until you need them and then running a process that takes a few days.
is it a common practice for us to create a whole new empty table?
Yes it is. This is called "Query Based Modeling," and it is quite common in Cassandra. While Cassandra scales and performs well, it does not offer much in the way of query flexibility. So to get around that, instead of using ill-performing methods (secondary indexes, ALLOW FILTERING) to query an existing table, the table is commonly duplicated with a different PRIMARY KEY. Basically, you are trading disk space for performance.
Not to self-promote or anything, but I gave a talk on this subject at the last Cassandra Summit. You may find the slides helpful: Escaping Disco Era Data Modeling
Speaking of performance, using the IN keyword on a partition key has been proven to be just as bad as using a secondary index. You'll get much better performance with 3 parallel queries, as opposed to this: event_type in ('lead', 'view', 'sales').
Additionally, your last query is using ALLOW FILTERING which is something you should never do on a production system, because it will result in a scan of your entire table, and several of your nodes.
For ideal performance, it is best to ensure that your queries target a specific data partition. This way, you will only hit a single node, and not introduce extraneous network traffic into the equation.

Using Cassandra for time series data

I'm on my research for storing logs to Cassandra.
The schema for logs would be something like this.
EDIT: I've changed the schema in order to make some clarification.
CREATE TABLE log_date (
userid bigint,
time timeuuid,
reason text,
item text,
price int,
count int,
PRIMARY KEY ((userid), time) - #1
PRIMARY KEY ((userid), time, reason, item, price, count) - #2
);
A new table will be created for the day everyday.
So a table contains logs for only one day.
My querying condition is as follows.
Query all logs from a specific user on a specific day(date not time).
So the reason, item, price, count will not be used as hints or conditions for queries at all.
My Question is which PRIMARY KEY design suits better.
EDIT: And the key here is that I want to store the logs in a schematic way.
If I choose #1 so many columns would be created per log. And the possibility of having more values per log is very high. The schema above is just an example. The log can contain values like subreason, friendid and so on.
If I choose #2 one (very) composite column will be created per log, and so far I couldn't find any valuable information about the overhead of the composite columns.
Which one should I choose? Please help.
My advise is that none of your two options seems to be ideal for your time-series, the fact the you're creating a table per-day, doesn't seem optimal either.
Instead I'd recommend to create a single Table and partition by userid and day and use a time uuids as the clustered column for the event, an example of this would look like:
CREATE TABLE log_per_day (
userid bigint,
date text,
time timeuuid,
value text,
PRIMARY KEY ((userid, date), time)
)
This will allow you to have all events in a day in a single row and allow you to do your query per day per user.
By declaring the time clustered column allows to have a wide row where you can insert as a many events as you need in a day.
So the row key is a composite key of the userid and plus date in text e.g.
insert into log_per_day (userid, date, time, value) values (1000,'2015-05-06',aTimeUUID1,'my value')
insert into log_per_day (userid, date, time, value) values (1000,'2015-05-06',aTimeUUID2,'my value2')
The two inserts above will be in the same row and therefore you will be able to read in a single query.
Also if you want more information about time series I highly recommend you to check Getting Started with Time Series Data Modeling
Hope it helps,
José Luis

Cassandra time based query

I have the following Cassandra table which records the user access to a web page.
create table user_access (
id timeuuid primary key,
user text,
access_time timestamp
);
and would like to do a query like this:
get the list of users who access the page for more than 10 times in the last hour.
Is it possible to do it in Cassandra? (I'm kind of stuck with the limited CQL query functionalities)
If not, how do I remodel the table to do this?
Can you do it? yes.
Can you do it efficiently? I'm not convinced.
It's not clear what the timeuuid you are using represents.
You could reorganize this to
CREATE TABLE user_access (
user text,
access_time timestamp,
PRIMARY KEY (user_id, access_time)
);
SELECT COUNT(*)
FROM user_access
WHERE user_id = '101'
AND access_time > 'current unix timestamp - 3600'
AND access_time < 'current unix timestamp';
Then filter the results on your own in your language of choice. I wouldn't hold your breathe waiting for sub query support.
That's going to be horribly inefficient if you have lots of users though.
There may be a better solution using cql's counter columns and binning accesses to the start of the hour. That could get you per hour accesses, but that's not the same as within the last hour.

Resources