Cassandra : How to select data updated in last 30 days - cassandra

We have a requirement to load last 30 days updated data from the table.
One of the potential solution below does not allow to do so.
select * from XYZ_TABLE where WRITETIME(lastupdated_timestamp) > (TOUNIXTIMESTAMP(now())-42,300,000);
select * from XYZ_TABLE where lastupdated_timestamp > (TOUNIXTIMESTAMP(now())-42,300,000);
The table has columns as
lastupdated_timestamp (with an index on this field)
lastupdated_userid (with an index on this field)
Any pointers ...

Unless your table was built with this query in mind, your query will search every partition of the database, which will become very costly once your dataset has become large and will probably result in a timeout.
To efficiently complete this query, the XYZ_TABLE should have a primary key something like so:
PRIMARY KEY ((update_month, update_day), lastupdated_timestamp)
This is so Cassandra knows right where to go find the data. It has month and day buckets it can quickly find, then you can run queries like this to find updates on a certain day.
SELECT * FROM XYZ_TABLE WHERE update_month = 07-18 and update_day = 06

Related

Cassandra, what is the efficient way to run subquery

I have a huge table of employees (around 20 to 30 million), and I have around 50,000 employee ids to select from this table.
What is the fastest way to query? Is it a query like this:
select * from employee_table where employeeid in (1,400,325 ....50000)
The ids are not necessarily in sequential order; they are in a random order.
When the IN clause is used in a query the load for the co-ordinator node increases because for every value (in your case the employee id) it needs to hit the required nodes (again based on the CL of your query) and collate the results before returning back to the client. Hence if your IN clause has a few values using IN is ok.
But in your case if you need to fetch ~50K employee IDs I would suggest you fire select * from employee_table where employeeid = <your_employee_id> in parallel for those 50K IDs
I would also suggest that when you do this you should monitor your cassandra cluster & to ensure these parallel queries are not causing a high load on your cluster. (This last statement is based on my personal experience :))

Cassandra for storing click logs

I work in ad tech and our current infrastructure uses MySQL for storing clicks and conversion logs. So far, MySQL has been useful to us for running ad hoc queries against click data.
We are considering switching to Cassandra as we receive huge traffic spikes during peak times. Not only that, we are growing at a very fast rate and we get about 500-1000 clicks per second every now and then(for an extended duration,sometimes for 20-30 minutes).
I have been the options available, and so far, my research has let me to believe that nothing beats Cassandra in terms of write performance.
I'm currently in the process of creating a data model to store clicks.
The major component of any clicks are as follows:
Campaign id
Pub id
Timestamp
Creative id
Event code (whether it is a valid click or an invalid click.This is an int value. For example, event_code=0 is a valid click)
Now, I need to support the following queries:
1. SELECT * FROM clicks WHERE campaign_id=?
2. SELECT * FROM clicks WHERE campaign_id=? AND date_time>=? AND date_time <=?
3. SELECT * FROM clicks WHERE campaign_id=? AND pub_id=? AND AND date_time>=? AND date_time <=? AND event_code=?
etc
This is simple enough to do with MySQL, after which I just get all the data from these queries in a CSV file.
However, if I were to model my tables based on the first query, this would mean that I would require to create a table in Cassandra like the following:
CREATE TABLE clicks_by_campaign(
camp_id int,
pub_id int,
date_time timestamp,
creative_id int,
event_code int,
//other fields like ip, user agent ,device etc,
PRIMARY KEY(camp_id,pub_id,date_time,event_code,creative_id))
But there are campaigns that can have millions of rows. For example, we have campaigns with a particular id, say id=3, that have more than 7 million clicks.
Wouldn't this create a wide rows problem? From what I understand, all of this campaign data would be stored as one partition on one physical machine. Is my thinking here correct or am I missing something? Please note that other queries have to be supported as well. For example, I might have to share the click logs for a particular publisher(irrespective of the campaign id). In which case, the query would look like:
SELECT * FROM clicks_by_publisher WHERE pub_id=?
This obviously would mean that I would have to create another table by the name 'clicks_by_publisher' etc.
I would also like to point out that I would be using Apache Flink that would analyze, aggregate and group clicks info on a time window of 1 minute. These results will further be stored into MySQL to provide as much support for ad-hoc queries as possible.
Can someone point me out in the right direction.
Is there any other strategy that I can use? Am I missing something?
You have a few options. Three that i feel i can describe. The first is specifying the columns as follows
campaign_id = PRIMARY_KEY
event_code = CLUSTER_KEY
date_time = CLUSTER_KEY
Running greater than or equal queries on cluster keys is possible. Your queries will run.
You're right in saying this would create a single partition for each campaign id. To solve your rows being stored on one physical machine you could create a different table that links campaign ids to row ids in your clicks table. This would reduce the overall data stored on a single machine.
Another solution would be to prefix each campaign id with a machine id. That splits the number of rows between each machine equally. It would mean creating a query prefixed with each machine id for each query but allows for growth.
This leads onto spark. Spark will handle running your query on multiple machines and concatenating the results for you automatically, essentially doing what i described above without the development overhead.
Working with Cassandra myself, i opted for a combination of the first and second solution because it fit with the data structure i was working with. Remember that Cassandra is very efficient at writes so don't be too conservative about creating tables to help filter queries and more sparsely store your data.
Perhaps storing clicks by a hash of campaign id's prefixed by the date will work for you.
Edit : Unless disabled, Cassandra will automatically hash your primary keys using the Murmur3 algorithm.
To model your requirement for fast reads and distributed right, use below table definition -
CREATE TABLE clicks_by_campaign(
camp_id int,
createdon bigint,
pub_id int,
creative_id int,
event_code int,
//other fields like ip, user agent ,device etc,
PRIMARY KEY((camp_id,createdon),event_code))
This will help to distribute data evenly across the partitions. This will also solves our second and third query -
2. SELECT * FROM clicks WHERE campaign_id=? AND date_time>=? AND date_time <=?
Query will be -
SELECT * FROM clicks_by_campaign WHERE token(camp_id, createdon) > token(100, '1111111111111') AND token(camp_id, createdon) <= token(100, '22222222222222')
3. SELECT * FROM clicks WHERE campaign_id=? AND pub_id=? AND AND date_time>=? AND date_time <=? AND event_code=?
The query will be -
SELECT * FROM clicks_by_campaign WHERE token(camp_id, createdon) > token(100, '1111111111111') AND token(camp_id, createdon) <= token(100, '22222222222222') AND event_code=10
First query -
1. SELECT * FROM clicks WHERE campaign_id=?
This is really a anti pattern in cassandra. What I would do , process campaign data batch wise, hourly- daily - weekly - yearly. Think about campaign id again, do we have to process the all the data at a time. Same goes for the 'clicks_by_publisher' .
Edit 1
Could you elaborate on what you mean by 'token' ?
Cassandra partitions rows using partition key. In above table definition we have combined camp_id and createdon values (camp_id and createdon think like composit primary key in RDBMS) to form a partition key. The cassandra partitioner calculates hash value combining camp_id and createdon , and decides which partition the row goes. To retrieve same row, partitioner need to recalculate the hash value. The function toke(), does that.
The time stamp represent the time at click event happened, this value is in milliseconds. Using createdon (type long), will help to evenly distribute the rows across the partitions.
For example for insert statement
1. INSERT INTO clicks_by_campaign (camp_id,createdon ,....) values 100,1111111111111,......) the calculated hash, lets say 111 (combining values 100,1111111111111 ) -- this will go in partition 1
2. INSERT INTO clicks_by_campaign (camp_id,createdon ,....) values (100,2222222222222,......) the calculated hash, lets say 222 (combining values 100,2222222222222 ) -- this will go in partition 2
Java has API to convert a date in to milliseconds. Date represented in milliseconds can be converted to any format using any time zone.
In fact , your use case is right candidate to design a time series data model.

Timeseries data modelling in cassandra

I am trying to store & retrieve data in cassandra in the following way:
Storing Data:
I created the table in the following way:
CREATE TABLE mydata (
myKey TEXT,
datetime TIMESTAMP,
value TEXT,
PRIMARY KEY (myKey,datetime)
);
Where i would store a value for every minute for last 5 years. So it stores 1440 * 365 * 5 = 2628000 records/columns per row (myKey as row key).
INSERT INTO mydata(myKey, datetime, value) VALUES ('1234ABCD','2013-04-03 07:01:00','72F');
INSERT INTO mydata(myKey, datetime, value) VALUES ('1234ABCD','2013-04-03 07:02:00','72F');
INSERT INTO mydata(myKey, datetime, value) VALUES ('1234ABCD','2013-04-03 07:03:00','72F');
.................
I am able to store data and all fine. However, i would like to know, if this is efficient way of doing (storing) data horizontally (2628000 values for each key for 1 million such keys altogether) ?
Retrieving Data:
After storing the data in above format, i am able to select data by using a simple select query for a period.
Ex:
SELECT *
FROM mydata
WHERE myKey='1234ABCD' AND datetime > '2013-04-03 07:01:00' AND datetime < '2013-04-03 07:04:00';
The query works fine and i get result as expected.
However my question is:
How can i select only those values at certain intervals. For example, if i query data for a day, i would get 1440 values (1 for every minute). I would like to get values at every 10 minutes interval (value at every 10th minute) limiting the no. of values to 144.
Is there a way to query the table if we use the above storage strategy?
If not, what are possible options to meet my requirement of querying data at a specific interval like 1-min, 10-min, 1-hour, 1-day etc?
Appreciate any other suggestions.
No it not right ,in future you will face problem because per row key we can only store 2 billion records or columns. After that it will not give error but it will store data also .
For your problem split column timestamp into year , month , day and time .
like 2016 , 04 , 04 and 15:03:00 .Put also year , month , day into partition key .
You definitely need to bound your partition with a modular version of the timestamp. But the granularity really depends on your reads.
If you are mainly going to read per day then use something like this PK((myKey, yyyymmdd), time)
If mainly by weeks PK((mykey, yyyyww), time), or month...
The problem is then if you want to read values for a whole year, then you better have a partition per weeks or month, or even year would do I think if you don't do any deletes, your partition size needs to be smaller than 100MB

Is creating a new table from scratch to support new query a common pratice in cassandra

Currently, we have the following table, which enables us to perform query based on day.
CREATE TABLE events_by_day(
...
traffic_type text,
device_class text,
country text,
...
yyyymmdd text,
event_type text,
the_datetime timeuuid,
PRIMARY KEY((yyyymmdd, event_type), the_datetime));
create index index_country on events (country);
create index index_traffic_type on events (traffic_type);
create index index_device_class on events (device_class);
The following queries are being supported.
select * from events where yymmdd = '20160303' and event_type in ('view');
select * from events where yymmdd = '20160303' and event_type in ('lead', 'view', 'sales');
select * from events where yymmdd = '20160303' and event_type = 'lead' and country = 'my' and device_class = 'smart' and traffic_type = 'WEB' ALLOW FILTERING;
When we need a data more than a day, we will perform the query multiple times. Say, I need "view" data from 1st of March 2016 till 3rd of March 2016, I will query 3 times.
select * from events where yymmdd = '20160301' and event_type in ('view');
select * from events where yymmdd = '20160302' and event_type in ('view');
select * from events where yymmdd = '20160303' and event_type in ('view');
Currently, all these fit well into our requirement.
However, in the future, let's say we have a new requirement, we need "view" data from 2013 till 2016.
Instead of querying it 1460 times (365 days * 4 years) , is it a common practice for us to create a whole new empty table like
CREATE TABLE events_by_year(
...
traffic_type text,
device_class text,
country text,
...
yyyy text,
event_type text,
the_datetime timeuuid,
PRIMARY KEY((yyyy, event_type), the_datetime));
and then fill up the data with large data from events_by_day (which might takes several days to finish the insertion as events_by_day table already has many rows)?
The short answer is yes. It is common to roll up weekly, monthly, yearly data in to new tables so that it can be queried more efficiently.
It also would be better to, for example, keep a rolling aggregation that runs daily (could be another suitable time period depending on your data and requirements) and calculates these values, rather than waiting until you need them and then running a process that takes a few days.
is it a common practice for us to create a whole new empty table?
Yes it is. This is called "Query Based Modeling," and it is quite common in Cassandra. While Cassandra scales and performs well, it does not offer much in the way of query flexibility. So to get around that, instead of using ill-performing methods (secondary indexes, ALLOW FILTERING) to query an existing table, the table is commonly duplicated with a different PRIMARY KEY. Basically, you are trading disk space for performance.
Not to self-promote or anything, but I gave a talk on this subject at the last Cassandra Summit. You may find the slides helpful: Escaping Disco Era Data Modeling
Speaking of performance, using the IN keyword on a partition key has been proven to be just as bad as using a secondary index. You'll get much better performance with 3 parallel queries, as opposed to this: event_type in ('lead', 'view', 'sales').
Additionally, your last query is using ALLOW FILTERING which is something you should never do on a production system, because it will result in a scan of your entire table, and several of your nodes.
For ideal performance, it is best to ensure that your queries target a specific data partition. This way, you will only hit a single node, and not introduce extraneous network traffic into the equation.

CQL: Search a table in cassandra using '<' on a indexed column

My cassandra data model:
CREATE TABLE last_activity_tracker ( id uuid, recent_activity_time timestamp, PRIMARY KEY(id));
CREATE INDEX activity_idx ON last_activity_tracker (recent_activity_time) ;
The idea is to keep track of 'id's and their most recent activity of an event.
I need to find the 'id's whose last activity was an year ago.
So, I tried:
SELECT * from last_activity_tracker WHERE recent_activity_time < '2013-12-31' allow filtering;
I understand that I cannot use other than '=' for secondary indexed columns.
However, I cannot add 'recent_activity_time' to the key as I need to update this column with the most recent activity time of an event if any.
Any ideas in solving my problem are highly appreciated.
I can see an issue with your query. You're not hitting a partition. As such, the performance of your query will be quite bad. It'll need to query across your whole cluster (assuming you took measures to make this work).
If you're looking to query the last activity time for an id, think about storing it in a more query friendly format. You might try this:
create table tracker (dummy int, day timestamp, id uuid, primary key(dummy, day, id));
You can then insert with the day to be the epoch for the date (ignoring the time), and dummy = 0.
That should enable you to do:
select * from tracker where dummy=0 and day > '2013-12-31';
You can set a ttl on insert so that old entries expire (maybe after a year in this case). The idea is that you're storing information in a way that suits your query.

Resources