Is creating a new table from scratch to support new query a common pratice in cassandra - cassandra

Currently, we have the following table, which enables us to perform query based on day.
CREATE TABLE events_by_day(
...
traffic_type text,
device_class text,
country text,
...
yyyymmdd text,
event_type text,
the_datetime timeuuid,
PRIMARY KEY((yyyymmdd, event_type), the_datetime));
create index index_country on events (country);
create index index_traffic_type on events (traffic_type);
create index index_device_class on events (device_class);
The following queries are being supported.
select * from events where yymmdd = '20160303' and event_type in ('view');
select * from events where yymmdd = '20160303' and event_type in ('lead', 'view', 'sales');
select * from events where yymmdd = '20160303' and event_type = 'lead' and country = 'my' and device_class = 'smart' and traffic_type = 'WEB' ALLOW FILTERING;
When we need a data more than a day, we will perform the query multiple times. Say, I need "view" data from 1st of March 2016 till 3rd of March 2016, I will query 3 times.
select * from events where yymmdd = '20160301' and event_type in ('view');
select * from events where yymmdd = '20160302' and event_type in ('view');
select * from events where yymmdd = '20160303' and event_type in ('view');
Currently, all these fit well into our requirement.
However, in the future, let's say we have a new requirement, we need "view" data from 2013 till 2016.
Instead of querying it 1460 times (365 days * 4 years) , is it a common practice for us to create a whole new empty table like
CREATE TABLE events_by_year(
...
traffic_type text,
device_class text,
country text,
...
yyyy text,
event_type text,
the_datetime timeuuid,
PRIMARY KEY((yyyy, event_type), the_datetime));
and then fill up the data with large data from events_by_day (which might takes several days to finish the insertion as events_by_day table already has many rows)?

The short answer is yes. It is common to roll up weekly, monthly, yearly data in to new tables so that it can be queried more efficiently.
It also would be better to, for example, keep a rolling aggregation that runs daily (could be another suitable time period depending on your data and requirements) and calculates these values, rather than waiting until you need them and then running a process that takes a few days.

is it a common practice for us to create a whole new empty table?
Yes it is. This is called "Query Based Modeling," and it is quite common in Cassandra. While Cassandra scales and performs well, it does not offer much in the way of query flexibility. So to get around that, instead of using ill-performing methods (secondary indexes, ALLOW FILTERING) to query an existing table, the table is commonly duplicated with a different PRIMARY KEY. Basically, you are trading disk space for performance.
Not to self-promote or anything, but I gave a talk on this subject at the last Cassandra Summit. You may find the slides helpful: Escaping Disco Era Data Modeling
Speaking of performance, using the IN keyword on a partition key has been proven to be just as bad as using a secondary index. You'll get much better performance with 3 parallel queries, as opposed to this: event_type in ('lead', 'view', 'sales').
Additionally, your last query is using ALLOW FILTERING which is something you should never do on a production system, because it will result in a scan of your entire table, and several of your nodes.
For ideal performance, it is best to ensure that your queries target a specific data partition. This way, you will only hit a single node, and not introduce extraneous network traffic into the equation.

Related

Cassandra time series table design for timestamp range queries

Our problem is a bit different from a usual timeseries problem as we do not have natural partition key in our data. In our system we get not more than 5k/s messages, so following many publications (like this one) we figured out a following schema (it's more complex but the below matters most):
CREATE TABLE IF NOT EXISTS test.messages (
date TEXT,
hour INT,
createdAt TIMESTAMP,
uuid UUID,
data TEXT,
PRIMARY KEY ((date, hour), createdAt, uuid)
)
We mostly want to query the system based on the creation (event) time; other filtering will likely be done on different engines like Spark. The problem is that we may have a query that spans e.g. two months, so ideally we should put 60+ dates and 24 hours in the WHERE-IN-part of query, which is cumbersome to say the least. Of course, we can execute queries like below:
SELECT * FROM messages WHERE createdat >= '2017-03-01 00:00:00' LIMIT 10 ALLOW FILTERING;
My understanding is that, while the above works, it will make a full scan, which will be expensive on larger cluster. Or am I mistaken and C* knows, which partitions to scan?
I was thinking to add an index, but this problem likely falls into high-cardinality antipattern, as I understand.
EDIT: the question is not that much about the data model, though suggestions are welcome, but more about feasibility of making the queries with cratedat range instead or listing all date and hour values required in WHERE-IN-part of query to avoid full scans.

Cassandra for storing click logs

I work in ad tech and our current infrastructure uses MySQL for storing clicks and conversion logs. So far, MySQL has been useful to us for running ad hoc queries against click data.
We are considering switching to Cassandra as we receive huge traffic spikes during peak times. Not only that, we are growing at a very fast rate and we get about 500-1000 clicks per second every now and then(for an extended duration,sometimes for 20-30 minutes).
I have been the options available, and so far, my research has let me to believe that nothing beats Cassandra in terms of write performance.
I'm currently in the process of creating a data model to store clicks.
The major component of any clicks are as follows:
Campaign id
Pub id
Timestamp
Creative id
Event code (whether it is a valid click or an invalid click.This is an int value. For example, event_code=0 is a valid click)
Now, I need to support the following queries:
1. SELECT * FROM clicks WHERE campaign_id=?
2. SELECT * FROM clicks WHERE campaign_id=? AND date_time>=? AND date_time <=?
3. SELECT * FROM clicks WHERE campaign_id=? AND pub_id=? AND AND date_time>=? AND date_time <=? AND event_code=?
etc
This is simple enough to do with MySQL, after which I just get all the data from these queries in a CSV file.
However, if I were to model my tables based on the first query, this would mean that I would require to create a table in Cassandra like the following:
CREATE TABLE clicks_by_campaign(
camp_id int,
pub_id int,
date_time timestamp,
creative_id int,
event_code int,
//other fields like ip, user agent ,device etc,
PRIMARY KEY(camp_id,pub_id,date_time,event_code,creative_id))
But there are campaigns that can have millions of rows. For example, we have campaigns with a particular id, say id=3, that have more than 7 million clicks.
Wouldn't this create a wide rows problem? From what I understand, all of this campaign data would be stored as one partition on one physical machine. Is my thinking here correct or am I missing something? Please note that other queries have to be supported as well. For example, I might have to share the click logs for a particular publisher(irrespective of the campaign id). In which case, the query would look like:
SELECT * FROM clicks_by_publisher WHERE pub_id=?
This obviously would mean that I would have to create another table by the name 'clicks_by_publisher' etc.
I would also like to point out that I would be using Apache Flink that would analyze, aggregate and group clicks info on a time window of 1 minute. These results will further be stored into MySQL to provide as much support for ad-hoc queries as possible.
Can someone point me out in the right direction.
Is there any other strategy that I can use? Am I missing something?
You have a few options. Three that i feel i can describe. The first is specifying the columns as follows
campaign_id = PRIMARY_KEY
event_code = CLUSTER_KEY
date_time = CLUSTER_KEY
Running greater than or equal queries on cluster keys is possible. Your queries will run.
You're right in saying this would create a single partition for each campaign id. To solve your rows being stored on one physical machine you could create a different table that links campaign ids to row ids in your clicks table. This would reduce the overall data stored on a single machine.
Another solution would be to prefix each campaign id with a machine id. That splits the number of rows between each machine equally. It would mean creating a query prefixed with each machine id for each query but allows for growth.
This leads onto spark. Spark will handle running your query on multiple machines and concatenating the results for you automatically, essentially doing what i described above without the development overhead.
Working with Cassandra myself, i opted for a combination of the first and second solution because it fit with the data structure i was working with. Remember that Cassandra is very efficient at writes so don't be too conservative about creating tables to help filter queries and more sparsely store your data.
Perhaps storing clicks by a hash of campaign id's prefixed by the date will work for you.
Edit : Unless disabled, Cassandra will automatically hash your primary keys using the Murmur3 algorithm.
To model your requirement for fast reads and distributed right, use below table definition -
CREATE TABLE clicks_by_campaign(
camp_id int,
createdon bigint,
pub_id int,
creative_id int,
event_code int,
//other fields like ip, user agent ,device etc,
PRIMARY KEY((camp_id,createdon),event_code))
This will help to distribute data evenly across the partitions. This will also solves our second and third query -
2. SELECT * FROM clicks WHERE campaign_id=? AND date_time>=? AND date_time <=?
Query will be -
SELECT * FROM clicks_by_campaign WHERE token(camp_id, createdon) > token(100, '1111111111111') AND token(camp_id, createdon) <= token(100, '22222222222222')
3. SELECT * FROM clicks WHERE campaign_id=? AND pub_id=? AND AND date_time>=? AND date_time <=? AND event_code=?
The query will be -
SELECT * FROM clicks_by_campaign WHERE token(camp_id, createdon) > token(100, '1111111111111') AND token(camp_id, createdon) <= token(100, '22222222222222') AND event_code=10
First query -
1. SELECT * FROM clicks WHERE campaign_id=?
This is really a anti pattern in cassandra. What I would do , process campaign data batch wise, hourly- daily - weekly - yearly. Think about campaign id again, do we have to process the all the data at a time. Same goes for the 'clicks_by_publisher' .
Edit 1
Could you elaborate on what you mean by 'token' ?
Cassandra partitions rows using partition key. In above table definition we have combined camp_id and createdon values (camp_id and createdon think like composit primary key in RDBMS) to form a partition key. The cassandra partitioner calculates hash value combining camp_id and createdon , and decides which partition the row goes. To retrieve same row, partitioner need to recalculate the hash value. The function toke(), does that.
The time stamp represent the time at click event happened, this value is in milliseconds. Using createdon (type long), will help to evenly distribute the rows across the partitions.
For example for insert statement
1. INSERT INTO clicks_by_campaign (camp_id,createdon ,....) values 100,1111111111111,......) the calculated hash, lets say 111 (combining values 100,1111111111111 ) -- this will go in partition 1
2. INSERT INTO clicks_by_campaign (camp_id,createdon ,....) values (100,2222222222222,......) the calculated hash, lets say 222 (combining values 100,2222222222222 ) -- this will go in partition 2
Java has API to convert a date in to milliseconds. Date represented in milliseconds can be converted to any format using any time zone.
In fact , your use case is right candidate to design a time series data model.

Cassandra data modeling

So I'm designing this data model for product price tracking.
A product can be followed by many users and an user can follow many products, so it's a many to many relation.
The products are under constant tracking, but a new price is inserted only if it has varied from the previous one.
The users have set an upper price limit for their followed products, so every time a price varies, the preferences are checked and the users will be notified if the price has dropped below their treshold.
So initially I thought of the following product model:
However "subscriberEmails" is a list collection that will handle up to 65536 elements. But being a big data solution, it's a boundary that we don't want to have. So we end up writing a separate table for that:
So now "usersByProduct" can have up to 2 billion columns, fair enough. And the user preferences are stored in a "Map" which is again limited but we think it's a good maximum number of products to follow by user.
Now the problem we're facing is the following:
Every time we want to update a product's price we would have to make a query like this:
INSERT INTO products("Id", date, price) VALUES (7dacedd2-c09b-46c5-8686-00c2a03c71dd, dateof(now()), 24.87); // Example only
But INSERT operations don't admit other conditional clauses than (IF NOT EXISTS) and that isn't what we want. We need to update the price only if it's different from the previous one, so this forces us to make two queries (one for reading current value and another to update it if necessary).
PD. UPDATE operations do have IF conditions but it's not our case because we need an INSERT.
UPDATE products SET date = dateof(now()) WHERE "Id" = 7dacedd2-c09b-46c5-8686-00c2a03c71dd IF price != 20.3; // example only
Don't try to apply a normal model on a cassandra database. It may work but you'll end up with terrible performance and scalability.
The recommended approach to Cassandra data modeling is to first figure out your read queries against the database and structure your data so that these reads are cheap. You'll probably need to duplicate writes somewhat but it's OK because writes are pretty cheap in Cassandra.
For your specific use case, the key query seems to be able to get all users interested in a price change in a product, so you create a table for this, for example:
create table productSubscriptions (
productId uuid,
priceLimit float,
createdAt timestamp,
email text,
primary key (productId,priceLimit,createdAt)
);
but since you also need to know all product subscriptions for a user, you all need a user-keyed table of the same data:
create table userProductSubscriptions (
email text,
productId uuid,
priceLimit float,
primary key (email, productId)
)
With these 2 tables, I guess you can see that all your main queries can be done with a single-row select and your insert/delete are straightforward but will require you to modify both tables in sync.
Obviously, you'll need to flesh out a bit more the schema for your complete need but this should give you an example on how to think about your cassandra schema.
Conditional update issue
For your conditional insert issue, the easiest answer is: do it with an UPDATE if you really need it (update and insert are nearly identical in CQL) but it's a very expensive operation so avoid it if you can.
For your use case, I would split your product table in three :
create table products (
category uuid,
productId uuid,
url text,
price float,
primary key (category, productId)
)
create table productPricingAudit (
productId uuid,
date timestamp,
price float,
primary key (productId, date)
)
create table priceScheduler (
day text,
checktime timestamp,
productId uuid,
url text,
primary key (day, checktime)
)
products table can hold for full catalog, optionally split in categories (so that listing all products in a single category is a single-row select)
productPricingAudit would have an insert with the latest price retrieved whatever it is since this will let you debug any pricing issue you may have
priceScheduler holds all the check to be made for a given day, ordered by check time. Your scheduler simply has to make a column range query on single row whenever it runs.
With such a schema, you don't care about the conditional updates, you simply issue 3 inserts whenever you update a product price even it doesn't change.
Okay, I will try to answer my own question: conditional inserts other than "IF NOT EXISTS" are not supported in Cassandra by the date, period.
The closest thing is a conditional update, but that doesn't work in our scenario. So there's one simple option left: application side logic. This means that you have to read the previous entry and perform the decision on your application. The obvious downside is that 2 queries are performed (one SELECT and one INSERT) which obviously adds latency.
However this suits our application because every time we perform a query to enqueue all items that should be checked, we can select the items urls and their current prices too. So the workers that check the latest price can then make the decision of inserting or not because they have the current price to compare with.
So... A query similar to this would be performed every X minutes:
SELECT id, url, price FROM products WHERE "nextCheckTime" < now();
// example only, wouldn't even work if nextCheckTime is not part of the PK or index
This is a very costly operation to perform on a Cassandra cluster because it has to go through all rows that are stored randomly in different nodes by default. Another downside is that we need some advanced and specific statistics regarding products and users.
So we've decided that a relational database will serve us better than Cassandra in this particular case.
We sadly leave all of Cassandra's advantages (fast inserts, easy scaling, built in sharding...) and look towards a MySQL Cluster or master-slave implementation.

cassandra filtering on an indexed column isn't working

I'm using (the latest version of) Cassandra nosql dbms to model some data.
I'd like to get a count of the number of active customer accounts in the last month.
I've created the following table:
CREATE TABLE active_accounts
(
customer_name text,
account_name text,
date timestamp,
PRIMARY KEY ((customer_name, account_name))
);
So because I want to filter by date, I create an index on the date column:
CREATE INDEX ON active_accounts (date);
When I insert some data, Cassandra automatically updates data on any existing primary key matches, so the following inserts only produce two records:
insert into active_accounts (customer_name, account_name, date) Values ('customer2', 'account2', 1418377413000);
insert into active_accounts (customer_name, account_name, date) Values ('customer1', 'account1', 1418377413000);
insert into active_accounts (customer_name, account_name, date) Values ('customer2', 'account2', 1418377414000);
insert into active_accounts (customer_name, account_name, date) Values ('customer2', 'account2', 1418377415000);
This is exactly what I'd like - I won't get a huge table of data, and each entry in the table represents a unique customer account - so no need for a select distinct.
The query I'd like to make - is how many distinct customer accounts are active within the last month say:
Select count(*) from active_accounts where date >= 1418377411000 and date <= 1418397411000 ALLOW FILTERING;
In response to this query, I get the following error:
code=2200 [Invalid query] message="No indexed columns present in by-columns clause with Equal operator"
What am I missing; isn't this the purpose of the Index I created?
Table design in Cassandra is extremely important and it must match the kind of queries that you are trying to preform. The reason that Cassandra is trying to keep you from performing queries on the date column, is that any query along that column will be extremely inefficient.
Table Design - Model your queries
One of the main reasons that Cassandra can be fast is that it partitions user data so that most( 99%)
of queries can be completed without contacting all of the nodes in the cluster. This means less network traffic, less disk access, and faster response time. Unfortunately Cassandra isn't able to determine automatically what the best way to partition data. The end user must determine a schema which fits into the C* datamodel and allows the queries they want at a high speed.
CREATE TABLE active_accounts
(
customer_name text,
account_name text,
date timestamp,
PRIMARY KEY ((customer_name, account_name))
);
This schema will only be efficient for queries that look like
SELECT timestamp FROM active_accounts where customer_name = ? and account_name = ?
This is because on the the cluster the data is actually going to be stored like
node 1: [ ((Bob,1)->Monday), ((Tom,32)->Tuesday)]
node 2: [ ((Candice, 3) -> Friday), ((Sarah,1) -> Monday)]
The PRIMARY KEY for this table says that data should be placed on a node based on the hash of the combination of CustomerName and AccountName. This means we can only look up data quickly if we have both of those pieces of data. Anything outside of that scope becomes a batch job since it requires hitting multiple nodes and filtering over all the data in the table.
To optimize for different queries you need to change the layout of your table or use a distributed analytics framework like Spark or Hadoop.
An example of a different table schema that might work for your purposes would be something like
CREATE TABLE active_accounts
(
start_month timestamp,
customer_name text,
account_name text,
date timestamp,
PRIMARY KEY (start_month, date, customer_name, account_name)
);
In this schema I would put the timestamp of the first day of the month as the partitioning key and date as the first clustering key. This means that multiple account creations that took place in the same month will end up in the same partition and on the same node. The data for a schema like this would look like
node 1: [ (May 1 1999) -> [(May 2 1999, Bob, 1), (May 15 1999,Tom,32)]
This places the account dates in order within each partition making it very fast for doing range slices between particular dates. Unfortunately you would have to add code on the application side to pull down the multiple months that a query might be spanning. This schema takes a lot of (dev) work so if these queries are very infrequent you should use a distributed analytics platform instead.
For more information on this kind of time-series modeling check out:
http://planetcassandra.org/getting-started-with-time-series-data-modeling/
Modeling in general:
http://www.slideshare.net/planetcassandra/cassandra-day-denver-2014-40328174
http://www.slideshare.net/johnny15676/introduction-to-cql-and-data-modeling
Spark and Cassandra:
http://planetcassandra.org/getting-started-with-apache-spark-and-cassandra/
Don't use secondary indexes
Allow filtering was added to the cql syntax to prevent users from accidentally designing queries that will not scale. The secondary indexes are really only for use by those do analytics jobs or those C* users who fully understand the implications. In Cassandra the secondary index lives on every node in your cluster. This means that any query that requires a secondary index necessarily will require contacting every node in the cluster. This will become less and less performant as the cluster grows and is definitely not something you want for a frequent query.

CQL: Search a table in cassandra using '<' on a indexed column

My cassandra data model:
CREATE TABLE last_activity_tracker ( id uuid, recent_activity_time timestamp, PRIMARY KEY(id));
CREATE INDEX activity_idx ON last_activity_tracker (recent_activity_time) ;
The idea is to keep track of 'id's and their most recent activity of an event.
I need to find the 'id's whose last activity was an year ago.
So, I tried:
SELECT * from last_activity_tracker WHERE recent_activity_time < '2013-12-31' allow filtering;
I understand that I cannot use other than '=' for secondary indexed columns.
However, I cannot add 'recent_activity_time' to the key as I need to update this column with the most recent activity time of an event if any.
Any ideas in solving my problem are highly appreciated.
I can see an issue with your query. You're not hitting a partition. As such, the performance of your query will be quite bad. It'll need to query across your whole cluster (assuming you took measures to make this work).
If you're looking to query the last activity time for an id, think about storing it in a more query friendly format. You might try this:
create table tracker (dummy int, day timestamp, id uuid, primary key(dummy, day, id));
You can then insert with the day to be the epoch for the date (ignoring the time), and dummy = 0.
That should enable you to do:
select * from tracker where dummy=0 and day > '2013-12-31';
You can set a ttl on insert so that old entries expire (maybe after a year in this case). The idea is that you're storing information in a way that suits your query.

Resources