Cassandra data modeling - cassandra

So I'm designing this data model for product price tracking.
A product can be followed by many users and an user can follow many products, so it's a many to many relation.
The products are under constant tracking, but a new price is inserted only if it has varied from the previous one.
The users have set an upper price limit for their followed products, so every time a price varies, the preferences are checked and the users will be notified if the price has dropped below their treshold.
So initially I thought of the following product model:
However "subscriberEmails" is a list collection that will handle up to 65536 elements. But being a big data solution, it's a boundary that we don't want to have. So we end up writing a separate table for that:
So now "usersByProduct" can have up to 2 billion columns, fair enough. And the user preferences are stored in a "Map" which is again limited but we think it's a good maximum number of products to follow by user.
Now the problem we're facing is the following:
Every time we want to update a product's price we would have to make a query like this:
INSERT INTO products("Id", date, price) VALUES (7dacedd2-c09b-46c5-8686-00c2a03c71dd, dateof(now()), 24.87); // Example only
But INSERT operations don't admit other conditional clauses than (IF NOT EXISTS) and that isn't what we want. We need to update the price only if it's different from the previous one, so this forces us to make two queries (one for reading current value and another to update it if necessary).
PD. UPDATE operations do have IF conditions but it's not our case because we need an INSERT.
UPDATE products SET date = dateof(now()) WHERE "Id" = 7dacedd2-c09b-46c5-8686-00c2a03c71dd IF price != 20.3; // example only

Don't try to apply a normal model on a cassandra database. It may work but you'll end up with terrible performance and scalability.
The recommended approach to Cassandra data modeling is to first figure out your read queries against the database and structure your data so that these reads are cheap. You'll probably need to duplicate writes somewhat but it's OK because writes are pretty cheap in Cassandra.
For your specific use case, the key query seems to be able to get all users interested in a price change in a product, so you create a table for this, for example:
create table productSubscriptions (
productId uuid,
priceLimit float,
createdAt timestamp,
email text,
primary key (productId,priceLimit,createdAt)
);
but since you also need to know all product subscriptions for a user, you all need a user-keyed table of the same data:
create table userProductSubscriptions (
email text,
productId uuid,
priceLimit float,
primary key (email, productId)
)
With these 2 tables, I guess you can see that all your main queries can be done with a single-row select and your insert/delete are straightforward but will require you to modify both tables in sync.
Obviously, you'll need to flesh out a bit more the schema for your complete need but this should give you an example on how to think about your cassandra schema.
Conditional update issue
For your conditional insert issue, the easiest answer is: do it with an UPDATE if you really need it (update and insert are nearly identical in CQL) but it's a very expensive operation so avoid it if you can.
For your use case, I would split your product table in three :
create table products (
category uuid,
productId uuid,
url text,
price float,
primary key (category, productId)
)
create table productPricingAudit (
productId uuid,
date timestamp,
price float,
primary key (productId, date)
)
create table priceScheduler (
day text,
checktime timestamp,
productId uuid,
url text,
primary key (day, checktime)
)
products table can hold for full catalog, optionally split in categories (so that listing all products in a single category is a single-row select)
productPricingAudit would have an insert with the latest price retrieved whatever it is since this will let you debug any pricing issue you may have
priceScheduler holds all the check to be made for a given day, ordered by check time. Your scheduler simply has to make a column range query on single row whenever it runs.
With such a schema, you don't care about the conditional updates, you simply issue 3 inserts whenever you update a product price even it doesn't change.

Okay, I will try to answer my own question: conditional inserts other than "IF NOT EXISTS" are not supported in Cassandra by the date, period.
The closest thing is a conditional update, but that doesn't work in our scenario. So there's one simple option left: application side logic. This means that you have to read the previous entry and perform the decision on your application. The obvious downside is that 2 queries are performed (one SELECT and one INSERT) which obviously adds latency.
However this suits our application because every time we perform a query to enqueue all items that should be checked, we can select the items urls and their current prices too. So the workers that check the latest price can then make the decision of inserting or not because they have the current price to compare with.
So... A query similar to this would be performed every X minutes:
SELECT id, url, price FROM products WHERE "nextCheckTime" < now();
// example only, wouldn't even work if nextCheckTime is not part of the PK or index
This is a very costly operation to perform on a Cassandra cluster because it has to go through all rows that are stored randomly in different nodes by default. Another downside is that we need some advanced and specific statistics regarding products and users.
So we've decided that a relational database will serve us better than Cassandra in this particular case.
We sadly leave all of Cassandra's advantages (fast inserts, easy scaling, built in sharding...) and look towards a MySQL Cluster or master-slave implementation.

Related

How to select data in Cassandra either by ID or date?

I have a very simple data table. But after reading a lot of examples in the internet, I am still more and more confused how to solve the following scenario:
1) The Table
My data table looks like this (without defining the primayr key, as this is my understanding problem):
CREATE TABLE documents (
uid text,
created text,
data text
}
Now my goal is to have to different ways to select data.
2) Select by the UID:
SELECT * FROM documents
WHERE uid = ‘xxxx-yyyyy-zzzz’
3) Select by a date limit
SELECT * FROM documents
WHERE created >= ‘2015-06-05’
So my question is:
What should my table definition in Cassandra look like, so that I can perform these selections?
To achieve both queries, you would need two tables.
First one would look like:
CREATE TABLE documents (
uid text,
created text,
data text,
PRIMARY KEY (uid));
and you retrieve your data with: SELECT * FROM documents WHERE uid='xxxx-yyyy-zzzzz' Of course, uid must be unique. You might want to consider the uuid data type (instead of text)
Second one is more delicate. If you set your partition to the full date, you won't be able to do a range query, as range query is only available on the clustering column. So you need to find the sweet spot for your partition key in order to:
make sure a single partition won't be too large (max 100MB,
otherwise you will run into trouble)
satisfy your query requirements.
As an example:
CREATE TABLE documents_by_date (
year int,
month int,
day int,
uid text,
data text,
PRIMARY KEY ((year, month), day, uid);
This works fine if within a day, you don't have too many documents (so your partition don't grow too much). And this allows you to create queries such as: SELECT * FROM documents_by_date WHERE year=2018 and month=12 and day>=6 and day<=24; If you need to issue a range query across multiple months, you will need to issue multiple queries.
If your partition is too large due to the data field, you will need to remove it from documents_by_date. And use documents table to retrieve the data, given the uid you retreived from documents_by_date.
If your partition is still too large, you will need to add hour in the partition key of documents_by_date.
So overall, it's not a straightforward request, and you will need to find the right balance for yourself when defining your partition key.
If latency is not a huge concern, an alternative would be to use the stratio lucene cassandra plugin, and index your date.
Question does not specify how your data is going to be with respect user and create time. But since its a document, I am assuming that one user will be creating one document at one "created" time.
Below is the table definition you can use.
CREATE TABLE documents (
uid text,
created text,
data text
PRIMARY KEY (uid, created)
) WITH CLUSTERING ORDER BY (created DESC);
WITH CLUSTERING ORDER BY (created DESC) can help you get the data order by created for a given user.
For your first requirement you can query like given below.
SELECT * FROM documents WHERE uid = 'SEARCH_UID';
For your second requirement you can query like given below
SELECT * FROM documents WHERE created > '2018-04-10 11:32:00' ALLOW FILTERING;
Use of Allow Filtering should be used diligently as it scans all partitions. If we have to create a separate table with date as primary key, it becomes tricky if there are many documents being inserted at very same second. Clustering order works best for the requirements where documents for a given user need to be sorted by time.

Cassandra for storing click logs

I work in ad tech and our current infrastructure uses MySQL for storing clicks and conversion logs. So far, MySQL has been useful to us for running ad hoc queries against click data.
We are considering switching to Cassandra as we receive huge traffic spikes during peak times. Not only that, we are growing at a very fast rate and we get about 500-1000 clicks per second every now and then(for an extended duration,sometimes for 20-30 minutes).
I have been the options available, and so far, my research has let me to believe that nothing beats Cassandra in terms of write performance.
I'm currently in the process of creating a data model to store clicks.
The major component of any clicks are as follows:
Campaign id
Pub id
Timestamp
Creative id
Event code (whether it is a valid click or an invalid click.This is an int value. For example, event_code=0 is a valid click)
Now, I need to support the following queries:
1. SELECT * FROM clicks WHERE campaign_id=?
2. SELECT * FROM clicks WHERE campaign_id=? AND date_time>=? AND date_time <=?
3. SELECT * FROM clicks WHERE campaign_id=? AND pub_id=? AND AND date_time>=? AND date_time <=? AND event_code=?
etc
This is simple enough to do with MySQL, after which I just get all the data from these queries in a CSV file.
However, if I were to model my tables based on the first query, this would mean that I would require to create a table in Cassandra like the following:
CREATE TABLE clicks_by_campaign(
camp_id int,
pub_id int,
date_time timestamp,
creative_id int,
event_code int,
//other fields like ip, user agent ,device etc,
PRIMARY KEY(camp_id,pub_id,date_time,event_code,creative_id))
But there are campaigns that can have millions of rows. For example, we have campaigns with a particular id, say id=3, that have more than 7 million clicks.
Wouldn't this create a wide rows problem? From what I understand, all of this campaign data would be stored as one partition on one physical machine. Is my thinking here correct or am I missing something? Please note that other queries have to be supported as well. For example, I might have to share the click logs for a particular publisher(irrespective of the campaign id). In which case, the query would look like:
SELECT * FROM clicks_by_publisher WHERE pub_id=?
This obviously would mean that I would have to create another table by the name 'clicks_by_publisher' etc.
I would also like to point out that I would be using Apache Flink that would analyze, aggregate and group clicks info on a time window of 1 minute. These results will further be stored into MySQL to provide as much support for ad-hoc queries as possible.
Can someone point me out in the right direction.
Is there any other strategy that I can use? Am I missing something?
You have a few options. Three that i feel i can describe. The first is specifying the columns as follows
campaign_id = PRIMARY_KEY
event_code = CLUSTER_KEY
date_time = CLUSTER_KEY
Running greater than or equal queries on cluster keys is possible. Your queries will run.
You're right in saying this would create a single partition for each campaign id. To solve your rows being stored on one physical machine you could create a different table that links campaign ids to row ids in your clicks table. This would reduce the overall data stored on a single machine.
Another solution would be to prefix each campaign id with a machine id. That splits the number of rows between each machine equally. It would mean creating a query prefixed with each machine id for each query but allows for growth.
This leads onto spark. Spark will handle running your query on multiple machines and concatenating the results for you automatically, essentially doing what i described above without the development overhead.
Working with Cassandra myself, i opted for a combination of the first and second solution because it fit with the data structure i was working with. Remember that Cassandra is very efficient at writes so don't be too conservative about creating tables to help filter queries and more sparsely store your data.
Perhaps storing clicks by a hash of campaign id's prefixed by the date will work for you.
Edit : Unless disabled, Cassandra will automatically hash your primary keys using the Murmur3 algorithm.
To model your requirement for fast reads and distributed right, use below table definition -
CREATE TABLE clicks_by_campaign(
camp_id int,
createdon bigint,
pub_id int,
creative_id int,
event_code int,
//other fields like ip, user agent ,device etc,
PRIMARY KEY((camp_id,createdon),event_code))
This will help to distribute data evenly across the partitions. This will also solves our second and third query -
2. SELECT * FROM clicks WHERE campaign_id=? AND date_time>=? AND date_time <=?
Query will be -
SELECT * FROM clicks_by_campaign WHERE token(camp_id, createdon) > token(100, '1111111111111') AND token(camp_id, createdon) <= token(100, '22222222222222')
3. SELECT * FROM clicks WHERE campaign_id=? AND pub_id=? AND AND date_time>=? AND date_time <=? AND event_code=?
The query will be -
SELECT * FROM clicks_by_campaign WHERE token(camp_id, createdon) > token(100, '1111111111111') AND token(camp_id, createdon) <= token(100, '22222222222222') AND event_code=10
First query -
1. SELECT * FROM clicks WHERE campaign_id=?
This is really a anti pattern in cassandra. What I would do , process campaign data batch wise, hourly- daily - weekly - yearly. Think about campaign id again, do we have to process the all the data at a time. Same goes for the 'clicks_by_publisher' .
Edit 1
Could you elaborate on what you mean by 'token' ?
Cassandra partitions rows using partition key. In above table definition we have combined camp_id and createdon values (camp_id and createdon think like composit primary key in RDBMS) to form a partition key. The cassandra partitioner calculates hash value combining camp_id and createdon , and decides which partition the row goes. To retrieve same row, partitioner need to recalculate the hash value. The function toke(), does that.
The time stamp represent the time at click event happened, this value is in milliseconds. Using createdon (type long), will help to evenly distribute the rows across the partitions.
For example for insert statement
1. INSERT INTO clicks_by_campaign (camp_id,createdon ,....) values 100,1111111111111,......) the calculated hash, lets say 111 (combining values 100,1111111111111 ) -- this will go in partition 1
2. INSERT INTO clicks_by_campaign (camp_id,createdon ,....) values (100,2222222222222,......) the calculated hash, lets say 222 (combining values 100,2222222222222 ) -- this will go in partition 2
Java has API to convert a date in to milliseconds. Date represented in milliseconds can be converted to any format using any time zone.
In fact , your use case is right candidate to design a time series data model.

How to do Cassandra data modeling for aggregate counts?

Let's say I have customer orders data coming into my service and I would like do some reporting on this data. All customer orders are saved in a Cassandra table so that I can get all orders for a given customer:
TABLE customer_orders
store_id uuid,
customer_id text,
order_id text,
order_amount int,
order_date timestamp,
PRIMARY: KEY (store_id, customer_id)
But I would also like to find all the customers with a given number of orders. Ideally I would like to have this in a ready to query table in Cassandra. For example "get all customers who have 1 order".
Therefore I have a table like this:
TABLE order_count_to_customer
store_id uuid,
order_count int,
customer_id text
PRIMARY KEY ((store_id, order_count), customer_id)
So the idea is when an order arrives both of these tables to be updated.
So I create a third table:
TABLE customer_to_orders_count
store_id uuid,
customer_id text,
orders_count counter,
PRIMARY KEY (store_id, orders_count)
When an order arrives:
I save it in the first table
Then update the counter in the third table by incrementing it with 1.
Then I read the counter in the third table and insert a new record in the second table.
When I need to find all the customers with a given number of orders I just query the second table.
The problem with this is that counters are not atomic and consistent. If I update the counter say to 3 there is no guarantee that when I read it next in order to update the second table it would be 3. It could be 2. Even if I read the counter before I do the update of the counter it could be some value from several steps back. So no guarantee either.
Please note that I am aware of the limitations of the counters in Cassandra and I am not asking how to solve the issue with the counters.
I am rather giving this example, in order to ask for some general advice on how to model the data in order to be able to do aggregate counting on it. I can of course use Spark to do aggregate queries directly on the first table in my example. But it seems to me that there could be some more clever way to do this and also Spark would involve bringing the whole table data into memory.
Have you thought about using the CQL Batch command. https://docs.datastax.com/en/cql/3.1/cql/cql_reference/batch_r.html
You can use this with all your steps to keep all your steps in one logical atomic transaction where either they will all succeed or fail. However this functionality does have a performance penalty.

Cassandra column family design

I'm having trouble designing a column family that suits the following requirement:
I would like to update X rows that match some condition for a field that is not the primary key and is not unique.
For example if a User column family has ID, name and birthday columns, I would like to update all the users that were born after some specific day.
Even if I add the 'birthday' to the primary key (lets say 'ID', 'birthday') I cannot perform this query because part of the primary key is missing.
How can i approach this by designing my column family differently ?
Thanks.
According to cassandra docs, there is no way to update rows without explicitly defining their partition key. This was done not by an accident, but because this feature (e.g. update users set status=1 where id>10) can allow user to update all data in table at once, which can be very-very-very expensive on large databases. Cassandra explicitly forbids all operations requiring data scans within multiple partitions.
To update multiple users all at once, you have to know their IDs. Having a table defined as:
CREATE TABLE stackoverflow.users (
id timeuuid PRIMARY KEY,
dob timestamp,
status text
)
and knowing user's primary key, you can run queries like update users set status='foo' where id in (1,2,3,4). But queries with really large sets of keys inside IN statement may cause performance issues on C*.
But how can you have an efficient range query like select id from some_table where dob>'2000-01-01 00:00:01'? There are two options available, and both of them are not really acceptable:
Create an index table like
CREATE TABLE stackoverflow.dob_index (
year int,
dob timestamp,
ids list<timeuuid>,
PRIMARY KEY (year, dob)
)
with compound partition+clustering primary key and use multiple queries like select * from dob_index where year=2014 and dob<'2014-05-01 00:00:01'; to fetch ids for different years. Notice that I've defined multiple partitions for the table to have some kind of even partition distribution in cluster. But the general idea is that you really shouldn't have a small amount of very large partitions. Prefer a large amount of small ones, if there's a choice.
Have a separate stand-alone index available for complex queries (like ElasticSearch/Solr/Sphinx).
But I suggest you to revisit your application logic in a way to avoid updating/deleting data at all:
instead of updating users table directly, you can have a separate table user_status you insert new statuses:
CREATE TABLE user_statuses (
id timeuuid,
updated_at timestamp,
status text,
PRIMARY KEY (id, updated_at)
)
When you need to scan/update a lot of rows at once, prefer using tools like Spark to efficiently distribute your workload among your cluster nodes.

Azure Tables - Partition Key and Row Key - Correct Choice

I am new to Azure tables and having read a lot of articles but would like some reassurance on the above given its fundamental.
I have data which is similar to this:
CustomerId, GUID
TripId, GUID
JourneyStep, GUID
Time, DataTime
AverageSpeed, int
Based on what I have read, is CustomerId a good PartitionKey? Where I become stuck is the combination of CustomerId and TripId that does not make a unique row. My justification for TripId as the Row Key is because every query will be a dataset based on CustomerId and TripId.
Just for context, the CustomerId is clearly unique, the TripId represents one journey in a vehicle and within that journey the JourneyStep represents a unit within that Trip which may be 10 steps or 1000.
The intention is aggregate the data into further tables with each level being used for a different purpose. At the most aggregated level, the customer will be given some scores.
The amount of data will obviously be huge so need to think about query performance from the outset.
Updated:
As requested, the solution is for Vehicle Telematics so think of yourself in your own car. Blackbox shipping data to an server which in turn passes it to Azure Tables. In Relational DB terms, I would have a Customer Table and a trip table with a foreign key back to the customer table.
The tripId is auto generated by the blackbox. TripId does not need stored by date time from a query point of view, however may be relevant from a query performance point of view.
Queries will be split into two:
Display a map of a single journey for each customer, so filter by customer and then Trip to then iterate each row (journeystep) to a map.
Per customer, I will score each trip and then retrieve trips for, let's say, the last month to aggregate a score. I do have SQL Database to enrich data with client records etc but for the volume data (the trip data) I wish to use Azure Tables.
The aggregates from the second query will probably be stored in a separate table, so if someone made 10 trips in one month, I would run the second query which would score each trip, then produce a score for all trips that month and store both answers so potentially a table of trip aggregates and a table of monthly aggregates.
The thing about the Partition Key is that it represents a logical grouping; You cannot insert data spanning multiple partition keys, for example. Similarly, rows with the same partition are likely to be stored on the same server, making it quick to retrieve all the data for a given partition key.
As such, it is important to look at your domain and figure out what aggregate you are likely to work with.
If I understand your domain model correctly, I would actually be tempted to use the TripId as the Partition Key and the JourneyStep as the Row Key.
You will need to, separately, have a table that lists all the Trip IDs that belongs to a given Customer - which sort of makes sense as you probably want to store some data, such as "trip name" etc in such a table anyway.
Your design has to be related to your query. You can filter your data based on 2 columns PartitionKey and RowKey. PartitionKey is your most important column since your queries will hit that column first.
In your case CustomerId should be your PartitionKey since most of the time you will try to reach your data based on the customer. (you may also need to keep another table for your client list)
Now, RowKey can be your tripId or time. if I were you I probably use rowKey as yyyyMMddHHmm|tripId format which will let you to query based on startWith and endWidth options.
Adding to #Frans answer:
One thing you could do is create a separate table for each customer. So you could have table named like Customer. That way each customer's data is nicely segregated into different tables. Then you could use TripId as PartitionKey and then JourneyStep as RowKey as suggested by #Frans. For storing some metadata about the trip, instead of going into a separate table, I would still use the same table but here I would keep the RowKey as empty and put other information about the trip there.
I would suggest considering the following approach to your PK/RK design. I believe it would yield the best performance for your outlined queries:
PartitionKey: combination of CustomerId and TripId.
string.Format("{0}_{1}", customerId.ToString(), tripId.ToString())
RowKey: combination of the DateTime.MaxValue.Ticks - Time.Ticks formatted to a large 0-padded string with the JourneyStep.
string.Format("{0}_{1}", (DateTime.MaxValue.Ticks - Time.Ticks).ToString("00000000000000000"), JourneyStep.ToString())
Such combination will allow you to do the following queries "quickly".
Get data by CustomerId only. Example: context.Trips.Where(n=>string.Compare(id + "_00000000-0000-0000-0000-000000000000", n.PartitionKey) <= 0 && string.Compare(id+"_zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz") >=0).AsTableServiceQuery(context);
Get data by CustomerId and TripId. Example: context.Trips.Where(n=>n.PartitionKey == string.Format("{0}_{1}", customerId, tripId).AsTableServiceQuery(context);
Get last X amount of journey steps if you were to search by either CustomerId or CustomerId/TripId by using the "Take" function
Get data via date-range queries by translating timestamps into Ticks
Save data into a trip with a single storage transaction (assuming you have less than 100 steps)
If you can guarantee uniqueness of Times of Steps within each Trip, you don't even have to put JourneyStep into the RowKey as it is somewhat inconvenient
The only downside to this schema is not being able to retrieve a particular single journey step without knowing its Time and Id. However, unless you have very specific use cases, downloading all of the steps inside a trip and then picking a particular one from the list should not be so bad.
HTH
The design of table storage is a function to optimize two major capabilities of Azure Tables:
Scalability
Search performance
As #Frans user already pointed out, Azure tables uses the partitionkey to decide how to scale out your data on multiple storage server nodes. Because of this, I would advise against having unique partitionkeys, since in theory, you will have Azure spanning out storage nodes that will be able to serve one customer only. I say "in theory" because, in practice, Azure uses smart algorithms to identify if there are patterns in your partitionkeys and thus be able to group them (example, if your ids are consecutive numbers). You don't want to fall into this scenario because the scalability of your storage will be unpredictable and at the hands of obscure algorithms that will be making those decisions. See HERE for more information about scalability.
Regarding performance, the fastest way to search is to hit both partitionkey+rowkey in your search queries. Contrary to Amazon DynamoDB, Azure Tables does not support secondary column indexes. If you have your search queries search for attributes stored in columns apart from those two, Azure will need to do a full table scan.
I faced a situation similar to yours, where the design of the partition/row keys was not trivial. In the end, we expanded our data model to include more information so we could design our table in such a way that ~80% of all search queries can be matched to partition+row keys, while the remaining 20% require a table scan. We decided to include the user's location, so our partition key is the user's country and the rowkey is a customer unique ID. This means our data model had to be expanded to include the user's country, which was not a big deal. Maybe you can do the same thing? Group your customers by segment, or by location, or by email address SMTP domain?

Resources