Cassandra data modeling for calculation - cassandra

I have customer_info table in cassandra. It will have the following columns.
UUID is the primary key.
customer_id
amount
other fields ...
100$ transaction limit for each customer during a 365 days.
I have following 2 options
Select all records of particular customer_id from customer table. Do calculation in memory at application code; if transaction limit is not crossing 100$ then do the insert or update in the customer_info table.
Maintain a new table customer_limit which will consist of customer_id and limit fields. Before CRUD operation on customer_info, I will query on customer_limit table to know the limit and based on the limit do the CRUD operation on customer_info table.
In terms of maintenance and faster read/write, which option is best suited ?

I would use 2 tables for this purpose.
table-2 would be a counter table with limit as the counter value. You should always query this table-2 before inserting into customer_info table.
Refer Counters here. They are easy to make concurrent increments avoiding read before write in application source code.
Also please read about Partion and Clustering key concepts. Your choice of keys for customer_info is not very good.

I think you must keep the details of each transaction, because you need a "moving" window of fixed aperture (365 days) that "advances" at each transaction.
You could create a transactions table with the following primary key fields pair:
(customer_id, transaction_date)
By clustering in DESC order this table (by date of course) then you can always query for the last 365 days, efficiently, everyday.

Related

Select row with highest timestamp

I have a table that stores events
CREATE TABLE active_events (
event_id VARCHAR,
number VARCHAR,
....
start_time TIMESTAMP,
PRIMARY KEY (event_id, number)
);
Now, I want to select an event with the highest start_time. It is possible? I've tried to create a secondary index, but no success.
This is a query I've created
select * from active_call order by start_time limit 1
But the error says ORDER BY is only supported when the partition key is restricted by an EQ or an IN.
Should I create some kind of materialized view? What should I do to execute my query?
This is an anti-pattern in Cassandra. To order the data you need to read all data and find the highest value. And this will require scanning of data on multiple nodes, and will be very long.
Materialized view also won't help much as order for data only exists inside an individual partition, so you will need to put all your data into a single partition that could be huge and data would be imbalanced.
I can only think of following workaround:
Have an additional table that will have all columns of the original table, but with a fake partition key and no clustering columns
You do inserts into that table in parallel to normal inserts, but use a fixed value for that fake partition key, and explicitly setting a timestamp for a record equal to start_time (don't forget to multiple by 1000 as timestamp uses microseconds). In this case it will guaranteed to be the value with the highest timestamp as Cassandra won't override it with other data with lower timestamp.
But this doesn't solve a problem with data skew, and all traffic will be handled by fixed number of nodes equal to RF.
Another alternative - use another database.
This type of query isn't valid in big data because it requires a full table scan and doesn't scale. It works in traditional relational databases because the dataset is smaller. Imagine you had billions of partitions each with thousands of rows spread across hundreds of nodes. A full table scan in a large cluster will take a very long time if it was allowed.
The error:
ORDER BY is only supported when the partition key is restricted by an EQ or an IN
gets returned because you can only sort the results provided (a) the query is restricted to a partition key, and (b) the rows are ordered by a clustering column. You cannot sort the results based on a column that is not part of the clustering key. Cheers!

Cassandra use aggregate function and then order by that aggregate

I have a cassandra database with a table that has the following columns:
itemid
userid
rating
itemid and userid are the primary key. My query looks like this:
SELECT itemid, avg(rating) as avgRating from mytable GROUP BY itemid order by avgRating asc;
I get the following error:
InvalidRequest: Error from server: code=2200 [Invalid query] message="ORDER BY is only supported when the partition key is restricted by an EQ or an IN."
How can I fix this?
I need to order by the average ratings after so I can get the top 10 movies based on their average rating.
Cassandra can only order results by clustering column(s). It cannot order results by an aggregate function.
There are a couple of options you could look at in order to accomplish this.
Make the query and then re-order the results in your application.
This option may work if you only expect a limited number of rows to be returned from each query.
Note that it is recommended that you only use aggregate functions (like avg()) when you know that it will only apply to a limited number of rows. Ideally you should only use them when operating on a single partition (use a WHERE clause to limit to a single partition). If you don't have any limit you may see very slow queries, or query timeouts if Cassandra needs to read a large number of rows in order to calculate the aggregate.
Store a pre-calculated average in the table, or cache it in your application.
This is the best option if you need calculated averages over a larger data set.
If you make average_rating a clustering column Cassandra will store the averages for each partition in sorted order. This is very efficient from Cassandra's perspective.
The downside is that you'll need to calculate the average in your application each time you insert into or update a row, because it will be a primary key column in your Cassandra table.
One thing you could look into is using a Cassandra trigger to calculate the average for you. This may make life easier for you if you have multiple applications writing to this table, however I am not actually sure if it is possible to modify a primary key column via a custom trigger. I would recommend doing some research & testing if you decide to look at this option. You can read about triggers here.

How to do Cassandra data modeling for aggregate counts?

Let's say I have customer orders data coming into my service and I would like do some reporting on this data. All customer orders are saved in a Cassandra table so that I can get all orders for a given customer:
TABLE customer_orders
store_id uuid,
customer_id text,
order_id text,
order_amount int,
order_date timestamp,
PRIMARY: KEY (store_id, customer_id)
But I would also like to find all the customers with a given number of orders. Ideally I would like to have this in a ready to query table in Cassandra. For example "get all customers who have 1 order".
Therefore I have a table like this:
TABLE order_count_to_customer
store_id uuid,
order_count int,
customer_id text
PRIMARY KEY ((store_id, order_count), customer_id)
So the idea is when an order arrives both of these tables to be updated.
So I create a third table:
TABLE customer_to_orders_count
store_id uuid,
customer_id text,
orders_count counter,
PRIMARY KEY (store_id, orders_count)
When an order arrives:
I save it in the first table
Then update the counter in the third table by incrementing it with 1.
Then I read the counter in the third table and insert a new record in the second table.
When I need to find all the customers with a given number of orders I just query the second table.
The problem with this is that counters are not atomic and consistent. If I update the counter say to 3 there is no guarantee that when I read it next in order to update the second table it would be 3. It could be 2. Even if I read the counter before I do the update of the counter it could be some value from several steps back. So no guarantee either.
Please note that I am aware of the limitations of the counters in Cassandra and I am not asking how to solve the issue with the counters.
I am rather giving this example, in order to ask for some general advice on how to model the data in order to be able to do aggregate counting on it. I can of course use Spark to do aggregate queries directly on the first table in my example. But it seems to me that there could be some more clever way to do this and also Spark would involve bringing the whole table data into memory.
Have you thought about using the CQL Batch command. https://docs.datastax.com/en/cql/3.1/cql/cql_reference/batch_r.html
You can use this with all your steps to keep all your steps in one logical atomic transaction where either they will all succeed or fail. However this functionality does have a performance penalty.

cassandra filtering on an indexed column isn't working

I'm using (the latest version of) Cassandra nosql dbms to model some data.
I'd like to get a count of the number of active customer accounts in the last month.
I've created the following table:
CREATE TABLE active_accounts
(
customer_name text,
account_name text,
date timestamp,
PRIMARY KEY ((customer_name, account_name))
);
So because I want to filter by date, I create an index on the date column:
CREATE INDEX ON active_accounts (date);
When I insert some data, Cassandra automatically updates data on any existing primary key matches, so the following inserts only produce two records:
insert into active_accounts (customer_name, account_name, date) Values ('customer2', 'account2', 1418377413000);
insert into active_accounts (customer_name, account_name, date) Values ('customer1', 'account1', 1418377413000);
insert into active_accounts (customer_name, account_name, date) Values ('customer2', 'account2', 1418377414000);
insert into active_accounts (customer_name, account_name, date) Values ('customer2', 'account2', 1418377415000);
This is exactly what I'd like - I won't get a huge table of data, and each entry in the table represents a unique customer account - so no need for a select distinct.
The query I'd like to make - is how many distinct customer accounts are active within the last month say:
Select count(*) from active_accounts where date >= 1418377411000 and date <= 1418397411000 ALLOW FILTERING;
In response to this query, I get the following error:
code=2200 [Invalid query] message="No indexed columns present in by-columns clause with Equal operator"
What am I missing; isn't this the purpose of the Index I created?
Table design in Cassandra is extremely important and it must match the kind of queries that you are trying to preform. The reason that Cassandra is trying to keep you from performing queries on the date column, is that any query along that column will be extremely inefficient.
Table Design - Model your queries
One of the main reasons that Cassandra can be fast is that it partitions user data so that most( 99%)
of queries can be completed without contacting all of the nodes in the cluster. This means less network traffic, less disk access, and faster response time. Unfortunately Cassandra isn't able to determine automatically what the best way to partition data. The end user must determine a schema which fits into the C* datamodel and allows the queries they want at a high speed.
CREATE TABLE active_accounts
(
customer_name text,
account_name text,
date timestamp,
PRIMARY KEY ((customer_name, account_name))
);
This schema will only be efficient for queries that look like
SELECT timestamp FROM active_accounts where customer_name = ? and account_name = ?
This is because on the the cluster the data is actually going to be stored like
node 1: [ ((Bob,1)->Monday), ((Tom,32)->Tuesday)]
node 2: [ ((Candice, 3) -> Friday), ((Sarah,1) -> Monday)]
The PRIMARY KEY for this table says that data should be placed on a node based on the hash of the combination of CustomerName and AccountName. This means we can only look up data quickly if we have both of those pieces of data. Anything outside of that scope becomes a batch job since it requires hitting multiple nodes and filtering over all the data in the table.
To optimize for different queries you need to change the layout of your table or use a distributed analytics framework like Spark or Hadoop.
An example of a different table schema that might work for your purposes would be something like
CREATE TABLE active_accounts
(
start_month timestamp,
customer_name text,
account_name text,
date timestamp,
PRIMARY KEY (start_month, date, customer_name, account_name)
);
In this schema I would put the timestamp of the first day of the month as the partitioning key and date as the first clustering key. This means that multiple account creations that took place in the same month will end up in the same partition and on the same node. The data for a schema like this would look like
node 1: [ (May 1 1999) -> [(May 2 1999, Bob, 1), (May 15 1999,Tom,32)]
This places the account dates in order within each partition making it very fast for doing range slices between particular dates. Unfortunately you would have to add code on the application side to pull down the multiple months that a query might be spanning. This schema takes a lot of (dev) work so if these queries are very infrequent you should use a distributed analytics platform instead.
For more information on this kind of time-series modeling check out:
http://planetcassandra.org/getting-started-with-time-series-data-modeling/
Modeling in general:
http://www.slideshare.net/planetcassandra/cassandra-day-denver-2014-40328174
http://www.slideshare.net/johnny15676/introduction-to-cql-and-data-modeling
Spark and Cassandra:
http://planetcassandra.org/getting-started-with-apache-spark-and-cassandra/
Don't use secondary indexes
Allow filtering was added to the cql syntax to prevent users from accidentally designing queries that will not scale. The secondary indexes are really only for use by those do analytics jobs or those C* users who fully understand the implications. In Cassandra the secondary index lives on every node in your cluster. This means that any query that requires a secondary index necessarily will require contacting every node in the cluster. This will become less and less performant as the cluster grows and is definitely not something you want for a frequent query.

alllow filtering, data modeling in cql

I'm currently using and researching about data modeling practices in cassandra. So far, I get that you need have a data modeling based on the queries executed. However, multiple select requirements make data modeling even harder or impossible to handle it on 1 table. So, when you can't handle these requirements on 1 table, you need to insert 2-3 tables. In other words, you need to make multiple inserts on 1 operation.
Currently, I'm dealing with a data model of a campaign structure. I have a campaign table on cassandra with the following cql;
CREATE TABLE campaign_users
(
created_at timeuuid,
campaign_id int,
uid bigint,
updated_at timestamp,
PRIMARY KEY (campaign_id, uid),
INDEX(campaign_id, created_at)
);
In this model, I need to be able to make incremental exports given a timestamp only. In cassandra, there is allow filtering mode that enables select queries for secondary indexes. So, my cql statement for incremental export is the following;
select campaign_id, uid
from campaign_users
where created_at > minTimeuuid('2013-08-14 12:26:06+0000') allow filtering;
However, if allow filtering is used, there is a warning saying that the statement have unpredictable performance. So, is it a good practice relying on allow filtering ? What can be other alternatives ?
The ALLOW FILTERING warning is because Cassandra is internally skipping over data, rather than using an index and seeking. This is unpredictable because you don't know how much data Cassandra is going to skip over per row returned. You could be scanning through all your data to return zero rows, in the worst case. This is in contrast to operations without ALLOW FILTERING (apart from SELECT COUNT queries), where the data read through scales linearly with the amount of data returned.
This is OK if you're returning most of the data, so the data skipped over doesn't cost very much. But if you were skipping over most of your data a lot of work will be wasted.
The alternative is to include time in the first component of your primary key, in buckets. E.g. you could have day buckets and duplicate your queries for each day that contains data you need. This method guarantees that most of the data Cassandra reads over is data that you want. The problem is that all data for the bucket (e.g. day) needs to fit in one partition. You can fix this by sharding the partition somehow e.g. include some aspect of the uid within it.

Resources