Currently, I am exploring cassandra and having an special use case to design an support view of an application
My access patterns.
To fetch specific transaction
select * from purchase_by_user where userid='Tom' and transaction_date='1/20/22'
select * from purchase_by_user where userid='Jerry' and transaction_date <=1/21/22 and transaction_date >= '1/16/22'
select * from purchase_by_user where userid='Tom' and amount="100"
select * from purchase by user where user='Jerry' and amount>='50'
Create table purchase_by_user (
order_id uuid,
amount decimal,
transaction_ts timestamp,
user_id text,
Primary key((user_id), uuid)
)
Lets say Tom is making millions of orders, With this above partion key the data will not be evenly spread against the cluster and also the search will be expensive here.
Can anyone help, what would be better partion key here.
I'd go with a PRIMARY KEY definition like this:
PRIMARY KEY((user_id, transaction_year), transaction_date, order_id)
) WITH CLUSTERING ORDER BY (transaction_date DESC, order_id ASC)
This makes use of the "bucketing" concept that Manish mentioned. In this case, if Tom is creating an order every single day, there will only be 365 in each partition.
Lets say Tom is making millions of orders
In fact, even if Tom placed two orders per day, it's still only be 730. So while thinking about throughput extremes is a good exercise, a single user placing even one million orders is probably not realistic.
Also, some of the queries above are using transaction_date in a range query. I've added transaction_date as the first clustering key to support those queries. And if transaction_date is in DESCending order, the most-recent transactions will be at the "top" of the partition (they'll be read first), which is usually how most date/time-driven applications tend to function.
You can use the concept of bucketing to reduce the number of rows in a single partition. For example you can create a key like (user_id int, bucket_number int). Here you can identify the max value of bucket_number on your expected data size. IF you expect this user can make millions order then you can have bucket value till 1000. The main idea is to focus that you dont end up creating partition with large number of rows.
Related
Cassandra Data Modeling Query
Hello,
The data model i am working on is as below with different tables for same data data set for satisfying different kinds of query. The data mainly stores event data of some campaigns sent out on multiple channels like email, web, mobile app, sms etc. Events can include page visits, email opens, link clicks etc for different subscribers.
Table 1:
(enterprise_id int, domain_id text, campaign_id int, event_category text, event_action text, datetime timestamp, subscription_id text, event_label text, ........) (many more columns not part of primary key)
PRIMARY KEY ((enterprise_id,campaign_id),domain_id, event_category, event_action, datetime, subscription_id))
CLUSTERING ORDER BY (domain_id DESC, event_category DESC, event_action DESC, datetime DESC, subscription_id DESC)
Keys and Data size for Table 1:
I have partition key as enterprise_id + campaign_id . Each enterprise can have several campaigns . The datastore may have data for few hundred campaigns. Each campaign can have upto 2-3 million records. Hence there may be 3000 partitions across 100 enterprises and each partition having 2-3 miilion records.
Cassandra Queries: Query always with partition key + primary key including the datetime field. The subscription id is included in primary key to keep each record unique as we can have multiple records with similar values for rest of the keys in primary key. enterprise_id +c ampaign_id is always available as a filter in the queries.
Table 2:
(enterprise_id int, domain_id text, event_category text, event_action text, datetime timestamp, subscription_id text, event_label text, campaign_id int........) (many more columns not part of primary key)
PRIMARY KEY (enterprise_id, domain_id, event_category, event_action, datetime, subscription_id))
CLUSTERING ORDER BY (domain_id DESC, event_category DESC, event_action DESC, datetime DESC, subscription_id DESC)
Keys and Data size for Table 2) : I have partition key as enterprise_id only. Each enterprise can have several campaigns . May be few hundred campaigns. Each campaign can have upto 2-3 Mn records. In this case the partition is quite big with data for all campaigns in a single partition. can have upto 800 - 900 million entries
Cassandra Queries: Query always with partition key + primary key upto datetime. The subscription id is included in primary key to keep each record unique as we can have multiple records with similar values for rest of the keys in primary key. In this case, data has to be queries across campaigns and we may not have campaign_id as a filter in the queries.
Table 3:
(enterprise_id int, subscription_id text, domain_id text, event_category text, event_action text, datetime timestamp, event_label text, campaign_id int........) (many more columns not part of primary key)
PRIMARY KEY (enterprise_id, subscription_id, domain_id, event_category, event_action, datetime, ))
CLUSTERING ORDER BY ( subscription_id DESC, domain_id DESC, event_category DESC, event_action DESC, datetime DESC,)
Keys and Data size for Table 3) : I have partition key as enterprise_id. Each enterprise can have several campaigns . May be few hundred campaigns. Each campaign can have upto 2-3 Mn records. In this case the partition is quite big with data for all campaigns in a single partition. can have upto 800 -900 million entries
Cassandra Queries: Query always with partition key + primary key as subscription_id only. Should be able to query directly on enterprise_id + subscription_id.
My Queries:
Size of data on each partition: With Table 2) and Table 3) i may end up with more than 800 -900 million rows per partition. As per my reading it is not ok to have so many entries per partition. How can i achieve my use case in this scenario? Even if i create multiple partitions based on some data like a week_number (1-52 in a year), the query will need to query across all partitions and end up using a IN clause with all week numbers which is as good as scanning all data.
Is it ok to have multiple tables with same partition key and different primary keys with Clustering order change? For example in Table 2 and Table 3 the hash will be on enterprise_id and will lead to same node. However only the clustering key order has changed and will allow me to query directly on the required key. Will the data be in different physical partitions for Table2 and Table3 in such a scenario? Or if it maps to same partition number how will cassandra internally distinguish between the two tables?
Is it ok to use ALLOW FILTERING if i specify the partition key. For example i can avoid the need for creating Table 3 and use table 2 for query on subscription_id directly if i use ALLOW FILTERING on Table 2. What will be the impact again.
First of all, please only as one question per question. Given the length and detail required for your answers, this post is unlikely to provide long term value for future users.
As per my reading it is not ok to have so many entries per partition. How can I achieve my use case in this scenario?
Unfortunately, if partitioning on a time component will not work, then you'll have to find some other column to partition the data by. I've seen rows-per-partition work ok in the range of 50k to 20k. Most of those use cases on the higher end had small partitions. It looks like your model has many columns, so I'd be curious as to the average partition size. Essentially, find a column to partition on which keeps your partition sizes in the 10MB to 1MB range.
Is it ok to have multiple tables with same partition key and different primary keys with Clustering order change?
Yes, this is perfectly fine.
Will the data be in different physical partitions for Table2 and Table3 in such a scenario? Or if it maps to same partition number how will cassandra internally distinguish between the two tables?
The partition is hashed into a number ranging from +/- 2^63. That number will then be compared to the partition ranges mapped to all nodes, and then the query will be sent to that node. So all the partition does, is determine which node is responsible for the data.
The tables have their data files written to different directories, based on table name. So Cassandra distinguishes between the tables by the table name provided in the query. Nothing you need to worry about.
Is it ok to use ALLOW FILTERING if I specify the partition key.
I would still recommend against it if you're concerned about performance. But the good thing about using the ALLOW FILTERING directive while specifying a full partition key, will indeed prevent Cassandra from reading multiple nodes to build the result set. So that should be ok. The only drawback here, is that Cassandra stores/reads data from disk by the defined CLUSTERING ORDER, and using ALLOW FILTERING obviously complicates that process (forcing random reads vs. sequential reads).
I need to be able to return all users that performed an action during a specified interval. The table definition in Cassandra is just below:
create table t ( timestamp from, timestamp to, user text, PRIMARY KEY((from,to), user))
I'm trying to implement the following query in Cassandra:
select * from t WHERE from > :startInterval and to < :toInterval
However, this query will obviously not work because it represents a range query on the partition key, forcing Cassandra to search all nodes in the cluster, defeating its purpose as an efficient database.
Is there an efficient to model this query in Cassandra?
My solution would be to split both timestamps into their corresponding years and months and use those as the partition key. The table would look like this:
create table t_updated ( yearFrom int, monthFrom int,yearTo int,monthTo int, timestamp from, timestamp to, user text, PRIMARY KEY((yearFrom,monthFrom,yearTo,monthTo), user) )
If i wanted the users that performed the action between Jan 2017 and July 2017 the query would look like the following:
select user from t_updated where yearFrom IN (2017) and monthFrom IN (1,2,3,4,5,6,7) and yearTo IN (2017) and monthTo IN (1,2,3,4,5,6,7)
Would there be a better way to model this query in Cassandra? How would you approach this issue?
First, the partition key has to operate on equals operator. It is better to use PRIMARY KEY (BUCKET, TIME_STAMP) here where bucket can be combination of year, month (or include days, hrs etc depending on how big your data set is).
It is better to execute multiple queries and combine the result in client side.
The answer depends on the expected number of entries. Thumb rule, is that a partition should not exceed 100mb. So if you expect a moderate number of entries, it would be enough to go with year as partition key.
We use Week-First-Date as a partition key in a iot scenario, where values get written at most once a minute.
I just begin study cassandra.
It was a table and queries.
CREATE TABLE finance.tickdata(
id_symbol int,
ts timestamp,
bid double,
ask double,
PRIMARY KEY(id_symbol,ts)
);
And query is successful,
select ts,ask,bid
from finance.tickdata
where id_symbol=3
order by ts desc;
Next it was decision move id_symbol in table name, new table(s) scripts.
CREATE TABLE IF NOT EXISTS mts_src.ticks_3(
ts timestamp PRIMARY KEY,
bid double,
ask double
);
And now query fails,
select * from mts_src.ticks_3 order by ts desc
I read from docs, that I need use and filter (WHERE) by primary key (partition key),
but technically my both examples same. Why cassandra so restricted in this aspect?
And one more question, It is good idea in general? move id_symbol in table name -
potentially it can be 1000 of unique id_symbol and a lot of data for each. Separate this data on individual tables look like good idea!? But I lose order by possibility, that is so necessary for me to take fresh data by each symbol_id.
Thanks.
You can't sort on the partition key, you can sort only on clustering columns inside the single partition. So you need to model your data accordingly. But you need to be very careful not to create very large partitions (when using ticker_id as partition key, for example). In this case you may need to create a composite keys, like, ticker_id + year, or month, depending on how often you're inserting the data.
Regarding the table per ticker, that's not very good idea, because every table has overhead, it will lead to increased resource consumption. 200 tables is already high number, and 500 is almost "hard limit"
Creating the following employee column family in Cassandra
Case 1:
CREATE TABLE employee (
name text,
designation text,
gender text,
created_by text,
created_date timestamp,
modified_by text,
modified_date timestamp,
PRIMARY KEY (name)
);
From UI, if i wanted to get all employee, it is not possible to
retrieve. is it true?
select * from employee; //not possible as it is partitioned by name
Case 2:
I was told to do this way to retrieve all employees.
We need to design this with a static key, to retrieve all the employees.
CREATE TABLE employee (
static_name text,
name text,
designation text,
gender text,
created_by text,
created_date timestamp,
modified_by text,
modified_date timestamp,
PRIMARY KEY (static_name,name)
);
static_name i.e.) "EMPLOYEE" will be the partition key and name will the clustering key. Primary key, combination of both static_name and name
static_name -> every time you add the employee , insert with the static value i.e) EMPLOYEE
now, you will be able to do "select all employees query"
//this will return you all the employees
select * from employee where static_name='EMPLOYEE';
is this true? can't we use case 1 to return all the employees?
Both approaches are o.k. with some catches
Approach 1:
When you say UI I guess you mean to use simple select * ... it's correct that this won't really work out of the box if you want to get every single one of them out. Especially if the data set is big. You could use pagination on a driver (I'm not 100% sure since I hadn't had a case in a while to use it) but when I needed to jump over all the partition I would use the token function i.e.:
select token(name), name from employee limit 1;
system.token(name) | name
----------------------+------
-8839064797231613815 | a
now you use the result of the token and put it into next query. This would have to be done by your program. After it would fetch all the elements that are greater than ... you would also need to start for all lower than the -8839064797231613815.
select token(name), name from employee where token(name) > -8839064797231613815 limit 1;
system.token(name) | name
----------------------+------
-8198557465434950441 | c
and then I would wrap this into a loop until I would fetch all the elements out. (I think this is also how spark cassandra does it when retrieving wide rows out from a cluster).
Disadvantage of this model is that it's really bad because it has to go all over the cluster and is more or less to be used in analytical work loads. Since you mentioned UI, It would take the user too long to get the result, so I would advise not to use approach 1 in UI related stuff.
Approach 2.
Disadvantage of the second one is that it would be what is called a hot row. Meaning every update would go to a single partition and this is most of the time bad model.
The advantage is that you could easily paginate over the one partition and get your data out by pagination functions built into the driver.
This would how ever behave just fine if you have moderate load (tens or low hundreds updates per second) and relatively low number of users, let's say for 100 000 this would work just fine. If your numbers are greater you have to somehow split up the users into multiple partitions so that the "load" gets distributed more evenly.
One possibility is to include letter of alphabet into "EMPLOYEE" ... so you would have "EMPLOYE_A", "EMPLOYEE_B" etc ... this would work relatively well. Not ideal again because of the lexicographical distribution and some partitions may get relatively larger amounts of that which is also not ideal.
One approach would be to create some artificial columns, let's say by design you say there are 10 buckets and when you insert into "EMPLOYEE" partition you just add (random bucket to the static prefix) "EMPLOYEE_1" and so on ... but when retrieving you go over specific partition until you exhaust the result.
Why might one want to use a clustered index in a cassandra table?
For example; in a table like this:
CREATE TABLE blah (
key text,
a text,
b timestamp,
c double,
PRIMARY KEY ((key), a, b, c)
)
The clustered part is the a, b, c part of the PRIMARY KEY.
What are the benefits? What considerations are there?
Clustering keys do three main things.
1) They affect the available query pattern of your table.
2) They determine the on-disk sort order of your table.
3) They determine the uniqueness of your primary key.
Let's say that I run an ordering system and want to store product data on my website. Additionally I have several distribution centers, as well as customer contracted pricing. So when a certain customer is on my site, they can only access products that are:
Available in a distribution center (DC) in their geographic area.
Defined in their contract (so they may not necessarily have access to all products in a DC).
To keep track of those products, I'll create a table that looks like this:
CREATE TABLE customerDCProducts (
customerid text,
dcid text,
productid text,
productname text,
productPrice int,
PRIMARY KEY (customerid, dcid, productid));
For this example, if I want to see product 123, in DC 1138, for customer B-26354, I can use this query:
SELECT * FROM customerDCProducts
WHERE customerid='B-26354' AND dcid='1138' AND productid='123';
Maybe I want to see products available in DC 1138 for customer B-26354:
SELECT * FROM customerDCProducts
WHERE customerid='B-26354' AND dcid='1138';
And maybe I just want to see all products in all DCs for customer B-26354:
SELECT * FROM customerDCProducts
WHERE customerid='B-26354';
As you can see, the clustering keys of dcid and productid allow me to run high-performing queries on my partition key (customerid) that are as focused as I may need.
The drawback? If I want to query all products for a single DC, regardless of customer, I cannot. I'll need to build a different query table to support that. Even if I want to query just one product, I can't unless I also provide a customerid and dcid.
What if I want my data ordered a certain way? For this example, I'll take a cue from Patrick McFadin's article on Getting Started With Time Series Data Modeling, and build a table to keep track of the latest temperatures for weather stations.
CREATE TABLE latestTemperatures (
weatherstationid text,
eventtime timestamp,
temperature text,
PRIMARY KEY (weatherstationid,eventtime),
) WITH CLUSTERING ORDER BY (eventtime DESC);
By clustering on eventtime, and specifying a DESCending ORDER BY, I can query the recorded temperatures for a particular station like this:
SELECT * FROM latestTemperatures
WHERE weatherstationid='1234ABCD';
When those values are returned, they will be in DESCending order by eventtime.
Of course, the one question that everyone (with a RDBMS background...so yes, everyone) wants to know, is how to query all results ordered by eventtime? And again, you cannot. Of course, you can query for all rows by omitting the WHERE clause, but that won't return your data sorted in any meaningful order. It's important to remember that Cassandra can only enforce clustering order within a partition key. If you don't specify one, your data will not be ordered (at least, not in the way that you want it to be).
Let me know if you have any additional questions, and I'll be happy to explain.