How to decide for clustering columns in Cassandra primary key? - cassandra

I am following book 'Definitive Cassandra'. A hotel application is used as an example in it. THere is a table available_rooms_by_hotel_date. It is to support use case when user wants to know about room availability in a given hotel from a given date. Data model is defined as:
hotel_id
date
room_number
is_available
hotel_id is partition key, while date and room_number are clustering columns.
Looking at the table, one can say that it supports use case when user wants to know availability of room from a given date for a given hotel.
I understand that order of clustering columns is also critical in Cassandra. So if I change order for date and room_number, how would it impact? Functionality wise I think use case is still supported. But does it impact performance or any other aspects like storage, node allocation etc?

In Cassandra you can query a table by:
the full primary key (hotel_id, date, room_number) - in this case you fetch only one row
the partition key (hotel_id) - in that case you get all rows inside given partition - it's a minimal requirement for SELECT query;
a partial primary key - partition key + some of the clustering columns, from left to right (any preceding clustering columns defined in the primary key should be specified). In the given example, you can specify only hotel_id and date, and it will return all rows for given date (or dates, if you do date IN (...). Another useful feature of the clustering column is that you can do a range query on it (but only on the last specified clustering column!). For example, if I want to find all the rooms in a given date range, I can do ... WHERE hotel_id = ... AND date >= '2020-04-05' AND date <= '2020-04-10'.
If you change the order of the room_number and date, then you could only ask for availability of the specific room(s) on date (or overall), not all rooms on specific dates - because you need to specify all preceding clustering columns, but you have only hotel_id and date, but not room_number...

Related

Why does querying based on the first clustering key require an ALLOW FILTERING?

Say I have this Cassandra table:
CREATE TABLE orders (
customerId int,
datetime date,
amount int,
PRIMARY KEY (customerId, datetime)
);
Then why would the following query require an ALLOW FILTERING:
SELECT * FROM orders WHERE date >= '2020-01-01'
Cassandra could just go to all the individual partitions (i.e. customers) and filter on the clustering key date. Since date is sorted there is no need to retrieve all the rows in orders and filter out the ones that match my where clause (as far as I understand it).
I hope someone can enlighten me.
Thanks
This happens because for normal work, Cassandra needs the partition key - it's used to find what machine(s) are storing the data for it. If you don't have partition key, like, in your example, Cassandra need to scan all data to find those that are matching your query. And this requires the use of the ALLOW FILTERING.
P.S. Data is sorted only inside the individual partitions, not globally.

How to get Last 6 Month data comparing with timestamp column using cassandra query?

How to get Last 6 Month data comparing with timestamp column using cassandra query?
I need to get all account statement which belongs to last 3/6 months comparing with updatedTime(TimeStamp column) and CurrentTime.
For example in SQL we are using DateAdd() function tor this to get. i dont know how to proceed this in cassandra.
If anyone know,reply.Thanks in Advance.
Cassandra 2.2 and later allows users to define functions (UDT) that can be applied to data stored in a table as part of a query result.
You can create your own method if you use Cassandra 2.2 and later UDF
CREATE FUNCTION monthadd(date timestamp, month int)
CALLED ON NULL INPUT
RETURNS timestamp
LANGUAGE java
AS $$java.util.Calendar c = java.util.Calendar.getInstance();c.setTime(date);c.add(java.util.Calendar.MONTH, month);return c.getTime();$$
This method receive two parameter
date timestamp: The date from you want add or subtract number of month
month int: Number of month you want to or add(+) subtract(-) from date
Return the date timestamp
Here is how you can use this :
SELECT * FROM ttest WHERE id = 1 AND updated_time >= monthAdd(dateof(now()), -6) ;
Here monthAdd method subtract 1 mont from the current timestamp, So this query will data of last month
Note : By default User-defined-functions are disabled in cassandra.yaml - set enable_user_defined_functions=true to enable if you are aware of the security risks
In cassandra you have to build the queries upfront.
Also be aware that you will probably have to bucket the data depending on the number of accounts that you have within some period of time.
If your whole database doesn't contain more than let's say 100k entries you are fine with just defining a single generic partition let's say with name 'all'. But usually people have a lot of data that simply goes into bucket that carries a name of month, week, hour. This depends on the number of inserts you get.
The reason for creating buckets is that every node can find a partition by it's partition key. This is the first part of the primary key definition. Then on every node the data is sorted by the second information that you pass in to the primary key. Having the data sorted enables you to "scan" over them i.e. you will be able to retrieve them by giving timestamp parameter.
Let's say you want to retrieve accounts from the last 6 months and that you are saving all the accounts from one month in the same bucket.
The schema might be something on the lines of:
create table accounts {
month text,
created_time timestamp,
account text,
PRIMARY KEY (month, created_time)
}
Usually you will do this at the application level, merging queries is an anti pattern but is o.k. for smaller amount of queries:
select account
from accounts
where month = '201701';
Output:
'201702'
'201703'
and so on.
If you have something really simple with let's say expected 100 000 entries then you could use the above schema and just do something like:
create table accounts {
bucket text,
created_time timestamp,
account text,
PRIMARY KEY (bucket, created_time)
}
select account
from accounts
where bucket = 'some_predefined_name'
and created_time > '2016-10-04 00:00:00'
Once more as a wrap-up, with cassandra you always have to prepare the structures for the access pattern you are going to use.

Why use a compound clustered key in Cassandra tables?

Why might one want to use a clustered index in a cassandra table?
For example; in a table like this:
CREATE TABLE blah (
key text,
a text,
b timestamp,
c double,
PRIMARY KEY ((key), a, b, c)
)
The clustered part is the a, b, c part of the PRIMARY KEY.
What are the benefits? What considerations are there?
Clustering keys do three main things.
1) They affect the available query pattern of your table.
2) They determine the on-disk sort order of your table.
3) They determine the uniqueness of your primary key.
Let's say that I run an ordering system and want to store product data on my website. Additionally I have several distribution centers, as well as customer contracted pricing. So when a certain customer is on my site, they can only access products that are:
Available in a distribution center (DC) in their geographic area.
Defined in their contract (so they may not necessarily have access to all products in a DC).
To keep track of those products, I'll create a table that looks like this:
CREATE TABLE customerDCProducts (
customerid text,
dcid text,
productid text,
productname text,
productPrice int,
PRIMARY KEY (customerid, dcid, productid));
For this example, if I want to see product 123, in DC 1138, for customer B-26354, I can use this query:
SELECT * FROM customerDCProducts
WHERE customerid='B-26354' AND dcid='1138' AND productid='123';
Maybe I want to see products available in DC 1138 for customer B-26354:
SELECT * FROM customerDCProducts
WHERE customerid='B-26354' AND dcid='1138';
And maybe I just want to see all products in all DCs for customer B-26354:
SELECT * FROM customerDCProducts
WHERE customerid='B-26354';
As you can see, the clustering keys of dcid and productid allow me to run high-performing queries on my partition key (customerid) that are as focused as I may need.
The drawback? If I want to query all products for a single DC, regardless of customer, I cannot. I'll need to build a different query table to support that. Even if I want to query just one product, I can't unless I also provide a customerid and dcid.
What if I want my data ordered a certain way? For this example, I'll take a cue from Patrick McFadin's article on Getting Started With Time Series Data Modeling, and build a table to keep track of the latest temperatures for weather stations.
CREATE TABLE latestTemperatures (
weatherstationid text,
eventtime timestamp,
temperature text,
PRIMARY KEY (weatherstationid,eventtime),
) WITH CLUSTERING ORDER BY (eventtime DESC);
By clustering on eventtime, and specifying a DESCending ORDER BY, I can query the recorded temperatures for a particular station like this:
SELECT * FROM latestTemperatures
WHERE weatherstationid='1234ABCD';
When those values are returned, they will be in DESCending order by eventtime.
Of course, the one question that everyone (with a RDBMS background...so yes, everyone) wants to know, is how to query all results ordered by eventtime? And again, you cannot. Of course, you can query for all rows by omitting the WHERE clause, but that won't return your data sorted in any meaningful order. It's important to remember that Cassandra can only enforce clustering order within a partition key. If you don't specify one, your data will not be ordered (at least, not in the way that you want it to be).
Let me know if you have any additional questions, and I'll be happy to explain.

Using Cassandra for time series data

I'm on my research for storing logs to Cassandra.
The schema for logs would be something like this.
EDIT: I've changed the schema in order to make some clarification.
CREATE TABLE log_date (
userid bigint,
time timeuuid,
reason text,
item text,
price int,
count int,
PRIMARY KEY ((userid), time) - #1
PRIMARY KEY ((userid), time, reason, item, price, count) - #2
);
A new table will be created for the day everyday.
So a table contains logs for only one day.
My querying condition is as follows.
Query all logs from a specific user on a specific day(date not time).
So the reason, item, price, count will not be used as hints or conditions for queries at all.
My Question is which PRIMARY KEY design suits better.
EDIT: And the key here is that I want to store the logs in a schematic way.
If I choose #1 so many columns would be created per log. And the possibility of having more values per log is very high. The schema above is just an example. The log can contain values like subreason, friendid and so on.
If I choose #2 one (very) composite column will be created per log, and so far I couldn't find any valuable information about the overhead of the composite columns.
Which one should I choose? Please help.
My advise is that none of your two options seems to be ideal for your time-series, the fact the you're creating a table per-day, doesn't seem optimal either.
Instead I'd recommend to create a single Table and partition by userid and day and use a time uuids as the clustered column for the event, an example of this would look like:
CREATE TABLE log_per_day (
userid bigint,
date text,
time timeuuid,
value text,
PRIMARY KEY ((userid, date), time)
)
This will allow you to have all events in a day in a single row and allow you to do your query per day per user.
By declaring the time clustered column allows to have a wide row where you can insert as a many events as you need in a day.
So the row key is a composite key of the userid and plus date in text e.g.
insert into log_per_day (userid, date, time, value) values (1000,'2015-05-06',aTimeUUID1,'my value')
insert into log_per_day (userid, date, time, value) values (1000,'2015-05-06',aTimeUUID2,'my value2')
The two inserts above will be in the same row and therefore you will be able to read in a single query.
Also if you want more information about time series I highly recommend you to check Getting Started with Time Series Data Modeling
Hope it helps,
José Luis

cassandra filtering on an indexed column isn't working

I'm using (the latest version of) Cassandra nosql dbms to model some data.
I'd like to get a count of the number of active customer accounts in the last month.
I've created the following table:
CREATE TABLE active_accounts
(
customer_name text,
account_name text,
date timestamp,
PRIMARY KEY ((customer_name, account_name))
);
So because I want to filter by date, I create an index on the date column:
CREATE INDEX ON active_accounts (date);
When I insert some data, Cassandra automatically updates data on any existing primary key matches, so the following inserts only produce two records:
insert into active_accounts (customer_name, account_name, date) Values ('customer2', 'account2', 1418377413000);
insert into active_accounts (customer_name, account_name, date) Values ('customer1', 'account1', 1418377413000);
insert into active_accounts (customer_name, account_name, date) Values ('customer2', 'account2', 1418377414000);
insert into active_accounts (customer_name, account_name, date) Values ('customer2', 'account2', 1418377415000);
This is exactly what I'd like - I won't get a huge table of data, and each entry in the table represents a unique customer account - so no need for a select distinct.
The query I'd like to make - is how many distinct customer accounts are active within the last month say:
Select count(*) from active_accounts where date >= 1418377411000 and date <= 1418397411000 ALLOW FILTERING;
In response to this query, I get the following error:
code=2200 [Invalid query] message="No indexed columns present in by-columns clause with Equal operator"
What am I missing; isn't this the purpose of the Index I created?
Table design in Cassandra is extremely important and it must match the kind of queries that you are trying to preform. The reason that Cassandra is trying to keep you from performing queries on the date column, is that any query along that column will be extremely inefficient.
Table Design - Model your queries
One of the main reasons that Cassandra can be fast is that it partitions user data so that most( 99%)
of queries can be completed without contacting all of the nodes in the cluster. This means less network traffic, less disk access, and faster response time. Unfortunately Cassandra isn't able to determine automatically what the best way to partition data. The end user must determine a schema which fits into the C* datamodel and allows the queries they want at a high speed.
CREATE TABLE active_accounts
(
customer_name text,
account_name text,
date timestamp,
PRIMARY KEY ((customer_name, account_name))
);
This schema will only be efficient for queries that look like
SELECT timestamp FROM active_accounts where customer_name = ? and account_name = ?
This is because on the the cluster the data is actually going to be stored like
node 1: [ ((Bob,1)->Monday), ((Tom,32)->Tuesday)]
node 2: [ ((Candice, 3) -> Friday), ((Sarah,1) -> Monday)]
The PRIMARY KEY for this table says that data should be placed on a node based on the hash of the combination of CustomerName and AccountName. This means we can only look up data quickly if we have both of those pieces of data. Anything outside of that scope becomes a batch job since it requires hitting multiple nodes and filtering over all the data in the table.
To optimize for different queries you need to change the layout of your table or use a distributed analytics framework like Spark or Hadoop.
An example of a different table schema that might work for your purposes would be something like
CREATE TABLE active_accounts
(
start_month timestamp,
customer_name text,
account_name text,
date timestamp,
PRIMARY KEY (start_month, date, customer_name, account_name)
);
In this schema I would put the timestamp of the first day of the month as the partitioning key and date as the first clustering key. This means that multiple account creations that took place in the same month will end up in the same partition and on the same node. The data for a schema like this would look like
node 1: [ (May 1 1999) -> [(May 2 1999, Bob, 1), (May 15 1999,Tom,32)]
This places the account dates in order within each partition making it very fast for doing range slices between particular dates. Unfortunately you would have to add code on the application side to pull down the multiple months that a query might be spanning. This schema takes a lot of (dev) work so if these queries are very infrequent you should use a distributed analytics platform instead.
For more information on this kind of time-series modeling check out:
http://planetcassandra.org/getting-started-with-time-series-data-modeling/
Modeling in general:
http://www.slideshare.net/planetcassandra/cassandra-day-denver-2014-40328174
http://www.slideshare.net/johnny15676/introduction-to-cql-and-data-modeling
Spark and Cassandra:
http://planetcassandra.org/getting-started-with-apache-spark-and-cassandra/
Don't use secondary indexes
Allow filtering was added to the cql syntax to prevent users from accidentally designing queries that will not scale. The secondary indexes are really only for use by those do analytics jobs or those C* users who fully understand the implications. In Cassandra the secondary index lives on every node in your cluster. This means that any query that requires a secondary index necessarily will require contacting every node in the cluster. This will become less and less performant as the cluster grows and is definitely not something you want for a frequent query.

Resources