Cassandra CLUSTERING ORDER with updates [performance] - cassandra

With Cassandra it is possible to specify the cluster ordering on a table with a particular column.
CREATE TABLE myTable (
user_id INT,
message TEXT,
modified DATE,
PRIMARY KEY ((user_id), modified)
)
WITH CLUSTERING ORDER BY (modified DESC);
Note: In this example, there is one message per user_id (intended)
Given this table my understanding is that the query's performance will be better in cases where recent data is queried.
However, if one where to make updates to the "modified" column does it add extra overhead on the server to "re-order" and is that overhead vs query performance significant?
In other words given this table would it perform better if the "CLUSTERING ORDER BY (modified DESC)" was dropped?
UPDATE: Updated the invalid CQL by adding modified to primary key, however, the original questions still stand.

In order to make modified a clustering column, it needs to be defined in the primary key.
CREATE TABLE myTable (
user_id INT,
message TEXT,
modified DATE,
PRIMARY KEY ((user_id), modified)
)
WITH CLUSTERING ORDER BY (modified DESC);
This way, your data will be sorted primarily by the hashed value of the user_id, and within each user_id by modified. You don't need to drop the "WITH CLUSTERING ORDER BY (modified DESC)"

Moving the comment as an answer, as reply of the updated question:
if one where to make updates to the "modified" column does it add
extra overhead on the server to "re-order" and is that overhead vs
query performance significant?
If modified is defined as part of the clustering key, you won't be able to update that record, but you will be able to add as many records as needed, each time with a different modified date.
Cassandra is an append-only database engine: this means that any update to the records will add a new record with a different timestamp, a select will consider the records with the latest timestamp. This means that there is no "re-order" operation.
Dropping or creating the clustering order should be defined in base of the query of how the information will be retrieved, if you are going to use only the latest records of that user_id, it makes sense to have the clustering order as you defined it.

in your data model user_id is a rowkey/shardkey/partition key (userid) that is important for data locality and the clustering column (modified) specifies the order that the data is arranged inside the partition. combination of these two keys makes the primary key.
Even in RDBS world, updating PK is avoidble for sake of data integrity.
however in cassandra there is no constraints/relation between column families/tables.
Assigning exact same values to Pk fields(userid,modified) will result in update the existing record else it will add set of fields.
refence:
https://www.datastax.com/dev/blog/we-shall-have-order

Related

Apache Cassandra stock data model design

I got a lot of data regarding stock prices and I want to try Apache Cassandra out for this purpose. But I'm not quite familiar with the primary/ partition/ clustering keys.
My database columns would be:
Stock_Symbol
Price
Timestamp
My users will always filter for the Stock_Symbol (where stock_symbol=XX) and then they might filter for a certain time range (Greater/ Less than (equals)). There will be around 30.000 stock symbols.
Also, what is the big difference when using another "filter", e.g. exchange_id (only two stock exchanges are available).
Exchange_ID
Stock_Symbol
Price
Timestamp
So my users would first filter for the stock exchange (which is more or less a foreign key), then for the stock symbol (which is also more or less a foreign key). The data would be inserted/ written in this order as well.
How do I have to choose the keys?
The Quick Answer
Based on your use-case and predicted query pattern, I would recommend one of the following for your table:
PRIMARY KEY (Stock_Symbol, Timestamp)
The partition key is made of Stock_Symbol, and Timestamp is the only clustering column. This will allow WHERE to be used with those two fields. If either are to be filtered on, filtering on Stock_Symbol will be required in the query and must come as the first condition to WHERE.
Or, for the second case you listed:
PRIMARY KEY ((Exchange_ID, Stock_Symbol), Timestamp)
The partition key is composed of Exchange_ID and Stock_Symbol, and Timestamp is the only clustering column. This will allow WHERE to be used with those three fields. If any of those three are to be filtered on, filtering on both Exchange_ID and Stock_Symbol will be required in the query and must come in that order as the first two conditions to WHERE.
See the last section of this answer for a few other variations that could also be applied based on your needs.
Long Answer & Explanation
Primary Keys, Partition Keys, and Clustering Columns
Primary keys in Cassandra, similar to their role in relational databases, serve to identify records and index them in order to access them quickly. However, due to the distributed nature of records in Cassandra, they serve a secondary purpose of also determining which node that a given record should be stored on.
The primary key in a Cassandra table is further broken down into two parts - the Partition Key, which is mandatory and by default is the first column in the primary key, and optional clustering column(s), which are all fields that are in the primary key that are not a part of the partition key.
Here are some examples:
PRIMARY KEY (Exchange_ID)
Exchange_ID is the sole field in the primary key and is also the partition key. There are no additional clustering columns.
PRIMARY KEY (Exchange_ID, Timestamp, Stock_Symbol)
Exchange_ID, Timestamp, and Stock_Symbol together form a composite primary key. The partition key is Exchange_ID and Timestamp and Stock_Symbol are both clustering columns.
PRIMARY KEY ((Exchange_ID, Timestamp), Stock_Symbol)
Exchange_ID, Timestamp, and Stock_Symbol together form a composite primary key. The partition key is composed of both Exchange_ID and Timestamp. The extra parenthesis grouping Exchange_ID and Timestamp group them into a single composite partition key, and Stock_Symbol is a clustering column.
PRIMARY KEY ((Exchange_ID, Timestamp))
Exchange_ID and Timestamp together form a composite primary key. The partition key is composed of both Exchange_ID and Timestamp. There are no clustering columns.
But What Do They Do?
Internally, the partitioning key is used to calculate a token, which determines on which node a record is stored. The clustering columns are not used in determining which node to store the record on, but they are used in determining order of how records are laid out within the node - this is important when querying a range of records. Records whose clustering columns are similar in value will be stored close to each other on the same node; they "cluster" together.
Filtering in Cassandra
Due to the distributed nature of Cassandra, fields can only be filtered on if they are indexed. This can be accomplished in a few ways, usually by being a part of the primary key or by having a secondary index on the field. Secondary indexes can cause performance issues according to DataStax Documentation, so it is typically recommended to capture your use-cases using the primary key if possible.
Any field in the primary key can have a WHERE clause applied to it (unlike unindexed fields which cannot be filtered on in the general case), but there are some stipulations:
Order Matters - The primary key fields in the WHERE clause must be in the order that they are defined; if you have a primary key of (field1, field2, field3), you cannot do WHERE field2 = 'value', but rather you must include the preceding fields as well: WHERE field1 = 'value' AND field2 = 'value'.
The Entire Partition Key Must Be Present - If applying a WHERE clause to the primary key, the entire partition key must be given so that the cluster can determine what node in the cluster the requested data is located in; if you have a primary key of ((field1, field2), field3), you cannot do WHERE field1 = 'value', but rather you must include the full partition key: WHERE field1 = 'value' AND field2 = 'value'.
Applied to Your Use-Case
With the above info in mind, you can take the analysis of how users will query the database, as you've done, and use that information to design your data model, or more specifically in this case, the primary key of your table.
You mentioned that you will have about 30k unique values for Stock_Symbol and further that it will always be included in WHERE cluases. That sounds initially like a resonable candidate for a partition key, as long as queries will include only a single value that they are searching for in Stock_Symbol (e.g. WHERE Stock_Symbol = 'value' as opposed to WHERE Stock_Symbol < 'value'). If a query is intended to return multiple records with multiple values in Stock_Symbol, there is a danger that the cluster will need to retrieve data from multiple nodes, which may result in performance penalties.
Further, if your users wish to filter on Timestamp, it should also be a part of the primary key, though wanting to filter on a range indicates to me that it probably shouldn't be a part of the partition key, so it would be a good candidate for a clustering column.
This brings me to my recommendation:
PRIMARY KEY (Stock_Symbol, Timestamp)
If it were important to distribute data based on both the Stock_Symbol and the Timestamp, you could introduce a pre-calculated time-bucketed field that is based on the time but with less cardinality, such as Day_Of_Week or Month or something like that:
PRIMARY KEY ((Stock_Symbol, Day_Of_Week), Timestamp)
If you wanted to introduce another field to filtering, such as Exchange_ID, it could be a part of the partition key, which would mandate it being included in filters, or it could be a part of the clustering column, which would mean that it wouldn't be required unless subsequent fields in the primary key needed to be filtered on. As you mentioned that users will always filter by Exchange_ID and then by Stock_Symbol, it might make sense to do:
PRIMARY KEY ((Exchange_ID, Stock_Symbol), Timestamp)
Or to make it non-mandatory:
PRIMARY KEY (Stock_Symbol, Exchange_ID, Timestamp)

Allow filtering function in Cassandra (which choice is correct?)

I am currently trying to model some time series data in base of Cassandra.
For example i have a table bigint_table, which was created by following query
**
CREATE TABLE bigint_table (name_id int,tuuid timeuuid, timestamp
timestamp, value text, PRIMARY KEY ((name_id),tuuid, timestamp)) WITH
CLUSTERING ORDER BY (tuuid asc, timestamp asc)
**
tuuid column was added because without it I had problems and I lost some data while inserting them in DB. name_id represents the channel's ID data comes from.tuuid column was added because without it I had problems and I lost some data while inserting them in DB. In one table there are lots of data with the same ID, but they are unique by timestamp and tuuid (values also can be the same sometimes).
I consistently execute 2 different queries to get values and timestamps
select value from bigint_table where name_id=6 and timestamp>'
2017-11-01 8:26:47.970+0000' and timestamp<'2017-11-30
8:26:52.048+0000' order by tuuid asc, timestamp asc allow filtering
2.
select timestamp from bigint_table where name_id=6 and timestamp>'
2017-11-01 8:26:47.970+0000' and timestamp<'2017-11-30
8:26:52.048+0000' order by tuuid asc, timestamp asc allow filtering
In this post author says one need to resist the urge to just add ALLOW FILTERING to itand one should think about data, model and what one is trying to do.
I thought a lot about using ALLOW FILTERING function or not, and I figured out that I have no choice in my case and I need to use it. But those words in post I mentioned above are keeping me in doubt. I would like to know your advise and what do you thnik about my problem. Is there another way to model my data tables, queries of which do not require ALLOW FILTERING? I would be very very thank you for advice.
The reason you need allow filtering is because you have the clustering column (tuuid, timestamp)in the wrong order. In this case the data stored first by tuuid and then by timestamp.But you're actually choosing data by timestamp and then ordering by tuuid so Cassandra can't use the indexes that you have specified. The order when you define the primary key matters.

Cassandra sort not by primary key

I'm trying to model a table in Cassandra, I'm quite new and stumbled upon one problem. I've got the following:
CREATE TABLE content_registry (
service text,
file text,
type_id tinyint,
container text,
status_id tinyint,
source_location text,
expiry_date timestamp,
modify_date timestamp,
create_date timestamp,
to_overwrite boolean,
PRIMARY KEY ((service), file, type_id)
);
So as I understand:
service is my partition key and based on this value hashes will be generated and values will be split in cluster
file is clustering key
type_id is clustering key
These three bodies combine a composite (compound) primary key
What I've figured out is that whenever I'll insert new data, Cassandra will upsert (either insert or update if the value with that compound primary key exists)
Now what I'm struggling is, that I want my data to come back sorted by create_date in descending order, however create_date is not part of primary key.
If I add create_date to my primary key, I won't be able to upsert data, because create_date means timestamp when record was inserted, so if I add it to primary key every time there's an insert, I'll end up with multiple records.
What are the other options? Order in application? That doesn't seem very efficient.
What I've figured out is that whenever I'll insert new data, Cassandra
will upsert (either insert or update if the value with that compound
primary key exists)
Totally right.
Now what I'm struggling is, that I want my data to come back sorted by
create_date in descending order, however create_date is not part of
primary key.
If I add create_date to my primary key, I won't be able to upsert
data, because create_date means timestamp when record was inserted, so
if I add it to primary key every time there's an insert, I'll end up
with multiple records.
With these sentences you are actually contradicting.
If create_date isn't part of your key but a property and the data is upserted, it means that the records are always the same. Therefore when querying by the key and fetching create_date you always have the latest. If you actually want to have the date when the record got created you should just not override the data anymore after the first time you inserted that record.
If it's the case you want to represent a series of data, you indeed need to avoid upserting, this is could be done by using create_date as additional partition key. I'd rather prefeer using time_uuid which comes with quite handy functions.
Last but not least, the most interesting question is, what actually the usecase is that you want to reflect. When modelling data in cassandra you always should know your queries you need to run in advance.
The key concept in Cassandra is that you have to decide what's your PRIMARY KEY, that is what in your rows can be unique and known at query times. This is a very basic requirement, since failing at recognizing this will lead to a bad model.
From what I can see, you identified service as your PARTITION KEY, so I'm thinking that this field is what "rules" your data. This is something you must really know to perform even a single query (ignoring the inefficient table scan SELECT * FROM content_registry;). Within each service, you currently have your rows ordered by file and then by type_id. I don't know the exact meaning of the latter field, but you can currently have two rows identified by ('service1', 'a.jpg', 1) and ('service1', 'a.jpg', 2). So if type_id is somehow related to the file, the model is a bit incorrect.
Now, assuming you want to fetch the same records for each service in another order, what you really need to do is create another table that will include the create_date as the first clustering column, eg (service, create_date, file, type_id). This will allow you to fetch records ordered by creation date, and when two records are created in the same date, they will be further ordered by file, and then by type_id.
A second approach is to attach a secondary index to the create_date field of your original table. This will allow to query by creation date.
A third approach, probably better than the second, is the use of a Materialized View. It will hide a lot of burdens for you and will probably scale better than secondary indexes.
Please note that having secondary indexes or materialized views usually don't scale well. Check if these approaches are enough for your use case.
If I add create_date to my primary key, I won't be able to upsert data.
Why not? Suppose your key was PRIMAY KEY (service, create_date, file, type_id)? That will let you sort by create_date for each service but not globally.
If you want to do it globally (that is, you want all services and all files sorted by create date) then things are probably more complex if you still want to be able to shard your data. One option would be to make the primary key PRIMARY KEY (create_date, service, file, type_id) and use one of the order preserving partitioners.
Also, a bit more information here: http://www.datastax.com/dev/blog/we-shall-have-order

Primary Key related CQL3 Queries cases & errors when sorting

I have two issues while querying Cassandra:
Query 1
> select * from a where author='Amresh' order by tweet_id DESC;
Order by with 2ndary indexes is not supported
What I learned: secondary indexes are made to be used only with a WHERE clause and not ORDER BY? If so, then how can I sort?
Query 2
> select * from a where user_id='xamry' ORDER BY tweet_device DESC;
Order by currently only supports the ordering of columns following their
declared order in the PRIMARY KEY.
What I learned: The ORDER BY column should be in the 2nd place in the primary key, maybe? If so, then what if I need to sort by multiple columns?
Table:
CREATE TABLE a(
user_id varchar,
tweet_id varchar,
tweet_device varchar,
author varchar,
body varchar,
PRIMARY KEY(user_id,tweet_id,tweet_device)
);
INSERT INTO a (user_id, tweet_id, tweet_device, author, body)
VALUES ('xamry', 't1', 'web', 'Amresh', 'Here is my first tweet');
INSERT INTO a (user_id, tweet_id, tweet_device, author, body)
VALUES ('xamry', 't2', 'sms', 'Saurabh', 'Howz life Xamry');
INSERT INTO a (user_id, tweet_id, tweet_device, author, body)
VALUES ('mevivs', 't1', 'iPad', 'Kuldeep', 'You der?');
INSERT INTO a (user_id, tweet_id, tweet_device, author, body)
VALUES ('mevivs', 't2', 'mobile', 'Vivek', 'Yep, I suppose');
Create index user_index on a(author);
To answer your questions, let's focus on your choice of primary key for this table:
PRIMARY KEY(user_id,tweet_id,tweet_device)
As written, the user_id will be used as the partition key, which distributes your data around the cluster but also keeps all of the data for the same user ID on the same node. Within a single partition, unique rows are identified by the pair (tweet_id, tweet_device) and those rows will be automatically ordered by tweet_id because it is the second column listed in the primary key. (Or put another way, the first column in the PK that is not a part of the partition key determines the sort order of the partition.)
Query 1
The WHERE clause is author='Amresh'. Note that this clause does not involve any of the columns listed in the primary key; instead, it is filtering using a secondary index on author. Since the WHERE clause does not specify an exact value for the partition key column (user_id) using the index involves scanning all cluster nodes for possible matches. Results cannot be sorted when they come from more than one replica (node) because that would require holding the entire result set on the coordinator node before it could return any results to the client. The coordinator can't know what is the real "first" result row until it has confirmed that it has received and sorted every possible matching row.
If you need the information for a specific author name, separate from user ID, and sorted by tweet ID, then consider storing the data again in a different table. The data design philosophy with Cassandra is to store the data in the format you need when reading it and to actually denormalize (store redundant information) as necessary. This is because in Cassandra, writes are cheap (though it places the burden of managing multiple copies of the same logical data on the application developer).
Query 2
Here, the WHERE clause is user_id = 'xamry' which happens to be the partition key for this table. The good news is that this will go directly to the replica(s) holding this partition and not bother asking the other nodes. However, you cannot ORDER BY tweet_device because of what I explained at the top of this answer. Cassandra stores rows (within a single partition) sorted by a single column, generally the second column in the primary key. In your case, you can access data for user_id = 'xamry' ORDER BY tweet_id but not ordered by tweet_device. The answer, if you really need the data sorted by device, is the same as for Query 1: store it in a table where that is the second column in the primary key.
If, when looking up the tweets by user_id you only ever need them sorted by device, simply flip the order of the last two columns in your primary key. If you need to be able to sort either way, store the data twice in two different tables.
The Cassandra storage engine does not offer multi-column sorting other than the order of columns listed in your primary key.

Cassandra CQL SELECT/DELETE issue due to primary key constraints

I need to store latest updates that needs to be pushed to users' newsfeed page in Cassandra table for later retrieval and my table's schema is as follow:
CREATE TABLE newsfeed (user_name text,
post_id bigint,
post_type text,
favorited boolean,
shared boolean,
own boolean,
date timestamp,
PRIMARY KEY (user_name,date,post_id,post_type) );
The first three column (username, postid, and posttype) in combination will build the actual primary-key of the table, however since I wanted to ORDER the SELECT queries on this table based on "date"s of rows I placed the date-column into the primary key fields as the "second" entry (did I have to do this?).
When I want to delete a row by giving only "user_name, post_id, and post_type" as follow:
DELETE FROM newsfeed WHERE user_name='pooria' and post_id=36 and post_type='p';
I will get the following error:
Bad Request: Missing PRIMARY KEY part date since post_id is set
I need the date-column to be part of the primary key since I want to use it in my ORDER BY clauses and on the other hand I have to delete some rows without knowing their "date" values!
So how such problems are tackled in Cassandra? should I be fixing my Data Model and have different schema for job?
DataStax's Chief Evangelist Patrick McFadden posted an article demonstrating a few time series modeling patterns. Definitely makes for a good read, and should be of some help to you: Getting Started with Time Series Data Modeling.
I think your table is just fine. Although, with the way that composite primary keys work in Cassandra, if you cannot skip primary key components in a query. So if you do end up needing to query data by user_name, post_id, and/or post_type differently (without date), you should create a table specifically for that query (which does not include date in the primary key).
I will however say that in-general, creating a table which will process regular delete operations is not a good idea. In fact, I'm pretty sure that has been classified as a Cassandra "anti-pattern." Data really isn't deleted from Cassandra; it is tombstoned. Tombstones are reconciled at compaction time (assuming that the tombstone threshold time has been met), and having too many of them has been known to cause performance issues.
If you read the article I linked above, go down to the section named "Time Series Pattern 3." You will notice that the INSERT statements are run with the USING TTL clause. This gives the data a time-to-live in seconds, after which it will "quietly disappear." For instance, if you wanted to keep your data around for 24 hours (86400 seconds) you could do something like this:
INSERT INTO newsfeed (...) VALUES (...) USING TTL 86400
Using the TTL feature is a preferable alternative to regular cleansing by DELETE.

Resources