Should every table in Cassandra have a partition key? - cassandra

I am trying to create a Cassandra table where i store the logs for a shop as per the timestamp. I also want to create a query which returns the data in a descending order with respect to the timestamp. If I make my timestamp as the primary key it will be automatically be the partition key as i don't have any other columns as composite primary key.
And in Cassandra we can't do ORDER BY on partition keys. Is there any way that I make my timestamp as primary key and not as partition key (A Cassandra DB without a partition key).
Thanks in advance.
table creation if required :
CREATE TABLE myCass.logs(timestamp timestamp, logs text, PRIMARY KEY (timestamp));

Since you have the timestamp you know the year, month, day. You could use those as your partition key and have the timestamp as a clustering column. In this way you would satisfy also the need for a partition key, you will have a primary key for the data, you could order by on timestamps and you would evenly spread your data across the cluster.
This way of splitting data is called bucketing. Here is some good reading on this subject - Cassandra Time Series Data Modeling For Massive Scale

Related

Apache Cassandra stock data model design

I got a lot of data regarding stock prices and I want to try Apache Cassandra out for this purpose. But I'm not quite familiar with the primary/ partition/ clustering keys.
My database columns would be:
Stock_Symbol
Price
Timestamp
My users will always filter for the Stock_Symbol (where stock_symbol=XX) and then they might filter for a certain time range (Greater/ Less than (equals)). There will be around 30.000 stock symbols.
Also, what is the big difference when using another "filter", e.g. exchange_id (only two stock exchanges are available).
Exchange_ID
Stock_Symbol
Price
Timestamp
So my users would first filter for the stock exchange (which is more or less a foreign key), then for the stock symbol (which is also more or less a foreign key). The data would be inserted/ written in this order as well.
How do I have to choose the keys?
The Quick Answer
Based on your use-case and predicted query pattern, I would recommend one of the following for your table:
PRIMARY KEY (Stock_Symbol, Timestamp)
The partition key is made of Stock_Symbol, and Timestamp is the only clustering column. This will allow WHERE to be used with those two fields. If either are to be filtered on, filtering on Stock_Symbol will be required in the query and must come as the first condition to WHERE.
Or, for the second case you listed:
PRIMARY KEY ((Exchange_ID, Stock_Symbol), Timestamp)
The partition key is composed of Exchange_ID and Stock_Symbol, and Timestamp is the only clustering column. This will allow WHERE to be used with those three fields. If any of those three are to be filtered on, filtering on both Exchange_ID and Stock_Symbol will be required in the query and must come in that order as the first two conditions to WHERE.
See the last section of this answer for a few other variations that could also be applied based on your needs.
Long Answer & Explanation
Primary Keys, Partition Keys, and Clustering Columns
Primary keys in Cassandra, similar to their role in relational databases, serve to identify records and index them in order to access them quickly. However, due to the distributed nature of records in Cassandra, they serve a secondary purpose of also determining which node that a given record should be stored on.
The primary key in a Cassandra table is further broken down into two parts - the Partition Key, which is mandatory and by default is the first column in the primary key, and optional clustering column(s), which are all fields that are in the primary key that are not a part of the partition key.
Here are some examples:
PRIMARY KEY (Exchange_ID)
Exchange_ID is the sole field in the primary key and is also the partition key. There are no additional clustering columns.
PRIMARY KEY (Exchange_ID, Timestamp, Stock_Symbol)
Exchange_ID, Timestamp, and Stock_Symbol together form a composite primary key. The partition key is Exchange_ID and Timestamp and Stock_Symbol are both clustering columns.
PRIMARY KEY ((Exchange_ID, Timestamp), Stock_Symbol)
Exchange_ID, Timestamp, and Stock_Symbol together form a composite primary key. The partition key is composed of both Exchange_ID and Timestamp. The extra parenthesis grouping Exchange_ID and Timestamp group them into a single composite partition key, and Stock_Symbol is a clustering column.
PRIMARY KEY ((Exchange_ID, Timestamp))
Exchange_ID and Timestamp together form a composite primary key. The partition key is composed of both Exchange_ID and Timestamp. There are no clustering columns.
But What Do They Do?
Internally, the partitioning key is used to calculate a token, which determines on which node a record is stored. The clustering columns are not used in determining which node to store the record on, but they are used in determining order of how records are laid out within the node - this is important when querying a range of records. Records whose clustering columns are similar in value will be stored close to each other on the same node; they "cluster" together.
Filtering in Cassandra
Due to the distributed nature of Cassandra, fields can only be filtered on if they are indexed. This can be accomplished in a few ways, usually by being a part of the primary key or by having a secondary index on the field. Secondary indexes can cause performance issues according to DataStax Documentation, so it is typically recommended to capture your use-cases using the primary key if possible.
Any field in the primary key can have a WHERE clause applied to it (unlike unindexed fields which cannot be filtered on in the general case), but there are some stipulations:
Order Matters - The primary key fields in the WHERE clause must be in the order that they are defined; if you have a primary key of (field1, field2, field3), you cannot do WHERE field2 = 'value', but rather you must include the preceding fields as well: WHERE field1 = 'value' AND field2 = 'value'.
The Entire Partition Key Must Be Present - If applying a WHERE clause to the primary key, the entire partition key must be given so that the cluster can determine what node in the cluster the requested data is located in; if you have a primary key of ((field1, field2), field3), you cannot do WHERE field1 = 'value', but rather you must include the full partition key: WHERE field1 = 'value' AND field2 = 'value'.
Applied to Your Use-Case
With the above info in mind, you can take the analysis of how users will query the database, as you've done, and use that information to design your data model, or more specifically in this case, the primary key of your table.
You mentioned that you will have about 30k unique values for Stock_Symbol and further that it will always be included in WHERE cluases. That sounds initially like a resonable candidate for a partition key, as long as queries will include only a single value that they are searching for in Stock_Symbol (e.g. WHERE Stock_Symbol = 'value' as opposed to WHERE Stock_Symbol < 'value'). If a query is intended to return multiple records with multiple values in Stock_Symbol, there is a danger that the cluster will need to retrieve data from multiple nodes, which may result in performance penalties.
Further, if your users wish to filter on Timestamp, it should also be a part of the primary key, though wanting to filter on a range indicates to me that it probably shouldn't be a part of the partition key, so it would be a good candidate for a clustering column.
This brings me to my recommendation:
PRIMARY KEY (Stock_Symbol, Timestamp)
If it were important to distribute data based on both the Stock_Symbol and the Timestamp, you could introduce a pre-calculated time-bucketed field that is based on the time but with less cardinality, such as Day_Of_Week or Month or something like that:
PRIMARY KEY ((Stock_Symbol, Day_Of_Week), Timestamp)
If you wanted to introduce another field to filtering, such as Exchange_ID, it could be a part of the partition key, which would mandate it being included in filters, or it could be a part of the clustering column, which would mean that it wouldn't be required unless subsequent fields in the primary key needed to be filtered on. As you mentioned that users will always filter by Exchange_ID and then by Stock_Symbol, it might make sense to do:
PRIMARY KEY ((Exchange_ID, Stock_Symbol), Timestamp)
Or to make it non-mandatory:
PRIMARY KEY (Stock_Symbol, Exchange_ID, Timestamp)

Cassandra data model for time series data

For monitoring some distributed software I insert their monitoring data into Cassandra table. The columns are metric_type, metric_value, host_name, component_type and time_stamp. The scenario is I collect all the metrics for all the nodes in every second. The time in uniform for all nodes and their metrics. The keys(that differentiate rows) are host_name, component_type, metric_type and time_stamp. I design my table like below:
CREATE TABLE metrics (
component_type text,
host_name text,
metric_type text,
time_stamp bigint,
metric_value text,
PRIMARY KEY ((component_type, host_name, metric_type), general_timestamp)
) WITH CLUSTERING ORDER BY (time_stamp DESC)
where component_type, host_name and metric_type are partitions key and time_stamp is clustering key.
The metrics table is suitable for the queries that gets some data according to their timestamp just for a host_name or a metric_type or a component_type, as using partition keys Cassandra will find the partition that data are stored and using clustering key will fetch data from that partition and this is the optimal case for Cassandra queries.
Besides that, I need a query that fetches all data just using time_stamp. For example :
SELECT * from metrics WHERE time_stamp >= 1529632009872 and time_stamp < 1539632009872 ;
I know the metric table is not optimal for the above query, because it should search every partition to fetch data. I guess in this situation we should design another table with the time_stamp as partition key, so data will be fetched from one or some limited number of partitions. But I am not certain about some aspects:
Is it optimal to set time_stamp as partition key? because of I insert data into the database every second and the partition key numbers will be a lot!
I need my queries to be interval on time_stamp and I know interval conditions are not allowed in partition keys, just allowed on clustering keys!
So what is the best Cassandra data model for such time series data and query?
Using time_stamp as partition key is not optimal in my opinion, as it would create a lot of partitions.
I would propose 2 solutions:
1) Go with a "week_first_day" as partition key. You would have to compute the correct week_first_day keys on your application side and then emit multiple select queries.
2) You could use ElasticSearch on top of cassandra. Cassandra remains the primary data source, but you have the freedom, to do complex selects. If you are interested, I would recommend to take a look at Elassandra .

Cassandra - Internal data storage when no clustering key is specified

I'm trying to understand the scenario when no clustering key is specified in a table definition.
If a table has only a partition key and no clustering key, what order the rows under the same partition are stored in? Is it even allowed to have multiple rows under the same partition when no clustering key exists? I tried searching for it online but couldn't get a clear explanation.
I got the below explanation from Cassandra user group so posting it here in case someone else is looking for the same info:
"Note that a table always has a partition key, and that if the table has
no clustering columns, then every partition of that table is only
comprised of a single row (since the primary key uniquely identifies
rows and the primary key is equal to the partition key if there is no
clustering columns)."
http://cassandra.apache.org/doc/latest/cql/ddl.html#the-partition-key

How to add the multiple column as a primary keys in cassandra?

I have an existing table with millions of records and initially we have two columns as partitioning key and clustering key and now I want add two more columns in a table as a partitioning key.
How?
If you make a change to the partition key you will need to create a new table and import the existing data. This is due to, in part, the fact that a partition key is not equal to a primary key in a relational database. The partition key is hashed by Cassandra and that hash is used to find partitions on disk. If you change the partition key you change the hash value and can no longer look up the partition!
CREATE TABLE KEYSPACE_NAME.AMAR_EXAMPLE (
COLUMN_1 TYPE,
COLUMN_2 TYPE,
COLUMN_3 TYPE,
...
COLUMN_N TYPE
// Here we declare the partition key columns and clustering columns
PRIMARY KEY ((COLUMN_1, COLUMN_2, COLUMN_3, COLUMN_4), CLUSTERING_COLUMN)
)
//If you need to change the default clustering order declare that here
WITH CLUSTERING ORDER BY (COLUMN_4 DESC);
You could export the data to CSV using COPY and then import the data to the new table via COPY or use the SSTABLELOADER. There is plenty of documentation and walkthroughs on how to use those tools. For example, this Datastax blog post talks about the changes made to the updated SSTABLELOADER. If you create a new table and import the existing data you will create new partitions and new hashes. Cassandra will not let you simply add additional columns to the partition key after the table has been created.
Understanding your data and the Cassandra data modeling techniques will help mitigate the amount of work you may find yourself doing changing partition keys. Check out the self-paced courses provided by Datastax. DS220: Data Modeling could really help.

row key in cassandra table

I am new to Cassandra, I am confused between rowkey and partition key in Cassandra.
I am creating a table like:
Create table events( day text, hour text, dip text, sip text, count counter,
primary key((day,hour), dip, sip));
As per my understanding, in the above table day and hour columns form a partition key and dip,sip columns form a clustering key.
My understanding is that row key is nothing but partition key i.e. day, hour columns form a row key.
Is my understanding correct? Can any one clarify this?
Is my understanding correct, Can any one clarify this?
Yes, your understanding is correct. The row key is the "old school" way of referring to a partition key. The partition key (as you probably understand) is the part of the CQL PRIMARY KEY which determines where the data is stored in the cluster. In your case, data within your partition keys will be sorted by dip and sip (your clustering keys).
You should give John Berryman's article Understanding How CQL3 Maps To Cassandra’s Internal Data Structure a read. It does a great job of explaining how your table structures map "under the hood."

Resources