Cassandra: How to query primary key using mathematical operators like >,< which is of type text? - cassandra

I have a table in Cassandra, whose primary key of type text(string) and stores only numbers.
Now I want to perform CQL query on this column using mathematical operators like ><.
Any idea on how to accomplish this?
select * from ynapanalyticsteam.df_tran_order_info where order_header_key>'2018';

why do you store it as text since your values are numbers?
Cassandra has some important concepts: partition key and clustering key.
Let's say you have a table like this:
TABLE A (
...,
PRIMARY KEY ((pk1, pk2), ck1, ck2, ck3, ck4, ck5)
)
pk1 and pk2 are the partition keys and your query must include them using =. The partition key it's used to determine the nodes to which the data belong.
The clustering columns (ck1, ck2, ..., ck5) are used for ordering the data and adding some other filtering using =, <, > operators. The clustering columns are used to control how data it's sorted in the partition.
You need to change your data model so order_header_key to be a clustering key and have another column to be your partition key.

Related

Apache Cassandra stock data model design

I got a lot of data regarding stock prices and I want to try Apache Cassandra out for this purpose. But I'm not quite familiar with the primary/ partition/ clustering keys.
My database columns would be:
Stock_Symbol
Price
Timestamp
My users will always filter for the Stock_Symbol (where stock_symbol=XX) and then they might filter for a certain time range (Greater/ Less than (equals)). There will be around 30.000 stock symbols.
Also, what is the big difference when using another "filter", e.g. exchange_id (only two stock exchanges are available).
Exchange_ID
Stock_Symbol
Price
Timestamp
So my users would first filter for the stock exchange (which is more or less a foreign key), then for the stock symbol (which is also more or less a foreign key). The data would be inserted/ written in this order as well.
How do I have to choose the keys?
The Quick Answer
Based on your use-case and predicted query pattern, I would recommend one of the following for your table:
PRIMARY KEY (Stock_Symbol, Timestamp)
The partition key is made of Stock_Symbol, and Timestamp is the only clustering column. This will allow WHERE to be used with those two fields. If either are to be filtered on, filtering on Stock_Symbol will be required in the query and must come as the first condition to WHERE.
Or, for the second case you listed:
PRIMARY KEY ((Exchange_ID, Stock_Symbol), Timestamp)
The partition key is composed of Exchange_ID and Stock_Symbol, and Timestamp is the only clustering column. This will allow WHERE to be used with those three fields. If any of those three are to be filtered on, filtering on both Exchange_ID and Stock_Symbol will be required in the query and must come in that order as the first two conditions to WHERE.
See the last section of this answer for a few other variations that could also be applied based on your needs.
Long Answer & Explanation
Primary Keys, Partition Keys, and Clustering Columns
Primary keys in Cassandra, similar to their role in relational databases, serve to identify records and index them in order to access them quickly. However, due to the distributed nature of records in Cassandra, they serve a secondary purpose of also determining which node that a given record should be stored on.
The primary key in a Cassandra table is further broken down into two parts - the Partition Key, which is mandatory and by default is the first column in the primary key, and optional clustering column(s), which are all fields that are in the primary key that are not a part of the partition key.
Here are some examples:
PRIMARY KEY (Exchange_ID)
Exchange_ID is the sole field in the primary key and is also the partition key. There are no additional clustering columns.
PRIMARY KEY (Exchange_ID, Timestamp, Stock_Symbol)
Exchange_ID, Timestamp, and Stock_Symbol together form a composite primary key. The partition key is Exchange_ID and Timestamp and Stock_Symbol are both clustering columns.
PRIMARY KEY ((Exchange_ID, Timestamp), Stock_Symbol)
Exchange_ID, Timestamp, and Stock_Symbol together form a composite primary key. The partition key is composed of both Exchange_ID and Timestamp. The extra parenthesis grouping Exchange_ID and Timestamp group them into a single composite partition key, and Stock_Symbol is a clustering column.
PRIMARY KEY ((Exchange_ID, Timestamp))
Exchange_ID and Timestamp together form a composite primary key. The partition key is composed of both Exchange_ID and Timestamp. There are no clustering columns.
But What Do They Do?
Internally, the partitioning key is used to calculate a token, which determines on which node a record is stored. The clustering columns are not used in determining which node to store the record on, but they are used in determining order of how records are laid out within the node - this is important when querying a range of records. Records whose clustering columns are similar in value will be stored close to each other on the same node; they "cluster" together.
Filtering in Cassandra
Due to the distributed nature of Cassandra, fields can only be filtered on if they are indexed. This can be accomplished in a few ways, usually by being a part of the primary key or by having a secondary index on the field. Secondary indexes can cause performance issues according to DataStax Documentation, so it is typically recommended to capture your use-cases using the primary key if possible.
Any field in the primary key can have a WHERE clause applied to it (unlike unindexed fields which cannot be filtered on in the general case), but there are some stipulations:
Order Matters - The primary key fields in the WHERE clause must be in the order that they are defined; if you have a primary key of (field1, field2, field3), you cannot do WHERE field2 = 'value', but rather you must include the preceding fields as well: WHERE field1 = 'value' AND field2 = 'value'.
The Entire Partition Key Must Be Present - If applying a WHERE clause to the primary key, the entire partition key must be given so that the cluster can determine what node in the cluster the requested data is located in; if you have a primary key of ((field1, field2), field3), you cannot do WHERE field1 = 'value', but rather you must include the full partition key: WHERE field1 = 'value' AND field2 = 'value'.
Applied to Your Use-Case
With the above info in mind, you can take the analysis of how users will query the database, as you've done, and use that information to design your data model, or more specifically in this case, the primary key of your table.
You mentioned that you will have about 30k unique values for Stock_Symbol and further that it will always be included in WHERE cluases. That sounds initially like a resonable candidate for a partition key, as long as queries will include only a single value that they are searching for in Stock_Symbol (e.g. WHERE Stock_Symbol = 'value' as opposed to WHERE Stock_Symbol < 'value'). If a query is intended to return multiple records with multiple values in Stock_Symbol, there is a danger that the cluster will need to retrieve data from multiple nodes, which may result in performance penalties.
Further, if your users wish to filter on Timestamp, it should also be a part of the primary key, though wanting to filter on a range indicates to me that it probably shouldn't be a part of the partition key, so it would be a good candidate for a clustering column.
This brings me to my recommendation:
PRIMARY KEY (Stock_Symbol, Timestamp)
If it were important to distribute data based on both the Stock_Symbol and the Timestamp, you could introduce a pre-calculated time-bucketed field that is based on the time but with less cardinality, such as Day_Of_Week or Month or something like that:
PRIMARY KEY ((Stock_Symbol, Day_Of_Week), Timestamp)
If you wanted to introduce another field to filtering, such as Exchange_ID, it could be a part of the partition key, which would mandate it being included in filters, or it could be a part of the clustering column, which would mean that it wouldn't be required unless subsequent fields in the primary key needed to be filtered on. As you mentioned that users will always filter by Exchange_ID and then by Stock_Symbol, it might make sense to do:
PRIMARY KEY ((Exchange_ID, Stock_Symbol), Timestamp)
Or to make it non-mandatory:
PRIMARY KEY (Stock_Symbol, Exchange_ID, Timestamp)

Cassandra: Is partition key also used in clustering?

Let's say I have a primary key like this: primary key (PK, CK).
Based on what I read (see refs), I think I can loosely describe the way Cassandra uses PK and CK as follows - PK will be used to decide which node(s) the data should go to and CK will be used for clustering (aka ordering) of data within that node.
Then, it seems PK is not used in clustering data within the node and that sounds wrong. What if I have a simple primary with with just PK? Will Cassandra only distribute data across nodes and not order data within each node since there is no clustering column?
refs:
https://docs.datastax.com/en/cql/3.1/cql/ddl/ddl_compound_keys_c.html
Difference between partition key, composite key and clustering key in Cassandra?
Then, it seems PK is not used in clustering data within the node and
that sounds wrong. What if I have a simple primary with with just PK?
Will Cassandra only distribute data across nodes and not order data
within each node since there is no clustering column?
Good question. Let's try this out. I'll create a simple table and INSERT some data:
aploetz#cqlsh:stackoverflow> CREATE TABLE programs
(name text PRIMARY KEY, data text);
aploetz#cqlsh:stackoverflow> INSERT INTO programs (name) VALUES ('Tron');
aploetz#cqlsh:stackoverflow> INSERT INTO programs (name) VALUES ('Yori');
aploetz#cqlsh:stackoverflow> INSERT INTO programs (name) VALUES ('Quorra');
aploetz#cqlsh:stackoverflow> INSERT INTO programs (name) VALUES ('Clu');
aploetz#cqlsh:stackoverflow> INSERT INTO programs (name) VALUES ('Flynn');
aploetz#cqlsh:stackoverflow> INSERT INTO programs (name) VALUES ('Zuze');
Now, let's run a query that should answer your question:
aploetz#cqlsh:stackoverflow> SELECT name, token(name) FROM programs;
name | system.token(name)
--------+----------------------
Flynn | -1059892732813900311
Zuze | 1815531347795840810
Yori | 2854211700591734382
Quorra | 3079126743186967718
Tron | 6359222509420865788
Clu | 8304850648940574176
(6 rows)
As you can see, they are definitely not in order by name, which is the partition key and lone PRIMARY KEY. But, my query runs the token() function on name, which shows the hashed value of the partition key (name in this case). The results are ordered by that.
So to answer your question, Cassandra orders its partitions by the hashed value of the partition key. Note that this order is maintained throughout the cluster, not just on a single node. Therefore, results for an unbound query (not recommended to be run in a multi-node configuration) will be ordered by the hashed value of the partition key, regardless of the number of nodes in the cluster.
Since all data for a table will be written to the same SSTables with a ordering of the partition key. So yes they are sorted.
I think what you're asking is why you can't use a primary key the same way you use a clustering key. For example you can't do less than (<) or greater than (>) on a partition key. Since one node doesn't have all the partition keys this type of query would have to check with all nodes in your cluster to see if they have any partition key that matches your query.

Cassandra how to filter hex values in blob field

Consider the following table:
CREATE TABLE associations (
someHash blob,
someValue int,
someOtherField text
PRIMARY KEY (someHash, someValue)
) WITH CLUSTERING ORDER BY (someValue ASC);
The inserts to this table have someHash as a hex value, like 0xA0000000000000000000000000000001, 0xA0000000000000000000000000000002, etc.
If a query needs to find all rows with 0xA0000000000, what's the recommended Cassandra way to do it?
The main problem with your query is that it does not take into account limitations of Cassandra, namely:
someHash is a partition key column
The partition key columns [in WHERE clause] support only two operators: = and IN (i.e. exact match)
In other words, your schema is designed in such a way, that effectively query should say: "let's retrieve all possible keys [from all nodes], let's filter them (type not important here) and then retrieve values for keys that match predicate". This is a full-scan of some sort and is not what Cassandra is best at. You can try using UDFs to do some data transformation (trimming someHash), but I would expect it to work well only with trivial amounts of data.
Golden rule of Cassandra is "query first": if you have such a use-case, schema should be designed accordingly - sub-key you want to query by should be actual partition key (full someHash value can be part of clustering key).
BTW, same limitation applies to most maps in programming: you can't do lookup by part of key (because of hashing).
Following your 0xA0000000000 example directly:
You could split up someHash into 48 bits (6 bytes) and 80 bits (10 bytes) parts.
PRIMARY KEY ((someHash_head, someHash_tail), someValue)
The IN will then have 16 values, from 0xA00000000000 to 0xA0000000000F.

Order by with Cassandra No Sql Db

I'm starting to using Cassandra but I'm getting some problems on "ordering" or "selecting".
CREATE TABLE functions (
id_function int,
sort int,
id_subfunction int,
php_class varchar,
php_function varchar,
PRIMARY KEY (id_function, sort, id_subfunction)
);
This is my table.
If I execute this query
SELECT * FROM functions WHERE id_subfunction = 0 ORDER BY sort;
this is what I get.
Bad Request: ORDER BY is only supported when the partition key is restricted by an EQ or an IN.
Where I'm doing wrong?
Thanks
PRIMARY KEY (id_function, sort, id_subfunction)
In Cassandra CQL the columns in a compound PRIMARY KEY are either partitioning keys or clustering keys. In your case, id_function (the first key listed) is the partitioning key. This is the key value that is hashed so that your data for that key can be evenly distributed on your cluster.
The remaining columns (sort and id_subfunction) are known as clustering columns, which determine the sort order of your data within a partition. This essentially means that your data will only be sorted by your clustering key(s) when a partitioning key is first designated in your WHERE clause.
You have two options:
1) Query this table by id_function instead:
SELECT * FROM functions WHERE id_function= 0 ORDER BY sort;
This will technically work, although I'm guessing that it won't give you the results that you are looking for.
2) The better option, is to create a "query table." This is a table designed to specifically handle your query by id_subfunction. It only differs from the original functions table in that the PRIMARY KEY is defined with id_subfunction as the partitioning key:
CREATE TABLE functionsbysubfunction (
id_function int,
sort int,
id_subfunction int,
php_class varchar,
php_function varchar,
PRIMARY KEY (id_subfunction, sort, id_function)
);
This query table will allow this query to function as expected:
SELECT * FROM functionsbysubfunction WHERE id_subfunction = 0;
And you shouldn't need to indicateORDER BY, unless you want to specify either ASCending or DESCending order.
Remember with Cassandra, it is important to design your data model according to how you want to query your data. And that may not necessarily be the way that it originally makes sense to store it.

Why cassandra/cql restrict to use where clause on a column that not indexed?

I have a table as follows in Cassandra 2.0.8:
CREATE TABLE emp (
empid int,
deptid int,
first_name text,
last_name text,
PRIMARY KEY (empid, deptid)
)
when I try to search by: "select * from emp where first_name='John';"
cql shell says:
"Bad Request: No indexed columns present in by-columns clause with Equal operator"
I searched for the issue and every places it says add a secondary index for the column 'first_name'.
But I need to know the exact reason for why that column need to be indexed?
Only thing I can figure out is performance.
Any other reasons?
Cassandra does not support for searching by arbitrary column. It is because it would involve scanning all the rows, which is not supported.
The data are internally organised into something which one can compare to HashMap[X, SortedMap[Y, Z]]. The key of the outer map is a partition key value and the key of the inner map is a kind of concatenation of all clustering columns values and a name of some regular column.
Unless you have an index on a column, you need to provide full (preferred) or partial path to the data you want to collect with the query. Therefore, you should design your schema so that queries contain primary key value and some range on clustering columns.
You may read about what is allowed and what is not here
Alternatively you can create an index in Cassandra, but that will hamper your write performance.

Resources