Cassandra how to filter hex values in blob field - cassandra

Consider the following table:
CREATE TABLE associations (
someHash blob,
someValue int,
someOtherField text
PRIMARY KEY (someHash, someValue)
) WITH CLUSTERING ORDER BY (someValue ASC);
The inserts to this table have someHash as a hex value, like 0xA0000000000000000000000000000001, 0xA0000000000000000000000000000002, etc.
If a query needs to find all rows with 0xA0000000000, what's the recommended Cassandra way to do it?

The main problem with your query is that it does not take into account limitations of Cassandra, namely:
someHash is a partition key column
The partition key columns [in WHERE clause] support only two operators: = and IN (i.e. exact match)
In other words, your schema is designed in such a way, that effectively query should say: "let's retrieve all possible keys [from all nodes], let's filter them (type not important here) and then retrieve values for keys that match predicate". This is a full-scan of some sort and is not what Cassandra is best at. You can try using UDFs to do some data transformation (trimming someHash), but I would expect it to work well only with trivial amounts of data.
Golden rule of Cassandra is "query first": if you have such a use-case, schema should be designed accordingly - sub-key you want to query by should be actual partition key (full someHash value can be part of clustering key).
BTW, same limitation applies to most maps in programming: you can't do lookup by part of key (because of hashing).

Following your 0xA0000000000 example directly:
You could split up someHash into 48 bits (6 bytes) and 80 bits (10 bytes) parts.
PRIMARY KEY ((someHash_head, someHash_tail), someValue)
The IN will then have 16 values, from 0xA00000000000 to 0xA0000000000F.

Related

Select row with highest timestamp

I have a table that stores events
CREATE TABLE active_events (
event_id VARCHAR,
number VARCHAR,
....
start_time TIMESTAMP,
PRIMARY KEY (event_id, number)
);
Now, I want to select an event with the highest start_time. It is possible? I've tried to create a secondary index, but no success.
This is a query I've created
select * from active_call order by start_time limit 1
But the error says ORDER BY is only supported when the partition key is restricted by an EQ or an IN.
Should I create some kind of materialized view? What should I do to execute my query?
This is an anti-pattern in Cassandra. To order the data you need to read all data and find the highest value. And this will require scanning of data on multiple nodes, and will be very long.
Materialized view also won't help much as order for data only exists inside an individual partition, so you will need to put all your data into a single partition that could be huge and data would be imbalanced.
I can only think of following workaround:
Have an additional table that will have all columns of the original table, but with a fake partition key and no clustering columns
You do inserts into that table in parallel to normal inserts, but use a fixed value for that fake partition key, and explicitly setting a timestamp for a record equal to start_time (don't forget to multiple by 1000 as timestamp uses microseconds). In this case it will guaranteed to be the value with the highest timestamp as Cassandra won't override it with other data with lower timestamp.
But this doesn't solve a problem with data skew, and all traffic will be handled by fixed number of nodes equal to RF.
Another alternative - use another database.
This type of query isn't valid in big data because it requires a full table scan and doesn't scale. It works in traditional relational databases because the dataset is smaller. Imagine you had billions of partitions each with thousands of rows spread across hundreds of nodes. A full table scan in a large cluster will take a very long time if it was allowed.
The error:
ORDER BY is only supported when the partition key is restricted by an EQ or an IN
gets returned because you can only sort the results provided (a) the query is restricted to a partition key, and (b) the rows are ordered by a clustering column. You cannot sort the results based on a column that is not part of the clustering key. Cheers!

Cassandra: How to query primary key using mathematical operators like >,< which is of type text?

I have a table in Cassandra, whose primary key of type text(string) and stores only numbers.
Now I want to perform CQL query on this column using mathematical operators like ><.
Any idea on how to accomplish this?
select * from ynapanalyticsteam.df_tran_order_info where order_header_key>'2018';
why do you store it as text since your values are numbers?
Cassandra has some important concepts: partition key and clustering key.
Let's say you have a table like this:
TABLE A (
...,
PRIMARY KEY ((pk1, pk2), ck1, ck2, ck3, ck4, ck5)
)
pk1 and pk2 are the partition keys and your query must include them using =. The partition key it's used to determine the nodes to which the data belong.
The clustering columns (ck1, ck2, ..., ck5) are used for ordering the data and adding some other filtering using =, <, > operators. The clustering columns are used to control how data it's sorted in the partition.
You need to change your data model so order_header_key to be a clustering key and have another column to be your partition key.

Cassandra Schema for standard SELECT/FROM/WHERE/IN query

Pretty new to Cassandra - I have data that looks like this:
<geohash text, category int, payload text>
The only query I want to run is:
SELECT category, payload FROM table WHERE geohash IN (list of 9 geohashes)
What would be the best schema in this case?
I know I could simply make my geohash the primary key and be done with it, but is there a better approach?
What are the benefits for defining PRIMARY KEY (geohash, category, payload)?
It depends on the size of your data for each row (geohash text, category int, payload text). If your payload size does not reach to tens of Mb, then you may want to put more geohash values into the same partition by using an artificial bucketId int, so your query can be performed on a server. Schema would look like this
geohash text, bucketId int, category int, payload text where the partition key is goehash and bucketId.
The recommendation is to have a sizeable partition <= 100 Mb, so you don't have to look up too many partitions. More is available here.
If you have a primary key on (geohash, category, payload), then you can have your data sorted on category and payload in the ascending order.
So based on the query, it sounds like you're considering a CQL schema that looks like this:
CREATE TABLE geohash_data (
geohash text,
category int,
data text,
PRIMARY KEY (geohash)
);
In Cassandra, the first (and in this case only) column in your PRIMARY KEY is the Partition Key. The Partition Key is what's used to distribute data around the cluster. So when you do your SELECT ... IN () query, you're basically querying for the data in 9 different partitions which, depending on how large your cluster is, the replication factor, and the consistency level you use to do the query, could end up querying at least 9 servers (and maybe more). Why does that matter?
Latency: The more partitions (and thus replicas/servers) involved in our query, the more potential for a slow server being able to negatively impact how quickly the data is returned.
Availability: The more partitions (and thus replicas/servers) involved in our query, the more potential that a single server going down could make it impossible for the query to be satisfied at all.
Both of those are bad scenarios so (as Toan rightly points out in his answer and the link he provided), we try to data model in Cassandra so that our queries will hit as few partitions (and thus replicas/servers) as possible. What does that mean for your scenario? Without knowing all the details, it's hard to say for sure, but let me make a couple guesses about your scenario and give you an example of how I'd try to solve it.
It sounds like maybe you already know the list of possible geohash values ahead of time (maybe they're at some regularly spaced interval of a predefined grid). It also sounds like maybe you're querying for 9 geohash values because you're doing some kind of "proximity" search where you're trying to get the data for the 9 geohashes in each direction around a given point.
If that's the case, the trick could be to denormalize the data at write time into a data model optimized for reading. For example, a schema like this:
CREATE TABLE geohash_data (
geohash text,
data_geohash text,
category int,
data text,
PRIMARY KEY (geohash, data_geohash)
);
When you INSERT a data point, you'd calculate the geohashes for the surrounding areas where you expect that data should show up in the results. You'd then INSERT the data multiple times for each geohash you calculated. So the value for geohash is the calculated value where you expect it to show up in the query results and the value for data_geohash is the actual value from the data you're inserting. Thus you'd have multiple (up to 9?) rows in your partition for a given geohash which represent the data of the surrounding geohashes.
This means your SELECT query now doesn't have to do an IN and hit multiple partitions. You just query WHERE geohash = ? for the point you want to search around.

Where and Order By Clauses in Cassandra CQL

I am new to NoSQL database and have just started using apache Cassandra. I created a simple table "emp" with primary key on "empno" column. This is a simple table as we always get in Oracle's default scott schema.
Now I loaded data using the COPY command and issued query Select * from emp order by empno but I was surprised that CQL did not allow Order by on empno column (which is PK). Also when I used Where condition, it did not allow any inequality operations on empno column (it said only EQ or IN conditions are allowed). It also did not allowed Where and Order by on any other column, as they were not used in PK, and did not have an index.
Can someone please help me what should I do if I want to keep empno unique in the table and want a query results in Sorted order of empno?
(My version is:
cqlsh:demodb> show version
[cqlsh 5.0.1 | Cassandra 2.2.0 | CQL spec 3.3.0 | Native protocol v4]
)
There are two parts to a PRIMARY KEY in Cassandra:
partition key(s)
clustering key(s)
PRIMARY KEY (partitionKey1,clusteringKey1,clusteringKey2)
or
PRIMARY KEY ((partitionKey1,partitionKey2),clusteringKey1,clusteringKey2)
The partition key determines which node(s) your data is stored on. The clustering key determines the order of the data within your partition key.
In CQL, the ORDER BY clause is really only used to reverse the defined sort direction of your clustering order. As for the columns themselves, you can only specify the columns defined (and in that exact order...no skipping) in your CLUSTERING ORDER BY clause at table creation time. So you cannot pick arbitrary columns to order your result set at query-time.
Cassandra achieves performance by using the clustering keys to sort your data on-disk, thereby only returning ordered rows in a single read (no random reads). This is why you must take a query-based modeling approach (often duplicating your data into multiple query tables) with Cassandra. Know your queries ahead of time, and build your tables to serve them.
Select * from emp order by empno;
First of all, you need a WHERE clause. It's ok to query without it, if you're working with a relational database. With Cassandra, you should do your best to avoid unbound SELECT queries. Besides, Cassandra can only enforce a sort order within a partition, so querying without a WHERE clause won't return data in the order you want, anyway.
Secondly, as I mentioned above, you need to define clustering keys. If you want to order your result set by empno, then you must find another column to define as your partition key. Try something like this:
CREATE TABLE emp_by_dept (
empno text,
dept text,
name text,
PRIMARY KEY (dept,empno)
) WITH CLUSTERING ORDER BY (empno ASC);
Now, I can query employees by department, and they will be returned to me ordered by empno:
SELECT * FROM emp_by_dept WHERE dept='IT';
But to be clear, you will not be able to query every row in your table, and have it ordered by a single column. The only way to get meaningful order into your result sets, is first partition your data in a way that makes sense to your business case. Running an unbound SELECT will return all of your rows (assuming that the query doesn't time-out while trying to query every node in your cluster), but result set ordering can only be enforced within a partition. So you have to restrict by partition key in order for that to make any sense.
My apologies for self-promoting, but last year I wrote an article for DataStax called We Shall Have Order!, in which I addressed how to solve these types of problems. Give it a read and see if it helps.
Edit for additional questions:
From your answer I concluded 2 things about Cassandra:
(1) There is no
way of getting a result set which is only order by a column that has
been defined as Unique.
(2) When we define a PK
(partition-key+clustering-key), then the results will always be order
by Clustering columns within any fixed partition key (we must restrict
to one partition-key value), that means there is no need of ORDER BY
clause, since it cannot ever change the order of rows (the order in
which rows are actually stored), i.e. Order By is useless.
1) All PRIMARY KEYs in Cassandra are unique. There's no way to order your result set by your partition key. In my example, I order by empno (after partitioning by dept). – Aaron 1 hour ago
2) Stopping short of saying that ORDER BY is useless, I'll say that its only real use is to switch your sort direction between ASC and DESC.
I created an index on "empno" column of "emp" table, it is still not
allowing ORDER BY empno. So, what Indexes are for? are they only for
searching records for specific value of index key?
You cannot order a result set by an indexed column. Secondary indexes are (not the same as their relational counterparts) really only useful for edge-case, analytics-based queries. They don't scale, so the general recommendation is not to use secondary indexes.
Ok, that simply means that one table cannot be used for getting
different result sets with different conditions and different sorting
order.
Correct.
Hence for each new requirement, we need to create a new table.
IT means if we have a billion rows in a table (say Sales table), and
we need sum of sales (1) Product-wise, (2) Region-wise, then we will
duplicate all those billion rows in 2 tables with one in clustering
order of Product, the other in clustering order of Region,. and even
if we need to sum sales per Salesman_id, then we build a 3rd table,
again putting all those billion rows? is it sensible?
It's really up to you to decide how sensible it is. But lack of query flexibility is a drawback of Cassandra. To get around it you can keep creating query tables (I.E., trading disk for performance). But if it gets to a point where it becomes ungainly or difficult to manage, then it's time to think about whether or not Cassandra is really the right solution.
EDIT 20160321
Hi Aaron, you said above "Stopping short of saying that ORDER BY is useless, I'll say that its only real use is to switch your sort direction between ASC and DESC."
But i found even that is not correct. Cassandra only allows ORDER by in the same direction as we define in the "CLUSTERING ORDER BY" caluse of CREATE TABLE. If in that clause we define ASC, it allows only order by ASC, and vice versa.
Without seeing an error message, it's hard to know what to tell you on that one. Although I have heard of queries with ORDER BY failing when you have too many rows stored in a partition.
ORDER BY also functions a little odd if you specify multiple columns to sort by. If I have two clustering columns defined, I can use ORDER BY on the first column indiscriminately. But as soon as I add the second column to the ORDER BY clause, my query only works if I specify both sort directions the same (as the CLUSTERING ORDER BY definition) or both different. If I mix and match, I get this:
InvalidRequest: code=2200 [Invalid query] message="Unsupported order by relation"
I think that has to do with how the data is stored on-disk. Otherwise Cassandra would have more work to do in preparing result sets. Whereas if it requires everything to either to match or mirror the direction(s) specified in the CLUSTERING ORDER BY, it can just relay a sequential read from disk. So it's probably best to only use a single column in your ORDER BY clause, for more predictable results.
Adding a redux answer as the accepted one is quite long.
Order by is currently only supported on the clustered columns of the PRIMARY KEY
and when the partition key is restricted by an Equality or an IN operator in where clause.
That is if you have your primary key defined like this :
PRIMARY KEY ((a,b),c,d)
Then you will be able to use the ORDER BY when & only when your query has :
a where clause with all the primary key restricted either by an equality operator (=) or an IN operator such as :
SELECT * FROM emp WHERE a = 1 AND b = 'India' ORDER BY c,d;
SELECT * FROM emp WHERE a = 1 AND b = 'India' ORDER BY c;
These two query are the only valid ones.
Also this query would not work :
SELECT * FROM emp WHERE a = 1 AND b = 'India' ORDER BY d,c;
because order by currently only support the ordering of columns following their declared order in the PRIMARY KEY that is in primary key definition c has been declared before d and the query violates the ordering by placing d first.

Is cassandra a row column database?

Im trying to learn cassandra but im confused with the terminology.
Many instances it says the row stores key/value pairs.
but, when I define a table its more like declaring a SQL table ie; you create a table and specify the column names and data types.
Can someone clarify this?
Cassandra is a column based NoSQL database. While yes at its lowest level it does store simple key-value pairs it stores these key-value pairs in collections. This grouping of keys and collections is analogous to rows and columns in a traditional relational model. Cassandra tables contain a schema and can be referenced (with restrictions) using a SQL-like language called CQL.
In your comment you ask about Apples being stored in a different table from oranges. The answer to that specific question is No it will be in the same table. However Cassandra tables have an additional concept call the Partition Key that doesn't really have an analgous concept in the relational world. Take for example the following table definition
CREATE TABLE fruit_types {
fruit text,
location text,
cost float,
PRIMARY KEY ((fruit), location)
}
In this table definition you will notice that we are defining the schema for the table. You will also notice that we are defining a PRIMARY KEY. This primary key is similar but not exactly like a relational concept. In Cassandra the PRIMAY KEY is made up of two parts the PARTITION KEY and CLUSTERING COLUMNS. The PARTITION KEY is the first fields specified in the PRIMARY KEY and can contain one or more fields delimitated by parenthesis. The purpose of the PARTITION KEY is to be hashed and used to define the node that owns the data and is also used to physically divide the information on the disk into files. The CLUSTERING COLUMNS make up the other columns listed in the PRIMARY KEY and amongst other things are used for defining how the data is physically stored on the disk inside the different files as specified by the PARTITION KEY. I suggest you do some additional reading on the PRIMARY KEY here if your interested in more detail:
https://docs.datastax.com/en/cql/3.0/cql/ddl/ddl_compound_keys_c.html
Basically cassandra storage is like sparse matrix, earlier version has a command line tool called cqlsh which can show the exact storage foot print of your columnfamily(aka table in latest version). Later community decided to keep RDBMS kind of syntax for better understanding coz the query language(CQL) syntax is similar to sql.
main storage is key(partition) (which is hash function result of chosen partition column in your table and rest of the columns will be tagged to it like sparse matrix.

Resources