Use IN in any column in a Cassandra Table - cassandra

I want to be able to use IN in any column in any order in Cassandra
So I have the next table:
CREATE TABLE test (a TEXT, b TEXT, c TEXT, PRIMARY KEY (a, b, c));
and this data:
INSERT INTO test (a, b, c) VALUES ('a1', 'b1', 'c1');
INSERT INTO test (a, b, c) VALUES ('a2', 'b2', 'c2');
This query works:
SELECT * FROM test WHERE c IN ('c1', 'c2') AND b IN ('b1') ALLOW FILTERING;
But if you remove the b IN it gives this error:
SELECT * FROM test WHERE c IN ('c1', 'c2') ALLOW FILTERING;
InvalidRequest: Error from server: code=2200 [Invalid query] message="IN
restrictions are not supported on indexed columns"
Seems like if I want to use IN in a column I should have used IN in some previous columns?
Is there a way to avoid this?
Modifying the Schema is valid but I need to use Cassandra and allow filtering through any columns (if there's no need to filter thought a columns then there would be no IN clause for that column).
Thanks for reading.
P.S: I know you are not supposed to use ALLOW FILTERING please assume there's no other way.
Edit: Seems like they may have fixed this?: https://issues.apache.org/jira/browse/CASSANDRA-14344

There is a lot of confusion cassandra's primary keys.
In order to respond to your question, i think you need to understand how cassandra primary keys are working internally.
When you are creating a Primary key with multiple fields like in your case:
CREATE TABLE test (a TEXT, b TEXT, c TEXT, PRIMARY KEY (a, b, c));
"a" will be the partition key, you can imagine it as a hash that will chose the partition on which data will be distributed.
b and c will be the clustering keys, these keys will be like a sorted list of your data and c will be nested in each b value, that means that you have to provide b in order to do constraints on c.
The cassandra documentation states that you can only use In clause on last column of the partition key and the last of the clustering key, but attention you'll have to provide all the other clustering keys.
So basically there is no way to do that in one table.
You should think of a tradeOff of your query flexibility vs data duplication.
One solution will be to denormalize your data in 2 tables like this:
CREATE TABLE test1 (a TEXT, b TEXT, c TEXT, PRIMARY KEY (a, b));
CREATE TABLE test2 (a TEXT, b TEXT, c TEXT, PRIMARY KEY (c, a, b));
By doing so, you will be able to query each table depending on your use case.
The following queries will work:
SELECT * FROM test2 WHERE c IN ('c1', 'c2');
SELECT * FROM test1 WHERE a IN ('a1', 'a2');
SELECT * FROM test1 WHERE b IN ('b1', 'b2') ALLOW FILTERING;
And so on, i think you got the point.
But really try to do the best tradeoff, in order to minimize the allow filtering usage. and remember that the queries on partition keys directly will be the fastest.

Related

How do I select all rows for a clustering column in cassandra?

I have a Partion key: A
Clustering columns: B, C
I do understand I can query like this
Select * from table where A = ?
Select * from table where A = ? and B = ?
Select * from table where A = ? and B = ? and C = ?
On certain cases, I want the B value to be any value in that column.
Is there a way I can query like the following?
Select * from table where A = ? and B = 'any value' and C = ?
Option 1:
In Cassandra, you should design your data model to suit your queries. Therefore the proper way to support your fourth query (queries by A and C, but not necessarily knowing B value), is to create a new table to handle that specific query. This table will be pretty much the same, except the CLUSTERING COLUMNS will be in slightly different order:
PRIMARY KEY (A, C, B)
Now this query will work:
Select * from table where A = ? and C = ?
Option 2:
Alternatively you can create a materialized view, with a different clustering order. Now Cassandra will keep the MV in sync with your table data.
create materialized view mv_acbd as
select A, B, C, D
from TABLE1
where A is not null and B is not null and C is not null
primary key (A, C, B);
Now the query against this MV will work like a charm
Select * from mv_acbd where A = ? and C = ?
Option 3:
Not the best, but you could use the following query with your table as it is
Select * from table where A = ? and C = ? ALLOW FILTERING
Relying on ALLOW FILTERING is never a good idea, and is certainly not something that you should do in a production cluster. For this particular case, the scan is within the same partition and performance may vary depending on ratio of how many clustering columns per partition your use case has.

Performance consideration of Cassandra query using composite partition key vs clustering column

I have the following DDL:
CREATE TABLE mykeyspace.mytable (
a text,
b text,
c text,
d text,
e text,
starttime timestamp,
endtime timestamp,
PRIMARY KEY ((a, b, c), d, e, starttime, endtime)
) WITH CLUSTERING ORDER BY (d ASC, e ASC, starttime ASC, endtime ASC)
and I only have the following SELECT/DELETE query:
SELECT */DELETE FROM mytable WHERE a = ? AND b = ? AND C = ? AND d = ?;
I just wonder if the column d can be included as part of the composite partition key so a row lookup is enough instead of a row lookup + clustering column lookup? In this case it will improve performance as well?
The column d include in the composite partition key will absolutely improve performance
Your data will distribute well among the cluster.
Your SELECT query will be faster, no clustering level filtering is required
Your DELETE query will mark that partition as markedForDeleteAt, instead of inserting range tombstone
I feel that the more columns I have in the PARTITION KEY the better.
So my suggestion is to incluse as much columns as possible in the PARTITION KEY. It will improve SELECT query performances in general, and will avoid some tombstones problems as well (because you will delete at partition level, unless you recreate the partitions of course).

Ranges (intervals) request in Cassandra DB - CQL

Excuse, if it is a duplicate, I've found a few questions about times ranges here, but my case seems a little bit different and not yet discussed.
I would like to store quite big chunks (bins) of data (blobs - 2-4Mb, this is the “black-box data”, I can't change its layout) to access with interval keys:
...
primary key ( bin_id int, from_item_id int, to_item_id int )
...
with ability to select with items ranges, like in this pseudo-code to select all chunks that contains interval of items [110, 200]:
select chunk from tb1 where chunk_id = 100500 and from_item_id >= 110 and to_item_id <= 200;
Attempt to run such a query directly ended with error:
code=2200 [Invalid query] message="PRIMARY KEY column "to_item_id" cannot be restricted (preceding column "from_item_id" is restricted by a non-EQ relation)"
Currently only solution I've found is to implement additional table (tb_map) with reverse mapping from item_id to bin_id and use select to make a query looks something like this:
...
– in tb_map
primary key (dummy_id, item_id)
...
select bin_id from tb_map where dummy_id = SOME_MAGIK and item_id >= 110 and item_id <= 200;
And then use bin_id to retrieve chunks from tb1 with EQ or IN operator like here:
select * from tb1 where bin_id in (...);
But I can't use this model due insert performance issues (application should avoid many inserts to additional table and should avoid maintaining additional data structures, but should be "as simple as nail").
Is it any simple solution to stay within one table (or several simple tables)? I'm stuck with no ideas how to model such behaviour in C* (may be slices should be used?), could local C* gurus provide any hints?
I'm using CQL 3.1
From CQL3 reference:
Moreover, for a given partition key, the clustering columns induce an ordering of rows and relations on them is restricted to the relations that allow to select a contiguous (for the ordering) set of rows.
In your case the query doesn't select a contiguous set of rows, so Cassandra refuses to process it.

Choosing the right schema for cassandra "table" in CQL3

We are trying to store lots of attributes for a particular profile_id inside a table (using CQL3) and cannot wrap our heads around which approach is the best:
a. create table mytable (profile_id, a1 int, a2 int, a3 int, a4 int ... a3000 int) primary key (profile_id);
OR
b. create MANY tables, eg.
create table mytable_a1(profile_id, value int) primary key (profile_id);
create table mytable_a2(profile_id, value int) primary key (profile_id);
...
create table mytable_a3000(profile_id, value int) primary key (profile_id);
OR
c. create table mytable (profile_id, a_all text) primary key (profile_id);
and just store 3000 "columns" inside a_all, like:
insert into mytable (profile_id, a_all) values (1, "a1:1,a2:5,a3:55, .... a3000:5");
OR
d. none of the above
The type of query we would be running on this table:
select * from mytable where profile_id in (1,2,3,4,5423,44)
We tried the first approach and the queries keep timing out and sometimes even kill cassandra nodes.
The answer would be to use a clustering column. A clustering column allows you to create dynamic columns that you could use to hold the attribute name (col name) and it's value (col value).
The table would be
create table mytable (
profile_id text,
attr_name text,
attr_value int,
PRIMARY KEY(profile_id, attr_name)
)
This allows you to add inserts like
insert into mytable (profile_id, attr_name, attr_value) values ('131', 'a1', 3);
insert into mytable (profile_id, attr_name, attr_value) values ('131', 'a2', 1031);
.....
insert into mytable (profile_id, attr_name, attr_value) values ('131', 'an', 2);
This would be the optimal solution.
Because you then want to do the following
'The type of query we would be running on this table: select * from mytable where profile_id in (1,2,3,4,5423,44)'
This would require 6 queries under the hood but cassandra should be able to do this in no time especially if you have a multi node cluster.
Also if you use the DataStax Java Driver you can run this requests asynchronously and concurrently on your cluster.
For more on data modelling and the DataStax Java Driver check out DataStax's free online training. Its worth a look
http://www.datastax.com/what-we-offer/products-services/training/virtual-training
Hope it helps.

Filtering on primary key with nullable column in cassandra versions 1.2.5 and greater

Im trying to understand how to do this, or if I cant, why cant I do this. ie
cqlsh:ps> create table s1 (a text, b text, stuff text, primary key(a,b));
cqlsh:ps> select * from s1 where a='a' and b=null;
Bad Request: Invalid null clustering key part b
It appears that there is a work around, i.e. don't store nulls in column "b", but something like "##null##", but that just seems silly.
Any help would be much appreciated.
You can't insert a null into b because it's part of the clustering key. If you try an insert query along the lines of the below: INSERT INTO ps.s1 (a, b, c) VALUES ('a', null, 'c_val');
You should get an exception:Bad Request: Invalid null value for clustering key part password
If you want to retrieve all data where a=a then you would simply do:
select * from s1 where a='a';
You can however insert nulls into fields which aren't part of the key:
INSERT INTO ps.s1 (a, b, c) VALUES ('a', 'b', null);

Resources