How do I select all rows from two clustering columns in cassandra database - cassandra

I have a Partion key: A
Clustering columns: B, C
I do understand I can query like this
Select * from table where A = ?
Select * from table where A = ? and B = ?
Select * from table where A = ? and B = ? and C = ?
Now I have a scenario where I need to fetch results from only B and C. Is there a way to this with out using Allow Filtering.

You cannot fetch on basis of 'B' and 'C' (the clustering columns) without partition key without using Allow Filtering. Though you can use spark and spark-cassandra-connector for filtering out the results on basis of 'B' and 'C'. Behind the scene it also used allow filtering but it has efficient mechanism to scan the table the right way.

Related

How to avoid key column name duplication in join?

I'm trying to join two tables in spark sql. Each table has 50+ columns. Both has column id as the key.
spark.sql("select * from tbl1 join tbl2 on tbl1.id = tbl2.id")
The joined table has duplicated id column.
We can of course specify which id column to keep like below:
spark.sql("select tbl1.id, .....from tbl1 join tbl2 on tbl1.id = tbl2.id")
But since we have so many columns in both tables, I do not want to type all the other column names in the query above. (other than id column, no other duplicated column names).
what should I do? thanks.
If id is the only column name in common, you can take advantage of the USING clause:
spark.sql("select * from tbl1 join tbl2 using (id) ")
The using clause matches columns that have the same name in both tables. When using select *, the column appears only once.
Assuming, you want to preserve the "duplicates", you can try to use the internal row-id or equivalents for your help. This helped me in the past, if I had to delete exactly one of two identical rows.
select *,ctid from table;
outputs in postgresql also the internal counter id. Your before exact identical rows become different now. I don't know about spark.sql, but I assume, that you can access a similar attribute there.
val joined = spark
.sql("select * from tbl1")
.join(
spark.sql("select * from tbl2"),
Seq("id"),
"inner" // optional
)
joined should have only one id column. Tested with Spark 2.4.8

Use IN in any column in a Cassandra Table

I want to be able to use IN in any column in any order in Cassandra
So I have the next table:
CREATE TABLE test (a TEXT, b TEXT, c TEXT, PRIMARY KEY (a, b, c));
and this data:
INSERT INTO test (a, b, c) VALUES ('a1', 'b1', 'c1');
INSERT INTO test (a, b, c) VALUES ('a2', 'b2', 'c2');
This query works:
SELECT * FROM test WHERE c IN ('c1', 'c2') AND b IN ('b1') ALLOW FILTERING;
But if you remove the b IN it gives this error:
SELECT * FROM test WHERE c IN ('c1', 'c2') ALLOW FILTERING;
InvalidRequest: Error from server: code=2200 [Invalid query] message="IN
restrictions are not supported on indexed columns"
Seems like if I want to use IN in a column I should have used IN in some previous columns?
Is there a way to avoid this?
Modifying the Schema is valid but I need to use Cassandra and allow filtering through any columns (if there's no need to filter thought a columns then there would be no IN clause for that column).
Thanks for reading.
P.S: I know you are not supposed to use ALLOW FILTERING please assume there's no other way.
Edit: Seems like they may have fixed this?: https://issues.apache.org/jira/browse/CASSANDRA-14344
There is a lot of confusion cassandra's primary keys.
In order to respond to your question, i think you need to understand how cassandra primary keys are working internally.
When you are creating a Primary key with multiple fields like in your case:
CREATE TABLE test (a TEXT, b TEXT, c TEXT, PRIMARY KEY (a, b, c));
"a" will be the partition key, you can imagine it as a hash that will chose the partition on which data will be distributed.
b and c will be the clustering keys, these keys will be like a sorted list of your data and c will be nested in each b value, that means that you have to provide b in order to do constraints on c.
The cassandra documentation states that you can only use In clause on last column of the partition key and the last of the clustering key, but attention you'll have to provide all the other clustering keys.
So basically there is no way to do that in one table.
You should think of a tradeOff of your query flexibility vs data duplication.
One solution will be to denormalize your data in 2 tables like this:
CREATE TABLE test1 (a TEXT, b TEXT, c TEXT, PRIMARY KEY (a, b));
CREATE TABLE test2 (a TEXT, b TEXT, c TEXT, PRIMARY KEY (c, a, b));
By doing so, you will be able to query each table depending on your use case.
The following queries will work:
SELECT * FROM test2 WHERE c IN ('c1', 'c2');
SELECT * FROM test1 WHERE a IN ('a1', 'a2');
SELECT * FROM test1 WHERE b IN ('b1', 'b2') ALLOW FILTERING;
And so on, i think you got the point.
But really try to do the best tradeoff, in order to minimize the allow filtering usage. and remember that the queries on partition keys directly will be the fastest.

How do I select all rows for a clustering column in cassandra?

I have a Partion key: A
Clustering columns: B, C
I do understand I can query like this
Select * from table where A = ?
Select * from table where A = ? and B = ?
Select * from table where A = ? and B = ? and C = ?
On certain cases, I want the B value to be any value in that column.
Is there a way I can query like the following?
Select * from table where A = ? and B = 'any value' and C = ?
Option 1:
In Cassandra, you should design your data model to suit your queries. Therefore the proper way to support your fourth query (queries by A and C, but not necessarily knowing B value), is to create a new table to handle that specific query. This table will be pretty much the same, except the CLUSTERING COLUMNS will be in slightly different order:
PRIMARY KEY (A, C, B)
Now this query will work:
Select * from table where A = ? and C = ?
Option 2:
Alternatively you can create a materialized view, with a different clustering order. Now Cassandra will keep the MV in sync with your table data.
create materialized view mv_acbd as
select A, B, C, D
from TABLE1
where A is not null and B is not null and C is not null
primary key (A, C, B);
Now the query against this MV will work like a charm
Select * from mv_acbd where A = ? and C = ?
Option 3:
Not the best, but you could use the following query with your table as it is
Select * from table where A = ? and C = ? ALLOW FILTERING
Relying on ALLOW FILTERING is never a good idea, and is certainly not something that you should do in a production cluster. For this particular case, the scan is within the same partition and performance may vary depending on ratio of how many clustering columns per partition your use case has.

How to make the query to work?

I have Cassandra version 2.0, and in it I am totally new in it, so the question...
I have table T1, with columns with names: 1,2,3...14 (for simplicity);
Partitioning key is column 1 , 2;
Clustering key is column 3, 1 , 5;
I need to perform following query:
SELECT 1,2,7 FROM T1 where 2='A';
Column 2 is a flag, so values are repeating.
I get the following error:
Unable to execute CQL query: Partitioning column 2 cannot be restricted because the preceding column 1 is either not restricted or is restricted by a non-EQ relation
So what is the right way to do it? I really need to get the data that already filtered. Thanks.
So, to make sure I understand your schema, you have defined a table T1:
CREATE TABLE T1 (
1 INT,
2 INT,
3 INT,
...
14 INT,
PRIMARY ((1, 2), 3, 1, 5)
);
Correct?
If this is the case, then Cassandra cannot find the data to answer your CQL query:
SELECT 1,2,7 FROM T1 where 2 = 'A';
because your query has not provided a value for column "1", without which Cassandra cannot compute the partition key (which requires, per your composite PRIMARY KEY definition, both columns "1" and "2"), and without that, it cannot determine where to look on which nodes in the ring. By including "2" in your partition key, you are telling Cassandra that that data is required for determine where to store (and thus, where to read) that data.
For example, given your schema, this query should work:
SELECT 7 FROM T1 WHERE 1 = 'X' AND 2 = 'A';
since you are providing both values of your partition key.
#Caleb Rockcliffe has good advice, though, regarding the need for other, secondary/supplemental lookup mechanisms if the above table definition is a big part of your workload. You may need to find some way to first lookup the values for "1" and "2", then issue your query. E.g.:
CREATE TABLE T1_index (
1 INT,
2 INT,
PRIMARY KEY (1, 2);
);
Given a value for "1", the above will provide all of the possible "2" values, through which you can then iterate:
SELECT 2 FROM T1_index WHERE 1 = 'X';
And then, for each "1" and "2" combination, you can then issue your query against table T1:
SELECT 7 FROM T1 WHERE 1 = 'X' AND 2 = 'A';
Hope this helps!
Your WHERE clause needs to include the first element of the partition key.

Cassandra CQL selecting rows with has values different on two columns

I create table:
CREATE TABLE T (
I int PRIMARY KEY,
A text,
B text
);
Than I add two columns X and Y using:
ALTER TABLE T ADD X int;
CREATE INDEX ON T (X);
ALTER TABLE T ADD Y int;
CREATE INDEX ON T (Y);
I put some data and now I would like to count rows which has different values on X and Y (even X < Y would be fine). I tried something like this:
select COUNT(*) from T where X < Y ;
This also doesn't work without COUNT - just simple *.
But I'm getting error no viable alternative at input ';'
Do you have some suggestions how to overcome this error?
I tried using counters instead of integers but they forced me to put all non-counter data to primary key which wasn't good idea in my case ...
I'm using Cassandra 1.2.6 and CQL 3.
PS can I perform UPDATE on all rows? without WHERE clause or with some dummy one?
As Cassandra prefers simple reads the Cassandra-way to do this is to insert a boolean flagged column on update/insert. With an (secondary) index you may query the reads faster as well.

Resources