How do I select all rows for a clustering column in cassandra? - cassandra

I have a Partion key: A
Clustering columns: B, C
I do understand I can query like this
Select * from table where A = ?
Select * from table where A = ? and B = ?
Select * from table where A = ? and B = ? and C = ?
On certain cases, I want the B value to be any value in that column.
Is there a way I can query like the following?
Select * from table where A = ? and B = 'any value' and C = ?

Option 1:
In Cassandra, you should design your data model to suit your queries. Therefore the proper way to support your fourth query (queries by A and C, but not necessarily knowing B value), is to create a new table to handle that specific query. This table will be pretty much the same, except the CLUSTERING COLUMNS will be in slightly different order:
PRIMARY KEY (A, C, B)
Now this query will work:
Select * from table where A = ? and C = ?
Option 2:
Alternatively you can create a materialized view, with a different clustering order. Now Cassandra will keep the MV in sync with your table data.
create materialized view mv_acbd as
select A, B, C, D
from TABLE1
where A is not null and B is not null and C is not null
primary key (A, C, B);
Now the query against this MV will work like a charm
Select * from mv_acbd where A = ? and C = ?
Option 3:
Not the best, but you could use the following query with your table as it is
Select * from table where A = ? and C = ? ALLOW FILTERING
Relying on ALLOW FILTERING is never a good idea, and is certainly not something that you should do in a production cluster. For this particular case, the scan is within the same partition and performance may vary depending on ratio of how many clustering columns per partition your use case has.

Related

Error Snowflake - Unsupported subquery type cannot be evaluated

I am facing an error in snowflake saying "Unsupported subquery type cannot be evaluated" after for example executing the below statement. How should write this statement to avoid this error?
select A
from (
select b
, c
FROM test_table
) ;
The outer query column list needs to be within the column list of the subquery. example: select b from (select b,c from test_table);
ignoring "columns" the query you have shown will never trigger this error.
You would get it from this form though:
select A.*
from tableA as A
where a.x = (select b.y FROM test_table as b where b.z = a.z)
this form assuming there is only 1 b.y per b.z can be turned into a inner join like
select A.*
from tableA as A
join test_table as b
on b.z = a.z and a.x = b.y
other forms of this pattern do the likes of max(b.y) and those can be made into a sub-select like:
select A.*
from tableA as A
join (
select c.z, max(c.y) from test_table as c group by 1
) as b
on b.z = a.z and a.x = b.y
but the general pattern is, in other databases there is no "cost" to do row-by-row queries, where-as Snowflake is more optimal with pre-building tables of similar data, and then equi-joining those results together. So both the "how-to-write" example pivot from a for-each-row thinking to a build the set of all possible answers, and then join that. This allows for the most parallel processing of the data possible. And while it means you the develop need to understand your data to get he best performance out of it, in general if you are doing large scale data processing, you should be understanding your data. So this costs, is rather acceptable imho.
If you are trying to Match Two Attributes on the Subquery.
Use like below:
If both need to matched:
select * from Table WHERE a IN ( select b FROM test_table ) AND a IN ( select c FROM test_table )
If any one need to matched:
select * from Table WHERE a IN ( select b FROM test_table ) OR a IN ( select c FROM test_table )

How do I select all rows from two clustering columns in cassandra database

I have a Partion key: A
Clustering columns: B, C
I do understand I can query like this
Select * from table where A = ?
Select * from table where A = ? and B = ?
Select * from table where A = ? and B = ? and C = ?
Now I have a scenario where I need to fetch results from only B and C. Is there a way to this with out using Allow Filtering.
You cannot fetch on basis of 'B' and 'C' (the clustering columns) without partition key without using Allow Filtering. Though you can use spark and spark-cassandra-connector for filtering out the results on basis of 'B' and 'C'. Behind the scene it also used allow filtering but it has efficient mechanism to scan the table the right way.

How to do compare/subtract records

Table A having 20 records and table B showing 19 records. How to find that one record is which is missing in table B. How to do compare/subtract records of these two tables; to find that one record. Running query in Apache Superset.
The exact answer depends on which column(s) define whether two records are the same. Assuming you wanted to use some primary key column for the comparison, you could try:
SELECT a.*
FROM TableA a
WHERE NOT EXISTS (SELECT 1 FROM TableB b WHERE b.pk = a.pk);
If you wanted to use more than one column to compare records from the two tables, then you would just add logic to the exists clause, e.g. for three columns:
WHERE NOT EXISTS (SELECT 1 FROM TableB b WHERE b.col1 = a.col1 AND
b.col2 = a.col2 AND
b.col3 = a.col3)

Use IN in any column in a Cassandra Table

I want to be able to use IN in any column in any order in Cassandra
So I have the next table:
CREATE TABLE test (a TEXT, b TEXT, c TEXT, PRIMARY KEY (a, b, c));
and this data:
INSERT INTO test (a, b, c) VALUES ('a1', 'b1', 'c1');
INSERT INTO test (a, b, c) VALUES ('a2', 'b2', 'c2');
This query works:
SELECT * FROM test WHERE c IN ('c1', 'c2') AND b IN ('b1') ALLOW FILTERING;
But if you remove the b IN it gives this error:
SELECT * FROM test WHERE c IN ('c1', 'c2') ALLOW FILTERING;
InvalidRequest: Error from server: code=2200 [Invalid query] message="IN
restrictions are not supported on indexed columns"
Seems like if I want to use IN in a column I should have used IN in some previous columns?
Is there a way to avoid this?
Modifying the Schema is valid but I need to use Cassandra and allow filtering through any columns (if there's no need to filter thought a columns then there would be no IN clause for that column).
Thanks for reading.
P.S: I know you are not supposed to use ALLOW FILTERING please assume there's no other way.
Edit: Seems like they may have fixed this?: https://issues.apache.org/jira/browse/CASSANDRA-14344
There is a lot of confusion cassandra's primary keys.
In order to respond to your question, i think you need to understand how cassandra primary keys are working internally.
When you are creating a Primary key with multiple fields like in your case:
CREATE TABLE test (a TEXT, b TEXT, c TEXT, PRIMARY KEY (a, b, c));
"a" will be the partition key, you can imagine it as a hash that will chose the partition on which data will be distributed.
b and c will be the clustering keys, these keys will be like a sorted list of your data and c will be nested in each b value, that means that you have to provide b in order to do constraints on c.
The cassandra documentation states that you can only use In clause on last column of the partition key and the last of the clustering key, but attention you'll have to provide all the other clustering keys.
So basically there is no way to do that in one table.
You should think of a tradeOff of your query flexibility vs data duplication.
One solution will be to denormalize your data in 2 tables like this:
CREATE TABLE test1 (a TEXT, b TEXT, c TEXT, PRIMARY KEY (a, b));
CREATE TABLE test2 (a TEXT, b TEXT, c TEXT, PRIMARY KEY (c, a, b));
By doing so, you will be able to query each table depending on your use case.
The following queries will work:
SELECT * FROM test2 WHERE c IN ('c1', 'c2');
SELECT * FROM test1 WHERE a IN ('a1', 'a2');
SELECT * FROM test1 WHERE b IN ('b1', 'b2') ALLOW FILTERING;
And so on, i think you got the point.
But really try to do the best tradeoff, in order to minimize the allow filtering usage. and remember that the queries on partition keys directly will be the fastest.

Cassandra schema - select by frequently updated column

Given table:
CREATE TABLE T (
a int,
last_modification_time timestamp,
b int,
PRIMARY KEY (a)
);
I'm frequently updating records. With each update last_modification_time is set to now() and also other fields are set.
What is the right cassandra approach to be able to query by last_modification_time range? I need to query like this:
select * from .. where a=Z and last_modification_time < X and last_modification_time > Y;
One way would be to create materialized view with PRIMARY KEY (a, last_modification_time) but I want to avoid this since materialized views are buggy in 3.X cassandra versions.
What would be alternative way of querying by last_modification_time range given last_modification_time is frequently updated?
How about having two tables? One could hold the current snapshot where you're updating the last_modification_time field and another one which holds the changes over time (something like a history table)? You could write to both of them using BATCH statements.
CREATE TABLE t_modifications (
a int,
last_modification_time timestamp,
b int,
PRIMARY KEY (a, last_modification_time)
) WITH CLUSTERING ORDER BY (last_modificaton_time DESC);
BEGIN BATCH
UPDATE T SET last_modification_time = 123, b = 4 WHERE a = 2;
INSERT INTO t_modifications (a, last_modification_time, b) values (2, 123, 4);
APPLY BATCH;
If you're interested on the latest snapshot by a given modification range, you can select and limit the t_modifications table:
SELECT * FROM t_modifications WHERE a = 2 AND last_modification_time < 136 LIMIT 1;
In general, to do range queries like this, the field you want to range on has to be part of the composite key, has to be the right-most element of the composite key, and all other elements in the composite key have to be specified. In your case, you would modify your PRIMARY KEY to (a, last_modification_time). You can then
SELECT * from t_modifications
WHERE a = aval
AND last_modification_time > beg
AND last_modification_time < end;
This will get you all records for aval between beg and end.

Resources