how Cql's Collection contains alternative value? - cassandra

I have a question to query to cassandra collection.
I want to make a query that work with collection search.
CREATE TABLE rd_db.test1 (
testcol3 frozen<set<text>> PRIMARY KEY,
testcol1 text,
testcol2 int
)
table structure is this...
and
this is the table contents.
in this situation, I want to make a cql query has alternative option values on set column.
if it is sql and testcol3 isn't collection,
select * from rd.db.test1 where testcol3 = 4 or testcol3 = 5
but it is cql and collection.. I try
select * from test1 where testcol3 contains '4' OR testcol3 contains '5' ALLOW FILTERING ;
select * from test1 where testcol3 IN ('4','5') ALLOW FILTERING ;
but this two query didn't work...
please help...

This won't work for you for multiple reasons:
there is no OR operation in CQL
you can do only full match on the value of partition key (testcol3)
although you may create secondary indexes for fields with collection type, it's impossible to create an index for values of partition key
You need to change data model, but you need to know the queries that you're executing in advance. From brief looking into your data model, I would suggest to rollout the set field into multiple rows, with individual fields corresponding individual partitions.
But I want to suggest to take DS201 & DS220 courses on DataStax Academy site for better understanding how Cassandra works, and how to model data for it.

Related

Cassandra : (3.11.11) find a string in the cassandra table column

I am a new bee to Cassandra.
I have a Table(table1) and the Data like
ch1,ch2,ch3,ch4
LD,9813970,1484914,'T03103','T04014'
LD,1008203,1486104,'T03103','T04024'
Want to find a string in this cassandra table : table1. Is there any option to search a given string in this table's column ch4 using only IN operator (not LIKE operator). Sample query is like
select * from table1 where 'T04014' IN (ch4)
if required ch4 column may included in the partition or clustering keys.
You didn't post the table schema so I'm going to assume that ch4 is not part of the primary key.
You cannot include a column in the filter unless it is part of the primary key or you have a secondary index defined on it. Be aware that secondary indexes are not always a good fit. Have a look at when to use an index for details.
The general recommendation is to denormalise and create a table specifically designed for each app query so you get the best performance out of your cluster. Cheers!

Cassandra query with multiple OPTIONAL condition

Is it possible to achieve this kind of query in cassandra efficiently?
Say I have a table something
CREATE TABLE something(
a INT,
b INT,
c INT,
d INT,
e INT
PRIMARY KEY(a,b,c,d,e)
);
And I want to query this table in following way:
SELECT * FROM something WHERE a=? AND b=? AND e=?
or
SELECT * FROM something WHERE a=? AND c=? AND d=?
or
SELECT * FROM something WHERE a=? AND b=? AND d=?
and so on.
All of the above queries won't work cause cassandra require query to specify clustering column in order.
I know normally this kind of scenario would need to create some materialized view or to denormalize data into several table. However, in this case, I will need to make 4*3*2*1 = 24 tables which is basically not a viable solution.
Secondary index require that ALLOW FILTERING option must be turn on for multiple index query to work which seems to be a bad idea. Besides, there may be some high cardinality columns in the something table.
I would like to know if there is any work around to allow such a complicated query to work?
How are you ending up with 24 tables? I did not get this.
If your query has equality condition on 3 columns. Then, isn't it 10 different queries? 5c3.
Maybe I understood your requirement partially and you really need n=(24) queries. But here are my suggestions:
Figure out any columns with low cardinality and create a secondary index to satisfy at least 1 query.
Things to avoid:
Don't go with 1 base table and 23 materialized views. Keep this ratio down to 1(base) : 5 or 8(mviews). So it pays to denormalize from application side.
You may use uuid as primary key in your base table so you can use them in materialized views.
Overall, even if you have 24 queries, try to get down to 4 or 5 base tables and then create 5 or 6 materialized views on each of them to reach your intended number of 24 or whatever.
You can use SOLR along with Cassandra to get such queries to work with Cassandra. If you are using DSE, it is much more easier. In SOLR query you can directly write:
SELECT * FROM keyspace.something WHERE solr_query='a:? b:? e:?'
Refer below link which shows you all the possible combinations you can use with SOLR
https://docs.datastax.com/en/datastax_enterprise/5.0/datastax_enterprise/srch/queriesCql.html?hl=solr%2Cwhere
Writes are very efficient in C*. Also read with partition key is performant.
Create 2 table index and content :
CREATE TABLE somethingIndex(
a_index text PRIMARY KEY,
a INT
);
CREATE TABLE something(
a INT PRIMARY KEY,
b INT,
c INT,
d INT,
e INT
);
During write INSERT all combination of (a,b,c,d,e) by concatenating there values.
With 5 element with 3 combination maximum will be 11 insert : 10 INSERT in somethingIndex + 1 INSERT into something.
This will much efficient rather using solr or other solution like materialize view.
Check solr if you need full text search. For exact search above solution is efficient.
Reading data, first select "a" value from somethingIndex and then read from something table.
SELECT a FROM somethingIndex where a_index = ?; // (a+b+e) or (a+c+d) or (a+b+d);
SELECT * FROM something where a = ?;

Cassandra Range Query : Secondary Index vs Unindexed Colum

I have seen that the best way to do range query on cassandra is by using CLUSTERING KEY. But I need to do some range query other than CLUSTERING KEY columns.
I read that we can do it on any column using ALLOW FILTERING. But is there any performance advantage if I create secondary index on that column ?
Have a look at this link:
https://www.datastax.com/dev/blog/allow-filtering-explained-2
The ALLOW FILTERING option allows you tell Cassandra that it is ok to perform in-memory filtering of the data once it loads rows from disk. So we can use this to search by a clustering column without specifying the previous clustering columns. But we can't use it on non-clustering columns.
See the below example schema from the blog. Use of ALLOW FILTERING doesn't allow us to filter by author column until we make it an index, which then doesn't need the ALLOW FILTERING option.
cqlsh:test> SELECT * FROM blogs WHERE author = 'john' ALLOW FILTERING;
Bad Request: No indexed columns present in by-columns clause with Equal operator
cqlsh:test>
cqlsh:test> CREATE INDEX authors ON blogs (author);
cqlsh:test> SELECT * FROM blogs WHERE author = 'john';
(0 rows)
cqlsh:test> SELECT * FROM blogs WHERE author = 'john' ALLOW FILTERING;
(0 rows)

Cassandra Allow filtering

I have a table as below
CREATE TABLE test (
day int,
id varchar,
start int,
action varchar,
PRIMARY KEY((day),start,id)
);
I want to run this query
Select * from test where day=1 and start > 1475485412 and start < 1485785654
and action='accept' ALLOW FILTERING
Is this ALLOW FILTERING efficient?
I am expecting that cassandra will filter in this order
1. By Partitioning column(day)
2. By the range column(start) on the 1's result
3. By action column on 2's result.
So the allow filtering will not be a bad choice on this query.
In case of the multiple filtering parameters on the where clause and the non indexed column is the last one, how will the filter work?
Please explain.
Is this ALLOW FILTERING efficient?
When you write "this" you mean in the context of your query and your model, however the efficiency of an ALLOW FILTERING query depends mostly on the data it has to filter. Unless you show some real data this is a hard to answer question.
I am expecting that cassandra will filter in this order...
Yeah, this is what will happen. However, the inclusion of an ALLOW FILTERING clause in the query usually means a poor table design, that is you're not following some guidelines on Cassandra modeling (specifically the "one query <--> one table").
As a solution, I could hint you to include the action field in the clustering key just before the start field, modifying your table definition:
CREATE TABLE test (
day int,
id varchar,
start int,
action varchar,
PRIMARY KEY((day),action,start,id)
);
You then would rewrite your query without any ALLOW FILTERING clause:
SELECT * FROM test WHERE day=1 AND action='accept' AND start > 1475485412 AND start < 1485785654
having only the minor issue that if one record "switches" action values you cannot perform an update on the single action field (because it's now part of the clustering key), so you need to perform a delete with the old action value and an insert it with the correct new value. But if you have Cassandra 3.0+ all this can be done with the help of the new Materialized View implementation. Have a look at the documentation for further information.
In general ALLOW FILTERING is not efficient.
But in the end it depends on the size of the data you are fetching (for which cassandra have to use ALLOW FILTERING) and the size of data its being fetched from.
In your case cassandra do not need filtering upto :
By the range column(start) on the 1's result
As you mentioned. But after that, it will rely on filtering to search data, which you are allowing in query itself.
Now, keep following in mind
If your table contains for example a 1 million rows and 95% of them have the requested value, the query will still be relatively efficient and you should use ALLOW FILTERING.
On the other hand, if your table contains 1 million rows and only 2 rows contain the requested value, your query is extremely inefficient. Cassandra will load 999, 998 rows for nothing. If the query is often used, it is probably better to add an index on the time1 column.
So ensure this first. If it works in you favour, use FILTERING.
Otherwise, it would be wise to add secondary index on 'action'.
PS : There is some minor edit.

Cassandra asks for ALLOW FILTERING even though column is clustering key

Very new to Cassandra so apologies if the question is simple.
I created a table:
create table ApiLog (
LogId uuid,
DateCreated timestamp,
ClientIpAddress varchar,
primary key (LogId, DateCreated));
This work fine:
select * from apilog
If I try to add a where clause with the DateCreated like this:
select * from apilog where datecreated <= '2016-07-14'
I get this:
Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING
From other questions here on SO and from the tutorials on datastax it is my understanding that since the datecreated column is a clustering key it can be used to filter data.
I also tried to create an index but I get the same message back. And I tried to remove the DateCreated from the primary key and have it only as an index and I still get the same back:
create index ApiLog_DateCreated on dotnetdemo.apilog (datecreated);
The partition key LogId determines on which node each partition will be stored. So if you don't specify the partition key, then Cassandra has to filter all the partitions of this table on all the nodes to find matching data. That's why you have to say ALLOW FILTERING, since that operation is very inefficient and is discouraged.
If you specify a specific LogId, then Cassandra can find the partition on a single node and efficiently do a range query by the clustering key.
So you need to plan your schema such that you can do your range queries within a single partition and not have to do a full table scan like you're trying to do.
When your query is rejected by Cassandra because it needs filtering, you should resist the urge to just add ALLOW FILTERING to it. You should think about your data, your model and what you are trying to do. You always have multiple options.
You can change your data model, add an index, use another table or use ALLOW FILTERING.
You have to make the right choice for your specific use case.
Anyway you want to make it work.
select * from dev."3" where "column" = '' limit 1000 ALLOW FILTERING;

Resources