Cassandra asks for ALLOW FILTERING even though column is clustering key - cassandra

Very new to Cassandra so apologies if the question is simple.
I created a table:
create table ApiLog (
LogId uuid,
DateCreated timestamp,
ClientIpAddress varchar,
primary key (LogId, DateCreated));
This work fine:
select * from apilog
If I try to add a where clause with the DateCreated like this:
select * from apilog where datecreated <= '2016-07-14'
I get this:
Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING
From other questions here on SO and from the tutorials on datastax it is my understanding that since the datecreated column is a clustering key it can be used to filter data.
I also tried to create an index but I get the same message back. And I tried to remove the DateCreated from the primary key and have it only as an index and I still get the same back:
create index ApiLog_DateCreated on dotnetdemo.apilog (datecreated);

The partition key LogId determines on which node each partition will be stored. So if you don't specify the partition key, then Cassandra has to filter all the partitions of this table on all the nodes to find matching data. That's why you have to say ALLOW FILTERING, since that operation is very inefficient and is discouraged.
If you specify a specific LogId, then Cassandra can find the partition on a single node and efficiently do a range query by the clustering key.
So you need to plan your schema such that you can do your range queries within a single partition and not have to do a full table scan like you're trying to do.

When your query is rejected by Cassandra because it needs filtering, you should resist the urge to just add ALLOW FILTERING to it. You should think about your data, your model and what you are trying to do. You always have multiple options.
You can change your data model, add an index, use another table or use ALLOW FILTERING.
You have to make the right choice for your specific use case.

Anyway you want to make it work.
select * from dev."3" where "column" = '' limit 1000 ALLOW FILTERING;

Related

Filter on the partition and the clustering key with an additional criteria

I want to filter on a table that has a partition and a clustering key with another criteria on a regular column. I got the following warning.
InvalidQueryException: Cannot execute this query as it might involve
data filtering and thus may have unpredictable performance. If you
want to execute this query despite the performance unpredictability,
use ALLOW FILTERING
I understand the problem if the partition and the clustering key are not used. In my case, is it a relevant error or can I ignore it?
Here is an example of the table and query.
CREATE TABLE mytable(
name text,
id uuid,
deleted boolean
PRIMARY KEY((name),id)
)
SELECT id FROM mytable WHERE name='myname' AND id='myid' AND deleted=false;
In Cassandra you can't filter data with non-primary key column unless you create index in it.
Cassandra 3.0 or up it is allowed to filter data with non primary key but in unpredictable performance
Cassandra 3.0 or up, If you provide all the primary key (as your given query) then you can use the query with ALLOW FILTERING, ignoring the warning
Otherwise filter from the client side or remove the field deleted and create another table :
Instead of updating the field to deleted true move your data to another table let's say mytable_deleted
CREATE TABLE mytable_deleted (
name text,
id uuid
PRIMARY KEY (name, id)
);
Now if you only have the non deleted data on mytable and deleted data on mytable_deleted table
or
Create index on it :
The column deleted is a low cardinality column. So remember
A query on an indexed column in a large cluster typically requires collating responses from multiple data partitions. The query response slows down as more machines are added to the cluster. You can avoid a performance hit when looking for a row in a large partition by narrowing the search.
Read More : When not to use an index

Performance impact of Allow filtering on same partition query in cassandra

I have table like this.
CREATE TABLE posts (
topic text
country text,
bookmarked text,
id uuid,
PRIMARY KEY (topic,id)
);
First query on single partition with allow filtering.
select * from posts where topic='cassandra' allow filtering;
Second query on single partition without allow filtering.
select * from posts where topic='cassandra';
My question is what is performance difference between first query and second query? Will first query(with allow filtering) get result from all partition before filtering though we have requested from single partition.
Thanks.
Allow filtering will allow you to run queries without specifying partition key. But if you using one, it will use only specific partition.
In this specific example you should see no difference.
Ran both queries on my test table with tracing on, got single partition in both execution plans:
Executing single-partition query on table_name
You don't need to use ALLOW FILTERING when you are querying with a partition key. So for the two queries you mentioned there will be no performance difference.
For Cassandra version 3.0 and up, ALLOW FILTERING can be used to query with any fields other than partition key. For example, you can run a query like this:
SELECT * FROM posts where country='Bangladesh';
And for Cassandra version below 3.0, ALLOW FILTERING can be used on only primary key.
Although it is not wise to query using ALLOW FILTERING.
Because, the only way Cassandra can execute this query is by retrieving all the rows from the table posts and then by filtering out the ones which do not have the requested value for the country column.
So you should useALLOW FILTERING at you own risk.

Cassandra Allow filtering

I have a table as below
CREATE TABLE test (
day int,
id varchar,
start int,
action varchar,
PRIMARY KEY((day),start,id)
);
I want to run this query
Select * from test where day=1 and start > 1475485412 and start < 1485785654
and action='accept' ALLOW FILTERING
Is this ALLOW FILTERING efficient?
I am expecting that cassandra will filter in this order
1. By Partitioning column(day)
2. By the range column(start) on the 1's result
3. By action column on 2's result.
So the allow filtering will not be a bad choice on this query.
In case of the multiple filtering parameters on the where clause and the non indexed column is the last one, how will the filter work?
Please explain.
Is this ALLOW FILTERING efficient?
When you write "this" you mean in the context of your query and your model, however the efficiency of an ALLOW FILTERING query depends mostly on the data it has to filter. Unless you show some real data this is a hard to answer question.
I am expecting that cassandra will filter in this order...
Yeah, this is what will happen. However, the inclusion of an ALLOW FILTERING clause in the query usually means a poor table design, that is you're not following some guidelines on Cassandra modeling (specifically the "one query <--> one table").
As a solution, I could hint you to include the action field in the clustering key just before the start field, modifying your table definition:
CREATE TABLE test (
day int,
id varchar,
start int,
action varchar,
PRIMARY KEY((day),action,start,id)
);
You then would rewrite your query without any ALLOW FILTERING clause:
SELECT * FROM test WHERE day=1 AND action='accept' AND start > 1475485412 AND start < 1485785654
having only the minor issue that if one record "switches" action values you cannot perform an update on the single action field (because it's now part of the clustering key), so you need to perform a delete with the old action value and an insert it with the correct new value. But if you have Cassandra 3.0+ all this can be done with the help of the new Materialized View implementation. Have a look at the documentation for further information.
In general ALLOW FILTERING is not efficient.
But in the end it depends on the size of the data you are fetching (for which cassandra have to use ALLOW FILTERING) and the size of data its being fetched from.
In your case cassandra do not need filtering upto :
By the range column(start) on the 1's result
As you mentioned. But after that, it will rely on filtering to search data, which you are allowing in query itself.
Now, keep following in mind
If your table contains for example a 1 million rows and 95% of them have the requested value, the query will still be relatively efficient and you should use ALLOW FILTERING.
On the other hand, if your table contains 1 million rows and only 2 rows contain the requested value, your query is extremely inefficient. Cassandra will load 999, 998 rows for nothing. If the query is often used, it is probably better to add an index on the time1 column.
So ensure this first. If it works in you favour, use FILTERING.
Otherwise, it would be wise to add secondary index on 'action'.
PS : There is some minor edit.

Cassandra select with clustering key range that might not have primary key

Sorry the title might/might not give exact description of what i intended.
Here is the problem. I need to select data based on date ranges and most of our queries have 'id' field that is used in our queries.
So, i have created data model with the id as primary key, and date as clustering key.
Essentially like below(i am just using fake/sample statements as i cannot give actual details).
create table tab1(
id text,
col1 text,
... coln text,
rec_date date,
rec_time timestamp,
PRIMARY KEY((id),rec_date,rec_time)
) WITH CLUSTERING ORDER BY rec_date DESC, rec_time DESC;
It works for most of the queries and worked fine.
However, i was trying to optimize below scenario.
-> All the records that are greater than the date abcd-xy-kl
Which one of the below approaches would be good for me.? Or any thing better than these two.?
1) very basic or simple approach. Use the query:
select * from tab1 where id > '0' AND rec_date > 'abcd-xy-kl'
Every record will be essentially greater than '0'. It might still do full table scan.
2) Create secondary index on rec_date and simply use the query:
select * from tab1 where rec_date > 'abcd-xy-kl'
Also, one key thing is i am using spark and using cassandraSqlContext.sql to get the dataframe.
So, considering all the above details, which approach would be better.?
I don't see the point of filtering with id as in your first example. The following should work and would be better approach from my perspective:
select * from tab1 where rec_date > 'abcd-xy-kl' ALLOW FILTERING;
Note that it won't work without ALLOW FILTERING at the end.
You cannot use > 0 for the partition key. It is not supported by Cassandra. Check the documentation for more information on the limitations on the WHERE part of the queries.
In order to query by your clustering keys efficiently you really need to use a secondary index. Refrain from using the ALLOW FILTERING unless you know what you're doing, because it could trigger a "distributed" scan and perform very poorly. Check the documentation for more information.

Where and Order By Clauses in Cassandra CQL

I am new to NoSQL database and have just started using apache Cassandra. I created a simple table "emp" with primary key on "empno" column. This is a simple table as we always get in Oracle's default scott schema.
Now I loaded data using the COPY command and issued query Select * from emp order by empno but I was surprised that CQL did not allow Order by on empno column (which is PK). Also when I used Where condition, it did not allow any inequality operations on empno column (it said only EQ or IN conditions are allowed). It also did not allowed Where and Order by on any other column, as they were not used in PK, and did not have an index.
Can someone please help me what should I do if I want to keep empno unique in the table and want a query results in Sorted order of empno?
(My version is:
cqlsh:demodb> show version
[cqlsh 5.0.1 | Cassandra 2.2.0 | CQL spec 3.3.0 | Native protocol v4]
)
There are two parts to a PRIMARY KEY in Cassandra:
partition key(s)
clustering key(s)
PRIMARY KEY (partitionKey1,clusteringKey1,clusteringKey2)
or
PRIMARY KEY ((partitionKey1,partitionKey2),clusteringKey1,clusteringKey2)
The partition key determines which node(s) your data is stored on. The clustering key determines the order of the data within your partition key.
In CQL, the ORDER BY clause is really only used to reverse the defined sort direction of your clustering order. As for the columns themselves, you can only specify the columns defined (and in that exact order...no skipping) in your CLUSTERING ORDER BY clause at table creation time. So you cannot pick arbitrary columns to order your result set at query-time.
Cassandra achieves performance by using the clustering keys to sort your data on-disk, thereby only returning ordered rows in a single read (no random reads). This is why you must take a query-based modeling approach (often duplicating your data into multiple query tables) with Cassandra. Know your queries ahead of time, and build your tables to serve them.
Select * from emp order by empno;
First of all, you need a WHERE clause. It's ok to query without it, if you're working with a relational database. With Cassandra, you should do your best to avoid unbound SELECT queries. Besides, Cassandra can only enforce a sort order within a partition, so querying without a WHERE clause won't return data in the order you want, anyway.
Secondly, as I mentioned above, you need to define clustering keys. If you want to order your result set by empno, then you must find another column to define as your partition key. Try something like this:
CREATE TABLE emp_by_dept (
empno text,
dept text,
name text,
PRIMARY KEY (dept,empno)
) WITH CLUSTERING ORDER BY (empno ASC);
Now, I can query employees by department, and they will be returned to me ordered by empno:
SELECT * FROM emp_by_dept WHERE dept='IT';
But to be clear, you will not be able to query every row in your table, and have it ordered by a single column. The only way to get meaningful order into your result sets, is first partition your data in a way that makes sense to your business case. Running an unbound SELECT will return all of your rows (assuming that the query doesn't time-out while trying to query every node in your cluster), but result set ordering can only be enforced within a partition. So you have to restrict by partition key in order for that to make any sense.
My apologies for self-promoting, but last year I wrote an article for DataStax called We Shall Have Order!, in which I addressed how to solve these types of problems. Give it a read and see if it helps.
Edit for additional questions:
From your answer I concluded 2 things about Cassandra:
(1) There is no
way of getting a result set which is only order by a column that has
been defined as Unique.
(2) When we define a PK
(partition-key+clustering-key), then the results will always be order
by Clustering columns within any fixed partition key (we must restrict
to one partition-key value), that means there is no need of ORDER BY
clause, since it cannot ever change the order of rows (the order in
which rows are actually stored), i.e. Order By is useless.
1) All PRIMARY KEYs in Cassandra are unique. There's no way to order your result set by your partition key. In my example, I order by empno (after partitioning by dept). – Aaron 1 hour ago
2) Stopping short of saying that ORDER BY is useless, I'll say that its only real use is to switch your sort direction between ASC and DESC.
I created an index on "empno" column of "emp" table, it is still not
allowing ORDER BY empno. So, what Indexes are for? are they only for
searching records for specific value of index key?
You cannot order a result set by an indexed column. Secondary indexes are (not the same as their relational counterparts) really only useful for edge-case, analytics-based queries. They don't scale, so the general recommendation is not to use secondary indexes.
Ok, that simply means that one table cannot be used for getting
different result sets with different conditions and different sorting
order.
Correct.
Hence for each new requirement, we need to create a new table.
IT means if we have a billion rows in a table (say Sales table), and
we need sum of sales (1) Product-wise, (2) Region-wise, then we will
duplicate all those billion rows in 2 tables with one in clustering
order of Product, the other in clustering order of Region,. and even
if we need to sum sales per Salesman_id, then we build a 3rd table,
again putting all those billion rows? is it sensible?
It's really up to you to decide how sensible it is. But lack of query flexibility is a drawback of Cassandra. To get around it you can keep creating query tables (I.E., trading disk for performance). But if it gets to a point where it becomes ungainly or difficult to manage, then it's time to think about whether or not Cassandra is really the right solution.
EDIT 20160321
Hi Aaron, you said above "Stopping short of saying that ORDER BY is useless, I'll say that its only real use is to switch your sort direction between ASC and DESC."
But i found even that is not correct. Cassandra only allows ORDER by in the same direction as we define in the "CLUSTERING ORDER BY" caluse of CREATE TABLE. If in that clause we define ASC, it allows only order by ASC, and vice versa.
Without seeing an error message, it's hard to know what to tell you on that one. Although I have heard of queries with ORDER BY failing when you have too many rows stored in a partition.
ORDER BY also functions a little odd if you specify multiple columns to sort by. If I have two clustering columns defined, I can use ORDER BY on the first column indiscriminately. But as soon as I add the second column to the ORDER BY clause, my query only works if I specify both sort directions the same (as the CLUSTERING ORDER BY definition) or both different. If I mix and match, I get this:
InvalidRequest: code=2200 [Invalid query] message="Unsupported order by relation"
I think that has to do with how the data is stored on-disk. Otherwise Cassandra would have more work to do in preparing result sets. Whereas if it requires everything to either to match or mirror the direction(s) specified in the CLUSTERING ORDER BY, it can just relay a sequential read from disk. So it's probably best to only use a single column in your ORDER BY clause, for more predictable results.
Adding a redux answer as the accepted one is quite long.
Order by is currently only supported on the clustered columns of the PRIMARY KEY
and when the partition key is restricted by an Equality or an IN operator in where clause.
That is if you have your primary key defined like this :
PRIMARY KEY ((a,b),c,d)
Then you will be able to use the ORDER BY when & only when your query has :
a where clause with all the primary key restricted either by an equality operator (=) or an IN operator such as :
SELECT * FROM emp WHERE a = 1 AND b = 'India' ORDER BY c,d;
SELECT * FROM emp WHERE a = 1 AND b = 'India' ORDER BY c;
These two query are the only valid ones.
Also this query would not work :
SELECT * FROM emp WHERE a = 1 AND b = 'India' ORDER BY d,c;
because order by currently only support the ordering of columns following their declared order in the PRIMARY KEY that is in primary key definition c has been declared before d and the query violates the ordering by placing d first.

Resources