Before you downvote I would like to state that I looked at all of the similar questions but I am still getting the dreaded "PRIMARY KEY column cannot be restricted" error.
Here's my table structure:
CREATE TABLE IF NOT EXISTS events (
id text,
name text,
start_time timestamp,
end_time timestamp,
parameters blob,
PRIMARY KEY (id, name, start_time, end_time)
);
And here's the query I am trying to execute:
SELECT * FROM events WHERE name = ? AND start_time >= ? AND end_time <= ?;
I am really stuck at this. Can anyone tell me what I am doing wrong?
Thanks,
Deniz
This is a query you need to remodel your data for, or use a distributed analytics platform (like spark). Id describes how your data is distributed through the database. Since it is not specified in this query a full table scan will be required to determine the necessary rows. The Cassandra design team has decided that they would rather you not do a query at all rather than do a query which will not scale.
Basically whenever you see "COLUMN cannot be restricted" It means that the query you have tried to perform cannot be done efficiently on the table you created.
To run the query, use the ALLOW FILTERING clause,
SELECT * FROM analytics.events WHERE name = ? AND start_time >= ? AND end_time <= ? ALLOW FILTERING;
The "general" rule to make query is you have to pass at least all partition key columns, then you can add each key in the order they're set." So in order for you to make this work you'd need to add where id = x in there.
However, it appears what this error message is implying is that once you select 'start_time > 34' that's as far "down the chain" you're allowed to go otherwise it would require the "potentially too costly" ALLOW FILTERING flag. So it has to be "only equality" down to one < > combo on a single column. All in the name of speed. This works (though doesn't give a range query):
SELECT * FROM events WHERE name = 'a' AND start_time = 33 and end_time <= 34 and id = '35';
If you're looking for events "happening at minute y" maybe a different data model would be possible, like adding an event for each minute the event is ongoing or what not, or bucketing based on "hour" or what not. See also https://stackoverflow.com/a/48755855/32453
Related
I have a table as below
CREATE TABLE test (
day int,
id varchar,
start int,
action varchar,
PRIMARY KEY((day),start,id)
);
I want to run this query
Select * from test where day=1 and start > 1475485412 and start < 1485785654
and action='accept' ALLOW FILTERING
Is this ALLOW FILTERING efficient?
I am expecting that cassandra will filter in this order
1. By Partitioning column(day)
2. By the range column(start) on the 1's result
3. By action column on 2's result.
So the allow filtering will not be a bad choice on this query.
In case of the multiple filtering parameters on the where clause and the non indexed column is the last one, how will the filter work?
Please explain.
Is this ALLOW FILTERING efficient?
When you write "this" you mean in the context of your query and your model, however the efficiency of an ALLOW FILTERING query depends mostly on the data it has to filter. Unless you show some real data this is a hard to answer question.
I am expecting that cassandra will filter in this order...
Yeah, this is what will happen. However, the inclusion of an ALLOW FILTERING clause in the query usually means a poor table design, that is you're not following some guidelines on Cassandra modeling (specifically the "one query <--> one table").
As a solution, I could hint you to include the action field in the clustering key just before the start field, modifying your table definition:
CREATE TABLE test (
day int,
id varchar,
start int,
action varchar,
PRIMARY KEY((day),action,start,id)
);
You then would rewrite your query without any ALLOW FILTERING clause:
SELECT * FROM test WHERE day=1 AND action='accept' AND start > 1475485412 AND start < 1485785654
having only the minor issue that if one record "switches" action values you cannot perform an update on the single action field (because it's now part of the clustering key), so you need to perform a delete with the old action value and an insert it with the correct new value. But if you have Cassandra 3.0+ all this can be done with the help of the new Materialized View implementation. Have a look at the documentation for further information.
In general ALLOW FILTERING is not efficient.
But in the end it depends on the size of the data you are fetching (for which cassandra have to use ALLOW FILTERING) and the size of data its being fetched from.
In your case cassandra do not need filtering upto :
By the range column(start) on the 1's result
As you mentioned. But after that, it will rely on filtering to search data, which you are allowing in query itself.
Now, keep following in mind
If your table contains for example a 1 million rows and 95% of them have the requested value, the query will still be relatively efficient and you should use ALLOW FILTERING.
On the other hand, if your table contains 1 million rows and only 2 rows contain the requested value, your query is extremely inefficient. Cassandra will load 999, 998 rows for nothing. If the query is often used, it is probably better to add an index on the time1 column.
So ensure this first. If it works in you favour, use FILTERING.
Otherwise, it would be wise to add secondary index on 'action'.
PS : There is some minor edit.
Sorry the title might/might not give exact description of what i intended.
Here is the problem. I need to select data based on date ranges and most of our queries have 'id' field that is used in our queries.
So, i have created data model with the id as primary key, and date as clustering key.
Essentially like below(i am just using fake/sample statements as i cannot give actual details).
create table tab1(
id text,
col1 text,
... coln text,
rec_date date,
rec_time timestamp,
PRIMARY KEY((id),rec_date,rec_time)
) WITH CLUSTERING ORDER BY rec_date DESC, rec_time DESC;
It works for most of the queries and worked fine.
However, i was trying to optimize below scenario.
-> All the records that are greater than the date abcd-xy-kl
Which one of the below approaches would be good for me.? Or any thing better than these two.?
1) very basic or simple approach. Use the query:
select * from tab1 where id > '0' AND rec_date > 'abcd-xy-kl'
Every record will be essentially greater than '0'. It might still do full table scan.
2) Create secondary index on rec_date and simply use the query:
select * from tab1 where rec_date > 'abcd-xy-kl'
Also, one key thing is i am using spark and using cassandraSqlContext.sql to get the dataframe.
So, considering all the above details, which approach would be better.?
I don't see the point of filtering with id as in your first example. The following should work and would be better approach from my perspective:
select * from tab1 where rec_date > 'abcd-xy-kl' ALLOW FILTERING;
Note that it won't work without ALLOW FILTERING at the end.
You cannot use > 0 for the partition key. It is not supported by Cassandra. Check the documentation for more information on the limitations on the WHERE part of the queries.
In order to query by your clustering keys efficiently you really need to use a secondary index. Refrain from using the ALLOW FILTERING unless you know what you're doing, because it could trigger a "distributed" scan and perform very poorly. Check the documentation for more information.
I'm trying to get data from a date range on Cassandra, the table is like this:
CREATE TABLE test6 (
time timeuuid,
id text,
checked boolean,
email text,
name text,
PRIMARY KEY ((time), id)
)
But when I select a data range I get nothing:
SELECT * FROM teste WHERE time IN ( minTimeuuid('2013-01-01 00:05+0000'), now() );
(0 rows)
How can I get a date range from a Cassandra Query?
The IN condition is used to specify multiple keys for a SELECT query. To run a date range query for your table, (you're close) but you'll want to use greater-than and less-than.
Of course, you can't run a greater-than/less-than query on a partition key, so you'll need to flip your keys for this to work. This also means that you'll need to specify your id in the WHERE clause, as well:
CREATE TABLE teste6 (
time timeuuid,
id text,
checked boolean,
email text,
name text,
PRIMARY KEY ((id), time)
)
INSERT INTO teste6 (time,id,checked,email,name)
VALUES (now(),'B26354',true,'rdeckard#lapd.gov','Rick Deckard');
SELECT * FROM teste6
WHERE id='B26354'
AND time >= minTimeuuid('2013-01-01 00:05+0000')
AND time <= now();
id | time | checked | email | name
--------+--------------------------------------+---------+-------------------+--------------
B26354 | bf0711f0-b87a-11e4-9dbe-21b264d4c94d | True | rdeckard#lapd.gov | Rick Deckard
(1 rows)
Now while this will technically work, partitioning your data by id might not work for your application. So you may need to put some more thought behind your data model and come up with a better partition key.
Edit:
Remember with Cassandra, the idea is to get a handle on what kind of queries you need to be able to fulfill. Then build your data model around that. Your original table structure might work well for a relational database, but in Cassandra that type of model actually makes it difficult to query your data in the way that you're asking.
Take a look at the modifications that I have made to your table (basically, I just reversed your partition and clustering keys). If you still need help, Patrick McFadin (DataStax's Chief Evangelist) wrote a really good article called Getting Started with Time Series Data Modeling. He has three examples that are similar to yours. In fact his first one is very similar to what I have suggested for you here.
My cassandra data model:
CREATE TABLE last_activity_tracker ( id uuid, recent_activity_time timestamp, PRIMARY KEY(id));
CREATE INDEX activity_idx ON last_activity_tracker (recent_activity_time) ;
The idea is to keep track of 'id's and their most recent activity of an event.
I need to find the 'id's whose last activity was an year ago.
So, I tried:
SELECT * from last_activity_tracker WHERE recent_activity_time < '2013-12-31' allow filtering;
I understand that I cannot use other than '=' for secondary indexed columns.
However, I cannot add 'recent_activity_time' to the key as I need to update this column with the most recent activity time of an event if any.
Any ideas in solving my problem are highly appreciated.
I can see an issue with your query. You're not hitting a partition. As such, the performance of your query will be quite bad. It'll need to query across your whole cluster (assuming you took measures to make this work).
If you're looking to query the last activity time for an id, think about storing it in a more query friendly format. You might try this:
create table tracker (dummy int, day timestamp, id uuid, primary key(dummy, day, id));
You can then insert with the day to be the epoch for the date (ignoring the time), and dummy = 0.
That should enable you to do:
select * from tracker where dummy=0 and day > '2013-12-31';
You can set a ttl on insert so that old entries expire (maybe after a year in this case). The idea is that you're storing information in a way that suits your query.
I have the following Cassandra table which records the user access to a web page.
create table user_access (
id timeuuid primary key,
user text,
access_time timestamp
);
and would like to do a query like this:
get the list of users who access the page for more than 10 times in the last hour.
Is it possible to do it in Cassandra? (I'm kind of stuck with the limited CQL query functionalities)
If not, how do I remodel the table to do this?
Can you do it? yes.
Can you do it efficiently? I'm not convinced.
It's not clear what the timeuuid you are using represents.
You could reorganize this to
CREATE TABLE user_access (
user text,
access_time timestamp,
PRIMARY KEY (user_id, access_time)
);
SELECT COUNT(*)
FROM user_access
WHERE user_id = '101'
AND access_time > 'current unix timestamp - 3600'
AND access_time < 'current unix timestamp';
Then filter the results on your own in your language of choice. I wouldn't hold your breathe waiting for sub query support.
That's going to be horribly inefficient if you have lots of users though.
There may be a better solution using cql's counter columns and binning accesses to the start of the hour. That could get you per hour accesses, but that's not the same as within the last hour.