Why does querying based on the first clustering key require an ALLOW FILTERING? - cassandra

Say I have this Cassandra table:
CREATE TABLE orders (
customerId int,
datetime date,
amount int,
PRIMARY KEY (customerId, datetime)
);
Then why would the following query require an ALLOW FILTERING:
SELECT * FROM orders WHERE date >= '2020-01-01'
Cassandra could just go to all the individual partitions (i.e. customers) and filter on the clustering key date. Since date is sorted there is no need to retrieve all the rows in orders and filter out the ones that match my where clause (as far as I understand it).
I hope someone can enlighten me.
Thanks

This happens because for normal work, Cassandra needs the partition key - it's used to find what machine(s) are storing the data for it. If you don't have partition key, like, in your example, Cassandra need to scan all data to find those that are matching your query. And this requires the use of the ALLOW FILTERING.
P.S. Data is sorted only inside the individual partitions, not globally.

Related

Select row with highest timestamp

I have a table that stores events
CREATE TABLE active_events (
event_id VARCHAR,
number VARCHAR,
....
start_time TIMESTAMP,
PRIMARY KEY (event_id, number)
);
Now, I want to select an event with the highest start_time. It is possible? I've tried to create a secondary index, but no success.
This is a query I've created
select * from active_call order by start_time limit 1
But the error says ORDER BY is only supported when the partition key is restricted by an EQ or an IN.
Should I create some kind of materialized view? What should I do to execute my query?
This is an anti-pattern in Cassandra. To order the data you need to read all data and find the highest value. And this will require scanning of data on multiple nodes, and will be very long.
Materialized view also won't help much as order for data only exists inside an individual partition, so you will need to put all your data into a single partition that could be huge and data would be imbalanced.
I can only think of following workaround:
Have an additional table that will have all columns of the original table, but with a fake partition key and no clustering columns
You do inserts into that table in parallel to normal inserts, but use a fixed value for that fake partition key, and explicitly setting a timestamp for a record equal to start_time (don't forget to multiple by 1000 as timestamp uses microseconds). In this case it will guaranteed to be the value with the highest timestamp as Cassandra won't override it with other data with lower timestamp.
But this doesn't solve a problem with data skew, and all traffic will be handled by fixed number of nodes equal to RF.
Another alternative - use another database.
This type of query isn't valid in big data because it requires a full table scan and doesn't scale. It works in traditional relational databases because the dataset is smaller. Imagine you had billions of partitions each with thousands of rows spread across hundreds of nodes. A full table scan in a large cluster will take a very long time if it was allowed.
The error:
ORDER BY is only supported when the partition key is restricted by an EQ or an IN
gets returned because you can only sort the results provided (a) the query is restricted to a partition key, and (b) the rows are ordered by a clustering column. You cannot sort the results based on a column that is not part of the clustering key. Cheers!

Allow filtering function in Cassandra (which choice is correct?)

I am currently trying to model some time series data in base of Cassandra.
For example i have a table bigint_table, which was created by following query
**
CREATE TABLE bigint_table (name_id int,tuuid timeuuid, timestamp
timestamp, value text, PRIMARY KEY ((name_id),tuuid, timestamp)) WITH
CLUSTERING ORDER BY (tuuid asc, timestamp asc)
**
tuuid column was added because without it I had problems and I lost some data while inserting them in DB. name_id represents the channel's ID data comes from.tuuid column was added because without it I had problems and I lost some data while inserting them in DB. In one table there are lots of data with the same ID, but they are unique by timestamp and tuuid (values also can be the same sometimes).
I consistently execute 2 different queries to get values and timestamps
select value from bigint_table where name_id=6 and timestamp>'
2017-11-01 8:26:47.970+0000' and timestamp<'2017-11-30
8:26:52.048+0000' order by tuuid asc, timestamp asc allow filtering
2.
select timestamp from bigint_table where name_id=6 and timestamp>'
2017-11-01 8:26:47.970+0000' and timestamp<'2017-11-30
8:26:52.048+0000' order by tuuid asc, timestamp asc allow filtering
In this post author says one need to resist the urge to just add ALLOW FILTERING to itand one should think about data, model and what one is trying to do.
I thought a lot about using ALLOW FILTERING function or not, and I figured out that I have no choice in my case and I need to use it. But those words in post I mentioned above are keeping me in doubt. I would like to know your advise and what do you thnik about my problem. Is there another way to model my data tables, queries of which do not require ALLOW FILTERING? I would be very very thank you for advice.
The reason you need allow filtering is because you have the clustering column (tuuid, timestamp)in the wrong order. In this case the data stored first by tuuid and then by timestamp.But you're actually choosing data by timestamp and then ordering by tuuid so Cassandra can't use the indexes that you have specified. The order when you define the primary key matters.

Get first row for each partition key in Cassandra

I am considering Cassandra as an intermediate storage during my ETL job to perform data deduplication.
Let's imagine I have a stream of events, each of them have some business entity id, timestamp and some value. I need to get only latest value in terms of in-event timestamp for each business key, but events may come unordered.
My idea was to create staging table with business id as a partition key and timestamp as a clustering key:
CREATE TABLE sample_keyspace.table1_copy1 (
id uuid,
time timestamp,
value text,
PRIMARY KEY (id, time)
) WITH CLUSTERING ORDER BY ( time DESC )
Now if I insert some data in this table I can get latest value for some given partition key:
select * from table1 where id = 96b29b4b-b60b-4be9-9fa3-efa903511f2d limit 1;
But that would require to issue such query for every business key I'm interested in.
Is there some effective way I could do it in CQL?
I know we have an ability to list all available partition keys (by select distinct id from table1). So if I look into storage model of Cassandra, getting first row for each partition key should not be too hard.
Is that supported?
If you're using a version after 3.6, there is an option on your query named PER PARTITION LIMIT (CASSANDRA-7017) which you can set to 1. This won't auto complete in cqlsh until 3.10 with CASSANDRA-12803.
SELECT * FROM table1 PER PARTITION LIMIT 1;
In a word: no.
The partitioning key is why Cassandra can work essentially any amount of data: It decides where to put/look for data using the hash of the partitioning key. That is why CQL SELECTs always need to do an equality filter on the entire partitioning key. In order to find the first time for each id, Cassandra would have to ask all nodes for any partition of the data, then perform a complex operation on each of them. Relational databases allow this, Cassandra does not. All it allows are full table scans (SELECT * from table1), or partition scans (SELECT DISTINCT id FROM table1), but those cannot* be linked to any complex operation.
*) I am omitting ALLOW FILTERING here, since it does not help in this context.

Cassandra asks for ALLOW FILTERING even though column is clustering key

Very new to Cassandra so apologies if the question is simple.
I created a table:
create table ApiLog (
LogId uuid,
DateCreated timestamp,
ClientIpAddress varchar,
primary key (LogId, DateCreated));
This work fine:
select * from apilog
If I try to add a where clause with the DateCreated like this:
select * from apilog where datecreated <= '2016-07-14'
I get this:
Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING
From other questions here on SO and from the tutorials on datastax it is my understanding that since the datecreated column is a clustering key it can be used to filter data.
I also tried to create an index but I get the same message back. And I tried to remove the DateCreated from the primary key and have it only as an index and I still get the same back:
create index ApiLog_DateCreated on dotnetdemo.apilog (datecreated);
The partition key LogId determines on which node each partition will be stored. So if you don't specify the partition key, then Cassandra has to filter all the partitions of this table on all the nodes to find matching data. That's why you have to say ALLOW FILTERING, since that operation is very inefficient and is discouraged.
If you specify a specific LogId, then Cassandra can find the partition on a single node and efficiently do a range query by the clustering key.
So you need to plan your schema such that you can do your range queries within a single partition and not have to do a full table scan like you're trying to do.
When your query is rejected by Cassandra because it needs filtering, you should resist the urge to just add ALLOW FILTERING to it. You should think about your data, your model and what you are trying to do. You always have multiple options.
You can change your data model, add an index, use another table or use ALLOW FILTERING.
You have to make the right choice for your specific use case.
Anyway you want to make it work.
select * from dev."3" where "column" = '' limit 1000 ALLOW FILTERING;

cassandra filtering on an indexed column isn't working

I'm using (the latest version of) Cassandra nosql dbms to model some data.
I'd like to get a count of the number of active customer accounts in the last month.
I've created the following table:
CREATE TABLE active_accounts
(
customer_name text,
account_name text,
date timestamp,
PRIMARY KEY ((customer_name, account_name))
);
So because I want to filter by date, I create an index on the date column:
CREATE INDEX ON active_accounts (date);
When I insert some data, Cassandra automatically updates data on any existing primary key matches, so the following inserts only produce two records:
insert into active_accounts (customer_name, account_name, date) Values ('customer2', 'account2', 1418377413000);
insert into active_accounts (customer_name, account_name, date) Values ('customer1', 'account1', 1418377413000);
insert into active_accounts (customer_name, account_name, date) Values ('customer2', 'account2', 1418377414000);
insert into active_accounts (customer_name, account_name, date) Values ('customer2', 'account2', 1418377415000);
This is exactly what I'd like - I won't get a huge table of data, and each entry in the table represents a unique customer account - so no need for a select distinct.
The query I'd like to make - is how many distinct customer accounts are active within the last month say:
Select count(*) from active_accounts where date >= 1418377411000 and date <= 1418397411000 ALLOW FILTERING;
In response to this query, I get the following error:
code=2200 [Invalid query] message="No indexed columns present in by-columns clause with Equal operator"
What am I missing; isn't this the purpose of the Index I created?
Table design in Cassandra is extremely important and it must match the kind of queries that you are trying to preform. The reason that Cassandra is trying to keep you from performing queries on the date column, is that any query along that column will be extremely inefficient.
Table Design - Model your queries
One of the main reasons that Cassandra can be fast is that it partitions user data so that most( 99%)
of queries can be completed without contacting all of the nodes in the cluster. This means less network traffic, less disk access, and faster response time. Unfortunately Cassandra isn't able to determine automatically what the best way to partition data. The end user must determine a schema which fits into the C* datamodel and allows the queries they want at a high speed.
CREATE TABLE active_accounts
(
customer_name text,
account_name text,
date timestamp,
PRIMARY KEY ((customer_name, account_name))
);
This schema will only be efficient for queries that look like
SELECT timestamp FROM active_accounts where customer_name = ? and account_name = ?
This is because on the the cluster the data is actually going to be stored like
node 1: [ ((Bob,1)->Monday), ((Tom,32)->Tuesday)]
node 2: [ ((Candice, 3) -> Friday), ((Sarah,1) -> Monday)]
The PRIMARY KEY for this table says that data should be placed on a node based on the hash of the combination of CustomerName and AccountName. This means we can only look up data quickly if we have both of those pieces of data. Anything outside of that scope becomes a batch job since it requires hitting multiple nodes and filtering over all the data in the table.
To optimize for different queries you need to change the layout of your table or use a distributed analytics framework like Spark or Hadoop.
An example of a different table schema that might work for your purposes would be something like
CREATE TABLE active_accounts
(
start_month timestamp,
customer_name text,
account_name text,
date timestamp,
PRIMARY KEY (start_month, date, customer_name, account_name)
);
In this schema I would put the timestamp of the first day of the month as the partitioning key and date as the first clustering key. This means that multiple account creations that took place in the same month will end up in the same partition and on the same node. The data for a schema like this would look like
node 1: [ (May 1 1999) -> [(May 2 1999, Bob, 1), (May 15 1999,Tom,32)]
This places the account dates in order within each partition making it very fast for doing range slices between particular dates. Unfortunately you would have to add code on the application side to pull down the multiple months that a query might be spanning. This schema takes a lot of (dev) work so if these queries are very infrequent you should use a distributed analytics platform instead.
For more information on this kind of time-series modeling check out:
http://planetcassandra.org/getting-started-with-time-series-data-modeling/
Modeling in general:
http://www.slideshare.net/planetcassandra/cassandra-day-denver-2014-40328174
http://www.slideshare.net/johnny15676/introduction-to-cql-and-data-modeling
Spark and Cassandra:
http://planetcassandra.org/getting-started-with-apache-spark-and-cassandra/
Don't use secondary indexes
Allow filtering was added to the cql syntax to prevent users from accidentally designing queries that will not scale. The secondary indexes are really only for use by those do analytics jobs or those C* users who fully understand the implications. In Cassandra the secondary index lives on every node in your cluster. This means that any query that requires a secondary index necessarily will require contacting every node in the cluster. This will become less and less performant as the cluster grows and is definitely not something you want for a frequent query.

Resources