SparkSQL restrict queries by Cassandra partition key ranges - apache-spark

Imagine that my primary key is a timestamp.
I would like to restrict the query by timestamp ranges.
I don't seem to manage to make it work, even if I used token(). Also I can't create a secondary index on the partition key.
How should this be done?

Cassandra doesn't allow for range queries on partition key.
One way of dealing with this problem is changing your schema so that your timestamp value would be a clustering column. For this to work, you need to introduce a sentinel column as partition key. See this question for more detailed answers: Range Queries in Cassandra (CQL 3.0)
Another way is just to let Spark do the filtering. Range queries on primary key should work in Spark SQL. They would simply not be pushed down to Cassandra and Spark would fetch all data and filter them on the Spark side.

Related

How is data sorted in the Cassandra memtable in the absence of a clustering key?

I am new to cassandra and was checking in how cassandra internals work. I checked this article and in this its stated that memtable is stored in sorted order.
But if there's no clustering key or multilple culstering keys, how cassandra store the data in that case in memtable? I want to know what is the criteria of sorting?
There are different ways where data is sorted when it comes to Cassandra.
The term "SSTable" stands for "sorted string table" meaning that the contents of a Cassandra data file are sorted. Data in memtables are sorted by the partition key so that they are already ordered when they are flushed to disk.
Additionally, it also makes it easy for Cassandra to determine whether a partition exists in an SSTable or not since it keeps metadata about the first and last partition key contained in the SSTable.
If the table has clustering columns, the rows are sorted based on the clustering order defined in the table schema. This is the only time where clustering keys are relevant for sorting. Cheers!
👉 Please support the Apache Cassandra community by hovering over the cassandra tag then click on the Watch tag button. 🙏 Thanks!

Cassandra IN clause for non primary key column

I want to use the IN clause for the non-primary key column in Cassandra. Is it possible? if it is not is there any alternate or suggestion?
Three possible solutions
Create a secondary index. This is not recommended due to performance problems.
See if you can designate that column in the existing table as part of the primary key
Create another denormalised table that table is optimised for your query. i.e data model by query pattern
Update:
And also even after you move that to primary key, operations with IN clause can be further optimised. I found this cassandra lookup by list of primary keys in java very useful

Performance impact of Allow filtering on same partition query in cassandra

I have table like this.
CREATE TABLE posts (
topic text
country text,
bookmarked text,
id uuid,
PRIMARY KEY (topic,id)
);
First query on single partition with allow filtering.
select * from posts where topic='cassandra' allow filtering;
Second query on single partition without allow filtering.
select * from posts where topic='cassandra';
My question is what is performance difference between first query and second query? Will first query(with allow filtering) get result from all partition before filtering though we have requested from single partition.
Thanks.
Allow filtering will allow you to run queries without specifying partition key. But if you using one, it will use only specific partition.
In this specific example you should see no difference.
Ran both queries on my test table with tracing on, got single partition in both execution plans:
Executing single-partition query on table_name
You don't need to use ALLOW FILTERING when you are querying with a partition key. So for the two queries you mentioned there will be no performance difference.
For Cassandra version 3.0 and up, ALLOW FILTERING can be used to query with any fields other than partition key. For example, you can run a query like this:
SELECT * FROM posts where country='Bangladesh';
And for Cassandra version below 3.0, ALLOW FILTERING can be used on only primary key.
Although it is not wise to query using ALLOW FILTERING.
Because, the only way Cassandra can execute this query is by retrieving all the rows from the table posts and then by filtering out the ones which do not have the requested value for the country column.
So you should useALLOW FILTERING at you own risk.

Token range selects with secondary index filtering in cassandra

Is it possible to do token range table scans in cassandra while also applying secondary index filtering?
It is possible. I was confused because the cassandra documentation said you may only use = on the partition key when filtering with secondary indices. Not sure why they say that.

Query in Cassandra that will sort the whole table by a specific field

I have a table like this
CREATE TABLE my_table(
category text,
name text,
PRIMARY KEY((category), name)
) WITH CLUSTERING ORDER BY (name ASC);
I want to write a query that will sort by name through the entire table, not just each partition.
Is that possible? What would be the "Cassandra way" of writing that query?
I've read other answers in the StackOverflow site and some examples created single partition with one id (bucket) which was the primary key but I don't want that because I want to have my data spread across the nodes by category
Cassandra doesn't support sorting across partitions; it only supports sorting within partitions.
So what you could do is query each category separately and it would return the sorted names for each partition. Then you could do a merge of those sorted results in your client (which is much faster than a full sort).
Another way would be to use Spark to read the table into an RDD and sort it inside Spark.
Always model cassandra tables through the access patterns (relational db / cassandra fill different needs).
Up to Cassandra 2.X, one had to model new column families (tables) for each access pattern. So if your access pattern needs a specific column to be sorted then model a table with that column in the partition/clustering key. So the code will have to insert into both the master table and into the projection table. Note depending on your business logic this may be difficult to synchronise if there's concurrent update, especially if there's update to perform after a read on the projections.
With Cassandra 3.x, there is now materialized views, that will allow you to have a similar feature, but that will be handled internally by Cassandra. Not sure it may fit your problem as I didn't play too much with 3.X but that may be worth investigation.
More on materialized view on their blog.

Resources