Cassandra Range Query : Secondary Index vs Unindexed Colum - cassandra

I have seen that the best way to do range query on cassandra is by using CLUSTERING KEY. But I need to do some range query other than CLUSTERING KEY columns.
I read that we can do it on any column using ALLOW FILTERING. But is there any performance advantage if I create secondary index on that column ?

Have a look at this link:
https://www.datastax.com/dev/blog/allow-filtering-explained-2
The ALLOW FILTERING option allows you tell Cassandra that it is ok to perform in-memory filtering of the data once it loads rows from disk. So we can use this to search by a clustering column without specifying the previous clustering columns. But we can't use it on non-clustering columns.
See the below example schema from the blog. Use of ALLOW FILTERING doesn't allow us to filter by author column until we make it an index, which then doesn't need the ALLOW FILTERING option.
cqlsh:test> SELECT * FROM blogs WHERE author = 'john' ALLOW FILTERING;
Bad Request: No indexed columns present in by-columns clause with Equal operator
cqlsh:test>
cqlsh:test> CREATE INDEX authors ON blogs (author);
cqlsh:test> SELECT * FROM blogs WHERE author = 'john';
(0 rows)
cqlsh:test> SELECT * FROM blogs WHERE author = 'john' ALLOW FILTERING;
(0 rows)

Related

Cassandra select CQL: Cannot add column after wildcard

I need to output the write timestamp as part of a table export for lots of tables, though I quite cannot figure out a way which does not force me to explicitely select all columns in the statement.
Instead of being able to do just this:
SELECT *, writetime(data) AS timestamp FROM dls.licenses;
I have to do that:
SELECT column1, column2, ... , writetime(data) AS timestamp FROM dls.licenses;
This is pretty unconvenient since it means I'd have to change the export tool every time the schema of any of the tables changes.
Is there a better way?
Edit: To clarify, the actual error I get is the following. The way the syntax is presented in the error one could think that the SQL should be ok:
SELECT *, writetime(id) AS timestamp FROM dls.licenses;
SyntaxException: line 1:8 mismatched input ',' expecting K_FROM (SELECT *[,]...)
Edit 2: Here is the keyspace and create statement used for this table:
CREATE KEYSPACE IF NOT EXISTS dls WITH replication = { 'class': 'SimpleStrategy', 'replication_factor': ‚1‘ };
CREATE TABLE IF NOT EXISTS dls.licenses (subscription_id text, id text, key text, data text, PRIMARY KEY (key));
CREATE INDEX IF NOT EXISTS ON dls.licenses (id);
BTW: I'm using the fresh Cassandra 4.0.0 (GA).
If you are exporting to CSV or JSON files, you may consider using DataStax's dsbulk.
https://github.com/datastax/dsbulk
The latest version of dsbulk 1.8.0 added support to export writetime and ttl.
https://docs.datastax.com/en/dsbulk/doc/dsbulk/reference/schemaOptions.html#schemaOptions__schemaOptionsPreserveTimestamp
dsbulk unload -url myData.csv -k ks1 -t table1 --timestamp
The WHERE clause specifies which rows must be queried. It is composed of relations on the columns that are part of the PRIMARY KEY and/or have a secondary index defined on them.
The column specification of the relation must be one of the following:
One or more members of the partition key of the table
A clustering column, only if the relation is preceded by other relations that specify all columns in the partition key
A column that is indexed using CREATE INDEX.
In Cassandra 3.6 and later, add ALLOW FILTERING to filter only on a non-indexed cluster column.
You may be able to solve your query problem by creating a secondary index on the column you want the writetime for. Keep in mind secondary indexes create overhead and which may result in unintended consequences.
The star (*) in SELECT * is the CQL syntax for "ALL columns" so by definition, it is not possible to include another column since ALL of them are selected even for native CQL functions. For this reason, you need to enumerate all column names + functions-on-columns.
+1 to Yuki's answer. I wanted to add that DSBulk adds a WRITETIME() column for every column in the table because it isn't possible to know in advance the write-time of each column in the partition until the full partition has been read.
Allow me to explain it using a couple of examples.
Schema
Consider this table:
CREATE TABLE users_by_email (
email text,
name text,
address text,
mobile text,
PRIMARY KEY (email)
)
Example 1
If we add a new record with a value specified for all columns:
INSERT INTO users_by_email (email, name, address, mobile)
VALUES ('alice#staysafe.com', 'Alice', '221B Baker St', '098-765-432-109');
then for this partition, all columns will have the same write-time.
Example 2
Consider a situation where a record is fragmented across multiple inserts over a period of time such as:
INSERT INTO users_by_email (email, name) VALUES ('dude#getvaccinated.now', 'Bob');
INSERT INTO users_by_email (email, address) VALUES ('dude#getvaccinated.now', '350 Fifth Ave');
INSERT INTO users_by_email (email, mobile) VALUES ('dude#getvaccinated.now', '012-555-123-456');
Each of the columns name, address and mobile would all have different write-times.
From these 2 examples, you should see that there isn't always a single write-time that applies to all columns in the partition.
For your specific use case, you need to figure out from the DSBulk output which write-time to use for situations where the partition fragments are inserted/updated at different times. Cheers!

how Cql's Collection contains alternative value?

I have a question to query to cassandra collection.
I want to make a query that work with collection search.
CREATE TABLE rd_db.test1 (
testcol3 frozen<set<text>> PRIMARY KEY,
testcol1 text,
testcol2 int
)
table structure is this...
and
this is the table contents.
in this situation, I want to make a cql query has alternative option values on set column.
if it is sql and testcol3 isn't collection,
select * from rd.db.test1 where testcol3 = 4 or testcol3 = 5
but it is cql and collection.. I try
select * from test1 where testcol3 contains '4' OR testcol3 contains '5' ALLOW FILTERING ;
select * from test1 where testcol3 IN ('4','5') ALLOW FILTERING ;
but this two query didn't work...
please help...
This won't work for you for multiple reasons:
there is no OR operation in CQL
you can do only full match on the value of partition key (testcol3)
although you may create secondary indexes for fields with collection type, it's impossible to create an index for values of partition key
You need to change data model, but you need to know the queries that you're executing in advance. From brief looking into your data model, I would suggest to rollout the set field into multiple rows, with individual fields corresponding individual partitions.
But I want to suggest to take DS201 & DS220 courses on DataStax Academy site for better understanding how Cassandra works, and how to model data for it.

Performance impact of Allow filtering on same partition query in cassandra

I have table like this.
CREATE TABLE posts (
topic text
country text,
bookmarked text,
id uuid,
PRIMARY KEY (topic,id)
);
First query on single partition with allow filtering.
select * from posts where topic='cassandra' allow filtering;
Second query on single partition without allow filtering.
select * from posts where topic='cassandra';
My question is what is performance difference between first query and second query? Will first query(with allow filtering) get result from all partition before filtering though we have requested from single partition.
Thanks.
Allow filtering will allow you to run queries without specifying partition key. But if you using one, it will use only specific partition.
In this specific example you should see no difference.
Ran both queries on my test table with tracing on, got single partition in both execution plans:
Executing single-partition query on table_name
You don't need to use ALLOW FILTERING when you are querying with a partition key. So for the two queries you mentioned there will be no performance difference.
For Cassandra version 3.0 and up, ALLOW FILTERING can be used to query with any fields other than partition key. For example, you can run a query like this:
SELECT * FROM posts where country='Bangladesh';
And for Cassandra version below 3.0, ALLOW FILTERING can be used on only primary key.
Although it is not wise to query using ALLOW FILTERING.
Because, the only way Cassandra can execute this query is by retrieving all the rows from the table posts and then by filtering out the ones which do not have the requested value for the country column.
So you should useALLOW FILTERING at you own risk.

Why cassandra/cql restrict to use where clause on a column that not indexed?

I have a table as follows in Cassandra 2.0.8:
CREATE TABLE emp (
empid int,
deptid int,
first_name text,
last_name text,
PRIMARY KEY (empid, deptid)
)
when I try to search by: "select * from emp where first_name='John';"
cql shell says:
"Bad Request: No indexed columns present in by-columns clause with Equal operator"
I searched for the issue and every places it says add a secondary index for the column 'first_name'.
But I need to know the exact reason for why that column need to be indexed?
Only thing I can figure out is performance.
Any other reasons?
Cassandra does not support for searching by arbitrary column. It is because it would involve scanning all the rows, which is not supported.
The data are internally organised into something which one can compare to HashMap[X, SortedMap[Y, Z]]. The key of the outer map is a partition key value and the key of the inner map is a kind of concatenation of all clustering columns values and a name of some regular column.
Unless you have an index on a column, you need to provide full (preferred) or partial path to the data you want to collect with the query. Therefore, you should design your schema so that queries contain primary key value and some range on clustering columns.
You may read about what is allowed and what is not here
Alternatively you can create an index in Cassandra, but that will hamper your write performance.

Bad Request: No indexed columns present in by-columns clause with Equal operator : CQL error?

I have below table in CQL-
create table test (
employee_id text,
employee_name text,
value text,
last_modified_date timeuuid,
primary key (employee_id)
);
I inserted couple of records in the above table like this which I will be inserting in our actual use case scenario-
insert into test (employee_id, employee_name, value, last_modified_date) values ('1', 'e27', 'some_value', now());
insert into test (employee_id, employee_name, value, last_modified_date) values ('2', 'e27', 'some_new_value', now());
insert into test (employee_id, employee_name, value, last_modified_date) values ('3', 'e27', 'some_again_value', now());
insert into test (employee_id, employee_name, value, last_modified_date) values ('4', 'e28', 'some_values', now());
insert into test (employee_id, employee_name, value, last_modified_date) values ('5', 'e28', 'some_new_values', now());
Now I was doing select query for - give me all the employee_id for employee_name e27.
select employee_id from test where employee_name = 'e27';
And this is the error I am getting -
Bad Request: No indexed columns present in by-columns clause with Equal operator
Perhaps you meant to use CQL 2? Try using the -2 option when starting cqlsh.
Is there anything wrong I am doing here?
My use cases are in general -
Give me everything for any of the employee_name?
Give me everything for what has changed in last 5 minutes?
Give me the latest employee_id and value for any of the employee_name?
Give me all the employee_id for any of the employee_name?
I am running Cassandra 1.2.11
The general rule is simple: "you can only query by columns that are part of the key". As an explanation all other queries would require a complete scan of the tables which might mean a lot of data sifting.
There are things that can modify this rule:
use secondary indexes for columns with low cardinality (more details here)
define multi-column keys (e.g. PRIMARY KEY (col1, col2); which would allow queries like col1 = value1 and col1 = value1 and col2 COND)
use ALLOW FILTERING in queries. This will result in a warning as Cassandra will have to sift through a lot of data and there will be no performance guarantees. For more details see details of ALLOW FILTERING in CQL and this SO thread
Cassandra take a little getting used to :) Some of us have been spoiled by some of the extra stuff RDBMS does for you that you do not get for free from noSql.
If you think back on a regular RDBMS table, if you SELECT on a column that has no index, the DB must do a full-table scan to find all the matches you seek. This is a no-no in Cassandra, and it will complain if you try to do this. Imagine if you found 10^32 matches to this query? It is not a reasonable ask.
In your table, you have coded *PRIMARY KEY(employee_id);* this is the row's primary and unique identifying key. You can now SELECT * from TEST where employee_id='123'; this is perfectly reasonable and Cassandra will happily return the result.
However, your SELECT from TEST WHERE employee_name = 'e27'; tells Cassandra to go and read EVERY record until it finds a match on 'e27'. With no index to rely on, it politely asks you to 'forget it'.
If you want to filter on a column, make sure you have an index on that column so that Cassandra can performs the filtering you need.

Resources