Are all values in a Primary Key Indexed? - cassandra

Following a Tutorial on Cassandra, it was mentioned that if I do the following:
PRIMARY KEY(id, name) that id is the partition key and hence it is indexed. The name is the clustering column and hence it is also indexed. This means I can do a query such as:
SELECT * FROM my_table WHERE id = 'id_abc'; //this works!
I can also do a query such as:
SELECT * FROM my_table WHERE id = 'id_abc' AND name = 'name_123'; // this works!
However, I cannot do the following query:
SELECT * FROM my_table WHERE name = 'name_123'; // this does not work
Why does the last statement not work if the clustering column is indexed? Why does the first query work and not the second?
The error I get for the last query is the following:
InvalidRequest: code=2200 [Invalid query] message="Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING"
Thanks in advance!

Just because it is named primary key there is no index on it in cassandra. ìd is your partition key - it defines which node in cassandra is responsible for your id. The clustering column name defines the order inside the partition.
Therefore SELECT * FROM my_table WHERE name = 'name_123'; // this does not work whould require all partitions to be scanned, which cassandra by default refuses.

Related

Cassandra query failed to exec - Want to know the reason

I'm working on creating a Scheduler service which require Cassandra table structure as below.
CREATE TABLE IF NOT EXISTS spc_cmd_scheduler (
id timeuuid,
router_id text,
account_id text,
mode text,
triggered_by text,
retry_count smallint,
PRIMARY KEY ((triggered_by,retry_count),id)
)WITH CLUSTERING ORDER BY (id ASC);
When I do query with PK I'm getting below error. May I know the reason why?
select count(*) from spc_cmd_scheduler where triggered_by = 'ROUTER_ONBOARD' and retry_count < 3;
InvalidRequest: Error from server: code=2200 [Invalid query] message="Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING"
I understand "ALLOW FILTERING" will solve my problem here but wanted to know what is wrong with the table structure.
What is the optimal way to design this table that suits my requirement.
Just to give background of my requirement, I need to run a scheduler to scan this table and issue a command and delete the entry once its successful. If command fails, I need to retry for 3 times.
So this table requires SELECT, UPDATE and DELETE operations.
In your case, the problem is that retry_count column is the part of the partition key, and we can use only equality operators (= or IN) for partition key column. Inequality operations (<, >, etc.) are supported only for clustering columns, and all preceding clustering columns need to be specified.

IN query on secondary index in cassandra when partition key is specified

I'm working with a system that uses a secondary index in cassandra, along with a composite primary key, e.g.
CREATE TABLE table (
a bigint,
b bigint,
c bigint,
PRIMARY KEY (a, b, c)
) WITH CLUSTERING ORDER BY (b ASC, c ASC)
CREATE INDEX secondary_index ON table (c);
One of the operations in the application using the table is to fetch a number of rows (typically tens) specifying the partition key and the secondary index key. Currently, it performs one query for each (partition key, secondary key) pair, in parallel, which works fine, e.g.:
select * from table where a = ? and c = ?;
However, I've noticed that the system's workload is such that most of the time, there is significant overlap in the partition keys across the requested rows, sometimes more than half of them have the same partition key. So, I thought that it might be more efficient to perform one query per partition key, with an IN clause on the secondary key, reducing the number of overall queries to single digits in most cases, and reducing read query overhead on the cluster.
However, at least executed from cqlsh, this does not seem to be allowed:
select * from table where a = ? and c in (...);
InvalidRequest: Error from server: code=2200 [Invalid query] message="PRIMARY KEY column "c" cannot be restricted as preceding column "b" is not restricted"
Is this just not allowed, and I'll have to continue making individual queries? Is there some reason it wouldn't actually be more efficient? Or is this just a limitation of CQL, and IN queries cannot use the secondary index? Perhaps there is an issue because the secondary index key is also in the primary key, and Cassandra attempts to use that instead of the secondary index?
You are not allowed to execute
select * from table where a = ? and c = ?;
Because then it means that Cassandra has to scan over the whole partition 'a' just to find all the values where c = 'your defined value'. This is because Cassandra do not have any info on what value b is and cannot pinpoint directly to the row.
There are good explanations on this page for most of query patterns.
https://www.datastax.com/blog/deep-look-cql-where-clause

Cassandra Apache query

I have a problems with a table in cassandra. Below is what I did:
CREATE KEYSPACE tfm WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 1 };
I'm working in one machine.
CREATE TABLE tfm.foehis(hocpny text, hocol text,honumr text,holinh text,hodtto text,hotour text,hoclic text, hooe text,hotpac text,hodtac text,hohrac text,hodesf text,hocdan text,hocdrs text,hocdsl text, hoobs text,hotdsc text,honrac text,holinr text,housca text,hodtea text,hohrea text,housea text,hodtcl text,hohrcl text,houscl text,hodtrc text,hohrrc text,housrc text,hodtra text,hohrra text,housra text,hodtcm text,hohrcm text,houscm text,hodtua text,hohrua text,houser text, PRIMARY KEY((hooe,hodtac,hohrac),hoclic));
Until this point everything is OK. But when I try to do some select queries, I get warnings and errors:
cqlsh> select count(*) from tfm.foehis;
count
-------
56980
(1 rows)
Warnings :
Aggregation query used without partition key
Read 100 live rows and 1055 tombstone cells for query SELECT * FROM tfm.foehis LIMIT 100 (see tombstone_warn_threshold)
Read 100 live rows and 1066 tombstone cells for query SELECT * FROM tfm.foehis WHERE token(hooe, hodtac, hohrac) >= token(1045161613, 20180502, 2304) LIMIT 100 (see tombstone_warn_threshold)
And
cqlsh> select count(*) from tfm.foehis where hoclic=1011;
InvalidRequest: Error from server: code=2200 [Invalid query] message="Invalid INTEGER constant (1011) for "hoclic" of type text"
cqlsh> select count(*) from tfm.foehis where hotpac=ANOE;
SyntaxException: line 1:49 no viable alternative at input ';' (...from tfm.foehis where hotpac=[ANOE];)
I supposed that the problems is in the definition of table, but I don't know where the problems is.
Actually your issue is in the queries. Since all your columns are text you need to use simple quotes around values.
Also, according to your table definition, the partition key is formed by hooe,hodtac,hohrac columns which means that all your queries must include this columns with exact values (=). hoclic will be the clustering column and on this one you will be able to use other operators and ordering.
Also, have in mind that running queries without the partition key is not recommended in Cassandra (like your select) since this will trigger a full cluster scan and you can run in all sorts of problems (for instance, garbage collection issues).
I would recommend some basic reading: https://www.datastax.com/dev/blog/the-most-important-thing-to-know-in-cassandra-data-modeling-the-primary-key and https://docs.datastax.com/en/cql/3.3/index.html
When executing the query select count(*) from tfm.foehis;, Cassandra will try to look up all the partitions to calculate the count. Cassandra works best when the queries made are pointed, hence the warning.
You have defined the column hoclic as text and are trying to search using an integer value.
First of all avoid select queries where full table scan is required. As performance will be impacted as cassandra need to make scan on all the partitions.
1) select count(*) from tfm.foehis where hoclic=1011; here the value provided is wrong as hoclic is text type . below is the correction :
select count(*) from tfm.foehis where hoclic='1011';
2) select count(*) from tfm.foehis where hotpac=ANOE I don't see hotpac as part of primary key. Cassandra is required to provide the partition key in search based on parameters.

At Cassandra, I do not know how to do ORDER BY

I prepared the following table "keyspaceB.memobox"
DROP TABLE IF EXISTS keyspaceB.memobox;
CREATE TABLE IF NOT EXISTS keyspaceB.memobox (
pkey1 text,
pkey2 text,
id timeuuid,
name text,
memo text,
date timestamp,
PRIMARY KEY ((pkey1, pkey2),id,name)
) WITH CLUSTERING ORDER BY (id DESC,name DESC);
And I registered the following data.
INSERT INTO memobox (pkey1,pkey2,id,name,memo,date) VALUES ('a','b',now(),'tanaka','greet message1','2016-12-13');
INSERT INTO memobox (pkey1,pkey2,id,name,memo,date) VALUES ('a','b',now(),'yamamoto','greet message2','2016-12-13');
The following will succeed
SELECT * FROM memobox where pkey1='a' and pkey2='b' ORDER BY id;
However, the following will fail. I would like to ask your professor what is wrong.
SELECT * FROM memobox where pkey1='a' and pkey2='b' ORDER BY name;
■error
cqlsh:keyspaceb> SELECT * FROM memobox where pkey1='a' and pkey2='b' ORDER BY name;
InvalidRequest: code=2200 [Invalid query] message="Order by currently only support the ordering of columns following their declared order in the PRIMARY KEY"
cqlsh:keyspaceb>
There are two different types of keys in cassandra, partition key and clustering key.
The partition key determines which node the data gets stored, while the clusterning key determines the order in which the data gets stored in that parition(node).
In your case the partition key is pkey1 and pkey2. and the clustering key is id and name.
so the data in a partition will be stored based on the id and then name.
e.g if we have the following data
id |name
1 | abc
1 | xyz
2 | aaa
In this case the row with id 1 is stored first, also if two rows have the same id then the order is decided by name column.
So when you query the data like this
SELECT * FROM memobox where pkey1='a' and pkey2='b' ORDER BY id;
cassandra finds the partitoin using pkey1 and pkey2 (aka the partition key) and then just return the data how it is stored on the disk.
However in the second case
SELECT * FROM memobox where pkey1='a' and pkey2='b' ORDER BY name;
since the data is not ordered by name alone,( it is first ordered by id and then by name). cassandra can not just blindly return the results , it has to do a lot more in order to correctly sort the results. Hence due to performance reasons this is not allowed.
That is why in the order by clause you have to specify the clustering columns in the order in which you specify them while creating the table(id and then name).
This is from another answer by #aaron
Where and Order By Clauses in Cassandra CQL
Cassandra achieves performance by using the clustering keys to sort
your data on-disk, thereby only returning ordered rows in a single
read (no random reads). This is why you must take a query-based
modeling approach (often duplicating your data into multiple query
tables) with Cassandra. Know your queries ahead of time, and build
your tables to serve them.

Cassandra Contains query error

I am new to Cassandra and trying to figure out how to get a simple contains query working with Cassandra.
My table looks like this
CREATE TABLE events (
timekey text,
id timeuuid,
event_types list<text>,
PRIMARY KEY ((timekey), id)
)
My query:
cqlsh> select count(1) from events where event_types contains 'foo';
**Bad Request: line 1:46 no viable alternative at input 'contains'**
Any thoughts about the error?
Also Is it possible to query for multiple event_types in one single query. I could not see any way to do this with Contains. Something equivalent to this in a regular sql
Relational SQL example:
select count(1) from events where event_types in ('foo', 'bar')
A couple of things. First of all, when I create your schema, insert a row, I get a different error message than you do:
aploetz#cqlsh:stackoverflow2> CREATE TABLE events (
... timekey text,
... id timeuuid,
... event_types list<text>,
... PRIMARY KEY ((timekey), id)
... );
aploetz#cqlsh:stackoverflow2> INSERT INTO events (timekey, id, event_types)
VALUES ('1', now(),['foo','bar']);
aploetz#cqlsh:stackoverflow2> select count(1) from events where event_types contains 'foo';
InvalidRequest: code=2200 [Invalid query] message="No secondary indexes on the restricted
columns support the provided operators: "
To get this to work, you will need to create a secondary index on your event_types collection. Of course secondary indexes on collections are a new feature as of Cassandra 2.1. By virtue of the fact that your error message is different, I'm going to guess that you would need to upgrade to 2.1.
I'm using 2.1.5 in my sandbox right now, so when I create an index on event_types this works:
aploetz#cqlsh:stackoverflow2> CREATE INDEX eventTypeIdx ON events(event_types);
aploetz#cqlsh:stackoverflow2> select count(1) from events where event_types contains 'foo';
count
-------
1
(1 rows)
Even though this may work, secondary indexes on large tables or in large clusters are known not to perform well. I would expect that secondary indexes on collections would perform even worse, so just take that as a warning.
Also Is it possible to query for multiple event_types in one single query?
There are ways to accomplish this, but I recommend against it for the aforementioned performance issues. I answered a similar question here, if you are interested: Cassandra CQL where clause with multiple collection values?

Resources