Cassandra Apache query - cassandra

I have a problems with a table in cassandra. Below is what I did:
CREATE KEYSPACE tfm WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 1 };
I'm working in one machine.
CREATE TABLE tfm.foehis(hocpny text, hocol text,honumr text,holinh text,hodtto text,hotour text,hoclic text, hooe text,hotpac text,hodtac text,hohrac text,hodesf text,hocdan text,hocdrs text,hocdsl text, hoobs text,hotdsc text,honrac text,holinr text,housca text,hodtea text,hohrea text,housea text,hodtcl text,hohrcl text,houscl text,hodtrc text,hohrrc text,housrc text,hodtra text,hohrra text,housra text,hodtcm text,hohrcm text,houscm text,hodtua text,hohrua text,houser text, PRIMARY KEY((hooe,hodtac,hohrac),hoclic));
Until this point everything is OK. But when I try to do some select queries, I get warnings and errors:
cqlsh> select count(*) from tfm.foehis;
count
-------
56980
(1 rows)
Warnings :
Aggregation query used without partition key
Read 100 live rows and 1055 tombstone cells for query SELECT * FROM tfm.foehis LIMIT 100 (see tombstone_warn_threshold)
Read 100 live rows and 1066 tombstone cells for query SELECT * FROM tfm.foehis WHERE token(hooe, hodtac, hohrac) >= token(1045161613, 20180502, 2304) LIMIT 100 (see tombstone_warn_threshold)
And
cqlsh> select count(*) from tfm.foehis where hoclic=1011;
InvalidRequest: Error from server: code=2200 [Invalid query] message="Invalid INTEGER constant (1011) for "hoclic" of type text"
cqlsh> select count(*) from tfm.foehis where hotpac=ANOE;
SyntaxException: line 1:49 no viable alternative at input ';' (...from tfm.foehis where hotpac=[ANOE];)
I supposed that the problems is in the definition of table, but I don't know where the problems is.

Actually your issue is in the queries. Since all your columns are text you need to use simple quotes around values.
Also, according to your table definition, the partition key is formed by hooe,hodtac,hohrac columns which means that all your queries must include this columns with exact values (=). hoclic will be the clustering column and on this one you will be able to use other operators and ordering.
Also, have in mind that running queries without the partition key is not recommended in Cassandra (like your select) since this will trigger a full cluster scan and you can run in all sorts of problems (for instance, garbage collection issues).
I would recommend some basic reading: https://www.datastax.com/dev/blog/the-most-important-thing-to-know-in-cassandra-data-modeling-the-primary-key and https://docs.datastax.com/en/cql/3.3/index.html

When executing the query select count(*) from tfm.foehis;, Cassandra will try to look up all the partitions to calculate the count. Cassandra works best when the queries made are pointed, hence the warning.
You have defined the column hoclic as text and are trying to search using an integer value.

First of all avoid select queries where full table scan is required. As performance will be impacted as cassandra need to make scan on all the partitions.
1) select count(*) from tfm.foehis where hoclic=1011; here the value provided is wrong as hoclic is text type . below is the correction :
select count(*) from tfm.foehis where hoclic='1011';
2) select count(*) from tfm.foehis where hotpac=ANOE I don't see hotpac as part of primary key. Cassandra is required to provide the partition key in search based on parameters.

Related

Cassandra query failed to exec - Want to know the reason

I'm working on creating a Scheduler service which require Cassandra table structure as below.
CREATE TABLE IF NOT EXISTS spc_cmd_scheduler (
id timeuuid,
router_id text,
account_id text,
mode text,
triggered_by text,
retry_count smallint,
PRIMARY KEY ((triggered_by,retry_count),id)
)WITH CLUSTERING ORDER BY (id ASC);
When I do query with PK I'm getting below error. May I know the reason why?
select count(*) from spc_cmd_scheduler where triggered_by = 'ROUTER_ONBOARD' and retry_count < 3;
InvalidRequest: Error from server: code=2200 [Invalid query] message="Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING"
I understand "ALLOW FILTERING" will solve my problem here but wanted to know what is wrong with the table structure.
What is the optimal way to design this table that suits my requirement.
Just to give background of my requirement, I need to run a scheduler to scan this table and issue a command and delete the entry once its successful. If command fails, I need to retry for 3 times.
So this table requires SELECT, UPDATE and DELETE operations.
In your case, the problem is that retry_count column is the part of the partition key, and we can use only equality operators (= or IN) for partition key column. Inequality operations (<, >, etc.) are supported only for clustering columns, and all preceding clustering columns need to be specified.

Filtering on Primary Key in Cassandra

I have the following table:
CREATE TABLE tab1 (id int PRIMARY KEY, price int, name text);
The following queries return errors:
SELECT name FROM tab1 WHERE id > 5;
SELECT name FROM tab1 WHERE id > 5 ALLOW FILTERING;
How can I fix it?
SELECT name FROM tab1 WHERE id > 5 ALLOW FILTERING; will not give an error since you are using allow filtering. If your queries require to use allow filtering, then you need to redesign your tables according to the queries. Allow filtering is not efficient way of querying your tables, especially in production. please check here
SELECT name FROM tab1 WHERE id > 5; will give you an error
[Invalid query] message="Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING"
The reason is; Cassandra doesn't work as how relational database works. The table structure doesn't allow you to run any query you want to, so you model your tables according to queries.
Please check here for where clause details. As it is stated in the documentation The partition key columns support only two operators: = and IN, in your case you are using greater, which causes you to get an error.

Filtering by `list<double>` column's element value range

I'd like to filter rows of following table in cassandra.
CREATE TABLE mids_test_db.defect_data (
wafer_id text,
defect_id text,
document_id text,
fields list<double>,
PRIMARY KEY (wafer_id, defect_id)
)
...
CREATE INDEX defect_data_fields_idx ON mids_test_db.defect_data (values(fields));
What I firstly tried using something like field[0] > 0.5 but failed.
cqlsh:mids_test_db> select fields from defect_data where wafer_id = 'MIDS_1_20170101_023000_30000_1548100671' and fields[0] > 0.5;
InvalidRequest: Error from server: code=2200 [Invalid query] message="Indexes on list entries (fields[index] = value) are not currently supported."
After searching google for a while, i'm feeling like this kind of job can not be easily done in Cassandra. The data model is something like a field value collection. mostly I want to query defect data using its fields data like above which is quite important in my business.
What approach should I have taken into consideration?. Application side filtering? Any hint or advice will be appreciated.
It's not possible to do directly with Cassandra, but you have following alternatives:
if your Cassandra is DataStax Enterprise, then you can use DSE Search;
you can add an additional table to perform lookup:
(...ignore this line...)
CREATE TABLE mids_test_db.defect_data_lookup (
wafer_id text,
defect_id text,
field double,
PRIMARY KEY (wafer_id, field, defect_id)
);
after that you should be able to to do a range scan inside partition, and at least fetch the defect_id field, and fetch all field values via second query.
Depending on your Cassandra version, you may be able to use materialized view to maintain that lookup table for you.

Are all values in a Primary Key Indexed?

Following a Tutorial on Cassandra, it was mentioned that if I do the following:
PRIMARY KEY(id, name) that id is the partition key and hence it is indexed. The name is the clustering column and hence it is also indexed. This means I can do a query such as:
SELECT * FROM my_table WHERE id = 'id_abc'; //this works!
I can also do a query such as:
SELECT * FROM my_table WHERE id = 'id_abc' AND name = 'name_123'; // this works!
However, I cannot do the following query:
SELECT * FROM my_table WHERE name = 'name_123'; // this does not work
Why does the last statement not work if the clustering column is indexed? Why does the first query work and not the second?
The error I get for the last query is the following:
InvalidRequest: code=2200 [Invalid query] message="Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING"
Thanks in advance!
Just because it is named primary key there is no index on it in cassandra. ìd is your partition key - it defines which node in cassandra is responsible for your id. The clustering column name defines the order inside the partition.
Therefore SELECT * FROM my_table WHERE name = 'name_123'; // this does not work whould require all partitions to be scanned, which cassandra by default refuses.

Cassandra Contains query error

I am new to Cassandra and trying to figure out how to get a simple contains query working with Cassandra.
My table looks like this
CREATE TABLE events (
timekey text,
id timeuuid,
event_types list<text>,
PRIMARY KEY ((timekey), id)
)
My query:
cqlsh> select count(1) from events where event_types contains 'foo';
**Bad Request: line 1:46 no viable alternative at input 'contains'**
Any thoughts about the error?
Also Is it possible to query for multiple event_types in one single query. I could not see any way to do this with Contains. Something equivalent to this in a regular sql
Relational SQL example:
select count(1) from events where event_types in ('foo', 'bar')
A couple of things. First of all, when I create your schema, insert a row, I get a different error message than you do:
aploetz#cqlsh:stackoverflow2> CREATE TABLE events (
... timekey text,
... id timeuuid,
... event_types list<text>,
... PRIMARY KEY ((timekey), id)
... );
aploetz#cqlsh:stackoverflow2> INSERT INTO events (timekey, id, event_types)
VALUES ('1', now(),['foo','bar']);
aploetz#cqlsh:stackoverflow2> select count(1) from events where event_types contains 'foo';
InvalidRequest: code=2200 [Invalid query] message="No secondary indexes on the restricted
columns support the provided operators: "
To get this to work, you will need to create a secondary index on your event_types collection. Of course secondary indexes on collections are a new feature as of Cassandra 2.1. By virtue of the fact that your error message is different, I'm going to guess that you would need to upgrade to 2.1.
I'm using 2.1.5 in my sandbox right now, so when I create an index on event_types this works:
aploetz#cqlsh:stackoverflow2> CREATE INDEX eventTypeIdx ON events(event_types);
aploetz#cqlsh:stackoverflow2> select count(1) from events where event_types contains 'foo';
count
-------
1
(1 rows)
Even though this may work, secondary indexes on large tables or in large clusters are known not to perform well. I would expect that secondary indexes on collections would perform even worse, so just take that as a warning.
Also Is it possible to query for multiple event_types in one single query?
There are ways to accomplish this, but I recommend against it for the aforementioned performance issues. I answered a similar question here, if you are interested: Cassandra CQL where clause with multiple collection values?

Resources