I am trying out Cassandra for the first time and running it locally for simple session management db. [Cassandra-2.0.4, CQL3, datastax driver 2.0.0-rc2]
The following count query works fine when there is no data in the table:
select count(*) from session_data where app_name=? and account=? and last_access > ?
But after even a single row is inserted into the table, the query fails with the following error:
java.lang.AssertionError
at org.apache.cassandra.db.filter.ExtendedFilter$WithClauses.getExtraFilter(ExtendedFilter.java:258)
at org.apache.cassandra.db.ColumnFamilyStore.filter(ColumnFamilyStore.java:1719)
at org.apache.cassandra.db.ColumnFamilyStore.getRangeSlice(ColumnFamilyStore.java:1674)
at org.apache.cassandra.db.PagedRangeCommand.executeLocally(PagedRangeCommand.java:111)
at org.apache.cassandra.service.StorageProxy$LocalRangeSliceRunnable.runMayThrow(StorageProxy.java:1418)
at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1931)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Here is the schema I am using:
CREATE KEYSPACE session WITH replication= {'class': 'SimpleStrategy', 'replication_factor': 1};
CREATE TABLE session_data (
username text,
session_id text,
app_name text,
account text,
last_access timestamp,
created_on timestamp,
PRIMARY KEY (username, session_id, app_name, account)
);
create index sessionIndex ON session_data (session_id);
create index sessionAppName ON session_data (app_name);
create index lastAccessIndex ON session_data (last_access);
I am wondering if there is something wrong in the table definition/indexes or the query itself. Any help/insight would be greatly appreciated.
It looks like you're tripping over a bug in Cassandra. Here is the assertion and related comments in the Cassandra sources:
/*
* This method assumes the IndexExpression names are valid column names, which is not the
* case with composites. This is ok for now however since:
* 1) CompositeSearcher doesn't use it.
* 2) We don't yet allow non-indexed range slice with filters in CQL3 (i.e. this will never be
* called by CFS.filter() for composites).
*/
assert !(cfs.getComparator() instanceof CompositeType);
This code was modified between cassandra-2.0.4 and trunk as part of ticket CASSANDRA-5417, but it's not clear to me that the author was aware of this issue. The assertion was removed, but the comment was not. I would recommend submitting a bug report to the Cassandra project.
Related
I have a table in ScyllaDB:
CREATE TABLE taxiservice.operatoragentsauditlog (
hourofyear int,
operationtime bigint,
action text,
actiontype text,
appname text,
entityid text,
entitytype text,
operatorid text,
operatoripaddress text,
operatorname text,
payload text,
PRIMARY KEY (hourofyear, operationtime)
) WITH CLUSTERING ORDER BY (operationtime DESC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'}
AND comment = ''
AND compaction = {'class': 'LeveledCompactionStrategy'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.0
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
CREATE INDEX auditactiontype ON taxiservice.operatoragentsauditlog (actiontype);
CREATE INDEX auditid ON taxiservice.operatoragentsauditlog (entityid);
CREATE INDEX agentid ON taxiservice.operatoragentsauditlog (operatorid);
CREATE INDEX auditaction ON taxiservice.operatoragentsauditlog (action);
I have return the query:
select * from taxiService.operatoragentsauditlog
where hourOfYear =3655
and actionType ='XYZ'
and operatorId in ('100','200') limit 500;
And Scylla throwing the issue like :
InvalidRequest: Error from server: code=2200 [Invalid query]
message="Cannot execute this query as it might involve data
filtering and thus may have unpredictable performance. If you
want to execute this query despite the performance
unpredictability, use ALLOW FILTERING"
Here whatever I included column names in conditions are index's in the table, then also its throwing the above mentioned error.
How I can fetch the details without adding allow filtering in query.
All the Scylla Query Written with Allow Filters and I deployed changes in Production, then Server started throwing Service internal error(NoHostAvailableException) and its caused to fetch the data from scylla db.
How I can resolve the NoHostAvailableException In Scylla?
com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (no host was tried)
at com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:83) ~[cassandra-driver-core-3.10.2.jar:?]
at com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:37) ~[cassandra-driver-core-3.10.2.jar:?]
at com.datastax.driver.core.DriverThrowables.propagateCause(DriverThrowables.java:35) ~[cassandra-driver-core-3.10.2.jar:?]
at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:293) ~[cassandra-driver-core-3.10.2.jar:?]
at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:58) ~[cassandra-driver-core-3.10.2.jar:?]
at com.datastax.driver.mapping.MethodMapper.invoke(MethodMapper.java:184) ~[cassandra-driver-mapping-3.10.2.jar:?]
at com.datastax.driver.mapping.AccessorInvocationHandler.invoke(AccessorInvocationHandler.java:67) ~[cassandra-driver-mapping-3.10.2.jar:?]
at com.sun.proxy.$Proxy161.getRideAuditLog(Unknown Source) ~[?:?]
at com.mycomany.myproduct.auditLog.AuditLogService.getRideAuditLog(AuditLogService.java:21) ~[taxiopconsoleservice-1.1.0.jar:?]
With distributed databases like Cassandra and Scylla, the idea is to build your tables to suit your queries. To that end, you could build another table and duplicate the data into it. In this new table, the primary key definition should look like this:
PRIMARY KEY (hourOfYear, actionType, operatorId)
That will support this query without the dreaded ALLOW FILTERING directive.
select * from taxiService.operatoragentsauditlog_by_hourofyear_and_actiontype
where hourOfYear =3655
and actionType ='XYZ'
and operatorId in ('100','200');
But as the original table is partitioned on hourOfYear, the query is restricted to a single partition. So even with ALLOW FILTERING it might not be that bad.
Your query
select * from taxiService.operatoragentsauditlog
where hourOfYear =3655
and actionType ='XYZ'
and operatorId in ('100','200') limit 500;
Will look at either the actionType or operatorId index to find all the rows matching the restriction on that column, and than need to go over all the candidate rows to check if they also match the two other restrictions. That's why you need ALLOW FILTERING. Theoretically, actionType = 'XYZ' may match a million rows, so this query will need to go over a million rows just to return a handful that match all the candiate rows.
Some search engines have an efficient way to intersect two index lookups - maybe actionType = 'XYZ' has a million matches, and operatorId in ('100', '200') has a million matches, but their intersection is just 10 rows. Search engines use a skip list mechanism to allow the intersection to be calculated efficiently. But Scylla doesn't have this feature. It will pick just one of the two indexes (you don't know which), and go over its matches one by one. By the way, please note that even if Scylla did support efficient index intersection, your hourOfYear = 3655 restriction isn't indexed, so would need row-by-row filtering anyway.
As Aaron noted in his answer, the solution can be to use ALLOW FILTERING if the one index match results in a small-enough number of matches (it would be better to have just one index, not two, so you'll know exactly which index is being used), or - change your table schema to something which better matches your queries. The latter is always good advice in many situations.
Regarding the NoHostAvailableException - it means the Scylla replicas failed to perform this query for some reason. It might indicate a bug, or a timeout (which would also be a bug, because a scan like your query should do paging - not time out). Please look at the Scylla log if there's an error message that appears at the time of this request, and report the problem in the Scylla bug tracker at https://github.com/scylladb/scylla/issues.
I have many tables per keyspace, therefore I would like to filter the tables based on restriction criteria. I tried this query but it is not really giving the intended result that I want:
SELECT table_name FROM system_schema.tables
WHERE keyspace_name = 'test'
and table_name >= 'test_001_%';
The output shown is:
'table_name'
---------------------
'test_001_metadata'
'test_001_time1'
'test_001_time2'
'test_001_time3'
'test_001_time4'
'test_002_metadata'
'test_002_time1'
'test_002_time2'
'test_002_time3'
What I really want is:
The output shown is:
'table_name'
---------------------
'test_001_metadata'
'test_001_time1'
'test_001_time2'
'test_001_time3'
'test_001_time4'
The other way out is to use LIKE keyword by creating secondary index on table_name. But I am a bit skeptical if it might cause problem as it is a system table. Another concern is, does clustering column ACTUALLY support secondary index?
Create a SASI index with mode contains on the table_name column after removing the previous index and try the query as
SELECT table_name FROM system_schema.tables
WHERE keyspace_name = 'test'
and table_name LIKE '%test_001_%';
The command to create a SASI index with mode contains is as follows:
CREATE CUSTOM INDEX ON system_schema.tables(table_name)
USING 'org.apache.cassandra.index.sasi.SASIIndex'
WITH OPTIONS = {'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
'case_sensitive': 'false', 'tokenization_normalize_uppercase': 'true', 'mode': 'CONTAINS'}
And for your second question, you cannot create secondary index on anything which is part of PRIMARY KEY.
I am creating a DataFrame and registering that DataFrame as temp view using df.createOrReplaceTempView('mytable'). After that I try to write the content from 'mytable' into Hive table(It has partition) using the following query
insert overwrite table
myhivedb.myhivetable
partition(testdate) // ( 1) : Note here : I have a partition named 'testdate'
select
Field1,
Field2,
...
TestDate //(2) : Note here : I have a field named 'TestDate' ; Both (1) & (2) have the same name
from
mytable
when I execute this query, I am getting the following error
Exception in thread "main" org.apache.hadoop.hive.ql.metadata.Table$ValidationFailureSemanticException: Partition spec
{testdate=, TestDate=2013-01-01}
Looks like I am getting this error because of the same field names ; ie testdate(the partition in Hive) & TestDate (The field in temp table 'mytable')
Whereas if my partition name testdate is different from the fieldname(ie TestDate), the query executes successuflly. Example...
insert overwrite table
myhivedb.myhivetable
partition(my_partition) //Note here the partition name is not 'testdate'
select
Field1,
Field2,
...
TestDate
from
mytable
My guess is it looks like a Bug in Spark...but would like to have second opinion...Am I missing something here?
#DuduMarkovitz #dhee ; apologies for being too late for the response. I am finally able to resolve the issue. Earlier I was creating the table using cameCase(in the CREATE statement) which seems to be the reason for the Exception. Now i have created the table using the DDL where field names are in lower case. This has resolved my issue
I am dealing with a puzzling behaviour when doing SELECTs on Cassandra 2.2.3. I have 4 nodes in the ring, and I create the following keyspace, table and index.
CREATE KEYSPACE IF NOT EXISTS my_keyspace
WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};
CREATE TABLE my_keyspace.my_table (
id text,
some_text text,
code text,
some_set set<int>,
a_float float,
name text,
type int,
a_double double,
another_set set<int>,
another_float float,
yet_another_set set<text>,
PRIMARY KEY (id, some_text, code)
) WITH read_repair_chance = 0.0
AND dclocal_read_repair_chance = 0.1
AND gc_grace_seconds = 864000
AND bloom_filter_fp_chance = 0.01
AND caching = { 'keys' : 'ALL', 'rows_per_partition' : 'NONE' }
AND comment = ''
AND compaction = { 'class' : 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy' }
AND compression = { 'sstable_compression' : 'org.apache.cassandra.io.compress.LZ4Compressor' }
AND default_time_to_live = 0
AND speculative_retry = '99.0PERCENTILE'
AND min_index_interval = 128
AND max_index_interval = 2048;
CREATE INDEX idx_my_table_code ON my_keyspace.my_table (code);
Then I insert some rows on the table. Some of them have empty sets. I perform this query through the default CQL client and get the row I am expecting:
SELECT * FROM my_table WHERE code = 'test';
Then I run some tests which are outside my control. I don't know what they do but I expect they read and possibly insert/update/delete some rows. I'm sure they don't delete or change any of the settings in the index, table or keyspace.
After the tests, I log in again through the default CQL client and run the following queries.
SELECT * FROM my_table WHERE code = 'test';
SELECT * FROM my_table;
SELECT * FROM my_table WHERE id = 'my_id' AND some_text = 'whatever' AND code = 'test';
The first one doesn't return anything.
The second one returns all the rows, including the one with code = 'test'.
The third one returns the expected row that the first query couldn't retrieve.
The only difference that I can see between this row and others is that it is one of the rows which contains some empty sets, as explained earlier. If I query for another of the rows that also contain some empty sets, I get the same behavior.
I would say the problem is related to the secondary index. Somehow, the operations performed during the tests leave the index in an state where it cannot see certain rows.
I'm obviously missing something. Do you have any ideas about what could cause this behavior?
Thanks in advance.
UPDATE:
I worked around the issue, but now I found the same problem somewhere else. Since the issue first happened, I found out more about the operations performed before the error: updates on specific columns that set a TTL for said columns. After some investigation I found some Jira issues which could be related to this problem:
https://issues.apache.org/jira/browse/CASSANDRA-6782
https://issues.apache.org/jira/browse/CASSANDRA-8206
However, those issues seem to have been solved on 2.0 and 2.1, and I'm using 2.2. I think these changes are included in 2.2, but I could be mistaken.
The main problem is the the type of query you are running on Cassandra.
The Cassadra data model is query driven, tables are recomputed to serve the query.
Tables are created by using well defined Primary Key (Partition Key & clustring key). Cassandra is not good for full table scan type of queries.
Now coming to your queries.
SELECT * FROM my_table WHERE code = 'test';
Here the column used is clustring column and it the equality search column it should be part of Partition Key. Clustring key will be present in different partitions so if Read consistency level is one it may give empty result.
SELECT * FROM my_table;
Cassandra is not good for this kind of table scan query. Here it will search all the table and get all the rows (poor querying).
SELECT * FROM my_table
WHERE id = 'my_id' AND some_text = 'whatever' AND code = 'test';
Here you mentioned everything so the correct results were returned.
I opened a Jira issue and the problem was fixed on 2.1.18 and 2.2.10:
https://issues.apache.org/jira/browse/CASSANDRA-13412
I speak just from what I read in the Jira issue. I didn't test the above scenario again after the fix was implemented because by then I had moved to the 3.0 version.
In the end though I ended up removing almost every use of secondary indices in my application, as I learned that they led to bad performance.
The reason is that in most cases they will result in fan-out queries that will contact every node of the cluster, with the corresponding costs.
There are still some cases where they can perform well, e.g. when you query by partition key at the same time, as no other nodes will be involved.
But for anything else, my advice is: consider if you can remove your secondary indices and do lookups in auxiliary tables instead. You'll have the burden of maintaining the tables in sync, but performance should be better.
I have a issue with my CQL and cassandra is giving me no viable alternative at input '(' (...WHERE id = ? if [(]...) error message. I think there is a problem with my statement.
UPDATE <TABLE> USING TTL 300
SET <attribute1> = 13381990-735b-11e5-9bed-2ae6d3dfc201
WHERE <attribute2> = dfa2efb0-7247-11e5-a9e5-0242ac110003
IF (<attribute1> = null OR <attribute1> = 13381990-735b-11e5-9bed-2ae6d3dfc201) AND <attribute3> = 0;
Any idea were the problem is in the statement about?
It would help to have your complete table structure, so to test your statement I made a couple of educated guesses.
With this table:
CREATE TABLE lwtTest (attribute1 timeuuid, attribute2 timeuuid PRIMARY KEY, attribute3 int);
This statement works, as long as I don't add the lightweight transaction on the end:
UPDATE lwttest USING TTL 300 SET attribute1=13381990-735b-11e5-9bed-2ae6d3dfc201
WHERE attribute2=dfa2efb0-7247-11e5-a9e5-0242ac110003;
Your lightweight transaction...
IF (attribute1=null OR attribute1=13381990-735b-11e5-9bed-2ae6d3dfc201) AND attribute3 = 0;
...has a few issues.
"null" in Cassandra is not similar (at all) to its RDBMS counterpart. Not every row needs to have a value for every column. Those CQL rows without values for certain column values in a table will show "null." But you cannot query by "null" since it isn't really there.
The OR keyword does not exist in CQL.
You cannot use extra parenthesis to separate conditions in your WHERE clause or your lightweight transaction.
Bearing those points in mind, the following UPDATE and lightweight transaction runs without error:
UPDATE lwttest USING TTL 300 SET attribute1=13381990-735b-11e5-9bed-2ae6d3dfc201
WHERE attribute2=dfa2efb0-7247-11e5-a9e5-0242ac110003
IF attribute1=13381990-735b-11e5-9bed-2ae6d3dfc201 AND attribute3=0;
[applied]
-----------
False