Cassandra SELECT on secondary index doesn't return row - cassandra

I am dealing with a puzzling behaviour when doing SELECTs on Cassandra 2.2.3. I have 4 nodes in the ring, and I create the following keyspace, table and index.
CREATE KEYSPACE IF NOT EXISTS my_keyspace
WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};
CREATE TABLE my_keyspace.my_table (
id text,
some_text text,
code text,
some_set set<int>,
a_float float,
name text,
type int,
a_double double,
another_set set<int>,
another_float float,
yet_another_set set<text>,
PRIMARY KEY (id, some_text, code)
) WITH read_repair_chance = 0.0
AND dclocal_read_repair_chance = 0.1
AND gc_grace_seconds = 864000
AND bloom_filter_fp_chance = 0.01
AND caching = { 'keys' : 'ALL', 'rows_per_partition' : 'NONE' }
AND comment = ''
AND compaction = { 'class' : 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy' }
AND compression = { 'sstable_compression' : 'org.apache.cassandra.io.compress.LZ4Compressor' }
AND default_time_to_live = 0
AND speculative_retry = '99.0PERCENTILE'
AND min_index_interval = 128
AND max_index_interval = 2048;
CREATE INDEX idx_my_table_code ON my_keyspace.my_table (code);
Then I insert some rows on the table. Some of them have empty sets. I perform this query through the default CQL client and get the row I am expecting:
SELECT * FROM my_table WHERE code = 'test';
Then I run some tests which are outside my control. I don't know what they do but I expect they read and possibly insert/update/delete some rows. I'm sure they don't delete or change any of the settings in the index, table or keyspace.
After the tests, I log in again through the default CQL client and run the following queries.
SELECT * FROM my_table WHERE code = 'test';
SELECT * FROM my_table;
SELECT * FROM my_table WHERE id = 'my_id' AND some_text = 'whatever' AND code = 'test';
The first one doesn't return anything.
The second one returns all the rows, including the one with code = 'test'.
The third one returns the expected row that the first query couldn't retrieve.
The only difference that I can see between this row and others is that it is one of the rows which contains some empty sets, as explained earlier. If I query for another of the rows that also contain some empty sets, I get the same behavior.
I would say the problem is related to the secondary index. Somehow, the operations performed during the tests leave the index in an state where it cannot see certain rows.
I'm obviously missing something. Do you have any ideas about what could cause this behavior?
Thanks in advance.
UPDATE:
I worked around the issue, but now I found the same problem somewhere else. Since the issue first happened, I found out more about the operations performed before the error: updates on specific columns that set a TTL for said columns. After some investigation I found some Jira issues which could be related to this problem:
https://issues.apache.org/jira/browse/CASSANDRA-6782
https://issues.apache.org/jira/browse/CASSANDRA-8206
However, those issues seem to have been solved on 2.0 and 2.1, and I'm using 2.2. I think these changes are included in 2.2, but I could be mistaken.

The main problem is the the type of query you are running on Cassandra.
The Cassadra data model is query driven, tables are recomputed to serve the query.
Tables are created by using well defined Primary Key (Partition Key & clustring key). Cassandra is not good for full table scan type of queries.
Now coming to your queries.
SELECT * FROM my_table WHERE code = 'test';
Here the column used is clustring column and it the equality search column it should be part of Partition Key. Clustring key will be present in different partitions so if Read consistency level is one it may give empty result.
SELECT * FROM my_table;
Cassandra is not good for this kind of table scan query. Here it will search all the table and get all the rows (poor querying).
SELECT * FROM my_table
WHERE id = 'my_id' AND some_text = 'whatever' AND code = 'test';
Here you mentioned everything so the correct results were returned.

I opened a Jira issue and the problem was fixed on 2.1.18 and 2.2.10:
https://issues.apache.org/jira/browse/CASSANDRA-13412
I speak just from what I read in the Jira issue. I didn't test the above scenario again after the fix was implemented because by then I had moved to the 3.0 version.
In the end though I ended up removing almost every use of secondary indices in my application, as I learned that they led to bad performance.
The reason is that in most cases they will result in fan-out queries that will contact every node of the cluster, with the corresponding costs.
There are still some cases where they can perform well, e.g. when you query by partition key at the same time, as no other nodes will be involved.
But for anything else, my advice is: consider if you can remove your secondary indices and do lookups in auxiliary tables instead. You'll have the burden of maintaining the tables in sync, but performance should be better.

Related

In Scylla DB, How can I query the records for multiple condition and without mentioning ALLOW FILTERING?

I have a table in ScyllaDB:
CREATE TABLE taxiservice.operatoragentsauditlog (
hourofyear int,
operationtime bigint,
action text,
actiontype text,
appname text,
entityid text,
entitytype text,
operatorid text,
operatoripaddress text,
operatorname text,
payload text,
PRIMARY KEY (hourofyear, operationtime)
) WITH CLUSTERING ORDER BY (operationtime DESC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'}
AND comment = ''
AND compaction = {'class': 'LeveledCompactionStrategy'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.0
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
CREATE INDEX auditactiontype ON taxiservice.operatoragentsauditlog (actiontype);
CREATE INDEX auditid ON taxiservice.operatoragentsauditlog (entityid);
CREATE INDEX agentid ON taxiservice.operatoragentsauditlog (operatorid);
CREATE INDEX auditaction ON taxiservice.operatoragentsauditlog (action);
I have return the query:
select * from taxiService.operatoragentsauditlog
where hourOfYear =3655
and actionType ='XYZ'
and operatorId in ('100','200') limit 500;
And Scylla throwing the issue like :
InvalidRequest: Error from server: code=2200 [Invalid query]
message="Cannot execute this query as it might involve data
filtering and thus may have unpredictable performance. If you
want to execute this query despite the performance
unpredictability, use ALLOW FILTERING"
Here whatever I included column names in conditions are index's in the table, then also its throwing the above mentioned error.
How I can fetch the details without adding allow filtering in query.
All the Scylla Query Written with Allow Filters and I deployed changes in Production, then Server started throwing Service internal error(NoHostAvailableException) and its caused to fetch the data from scylla db.
How I can resolve the NoHostAvailableException In Scylla?
com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (no host was tried)
at com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:83) ~[cassandra-driver-core-3.10.2.jar:?]
at com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:37) ~[cassandra-driver-core-3.10.2.jar:?]
at com.datastax.driver.core.DriverThrowables.propagateCause(DriverThrowables.java:35) ~[cassandra-driver-core-3.10.2.jar:?]
at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:293) ~[cassandra-driver-core-3.10.2.jar:?]
at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:58) ~[cassandra-driver-core-3.10.2.jar:?]
at com.datastax.driver.mapping.MethodMapper.invoke(MethodMapper.java:184) ~[cassandra-driver-mapping-3.10.2.jar:?]
at com.datastax.driver.mapping.AccessorInvocationHandler.invoke(AccessorInvocationHandler.java:67) ~[cassandra-driver-mapping-3.10.2.jar:?]
at com.sun.proxy.$Proxy161.getRideAuditLog(Unknown Source) ~[?:?]
at com.mycomany.myproduct.auditLog.AuditLogService.getRideAuditLog(AuditLogService.java:21) ~[taxiopconsoleservice-1.1.0.jar:?]
With distributed databases like Cassandra and Scylla, the idea is to build your tables to suit your queries. To that end, you could build another table and duplicate the data into it. In this new table, the primary key definition should look like this:
PRIMARY KEY (hourOfYear, actionType, operatorId)
That will support this query without the dreaded ALLOW FILTERING directive.
select * from taxiService.operatoragentsauditlog_by_hourofyear_and_actiontype
where hourOfYear =3655
and actionType ='XYZ'
and operatorId in ('100','200');
But as the original table is partitioned on hourOfYear, the query is restricted to a single partition. So even with ALLOW FILTERING it might not be that bad.
Your query
select * from taxiService.operatoragentsauditlog
where hourOfYear =3655
and actionType ='XYZ'
and operatorId in ('100','200') limit 500;
Will look at either the actionType or operatorId index to find all the rows matching the restriction on that column, and than need to go over all the candidate rows to check if they also match the two other restrictions. That's why you need ALLOW FILTERING. Theoretically, actionType = 'XYZ' may match a million rows, so this query will need to go over a million rows just to return a handful that match all the candiate rows.
Some search engines have an efficient way to intersect two index lookups - maybe actionType = 'XYZ' has a million matches, and operatorId in ('100', '200') has a million matches, but their intersection is just 10 rows. Search engines use a skip list mechanism to allow the intersection to be calculated efficiently. But Scylla doesn't have this feature. It will pick just one of the two indexes (you don't know which), and go over its matches one by one. By the way, please note that even if Scylla did support efficient index intersection, your hourOfYear = 3655 restriction isn't indexed, so would need row-by-row filtering anyway.
As Aaron noted in his answer, the solution can be to use ALLOW FILTERING if the one index match results in a small-enough number of matches (it would be better to have just one index, not two, so you'll know exactly which index is being used), or - change your table schema to something which better matches your queries. The latter is always good advice in many situations.
Regarding the NoHostAvailableException - it means the Scylla replicas failed to perform this query for some reason. It might indicate a bug, or a timeout (which would also be a bug, because a scan like your query should do paging - not time out). Please look at the Scylla log if there's an error message that appears at the time of this request, and report the problem in the Scylla bug tracker at https://github.com/scylladb/scylla/issues.

Cassandra read from large dataset

I need to get a count from a very large dataset in Cassandra, 100 million plus. I am worried about the memory hit cassandra would take if I just ran the following query.
select count(*) from conv_org where org_id = 'TEST_ORG'
I was told I could use cassandra Automatic Paging to do this? Does this seem like a good option?
Would the syntax look something like this?
Statement stmt = new SimpleStatement("select count(*) from conv_org where org_id = 'TEST_ORG'");
stmt.setFetchSize(1000);
ResultSet rs = session.execute(stmt);
I am unsure the above code will work as I do not need a result set back I just need a count.
Here is the data model.
CREATE TABLE ts.conv_org (
org_id text,
create_time timestamp,
test_id text,
org_type int,
PRIMARY KEY (org_id, create_time, conv_id)
)
If org_id isn't your primary key counting in cassandra in general is not a fast operation and can easily lead to a full scan of all sstables in your cluster and therefore be painfully slow.
In Java for example you can do something like this:
ResultSet rs = session.execute(...);
Iterator<Row> iter = rs.iterator();
while (iter.hasNext()) {
if (rs.getAvailableWithoutFetching() == 100 && !rs.isFullyFetched())
rs.fetchMoreResults();
Row row = iter.next()
... process the row ...
}
https://docs.datastax.com/en/drivers/java/2.0/com/datastax/driver/core/ResultSet.html
You could select a small colum and count your self. There is int getAvailableWithoutFetching() and isFullyFetched() that could help you.
In general if you really need a count - maintain it yourself.
On the other hand, if you have really many rows in one partition you can have also some other performance problems.
But that's hard to say without knowing the data model.
Maybe you want to use "counter table" in addition to your dataset.
Pros: get counter fast.
Cons: need to maintained that table.
Reference:
https://docs.datastax.com/en/cql/3.3/cql/cql_using/useCountersConcept.html

Cassandra timeout during read query at consistency ONE

I have a problem with the cassandra db and hope somebody can help me. I have a table “log”. In the log table, I have inserted about 10000 rows. Everything works fine. I can do a
select * from
select count(*) from
As soon I insert 100000 rows with TTL 50, I receive a error with
select count(*) from
Version: cassandra 2.1.8, 2 nodes
Cassandra timeout during read query at consistency ONE (1 responses
were required but only 0 replica responded)
Has someone a idea what I am doing wrong?
CREATE TABLE test.log (
day text,
date timestamp,
ip text,
iid int,
request text,
src text,
tid int,
txt text,
PRIMARY KEY (day, date, ip)
) WITH read_repair_chance = 0.0
AND dclocal_read_repair_chance = 0.1
AND gc_grace_seconds = 864000
AND bloom_filter_fp_chance = 0.01
AND caching = { 'keys' : 'ALL', 'rows_per_partition' : 'NONE' }
AND comment = ''
AND compaction = { 'class' : 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy' }
AND compression = { 'sstable_compression' : 'org.apache.cassandra.io.compress.LZ4Compressor' }
AND default_time_to_live = 0
AND speculative_retry = '99.0PERCENTILE'
AND min_index_interval = 128
AND max_index_interval = 2048;
That error message indicates a problem with the READ operation. Most likely it is a READ timeout. You may need to update your Cassandra.yaml with a larger read timeout time as described in this SO answer.
Example for 200 seconds:
read_request_timeout_in_ms: 200000
If updating that does not work you may need to tweak the JVM settings for Cassandra. See DataStax's "Tuning Java Ops" for more information
count() is a very costly operation, imagine Cassandra need to scan all the row from all the node just to give you the count. In small amount of rows if works, but on bigger data, you should use another approaches to avoid timeout.
First of all, we have to retrieve row by row to count amount and forgot about count(*)
We should make a several(dozens, hundreds?) queries with filtering by partition and clustering key and summ amount of rows retrieved by each query.
Here is good explanation what is clustering and partition keys In your case day - is partition key, composite key consists from two columns: date and ip.
It most likely impossible to do it with cqlsh commandline client, so you should write a script by yourself. Official drivers for popular programming languages: http://docs.datastax.com/en/developer/driver-matrix/doc/common/driverMatrix.html
Example of one of such queries:
select day, date, ip, iid, request, src, tid, txt from test.log where day='Saturday' and date='2017-08-12 00:00:00' and ip='127.0 0.1'
Remarks:
If you need just to calculate count and nothing more, probably has a sense to google for tool like https://github.com/brianmhess/cassandra-count
If Cassandra refuses to run your query without ALLOW FILTERING that mean query is not efficient https://stackoverflow.com/a/38350839/2900229

CQL no viable alternative at input '(' error

I have a issue with my CQL and cassandra is giving me no viable alternative at input '(' (...WHERE id = ? if [(]...) error message. I think there is a problem with my statement.
UPDATE <TABLE> USING TTL 300
SET <attribute1> = 13381990-735b-11e5-9bed-2ae6d3dfc201
WHERE <attribute2> = dfa2efb0-7247-11e5-a9e5-0242ac110003
IF (<attribute1> = null OR <attribute1> = 13381990-735b-11e5-9bed-2ae6d3dfc201) AND <attribute3> = 0;
Any idea were the problem is in the statement about?
It would help to have your complete table structure, so to test your statement I made a couple of educated guesses.
With this table:
CREATE TABLE lwtTest (attribute1 timeuuid, attribute2 timeuuid PRIMARY KEY, attribute3 int);
This statement works, as long as I don't add the lightweight transaction on the end:
UPDATE lwttest USING TTL 300 SET attribute1=13381990-735b-11e5-9bed-2ae6d3dfc201
WHERE attribute2=dfa2efb0-7247-11e5-a9e5-0242ac110003;
Your lightweight transaction...
IF (attribute1=null OR attribute1=13381990-735b-11e5-9bed-2ae6d3dfc201) AND attribute3 = 0;
...has a few issues.
"null" in Cassandra is not similar (at all) to its RDBMS counterpart. Not every row needs to have a value for every column. Those CQL rows without values for certain column values in a table will show "null." But you cannot query by "null" since it isn't really there.
The OR keyword does not exist in CQL.
You cannot use extra parenthesis to separate conditions in your WHERE clause or your lightweight transaction.
Bearing those points in mind, the following UPDATE and lightweight transaction runs without error:
UPDATE lwttest USING TTL 300 SET attribute1=13381990-735b-11e5-9bed-2ae6d3dfc201
WHERE attribute2=dfa2efb0-7247-11e5-a9e5-0242ac110003
IF attribute1=13381990-735b-11e5-9bed-2ae6d3dfc201 AND attribute3=0;
[applied]
-----------
False

Composite columns and "IN" relation in Cassandra

I have the following column family in Cassandra for storing time series data in a small number of very "wide" rows:
CREATE TABLE data_bucket (
day_of_year int,
minute_of_day int,
event_id int,
data ascii,
PRIMARY KEY (data_of_year, minute_of_day, event_id)
)
On the CQL shell, I am able to run a query such as this:
select * from data_bucket where day_of_year = 266 and minute_of_day = 244
and event_id in (4, 7, 11, 1990, 3433)
Essentially, I fix the value of the first component of the composite column name (minute_of_day) and want to select a non-contiguous set of columns based on the distinct values of the second component (event_id). Since the "IN" relation is interpreted as an equality relation, this works fine.
Now my question is, how would I accomplish the same type of composite column slicing programmatically and without CQL. So far I have tried the Python client pycassa and the Java client Astyanax, but without any success.
Any thoughts would be welcome.
EDIT:
I'm adding the describe output of the column family as seen through cassandra-cli. Since I am looking for a Thrift-based solution, maybe this will help.
ColumnFamily: data_bucket
Key Validation Class: org.apache.cassandra.db.marshal.Int32Type
Default column value validator: org.apache.cassandra.db.marshal.AsciiType
Cells sorted by: org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.Int32Type,org.apache.cassandra.db.marshal.Int32Type)
GC grace seconds: 864000
Compaction min/max thresholds: 4/32
Read repair chance: 0.1
DC Local Read repair chance: 0.0
Populate IO Cache on flush: false
Replicate on write: true
Caching: KEYS_ONLY
Bloom Filter FP chance: default
Built indexes: []
Compaction Strategy: org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy
Compression Options:
sstable_compression: org.apache.cassandra.io.compress.SnappyCompressor
There is no "IN"-type query in the Thrift API. You could perform a series of get queries for each composite column value (day_of_year, minute_of_day, event_id).
If your event_ids were sequential (and your question says they are not) you could perform a single get_slice query, passing in the range (e.g., day_of_year, minute_of_day, and range of event_ids). You could grab bunches of them in this way and filter the response programatically yourself (e.g., grab all data on the date with event ids between 4-3433). More data transfer, more processing on the client side so not a great option unless you really are looking for a range.
So, if you want to use "IN" with Cassandra you will need to switch to a CQL-based solution. If you are considering using CQL in python another option is cassandra-dbapi2. This worked for me:
import cql
# Replace settings as appropriate
host = 'localhost'
port = 9160
keyspace = 'keyspace_name'
# Connect
connection = cql.connect(host, port, keyspace, cql_version='3.0.1')
cursor = connection.cursor()
print "connected!"
# Execute CQL
cursor.execute("select * from data_bucket where day_of_year = 266 and minute_of_day = 244 and event_id in (4, 7, 11, 1990, 3433)")
for row in cursor:
print str(row) # Do something with your data
# Shut the connection
cursor.close()
connection.close()
(Tested with Cassandra 2.0.1.)

Resources