Got the following info log:
Query SELECT column1, column2 FROM table_name WHERE productId IN ('column1_value') is not prepared on /<example-ip-address>:9042, preparing before retrying executing. Seeing this message a few times is fine, but seeing it a lot may be source of performance problems
Any suggestions?
This happens when you send bound query with prepared query ID that is unknown for given node. Node may not "know" this query because of several reasons:
Node crashed and lost cache of prepared statements
Prepared statement was evicted from cache (you should see corresponding message server side). This may happen if somebody prepares too many queries
There were options set to not prepare queries on all nodes (see documentation)
You need to check these hypothesis before continue. I think that it could happen because of 2nd item - this may happen if people prepare "literal" queries that don't have bind markers.
Related
I'm using Cassandra Java driver with a fetch size set to 1k. I need to query all records in a table and perform some time consuming action for a every row.
What will happen if I'll keep the ResultSet open (not fully iterated) for a one day?
What I don't care about:
consistency. If some new record will be written in the meantime, I'm ok to fetch it. However, I'm fine if I won't get it
fault tolerance. If during that process some node will fail, I'm fine if the query will fail too. However, I would like to detect that from the client perspective.
What I care about:
Cassandra resource utilization - I don't want to cause cluster outage due to some blocked resources
lateness - I don't want to block (or slow down much) cluster for other consumers of that table
I would like to get all records which existed when I started the query (assuming no deletions). However, they don't have to be up to date
The paging state is the information about the last read data (literally serialized partition key, clustering, and remaining). When sent to coordinator it will look for everything greater than that. So there are no resources in the server spent for this and no performance impact vs a normal read.
Cassandra does not have any features to allow isolation even within a single query. If data has changed from when the first query was made and the second, you will get the up to date information.
I'm running Cassandra 2.2.1, 3 node cluster at a RF=3. If I perform simple deletes at quorum on a bunch of entries, verifying the results via a select at quorum reveals that some entries that should have been deleted persist in the table. The delete queries which were issued through the Java driver completed successfully without exception. I also use a retry policy to handle failed delete/writes but the policy for these failures is never invoked because they 'succeed'. I can reproduce the problem 100% of the time, it usually starts happening after I've issued around 100 deletes into the table. I understand how tombstones and gc grace period work and this is not a situation of resurected deletes. Read somewhere that it could be a ntp issue but all 3 nodes sync to the same clock and there's no drift as I can tell. I can share logs or anything else required to root cause. Thanks!
Update:
I resolved the problem and it seems to be a weird race condition that appears to either be time related or sequence related. If there is some time drift between nodes could be possible for the delete to be ignored if it was issued ahead of the insert from a tagged timestamp perspective.
E.G.
-insert is issued by node 1 at T1 (timestamp of node 1)
-delete comes into the system via node 3 but tagged with timestamp T0
-system concludes that insert occurred later so ignores delete
This gives the illusion that the delete executes ahead of insert depending on the timestamp sent by the respective nodes.
Allowing sufficient time between insert and delete resolved my issue although I'm not quite sure what the real root cause was.
Another option is to enable client side timestamps (instead of server side which is what you currently have).
If the same client issues the insert/update/delete it assures that the timestamps will be inline with the operation invocation.
using client side timestamps will remove the need to have a “sufficient time“ between insert/update and delete.
Please note that correct timestamp is also needed for cases in which two consective writes update the same “key“ (and this bugs are harder to detect :( ). Client side timestamps resolves such issues as well (given that the same client issues the requests)
How much time do you have between the delete and the select? As Cassandra has an "eventually consistent" behaviour, adding a delay between the delete and the select may solve the issue
I am attempting to add a field to a user defined type in cassandra 2.1.2, using the nodejs driver from datastax. I added the field using ALTER TYPE in cqlsh. When I attempt to add a row containing the udt with a value for the new field, it gets inserted with null value, instead of value I supplied. I strongly suspect this has to do with the way the cluster is caching the prepared statement. Because I recall reading that the prepared statements are indexed by a hash of the query, I tried changing some whitespace in the query to see if it helped.This actually seemed to work, but only once. subsequent inserts result in error:
message: 'Operation timed out - received only 0 responses.',
info: 'Represents an error message from the server',
code: 4352,
consistencies: 10,
received: 0,
blockFor: 1,
writeType: 'SIMPLE',
coordinator: '127.0.0.1:9042',
and it would seem the new rows are not added.. until I restart cassandra, at which point not only do the inserts that I thought had failed show up, but subsequent ones work fine. This is very disconcerting, but fortunately I have only done this in test instances. I do need to make this change in production however, and restarting the cluster to add a single field is not really an option. Is there a better way to get the cluster to evict the cached prepared statement?
I strongly suspect this has to do with the way the cluster is caching the prepared statement.
Put Cassandra log in DEBUG mode to be sure the prepared statement cache is the root cause. If it is, create an JIRA so the dev team can fix it...
Optionally you can also enable tracing to see what is going on server-side
To enable tracing in cqlsh, just type TRACING ON
To enable tracing with the Java driver, just call enableTracing() on the statement object
I'm using Cassandra for my project and I was facing a timeout issue during writes, the same the guy was receiving in this post Cassandra cluster with bad insert performance and insert stability (at the moment I'm testing with only one node, Java Driver, last release of Cassandra). The application has to insert a huge quantity of data per user once per day (during nights). I have a rest controller that accepts files and then processes them as they arrive in parallel to insert values in Cassandra. I have to insert 1million entries per user, where an entry has up to 8 values (time is not so important, it can take also 10minutes). Following the answer provided in Cassandra cluster with bad insert performance and insert stability I decided to add executeAsync(), Semaphore and PreparedStatement to my application, while previously I was using none of them.
The problem now is that, using variable keyspaces (one per user) and having the necessity to update lists in the database, I can't initialize my PreparedStatements in the initialization phase but I have to do it at least once per file processed (one file contains 10+k entries) and an user has to upload up to 100 files per day. For this reason, I'm getting this warning:
Re-preparing already prepared query INSERT INTO c2bdd9f7073dce28ed973238ac85b6e5d6162fce.sensorMonitoringLog (timestamp, sensorId, isLogging) VALUES (?, ?, ?). Please note that preparing the same query more than once is generally an anti-pattern and will likely affect performance. Consider preparing the statement only once.
My question is: is it a good practice to use PreparedStatement like this or it is better to use normal insert with executeAsync()?
Thank you
If you are facing a timeout issue during write, it is a good idea to use PreparedStatement but not to use asynchronous insert. Timeouts are a way to prevent Cassandra from work overload. With asynchronism you are giving it more work at the same time and the risk of OOM would grow.
To do things properly with PreparedStatement, you have to create one and only one Session object by keyspace. Then each session must prepare its own statement once.
Moreover, be aware their is a thread safety risk with PreparedStatement and asynchronism. Preparing a statement must be synchronized. But once again, I advice you not to use ExecuteAsynch in such case.
I noticed if I have a java method in which I have a preparedStatement uisng the JDBC driver that comes with Cassandra it is always slow. But if I put the same query twice in the method the second time it is 20x faster. Why is that? I would think the second, third, four time I call the java method it would be faster then the first. I am using Cassandra 1.2.5. I have also cached 100MB of rows in the row-cache and set the table to caching = "all". In Cassandra-cli I verified the settings. And in Cassandra-Cli I verified the second, third fourth time I get the rows from the same table I do the JDBC calls against I get faster response time.
Any Ideas?
Thanks,
-Tony
From the all knowing CQL3 documentation (always a great starting point btw):
Prepared statement is an optimization that allows to parse a query only once but execute it multiple times with different concrete values.
The statement gets cached. This is the difference maker you are experiencing. Also prepared statements get pre-compiled, typically meaning an execution plan is prepared before the query is run against the db. Knowing what you are doing makes the process faster.
At the first run your prepared statement is cached in-case you run the same query again, which you do, and since its cached the querying will be executed much faster.