Cassandra - do not upsert - cassandra

We have a requirement where we would like our application (which might be deployed on multiple hosts) to create a row in Cassandra. The only host which is successful in creating the row, execute the work. Would it be enough to write an insert statement like below so that if two server try to insert the row, only one succeeds, and the other one gets the exception/does not succeed?
INSERT INTO keyspace1.claim (claim_id, status) VALUES (1, false) IF NOT EXIST
Would like to understand using IF NOT EXIST will avoid the upsert.
Thanks,
Shilpa

Yes, IF NOT EXISTS will include a paxos round and read-before-write though so much much slower. Check the resultset of the insert with wasApplied() to tell if it took or not.
https://www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0

Related

DSE Cassandra 3.x delete operation

I have a table with a PRIMARY KEY of ( (A,B), C)
Partition key (A,B)
Clustering key C
My question is related to deleting from this table.
Is it efficient to use the IN clause when deleting or to issue multiple
delete statements using the equality operation.
delete from table where A=xx and B IN ('a','b','c');
-OR-
delete from table where A=xx and B='a';
delete from table where A=xx and B='b';
delete from table where A=xx and B='c';
Is there any harm in using the IN operator as in the 1st delete statement.
There may be up to around 20 deletes in total (or 20 items in the IN clause).
Thanks in advance for all your help!
With a few small exceptions its almost always better to use the 2nd option multiple deletes issued asynchronously instead. The coordinator of the IN clause will be put on a lot of load while the later will evenly distribute the load. Also with a TokenAware load balancer the requests will go directly to the correct replicas and can complete pretty quickly. If you are doing hundreds or more of the deletes you might wanna use a Semaphore or something though to limit number of in flight deletes, just to prevent overloading cluster.
It depends on the needs of your application. If the delete operations are expected to be fast, then you'll probably want to run each one explicitly (second option).
On the other hand, if the delete runs as a part of a batch or cleanup job, and nobody really cares how long it takes, then you could probably get away with using IN. The trick there would be in keeping it from timing-out (and as Chris indicated, putting undue load on the node). It might make sense to break-down your groups of values for column B, to keep those small. While 20 list items with IN isn't the most I've heard of someone trying, it's definitely more than I would ever use personally (I'd try to keep it smaller than 10).
Essentially, using the IN operator with a DELETE is going to be susceptible to performance issues just like it would be on a SELECT, as described in this answer (included here for reference):
Is the IN relation in Cassandra bad for queries?

Cassandra Prepared Statement and adding new columns

We are using cached PreparedStatement for queries to DataStax Cassandra. But if we need to add new columns to a table, we need to restart our application server to recache the prepared statement.
I came across this bug in cassandra, that explains the solution
https://datastax-oss.atlassian.net/browse/JAVA-420
It basically gives a work around to not use "SELECT * FROM table" in the query, but use "SELECT column_names FROM table"
But now we came across the same issue with Delete statements. After adding a new column to a table, the Delete prepared statement does not delete a record.
I don't think we can use the same work around as mentioned in the ticket for Select statement, as * or column_names does not make sense with Deleting a row.
Any help would be appreciated. We basically want to avoid having to restart our application server for any additions to database tables
We basically want to avoid having to restart our application server for any additions to database tables
Easy solution that require a little bit of coding: use JMX
Let me explain.
In your application code, keep a cache (you can use Guava cache implementation for example) of all prepared statement. The key to access the cache can be, for example, the query string.
Now, expose a JMX method to clear the cache and force the application to re-prepare again the queries.
Every time you update a schema, just call the appropriate method(s) to clean the cache, you don't need to restart your application

Cassandra PreparedStatement vs normal insert

I'm using Cassandra for my project and I was facing a timeout issue during writes, the same the guy was receiving in this post Cassandra cluster with bad insert performance and insert stability (at the moment I'm testing with only one node, Java Driver, last release of Cassandra). The application has to insert a huge quantity of data per user once per day (during nights). I have a rest controller that accepts files and then processes them as they arrive in parallel to insert values in Cassandra. I have to insert 1million entries per user, where an entry has up to 8 values (time is not so important, it can take also 10minutes). Following the answer provided in Cassandra cluster with bad insert performance and insert stability I decided to add executeAsync(), Semaphore and PreparedStatement to my application, while previously I was using none of them.
The problem now is that, using variable keyspaces (one per user) and having the necessity to update lists in the database, I can't initialize my PreparedStatements in the initialization phase but I have to do it at least once per file processed (one file contains 10+k entries) and an user has to upload up to 100 files per day. For this reason, I'm getting this warning:
Re-preparing already prepared query INSERT INTO c2bdd9f7073dce28ed973238ac85b6e5d6162fce.sensorMonitoringLog (timestamp, sensorId, isLogging) VALUES (?, ?, ?). Please note that preparing the same query more than once is generally an anti-pattern and will likely affect performance. Consider preparing the statement only once.
My question is: is it a good practice to use PreparedStatement like this or it is better to use normal insert with executeAsync()?
Thank you
If you are facing a timeout issue during write, it is a good idea to use PreparedStatement but not to use asynchronous insert. Timeouts are a way to prevent Cassandra from work overload. With asynchronism you are giving it more work at the same time and the risk of OOM would grow.
To do things properly with PreparedStatement, you have to create one and only one Session object by keyspace. Then each session must prepare its own statement once.
Moreover, be aware their is a thread safety risk with PreparedStatement and asynchronism. Preparing a statement must be synchronized. But once again, I advice you not to use ExecuteAsynch in such case.

MemSQL - why can't I do a cross-database insert into .. select

I'm trying to do a simple insert with a field list from a table in one database to a table in another.
insert into db_a.target_table (field1,field2,field3) select field1,field2,field3 from db_b.source_table;
The error message seems straight-forward..
MemSQL does not support this type of query: Cross-database INSERT ... SELECT
Oddly enough, this example does work:
insert into db_a.target_table select * from db_b.source_table;
But this seems like such a common scenario. Has anyone run into a similar issue, and were you able to work around it?
Unfortunately, this isn't allowed because it is difficult to keep such queries transactional; multi-statement transactions are used internally to guarantee transactionality of the single insert-select (if one partition fails (dup key or something), we want to rollback everything!). Since we don't have cross-db multi-statement transactions (yet!), we don't have cross-db insert-select (yet!).
Stay tuned for nicer solutions.
However, if you REAAALY want to do this, here is what you do. However,
PROCEED AT YOUR OWN RISK. THIS IS NOT A SUPPORTED PROCEEDURE.
But it should work.
1) On db_b, create a table with the same columns as source_table, but make the shard key SHARD().
2) On db_a, run SHOW PARTITIONS.
3) For each of those partitions, create a connection to db_a_<ordinal> on the host and port listed in SHOW PARTITIONS. Run SHOW DATABASES on that connection and you'll see some databases called db_b_<another>. Pick one, doesn't matter which. Run INSERT INTO db_b<another>.source_table SELECT * from db_a_<ordinal>.source_table.
3.5) At this point, you haven't yet written to a table you care about, but now we will. Look at db_b.source_table. Is everything correct? Is all the data there? Run SHOW CREATE TABLE and double check the shard key is SHARD KEY () (it should be in comments). Everything look good? Ok, we can proceed.
4) After you're done doing this for EVERY partition, you can do INSERT INTO db_b.target_table (cols) SELECT cols from db_b.source_table, or whatever you want.
Good luck!

First preparedStatement using Cassandra always slow

I noticed if I have a java method in which I have a preparedStatement uisng the JDBC driver that comes with Cassandra it is always slow. But if I put the same query twice in the method the second time it is 20x faster. Why is that? I would think the second, third, four time I call the java method it would be faster then the first. I am using Cassandra 1.2.5. I have also cached 100MB of rows in the row-cache and set the table to caching = "all". In Cassandra-cli I verified the settings. And in Cassandra-Cli I verified the second, third fourth time I get the rows from the same table I do the JDBC calls against I get faster response time.
Any Ideas?
Thanks,
-Tony
From the all knowing CQL3 documentation (always a great starting point btw):
Prepared statement is an optimization that allows to parse a query only once but execute it multiple times with different concrete values.
The statement gets cached. This is the difference maker you are experiencing. Also prepared statements get pre-compiled, typically meaning an execution plan is prepared before the query is run against the db. Knowing what you are doing makes the process faster.
At the first run your prepared statement is cached in-case you run the same query again, which you do, and since its cached the querying will be executed much faster.

Resources