Cassandra Datastax driver prepared statement evaluates 'now()' once only - cassandra

I'm using Cassandra 1.2.4 and 1.0.0 of the Datastax java driver (through Clojure's Alia, but I don't think that matters here). If I prepare a statement with a timeuuid column, and put 'now()' in the values for the timeuuid, now() gets evaluated once at the time the prepared statement is compiled, and then never again.
imagine this prepared statement: "insert into some_table (id,time) values (?,now())"
after you prepare it, every time you execute it, now() will always be the same, regardless of when you execute it.
you can sort of work around this using the min/max timeuuid functions, but then you lose all the benefits of the uniqueness of timeuuid values. effectively, I think this means that I can't insert a timeuuid using a prepared statement.
Am I missing something? or is there a feature missing?

This sounds like a bug/shortcoming on cassandra side. Alternatively you could pass a uuid instance as a value to the prepared statement:
insert into some_table (id,time) values (?,?)
and then use either a UUID instance you create from the UUIDs namespace http://www.datastax.com/drivers/java/apidocs/com/datastax/driver/core/utils/UUIDs.html#timeBased() in java-driver or a com.eaio.uuid.UUID instance, you can create the latter from a clojure wrapper: mpenet/tardis (it's in the dev/test dependencies of alia already), using (unique-time-uuid (java.util.Date.)).
I might just wrap uuids too on alia directly, tardis is more flexible in some ways, but the former is the official thing.
https://github.com/mpenet/tardis
I have confirmed with c* people this is something that will/can be improved, there is an issue about this here if you want to track its progress: https://issues.apache.org/jira/browse/CASSANDRA-5616

Related

When read-your-own-writes can fail?

I use RL/WL=QUORUM and send two updates, is it possible that next SELECT reads my first update, in some circumstances?
CREATE TABLE aggr(
id int,
mysum int,
PRIMARY KEY(id)
)
INSERT INTO aggr(id, mysum) VALUES(1, 2)
INSERT INTO aggr(id, mysum) VALUES(1, 3)
SELECT mysum FROM aggr WHERE id=1 -- expect mysum=3 here, but is it a must?
As I can judge from here it is possible even to lost part of the second update if two updates come within same timestamp.
If I work around timestamp problem, can I be sure that I always read what I wrote last time?
No, assuming your using client side monotonic timestamps (current default, wasn't in past). But it is possible with other settings. I am assuming here that its a single client issuing those two writes. If the 2 inserts are coming from two different servers it all depends on their timestamps.
This is the default for java driver 3.x but if using a version of cassandra pre CQL3 (2.0) you need to provide them with USING TIMESTAMP in your query since the protocol didn't support it. Otherwise the two writes can go to different coordinators, and if the coordinators have clock drift between them the 1st insert may be considered "newer" than the 2nd. With client side timestamps though (should be the default on your driver if using new versions) thats not the case.
If you do your updates synchronously with CL=QUORUM the second update will always overwrite the first one. A lower consistency level on any of the requests would not guarantee this.

cassandra - how to update decimal column by adding to existing value

I have a cassandra table that looks like the following:
create table position_snapshots_by_security(
securityCode text,
portfolioId int,
lastUpdated date,
units decimal,
primary key((securityCode), portfolioId)
)
And I would like to something like this:
update position_snapshots_by_security
set units = units + 12.3,
lastUpdated = '2017-03-02'
where securityCode = 'SPY'
and portfolioId = '5dfxa2561db9'
But it doesn't work.
Is it possible to do this kind of operation in Cassandra? I am using version 3.10, the latest one.
Thank you!
J
This is not possible in Cassandra (any version) because it would require a read-before-write (anti-)pattern.
You can try the counter columns if they suit your needs. You could also try to caching/counting at application level.
You need to issue a read at application level otherwise, killing your cluster performance.
Cassandra doesn't do a read before a write (except when using Lightweight Transactions) so it doesn't support operations like the one you're trying to do which rely on the existing value of a column. With that said, it's still possible to do this in your application code with Cassandra. If you'll have multiple writers possibly updating this value, you'll want to use the aforementioned LWT to make sure the value is accurate and multiple writers don't "step on" each other. Basically, the steps you'll want to follow to do that are:
Read the current value from Cassandra using a SELECT. Make sure you're doing the read with a consistency level of SERIAL or LOCAL_SERIAL if you're using LWTs.
Do the calculation to add to the current value in your application code.
Update the value in Cassandra with an UPDATE statement. If using a LWT you'll want to do UPDATE ... IF value = previous_value_you_read.
If using LWTs, the UPDATE will be rejected if the previous value that you read changed while you were doing the calculation. (And you can retry the whole series of steps again.) Keep in mind that LWTs are expensive operations, particularly if the keys you are reading/updating are heavily contended.
Hope that helps!

Is it possible to specify the WRITETIME in a Cassandra INSERT command?

I am having a problem where a few INSERT commands are viewed as being send simultaneously on the Cassandra side when my code clearly does not send them simultaneously. (When you get a little congestion on the network, then the problem happens, otherwise, everything works just fine.)
What I am thinking would solve this problem is a way for me to be able to specify the WRITETIME myself. From what I recall, that was possible in thrift, but maybe not (i.e. we could read it for sure.)
So something like this (to simulate the TTL):
INSERT INTO table_name (a, b, c) VALUES (1, 2, 3) USING WRITETIME = 123;
The problem I'm facing is overwriting the same data and once in a while the update is ignored because it ends up with the same or even an older timestamp (probably because it is sent to a different node and the time of each node is slightly different and since the C++ process uses threads, it can be send before/after without your control...)
The magic syntax you're looking for is:
INSERT INTO tbl (col1, col2) VALUES (1,2) USING TIMESTAMP 123456789000
Be very cautious using this approach - make sure you use the right units (microseconds, typically).
You can override the meaning of time stamps in some cases - it's a sneaky trick we've used in the past to do clever things like first-write-wins and even stored leaderboard values in the TIMESTAMP field so highest score would be persisted, but you should REALLY understand the concept before trying these (deletes become nontrivial)

Cassandra nodejs DataStax driver don't return newly added columns via prepared statement execution

After adding a pair of columns in schema, I want to select them via select *. Instead select * returns old set of columns and none new.
By documentation recommendation, I use {prepare: true} to smooth JavaScript floats and Cassandra ints/bigints difference (I don't really need the prepared statement here really, it is just to resolve this ResponseError : Expected 4 or 0 byte int issue and I also don't want to bother myself with query hints).
So on first execution of select * I had 3 columns. After this, I added 2 columns to schema. select * still returns 3 columns if is used with {prepare: true} and 5 columns if used without it.
I want to have a way to reliably refresh this cache or make cassandra driver prepare statements on each app start.
I don't consider restarting database cluster a reliable way.
This is actually an issue in Cassandra that was fixed in 2.1.3 (CASSANDRA-7910). The problem is that on schema update, the prepared statements are not evicted from the cache on the Cassandra side. If you are running a version less than 2.1.3 (which is likely since 2.1.3 was released last week), there really isn't a way to work around this unless you create another separate prepared statement that is slightly different (like extra spaces or something to cause a separate unique statement).
When running with 2.1.3 and changing the table schema, C* will properly evict the relevant prepared statements from the cache, and when the driver sends another query using that statement, Cassandra will respond with an 'UNPREPARED' message, which should provoke the nodejs driver to reprepare the query and resend the request for you.
On the Node.js driver, you can programatically clear the prepared statement metadata:
client.metadata.clearPrepared();

Spark Cassandra connector - Range query on partition key

I'm evaluating spark-cassandra-connector and i'm struggling trying to get a range query on partition key to work.
According to the connector's documentation it seems that's possible to make server-side filtering on partition key using equality or IN operator, but unfortunately, my partition key is a timestamp, so I can not use it.
So I tried using Spark SQL with the following query ('timestamp' is the partition key):
select * from datastore.data where timestamp >= '2013-01-01T00:00:00.000Z' and timestamp < '2013-12-31T00:00:00.000Z'
Although the job spawns 200 tasks, the query is not returning any data.
Also I can assure that there is data to be returned since running the query on cqlsh (doing the appropriate conversion using 'token' function) DOES return data.
I'm using spark 1.1.0 with standalone mode. Cassandra is 2.1.2 and connector version is 'b1.1' branch. Cassandra driver is DataStax 'master' branch.
Cassandra cluster is overlaid on spark cluster with 3 servers with replication factor of 1.
Here is the job's full log
Any clue anyone?
Update: When trying to do server-side filtering based on the partition key (using CassandraRDD.where method) I get the following exception:
Exception in thread "main" java.lang.UnsupportedOperationException: Range predicates on partition key columns (here: timestamp) are not supported in where. Use filter instead.
But unfortunately I don't know what "filter" is...
i think the CassandraRDD error is telling that the query that you are trying to do is not allowed in Cassandra and you have to load all the table in a CassandraRDD and then make a spark filter operation over this CassandraRDD.
So your code (in scala) should something like this:
val cassRDD= sc.cassandraTable("keyspace name", "table name").filter(row=> row.getDate("timestamp")>=DateFormat('2013-01-01T00:00:00.000Z')&&row.getDate("timestamp") < DateFormat('2013-12-31T00:00:00.000Z'))
If you are interested in making this type of queries you might have to take a look to others Cassandra connectors, like the one developed by Stratio
You have several options to get the solution you are looking for.
The most powerful one would be to use Lucene indexes integrated with Cassandra by Stratio, which allows you to search by any indexed field in the server side. Your writing time will be increased but, on the other hand, you will be able to query any time range. You can find further information about Lucene indexes in Cassandra here. This extended version of Cassandra is fully integrated into the deep-spark project so you can take all the advantages of the Lucene indexes in Cassandra through it. I would recommend you to use Lucene indexes when you are executing a restricted query that retrieves a small-medium result set, if you are going to retrieve a big piece of your data set, you should use the third option underneath.
Another approach, depending on how your application works, might be to truncate your timestamp field so you can look for it using an IN operator. The problem is, as far as I know, you can't use the spark-cassandra-connector for that, you should use the direct Cassandra driver which is not integrated with Spark, or you can have a look at the deep-spark project where a new feature allowing this is about to be released very soon. Your query would look something like this:
select * from datastore.data where timestamp IN ('2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', ... , '2013-12-31')
, but, as I said before, I don't know if it fits to your needs since you might not be able to truncate your data and group it by date/time.
The last option you have, but the less efficient, is to bring the full data set to your spark cluster and apply a filter on the RDD.
Disclaimer: I work for Stratio :-) Don't hesitate on contacting us if you need any help.
I hope it helps!

Resources