If I use a Cassandra batch statement using CQL, then each statement can have an individual timestamp. For example, something like:
BEGIN BATCH
INSERT INTO users (name, surname) VALUES ('Bob', 'Smith') USING TIMESTAMP 10000001;
DELETE FROM users USING TIMESTAMP 10000000 WHERE user='Bob';
APPLY BATCH;
If I try to do something similar using the C++ driver, I'd do something like this:
Create the batch with cass_batch_new
Create the statements with cass_future_get_prepared then cass_prepared_bind
Set the timestamp on each statement with cass_statement_set_timestamp
Add the statement to the batch using cass_batch_add_statement
Execute the batch using cass_session_execute_batch
I'd then expect this to behave in the same way as the CQL batch statement, in as much as each statement in the batch is executed with its own separate timestamp. But, based on my testing, I've not been able to get this to work. It appears to executed the entire batch using a single timestamp.
Similarly, if I create a monotonic timestamp generator to generate the timestamps for me, it appears to just use a timestamp for the batch and not for the individual statements.
I've taken a look at the source code for the C++ driver and it looks like when it encodes the statements in the batch for sending to the database (in ExecuteRequest::encode_batch), it doesn't attempt to encode a timestamp for each statement in the batch, just for the batch overall. When encoding individual statements not in a batch it does encode the timestamp for the statement (in ExecuteRequest::internal_encode).
As a workaround, instead of setting the timestamp on the statements using cass_statement_set_timestamp, I can put the "USING TIMESTAMP 10000001" directly into the CQL string, and that then works as intended. So, it appears that the database can correctly have separate timestamps on each statement in the batch, but the C++ driver can't send them.
But putting the timestamp directly into the CQL with "USING TIMESTAMP 10000001" then I can't reuse the statement by just binding new values to it. I'd need to prepare the statement again.
Has anyone else tried this and managed to get it to work? Or is it just a known limitation of the C++ driver?
I'm using Cassandra C++ driver version 2.2.2 and database version 2.2.5 which as far as I can tell is using native protocol version 4
I also raised this on the Cassandra C++ driver mailing list Google group and Michael Penick replied to say it's not currently possible. The underlying protocol does not support a timestamp per statement in the batch, so the driver is not able to send one.
Native Protocol v4 spec
Related
In our spark application, we are running multiple batch processes everyday. sources for these batch process are different like Oracle, mongoDB, Files. We are storing different value for incremental processing based on source like latest timestamp for some oracle tables, ID for some oracle table, list for some file system and using those values for next incremental run.
Currently calculation of these offset value are dependent on source types, we need to customize code to store this value every time when we add new source type.
Is there any generic way to resolve this issue like checkpoint in streaming.
I always like to look into the destination for the last written partition, or get some max(primary_key) and then based on that value select data from the source database to write during the current run.
There would be no need to store anything, you would just need to supply to your batch processing algorithm the table name, source type, and primary key/timestamp column. The algorithm would then find the latest value you already have.
It really depends on your load philosophy and how your storage is divided; if you have raw/source/prepared layers. It is a good idea to load data in a raw format which can be easily compared to the original source in order to do what I described above.
Alternatives include:
Writing a file which contains that primary column and the latest value, your batch job would read this file to determine what to read next.
Updating the job execution configuration with an argument corresponding to the latest value, so on the next run the latest value is passed to your algorithm.
I would like to execute over hundred of user-defined-type statements. These statements are encapsulated in a .cql file.
While executing .cql file everytime for new cases, I find that many of the statements within it gets skipped.
Therefore, I would like to know if there is any performance issues of executing 100s of statements composed in .cql file
Note: I am executing .cql files on behalf of a Python script via os.system method
The performance of executing 100's of DDL statements via code (or cql file/cqlsh) is proportional to the number of nodes in the cluster. In a distributed system like Cassandra all nodes have to agree for the schema change and more the number of nodes, more the time it takes for schema agreement.
There is essentially a timeout value maxSchemaAgreementWaitSeconds which determines how long coordinator node will wait before replying back to client. Typically case for schema deployment is one or two tables and the default value for this parm works just fine.
Since in the special case of multiple DDL executed at once via code/cqlsh; its better to increase the value for maxSchemaAgreementWaitSeconds say to 20sec. Its going to a take a little longer for the schema deployment, but it will make sure the deployment succeeds.
Java reference
Python reference
After adding a pair of columns in schema, I want to select them via select *. Instead select * returns old set of columns and none new.
By documentation recommendation, I use {prepare: true} to smooth JavaScript floats and Cassandra ints/bigints difference (I don't really need the prepared statement here really, it is just to resolve this ResponseError : Expected 4 or 0 byte int issue and I also don't want to bother myself with query hints).
So on first execution of select * I had 3 columns. After this, I added 2 columns to schema. select * still returns 3 columns if is used with {prepare: true} and 5 columns if used without it.
I want to have a way to reliably refresh this cache or make cassandra driver prepare statements on each app start.
I don't consider restarting database cluster a reliable way.
This is actually an issue in Cassandra that was fixed in 2.1.3 (CASSANDRA-7910). The problem is that on schema update, the prepared statements are not evicted from the cache on the Cassandra side. If you are running a version less than 2.1.3 (which is likely since 2.1.3 was released last week), there really isn't a way to work around this unless you create another separate prepared statement that is slightly different (like extra spaces or something to cause a separate unique statement).
When running with 2.1.3 and changing the table schema, C* will properly evict the relevant prepared statements from the cache, and when the driver sends another query using that statement, Cassandra will respond with an 'UNPREPARED' message, which should provoke the nodejs driver to reprepare the query and resend the request for you.
On the Node.js driver, you can programatically clear the prepared statement metadata:
client.metadata.clearPrepared();
I'm evaluating spark-cassandra-connector and i'm struggling trying to get a range query on partition key to work.
According to the connector's documentation it seems that's possible to make server-side filtering on partition key using equality or IN operator, but unfortunately, my partition key is a timestamp, so I can not use it.
So I tried using Spark SQL with the following query ('timestamp' is the partition key):
select * from datastore.data where timestamp >= '2013-01-01T00:00:00.000Z' and timestamp < '2013-12-31T00:00:00.000Z'
Although the job spawns 200 tasks, the query is not returning any data.
Also I can assure that there is data to be returned since running the query on cqlsh (doing the appropriate conversion using 'token' function) DOES return data.
I'm using spark 1.1.0 with standalone mode. Cassandra is 2.1.2 and connector version is 'b1.1' branch. Cassandra driver is DataStax 'master' branch.
Cassandra cluster is overlaid on spark cluster with 3 servers with replication factor of 1.
Here is the job's full log
Any clue anyone?
Update: When trying to do server-side filtering based on the partition key (using CassandraRDD.where method) I get the following exception:
Exception in thread "main" java.lang.UnsupportedOperationException: Range predicates on partition key columns (here: timestamp) are not supported in where. Use filter instead.
But unfortunately I don't know what "filter" is...
i think the CassandraRDD error is telling that the query that you are trying to do is not allowed in Cassandra and you have to load all the table in a CassandraRDD and then make a spark filter operation over this CassandraRDD.
So your code (in scala) should something like this:
val cassRDD= sc.cassandraTable("keyspace name", "table name").filter(row=> row.getDate("timestamp")>=DateFormat('2013-01-01T00:00:00.000Z')&&row.getDate("timestamp") < DateFormat('2013-12-31T00:00:00.000Z'))
If you are interested in making this type of queries you might have to take a look to others Cassandra connectors, like the one developed by Stratio
You have several options to get the solution you are looking for.
The most powerful one would be to use Lucene indexes integrated with Cassandra by Stratio, which allows you to search by any indexed field in the server side. Your writing time will be increased but, on the other hand, you will be able to query any time range. You can find further information about Lucene indexes in Cassandra here. This extended version of Cassandra is fully integrated into the deep-spark project so you can take all the advantages of the Lucene indexes in Cassandra through it. I would recommend you to use Lucene indexes when you are executing a restricted query that retrieves a small-medium result set, if you are going to retrieve a big piece of your data set, you should use the third option underneath.
Another approach, depending on how your application works, might be to truncate your timestamp field so you can look for it using an IN operator. The problem is, as far as I know, you can't use the spark-cassandra-connector for that, you should use the direct Cassandra driver which is not integrated with Spark, or you can have a look at the deep-spark project where a new feature allowing this is about to be released very soon. Your query would look something like this:
select * from datastore.data where timestamp IN ('2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', ... , '2013-12-31')
, but, as I said before, I don't know if it fits to your needs since you might not be able to truncate your data and group it by date/time.
The last option you have, but the less efficient, is to bring the full data set to your spark cluster and apply a filter on the RDD.
Disclaimer: I work for Stratio :-) Don't hesitate on contacting us if you need any help.
I hope it helps!
I'm using Cassandra 1.2.4 and 1.0.0 of the Datastax java driver (through Clojure's Alia, but I don't think that matters here). If I prepare a statement with a timeuuid column, and put 'now()' in the values for the timeuuid, now() gets evaluated once at the time the prepared statement is compiled, and then never again.
imagine this prepared statement: "insert into some_table (id,time) values (?,now())"
after you prepare it, every time you execute it, now() will always be the same, regardless of when you execute it.
you can sort of work around this using the min/max timeuuid functions, but then you lose all the benefits of the uniqueness of timeuuid values. effectively, I think this means that I can't insert a timeuuid using a prepared statement.
Am I missing something? or is there a feature missing?
This sounds like a bug/shortcoming on cassandra side. Alternatively you could pass a uuid instance as a value to the prepared statement:
insert into some_table (id,time) values (?,?)
and then use either a UUID instance you create from the UUIDs namespace http://www.datastax.com/drivers/java/apidocs/com/datastax/driver/core/utils/UUIDs.html#timeBased() in java-driver or a com.eaio.uuid.UUID instance, you can create the latter from a clojure wrapper: mpenet/tardis (it's in the dev/test dependencies of alia already), using (unique-time-uuid (java.util.Date.)).
I might just wrap uuids too on alia directly, tardis is more flexible in some ways, but the former is the official thing.
https://github.com/mpenet/tardis
I have confirmed with c* people this is something that will/can be improved, there is an issue about this here if you want to track its progress: https://issues.apache.org/jira/browse/CASSANDRA-5616