Cassandra batch operation support with CrudRepository - cassandra

I understand cassandra does not support transaction management, but it does support batch operations which can be used like transactions (either all will success or none).
Similarly spring-data for cassandra provides CassandraTemplate which provides batchOps to support this batch feature.
I am wondering if something similar is available to be used with CrudRepository (a high level api which internally uses CassandraTemplate).

Batches are really used for very limited sets of use cases - please check documentation. It could be better to devise some other way to achieve the same goal, but this really depends on what you want to do - what operations to execute, table schemas, etc.

Related

Count distinct in infinite stream

I am looking for a way to create a streaming application that can withstand millions of events per second and output a distinct count of those events in real time. As this stream is unbounded by any time window it obviously has to be backed by some storage. However, I cannot find the best way to do this maintaining a good level of abstraction (meaning that I want a framework to handle storing and counting for me, otherwise I don't need a framework at all). The preferred storage for me are Cassandra and Redis (both ideally).
The options I've considered are Flink, Spark and Kafka Streams. I do know the differences between them, but I still can't pick the best solution. Can someone advice? Thanks in advance.
Regardless of which solution you choose, if you can withstand it not being 100% accurate (but being very very close), you can have your operator using HyperLogLog (there are Java implementations available). This allows you to not actually have to keep around data about each individual item, drastically reducing your memory usage.
Assuming Flink, the necessary state is quite small (< 1MB), so can easily use the FSStateBackend which is heap-based and checkpoints to the file system, allowing you to reduce serialization overhead.
Again assuming you go with Flink, Using the [ContinuousEventTimeTrigger][2], you can also get a view into how many unique items are currently being tracked.
I'd suggest to reconsider the choice of storage system. Using an external system is significantly slower than using local state. Flink applications locally maintain state on the JVM heap or in RocksDB (on disk) and can checkpoint it in regular intervals to persistent storage such as HDFS. This state can grow very big (10s of TBs) and still be efficiently maintained because checkpoints can be incrementally and asynchronously done. This gives much better performance than sending a query to an external system for each record.
If you still prefer Redis or Cassandra, you can use Flink's AsyncIO operator to send asynchronous requests to improve the throughput of your application.

Pros and Cons of Cassandra User Defined Functions

I am using Apache Cassandra to store mostly time series data. And I am grouping the data and aggregating/counting it based on some conditions. At the moment I am doing this in a Java 8 application, but with the release of Cassandra 3.0 and the User Defined Functions, I have been asking myself if extracting the grouping and aggregation/counting logic to Cassandra is a good idea. To my understanding this functionallity is something like the stored procedures in SQL.
My concern is if this will impact the computation performance and the overall performance of the database. I am also not sure if there are other issues with it and if this new feature is something like the secondary indexes in Cassandra - you can do them, but it is not recommended at all.
Have you used user defined functions in Cassandra? Do you have any observations on the performance? What are the good and bad sides of this new functionality? Is it applicable in my use case?
You can compare it to using count() or avg() kind of aggregations. They can save you a lot of network traffic and object creation/GC by having the coordinator only send the result, but its easy to get carried away and make the coordinator do a lot of work. This extra work takes away from normal C* duties, and can just as likely increase GCs as reduce them.
If your aggregating 100 rows in a partition its probably fine and if your aggregating 10000 its probably not end of the world if its very rare. If your calling it once a second though its a problem. If your aggregating over 1000 I would be very careful.
If you absolutely need to do it and its a lot of data often, you may want to create dedicated proxy coordinators (-Djoin_ring=false) to bear the brunt of the load without impacting normal C* read/writes. At that point its just as easy to create dedicated workload DC for it or something (with RF=0 for your keyspace, and set application to be part of that DC with DCAwareRoundRobinPolicy). This also is the point where using Spark is probably the right thing to do.

Enabling Locking of Data in C

I just want to know whether there is any possible method to implement locking of a data in cassandra?I tried with multithreading using hector but this didn't work out well. Can anyone suggest a method?
Cassandra does not support any type of locking. Out of 4 traditional database properties (ACID - Atomicity, Consistency, Isolation and Durability), it only fully supports D. The way it supports the other 3 is arguable, because they only support each one of them partially. You can read more here. IMHO the only way to accomplish your goal is to use some synchronization layer which will intercept all calls to Cassandra and performa all necessary locking inside of itself, before anything is sent to Cassandra. In this case, you're using Astyanax as such layer.

Cassandra bulk insert operation, internally

I am looking for Cassandra/CQL's cousin of the common SQL idiom of INSERT INTO ... SELECT ... FROM ... and have been unable to find anything to do such an operation programmatically or in CQL. Is it just not supported?
My use case is to do a reasonably bulky copy from one table to another. I don't need any particular concurrent guarantees, but it's a lot of data so I'd like to avoid the additional network overhead of writing a client that retrieves data from one table, then issues batches of inserts into the other table. I understand that the changes will still need to be transported between nodes of the Cassandra cluster according to the replication set-up, but it seems reasonable for there to be an "internal" option to do a bulk operation from one table to another. Is there such a thing in CQL or elsewhere? I'm currently using Hector to talk to Cassandra.
Edit: it looks like sstableloader might be relevant, but is awfully low-level for something that I'd expect to be a fairly common use case. Taking just a subset of rows from one table to another also seems less than trivial in that framework.
Correct, this is not supported natively. (Another alternative would be a map/reduce job.) Cassandra's API focuses on short requests for applications at scale, not batch or analytical queries.

For what reasons it is not preferable to use Thrift API for Cassandra access?

Please enumerate reasons for why its not advisable to use Thrift interface for Cassandra? What are the possible advantages and disadvantages?
If you use the raw Thrift apis the cons will be:
no connection pooling
no monitoring
no object oriented interface (not entirely true)
no failover support
To continue Schildmeijer's good start:
No batch interface.
No chunking of get_range_slices() or get_indexed_slices() so you can easily swamp Cassandra
Non-string types must be packed into binary strings yourself
You'll probably mess up timestamp precision
Exception messages are generally useless
Thrift is broken by default in some languages. See the PHP C extension, for example.
Because the code is generated, it's not intuitive, especially regarding super columns, SlicePredicates and batch_mutate().
Schema modification commands don't wait for schema agreement among all the nodes in the cluster

Resources