I just want to know whether there is any possible method to implement locking of a data in cassandra?I tried with multithreading using hector but this didn't work out well. Can anyone suggest a method?
Cassandra does not support any type of locking. Out of 4 traditional database properties (ACID - Atomicity, Consistency, Isolation and Durability), it only fully supports D. The way it supports the other 3 is arguable, because they only support each one of them partially. You can read more here. IMHO the only way to accomplish your goal is to use some synchronization layer which will intercept all calls to Cassandra and performa all necessary locking inside of itself, before anything is sent to Cassandra. In this case, you're using Astyanax as such layer.
Related
I was going through lots of posts over SO and the official documentation of Flink but I couldn't get what I was looking for. I am looking for the difference between RichSinkFunction, RichAsyncFunction, AsyncIO and CassandraSink for writing records faster/multithreaded in the Cassandra DB using Flink.
My understanding is as follows:
RichSinkFunction - If implemented properly, then it will do the work for you. Since it opens and closes the connection once.
RichAsyncFunction- Implementation same as RichSinkFunction. It works in Sync mode originally. I can use executorService for multithreading purposes. Also, I read that if the capacity is passed thoughtfully, it can give you higher throughput.
AsyncIO - Doesn't work multithreaded by default. Also, according to one of the SO answers, we can use executorService same as RichAsyncFunction for creating separate threads which are not mentioned in the documentation.
CassandraSink - Provided by Flink with various properties. Will using setMaxConcurrentReqeuests give me faster results?
What would be the best to use among the mentioned classes for the purpose I am looking for via the Flink program?
I think you can use CassandraSink, which uses DataStax's java-driver(https://github.com/datastax/java-driver) to access Cassandra. Flink uses executeAsync function to achieve better speed. As you mentioned, setMaxConcurrentRequests sets the max number of requests that can be sent from the same session.
RichSinkFunction is a fundamental function in Flink. We can implement our own Cassandra Sink with RichSinkFunction, but it requires work like initialising the client, creating threads, etc.
RichAsyncFunction is mainly for the AsyncIO operator. We still have to initialise the Cassandra client. Except that, we only have to focus on implementing the asyncInvoke function and failure handling.
CassandraSink implements RichSinkFunction with Cassandra client's async API. This is the easiest async API with the least code.
I understand cassandra does not support transaction management, but it does support batch operations which can be used like transactions (either all will success or none).
Similarly spring-data for cassandra provides CassandraTemplate which provides batchOps to support this batch feature.
I am wondering if something similar is available to be used with CrudRepository (a high level api which internally uses CassandraTemplate).
Batches are really used for very limited sets of use cases - please check documentation. It could be better to devise some other way to achieve the same goal, but this really depends on what you want to do - what operations to execute, table schemas, etc.
When we want to implement the MapStore interface we only have the method loadAll for initializing the map. Therefore you have to provide a set of keys to load into the map. How do you handle the situation when you have date / time as primary key. Intuitively one would define a key range where tst between a and b. But since we only can provide a Set we have to pre-fetch all the possible date-time values (via SQL or whatever). And the next time the IMap will start to hammer the database fetching every key one by one. Is this the best approach? Isn't there a more convenient way to do this?
My advice would be to stop thinking about the maps as if they were tables in a relational database. Try to think in terms that conform to the semantics of a Map (if you are using a Map as there are other distributed collections in Hazelcast). For example, you have to keep in mind that you can only make queries about objects that are available in memory as the semantics of query applies only to the case in which Hazelcast is used as a data grid and not as a cache. If semantics is the use of a cache, you should limit your access by key, as you would proceed with a traditional map in Java.
For example, when it comes to a data grid, you must think that access to the database will occur typically only to respond to disaster recovery scenarios. Therefore, the initial data loading from disk to memory may strongly hit the database, but that would only occur in cases of recovery, so that it is not such a major handicap. In the case of use of caching, yes it would be important to be very efficient when planning your persistence strategy since access to the database will be more frequent.
If you provide further information about what it's your particular use case, especially regarding eviction policies, maybe I might help you more.
Hope this helps.
I am looking for Cassandra/CQL's cousin of the common SQL idiom of INSERT INTO ... SELECT ... FROM ... and have been unable to find anything to do such an operation programmatically or in CQL. Is it just not supported?
My use case is to do a reasonably bulky copy from one table to another. I don't need any particular concurrent guarantees, but it's a lot of data so I'd like to avoid the additional network overhead of writing a client that retrieves data from one table, then issues batches of inserts into the other table. I understand that the changes will still need to be transported between nodes of the Cassandra cluster according to the replication set-up, but it seems reasonable for there to be an "internal" option to do a bulk operation from one table to another. Is there such a thing in CQL or elsewhere? I'm currently using Hector to talk to Cassandra.
Edit: it looks like sstableloader might be relevant, but is awfully low-level for something that I'd expect to be a fairly common use case. Taking just a subset of rows from one table to another also seems less than trivial in that framework.
Correct, this is not supported natively. (Another alternative would be a map/reduce job.) Cassandra's API focuses on short requests for applications at scale, not batch or analytical queries.
Please enumerate reasons for why its not advisable to use Thrift interface for Cassandra? What are the possible advantages and disadvantages?
If you use the raw Thrift apis the cons will be:
no connection pooling
no monitoring
no object oriented interface (not entirely true)
no failover support
To continue Schildmeijer's good start:
No batch interface.
No chunking of get_range_slices() or get_indexed_slices() so you can easily swamp Cassandra
Non-string types must be packed into binary strings yourself
You'll probably mess up timestamp precision
Exception messages are generally useless
Thrift is broken by default in some languages. See the PHP C extension, for example.
Because the code is generated, it's not intuitive, especially regarding super columns, SlicePredicates and batch_mutate().
Schema modification commands don't wait for schema agreement among all the nodes in the cluster