Why isn't TokenAwarePolicy the default for the Datastax java driver - cassandra

Are there drawbacks to using the TokenAwarePolicy over the current default of RoundRobinPolicy?
It seems to me that routing requests to nodes identified as being replicas by the routing-key should always be preferable, and a RoundRobinPolicy wrapped by TokenAwarePolicy should be the default?

There is no real drawback to using TokenAwarePolicy and in fact we've changed the default in recent releases (2.0.2) so it is now using token aware.

Related

jCache providers features

So i'm going to use jCache implementation for my J2EE application java 8 and i want to know what is the difference between all the providers and all its features.
Hazecast
ehcache
infinispan
can anyone help me to choose one of them ( in terms of cluster support, easy to use, performance ...) ?
JCache is a specification, so that all implementations behave in the same way regarding the caching features.
However, a key differentiator to evaluate the products is whether you want the cache to be distributed or not. The Open Source version of Hazelcast is distributed, this is not the case for EhCache.
Disclaimer: I work for Hazelcast.

What is the best way to resolve CouchDB document conflicts across 2 DB instances?

I have one application running over NodeJS and I am trying to make a distributed app. All write request goes to Node application and it writes to CouchDB A and on success of that It writes to CouchDB B. We read data through ELB(which reads from the 2 DBs).It's working fine.
But I faced a problem recently, my CouchDB B goes down and after CouchDB B up, now there is document _rev mismatch between the 2 instances.
What would be the best approach to resolve the above scenario without any down time?
If your CouchDB A & CouchDB B are in the same data centre, then #Flimzy's suggestion of using CouchDB 2.0 in a clustered deployment is a good one. You can have n CouchDB nodes configured in a cluster with a load balancer sitting above the cluster, delivering HTTP(s) traffic to any node that is "up".
If A & B are geographically separated, you can use CouchDB Replication to move data from A-->B and B-->A which would keep both instances perfectly in sync. A & B could each be clusters of 3 or more CouchDB 2.0 nodes, or single instances of CouchDB 1.7.
None of these solutions will "fix" the problem you are seeing when two copies of the database are modified in different ways at the same time. This "conflict" state is CouchDB's way of preventing data loss when two writes clash. Your app can resolve the conflict by picking a winning revision or writing a new one. It's not a fault condition, it's helping your application recover from a data loss during concurrent writes in a distributed system.
You can read more about document conflicts in this blog post series.
If both of your 1.6.x nodes are syncing buckets using standard replication, turning off one node shouldn’t be an issue. On node up it receives all updates without having conflicts – because there were no way to make them, the node was down.
If you experience conflicts during normal operation, unfortunately there exist no common general way to resolve them automatically. However, in most cases you can find a strategy of marking affected doc subtrees in a way allowing to determine which subversion is most recent (or more important).
To detect docs that have conflicts you may use standard views: a doc received by a view function has the _conflicts property if there exist conflicting revisions. Using appropriate view you can detect conflicts and merge docs. Anyway, regardless of how you detect conflicts, you need external code for resolving them.
If your conflicting data is numeric by nature, consider using CRDT structures and standard map/reduce to obtain final value. If your data is text-like you may also try to use CRDT, but to obtain reasonable performance you need to use reducers written in Erlang.
As for 2.x. I do not recommend using 2.x for your case (actually, for any real case except experiments). First, using 2.x will not remove conflicts, so it does not solve your problem. Also taking in account 2.x requires a lot of poorly documented manual operations across nodes and is unable to rebalance, you will get more pain than value.
BTW using any cluster solution have very little sense for two nodes.
As for above mentioned CVE 12635 and CouchDB 1.6.x: you can use this patch https://markmail.org/message/kunbxk7ppzoehih6 to cover the vulnerability.

Why is there no built-in HashMapStreamSerializer in Hazelcast?

The Hazelcast documentation provides examples of how we can write our own LinkedListStreamSerializer and HashMapStreamSerializer and it says that support will be added for these in the future.
It looks as though the LinkedListStreamSerializer is in fact supported now, which is great, but not the HashMap one.
I'm wondering if there is any reason why not and should I be concerned about continuing to use the example one from the documentation.
You should be fine with the HashMapStreamSerializer.
It's now tricky to add a new serializer into Hazelcast due backward compatibility - as older clients wouldn't be able to deserialize blobs serialized with the new serializer.

What is better to use (CqlConnection and CqlCommand) or (Cluster and Session)

Is there an advantage to using one or the classes to execute statement in a .Net application. As a .Net developer using CqlConnection and CqlCommand is very similar what is done for other dbs (like SqlServer). I read on some web sites that Cluster and Session is a better way to go.
The documentation in DataStax does not describe the differences or any suggestions of which to use under what circumstances.
Thanks
Use the cluster and session objects in the DataStax driver
DataStax drivers provide critical functionality for enterprise cassandra apps, including configurable load balancing policies, automatic failover, retry policy, and tunability. These features are exposed via the cluster and session objects.
Notice that CqlConnection and CqlCommand are not even mentioned in the DataStax documentation. This is because they are used under the hood by the driver.
You can certainly use these to connect and read/write to cassandra but you will be missing out on the features I mentioned.
Pro Tip: Check the code comments here to see the functionality of the Cluster object. DataStax drivers are Open Source so feel free to go code diving!

Cassandra as an embedded service and with custom consistency level

I am thinking of building an application that uses Cassandra as its data store, but has low latency requirements. I am aware of EmbeddedCassandraService from this blog post
Is the following implementation possible and what are known pitfalls (defects, functional limitations)?
1) Run Cassandra as an embedded service, persisting data to disk (durable).
2) Java application interacts with local embedded service via one of the following. What are the pros
TMemoryBuffer (or something more appropriate?)
StorageProxy (what are the pitfalls of using this API?)
Apache Avro? (see question #5 below)
3) Java application interacts with remote Cassandra service ("backup" nodes) via Thrift (or Avro?).
4) Write must always succeed to the local embedded Cassandra service in order to be successful, and at least one of the remote (non-embedded) Cassandra nodes. Is this possible? Is it possible to define a custom / complex consistency level?
5) Side-question: Cassandra: The Definitive Guide mentions in several places that Thrift will ultimately be replaced with Avro, but seems like that's not the case just yet?
As you might guess, I am new to Cassandra, so any direction to specific documentation pages (not the wiki homepage) or sample projects are appreciated.
Unless your entire database is sitting on the local machine (i.e. a single node), you gain nothing by this configuration. Cassandra will shard your data across the cluster, so (as mentioned in one of the comments) your writes will frequently be made to another node that owns the data. Presuming you write with a consistency level of at least one, your call will block until that other node acks the write. This negates any benefit of talking to the embedded instance since you have some network latency anyway.

Resources