How to set WRITE consistency explicitly with Datastax java driver? - cassandra

With datastax java driver to connect to Cassandra, I wish to set explicitly WRITE consistency, but seems like we can set consistency level only for queries. Below is the sample code. How do i mention write consistency from driver lever ?
Cluster cluster = Cluster
.builder()
.addContactPoint(host)
.withQueryOptions(new QueryOptions().setConsistencyLevel(ConsistencyLevel.ONE))
.withRetryPolicy(DefaultRetryPolicy.INSTANCE)
.withCredentials(userName,password)
.withLoadBalancingPolicy(
new TokenAwarePolicy(DCAwareRoundRobinPolicy.builder().build()))
.build();

We have completely different requirements for reads and writes (reads have really tight SLA regarding latency numbers and writes are not that important to us to finish fast).
We decided to split sessions, we created two Cluster objects and out of those we created two sessions, one for read and one for write. When we are writing we are using writeSession and we write with CL QUORUM while when we read we use readSession which is tuned for latency requirements, with CL ONE, speculative executions and tight socket read timeout.
Long story short, you can define session specific for all your writes and define consistency level on Cluster object. Be aware only that this will implicate some more connections from driver to Cassandra cluster.

Consistency can be set at Cluster level, in which case any queries run with session.execute will have that consistency level. You can also set the consistency level as a part of the session.execute statement itself.

Related

Insert Data using Spark in Cassandra

I am writing 1.2 billion rows of data (two columns) in Cassandra using spark and datastax spark connector. I have a two DC setup, I will be writing with local_quorum. I have 3 replications in both DC. Will there be latency introduced due to other DC. What other things should I keep in mind while inserting Data. I have tested on single DC and results are satisfactory.
Writes will be sent to other DC anyway, but because you're using LOCAL_QUORUM, Spark won't wait for confirmation from nodes in that DC, so it shouldn't affect the latency. The only thing that I would monitor - if the another DC is far away, and/or have a slow link, then the nodes where write happens may start to collect hints, and if this happens, then this may slightly affect performance because hints need to be written & then replayed after the remote node is back.

Cassandra throughput descrease when moving from "single data node" to "two data node" cassandra cluster

I have a single data node cassandra version 3.11.2 , and a cassandra c++ driver version 2.7. Single data node cluster having 500 000 rows. I read asynchronous and then and pushed data to queue where a scheduler take up the data write asynchronous using cassandra c++ driver. I have 10 application thread 10 io thread and 10 schedular thread. I got a TPS of 38000.
But the same activity I did with "TWO DATA NODE" cassandra cluster both reside on same Rack and try to read and write with consistency level "TWO". My TPS drop down to 12000.
Why my performance degrades so much even all configuration and client binary is same? By just changing READ CONSISTENCY to TWO and WRITE CONSITENCY to TWO.
What I need to do more to get a TPS around 40000. Do I need to add more DATA NODE?
The TWO consistency level means that when you read, you need to get data from two nodes, and this adds latency. The same for write - when you write with TWO, 2 nodes should confirm that data is written, that also adds latency...
I would recommend to read following section in DSE Architecture guide (better the whole guide completely) to get understanding about consistency levels.

Improving SQL Query using Spark Multi Clusters

I was experimenting if Spark with multi clusters can improve slow SQL query. I created two workers for master and they are running on local Spark Standalone. Yes, I did halve the memory and the number of cores to create workers on local machine. I specified partitions for sqlContext, using partitionColumn, lowerBound, UpperBoundand numberPartitions, so that tasks (or partitions) can be distributed over workers. I described them as below (partitionColumn is unique):
df = sqlContext.read.format("jdbc").options(
url = "jdbc:sqlserver://localhost;databasename=AdventureWorks2014;integratedSecurity=true;",
driver = "com.microsoft.sqlserver.jdbc.SQLServerDriver",
dbtable = query,
partitionColumn = "RowId",
lowerBound = 1,
upperBound = 10000000,
numPartitions = 4).load()
I ran my script over the master after specifying the options, but I couldn't get any performance improvement against when running on spark without cluster. I know I should have not halved the memory for integrity of the experiment. But I would like to know if that might be the case or any reason if that's not the case. Any thoughts are welcome. Many thanks.
There are multiple factors which play a role here, though the weights of each of these can differ on a case by case basis.
As nicely pointed out by mtoto, increasing number of workers on a single machine, is unlikely to bring any performance gains.
Multiple workers on a single machine have access to the same fixed pool of resources. Since worker doesn't participate in the processing itself, you just use a higher fraction of this pool for management.
There legitimate cases when we prefer a higher number of executor JVMs, but it is not the same as increasing number of workers (the former one is an application resource, the latter one is a cluster resource).
It is not clear if you use the same number of cores for baseline and multi-worker configuration, nevertheless cores are not the only resource you have to consider working with Spark. Typical Spark jobs are IO (mostly network and disk) bound. Increasing number of threads on a single node, without making sure that there is sufficient disk and network configuration, will just make them wait for the data.
Increasing cores alone is useful only for jobs which are CPU bound (and these will typically scale better on a single machine).
Fiddling with Spark resources won't help you, if external resource cannot keep up with the requests. A high number of concurrent batch reads from a single non-replicated database will just throttle the server.
In this particular case you make it even worse by running a database server on the same node as Spark. It has some advantages (all traffic can go through loopback), but unless database and Spark use different sets of disks, they'll be competing over disk IO (and other resources as well).
Note:
It is not clear what is the query, but if it is slow when executed directly against database, fetching it from Spark will it even slower. You should probably take a closer look at query and/or database structure and configuration first.

Cassandra consistency Issue

We have our Cassandra cluster running on AWS EC2 with 4 nodes in the ring. We have face data inconsistency issue. We changed consistency level two while using "cqlsh" shell, the data inconsistency issue has been solved.
But we dont know "How to set consistency level on Cassandra cluster?"
Consistency level can be set at per session or per statement basis. You will need to check the consistency level of writes and reads, to get a strong consistency your R + W ( read consistency + write consistency ) should be greater than your replication factor.
If you are using Java Driver, you can set default consistency at cluster level using "Cluster.Builder.withQueryOptions()" method.
http://docs.datastax.com/en/drivers/java/2.0/com/datastax/driver/core/Cluster.Builder.html#withQueryOptions-com.datastax.driver.core.QueryOptions-

Cassandra cluster with each node total replication

Hi I'm new to Cassandra. I have a 2 node Cassandra cluster. For reasons imposed by the front end I need...
Total replication of all data on each of the two nodes.
Eventual consistent writes. So the node being written to will respond with an acknowledge to the front end straight away. Not synchronized on the replication
Can anyone tell me is this possible? Is it done in the YAML file? I know there is properties there for consistency but I don't see that any of the Partitioners suit my needs. Where can I set the replication factor?
Thanks
You set the replication factor during creation of the keyspace. So if you use (and plan for the future on using) a single data center set-up, you create the keyspace using cqlsh like so
CREATE KEYSPACE "Excalibur"
WITH REPLICATION = {'class' : 'SimpleStrategy', 'replication_factor' : 3};
Check out the documentation regarding the create keyspace. How this is handled internally is related to the snitch definition of the cluster and a strategy option defined per keyspace. In the case of the SimpleStrategy above, this simply assumes a ring topology of your cluster and places the data clockwise in that ring (see this).
Regarding consistency, you can very different levels of consistency for write and read operations in your client/driver during each operation:
Cassandra extends the concept of eventual consistency by offering tunable consistency―for any given read or write operation, the client application decides how consistent the requested data should be.
Read the doc
If you use Java in your clients, and the DatatStax Java driver, you can set the consistency level using
QueryOptions.setConsistencyLevel(ConsistencyLevel consistencyLevel)
"One" is the default setting.
Hope that helps

Resources