Number of connections handled by singleton object of Cassandra - cassandra

I have three node Cassandra cluster which is serving currently 50 writes/sec. Now, It would be 100 writes/sec and following are the details of my cluster :
Keyspace definition :
CREATE KEYSPACE keyspacename WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'datacenter1' : 3 };
Partitioner :
org.apache.cassandra.dht.RandomPartitioner
I have client in c# (datastax c# driver) and i am using the singleton design pattern or rather creating only one object of cassandra server. Which will be used for writing and reading the data from ring. And reason for doing it was tcp connections were not getting closed on ring. Till now my ring is working fine and it is able to sustain the load of 50 writes/sec. Now it would increase to 100 writes/sec.
So, my question is will the same design pattern would be able to handle the same given the configuration of my ring?
C# code :
public static ISession GetSingleton()
{
if (_singleton == null)
{
Cluster cluster = Cluster.Builder().AddContactPoints(ConfigurationManager.AppSettings["cassandraCluster"].ToString().Split(',')).Build();
ISession session = cluster.Connect(ConfigurationManager.AppSettings["cassandraKeySpace"].ToString());
_singleton = session;
}
return _singleton;
}

From the Cassandra side, 100 writes/sec is quite low. It would handle it easily.
From the client side, I see no problem with your design. According to me, it is a good idea to use Singleton pattern. But I cannot give you an exact answer since I do not know :
What the size of your written data is.
How performant your network is.
Whether you use synchronous or asynchronous execution.
Generaly, we can reasonably consider 10 ms/writes. With synchronous execution, you would be able to write 100 times/sec. But you could not go along indefinitely because the driver would not create more connections.
In the other hand, you can use ExecuteAsync method to execute writes asynchronously. The C# Cassandra driver will manage the connection pool for you.
Another tip I can give you is PreparedStatement.

Related

from single machine to parallel processing?

i am new to Spark Streaming and Big Data in general. I am trying to understand the structure of a project in spark. I want to create a main class lets say "driver" with M machines, each machine keeps an array of counters and its values. In single machine and not in Spark, I would create a class for the machines and a class for the counters and i would do the computations that i want. But i am wondering if that is happening in Spark too. Would the same project but in Spark, have the structure I am quoting bellow?
class Driver {
var num : Int = 100
var machines: Array[Machine] = new Array[Machine](num)
//split incoming dstream and fill machines' queues
}
class Machine {
var counters = new Queue[(Int,Int)]() // counter with id 1 and value 25
def fillCounters: Unit = { ... } //function to fill the queue counters
}
In general, you could imagine Spark application as a driver part of application which runs all coordination tasks, constructs graph (you will find mentions of direct acyclic graph or DAG in theoretical parts of tutorials on Spark and distributed computations) of operations to take place over your data, and executor part which results in many copies of the code, sent to each node of the cluster to run over the data.
Main idea is that driver extracts part of your application's code that needs to be run locally with data on nodes, serializes it, sends over network to each executor, launches, manages and collects results.
Spark framework hides this details for simplicity of usage, so applications being developed and looks like single-threaded application.
Developer could separate contexts that run on driver and executors, but this is not very common for tutorials (again, for simplicity).
So, to the answer for the actual question above:
you do not need to design your application in a way you demonstrated above, unless your certainly want to.
Just follow official Spark tutorial to get viable solution and split it afterwards with contexts of execution.
There is good post, summarizing a lot of Spark turorials, videos and talks - you could find it here at SO.

Should I use Hazelcast to detect duplicate requests to a REST service

I have a simple usecase. I have a system where duplicate requests to a REST service (with dozens of instances) are not allowed. However, also difficult to prevent because of a complicated datastore configuration and downstream services.
So the only way I can prevent duplicate "transactions" is to have some centralized place where I write a unique hash of a request data. Each REST endpoint first checks if the hash of a new request already exists and only proceeds if no such hash exists.
For purposes of this question assume that it's not possible to do this with database constraints.
One solution is to create a table in the database where I store my request hashes and always write to this table before proceeding with the request. However, I want something lighter than that.
Another solution is to use something like Redis and write to redis my unique hashes before proceeding with the request. However, I don't want to spin up a Redis cluster and maintain it etc..
I was thinking of embedding Hazelcast in each of my app instances and write my unique hashes there. In theory, all instances will see the hash in the memory grid and will be able to detect duplicate requests. This solves my problem of having a lighter solution than a database and the other requirement of not having to maintain a Redis cluster.
Ok now for my question finally. Is it a good idea to use Hazelcast for this usecase?
Will hazelcast be fast enough to detect duplicate requests that come in milliseconds or microseconds apart ?
If request 1 comes into instance 1 and request 2 comes into instance 2 microseconds apart. Instance 1 writes to hazelcast a hash of the request, instance 2 checks hazelcast for existence of the hash only millyseconds later will the hash have be detected? Is hazelcast going to propagate the data across the cluster in time? Does it even need to do that?
Thanks in advance, all ideas are welcome.
Hazelcast is definitely a good choice for this kind of usecase. Especially if you just use a Map<String, Boolean> and just test with Map::containsKey instead of retrieving the element and check for null. You should also put a TTL when putting the element, so you won't run out of memory. However, same as with Redis, we recommend to use Hazelcast with a standalone cluster for "bigger" datasets, as the lifecycle of cached elements normally interferes with the rest of the application and complicates GC optimization. Running Hazelcast embedded is a choice that should be taken only after serious considerations and tests of your application at runtime.
Yes you can use Hazelcast distributed Map to detect duplicate requests to a REST service as whenever there is put operation in hazelcast map data will be available to all the other clustered instance.
From what I've read and seen in the tests, it doesn't actually replicate. It uses a data grid to distribute the primary data evenly across all the nodes rather than each node keeping a full copy of everything and replicating to sync the data. The great thing about this is that there is no data lag, which is inherent to any replication strategy.
There is a backup copy of each node's data stored on another node, and that obviously depends on replication, but the backup copy is only used when a node crashes.
See the below code which creates two hazelcast clustered instances and get the distributed map. One hazelcast instance putting the data into distibuted IMap and other instance is getting data from the IMap.
import com.hazelcast.config.Config;
import com.hazelcast.core.Hazelcast;
import com.hazelcast.core.HazelcastInstance;
import com.hazelcast.core.IMap;
public class TestHazelcastDataReplication {
//Create 1st Instance
public static final HazelcastInstance instanceOne = Hazelcast
.newHazelcastInstance(new Config("distributedFisrtInstance"));
//Create 2nd Instance
public static final HazelcastInstance instanceTwo = Hazelcast
.newHazelcastInstance(new Config("distributedSecondInstance"));
//Insert in distributedMap using instance one
static IMap<Long, Long> distributedInsertMap = instanceOne.getMap("distributedMap");
//Read from distributedMap using instance two
static IMap<Long, Long> distributedGetMap = instanceTwo.getMap("distributedMap");
public static void main(String[] args) {
new Thread(new Runnable() {
#Override
public void run() {
for (long i = 0; i < 100000; i++) {
//Inserting data in distributedMap using 1st instance
distributedInsertMap.put(i, System.currentTimeMillis());
//Reading data from distributedMap using 2nd instance
System.out.println(i + " : " + distributedGetMap.get(i));
}
}
}).start();
}
}

What is a good way to discover all queries made by a Cassandra java app?

For a SQL db, I could just turn on logging on the driver to see what queries were made... or log them on the server side if I had access to the db server.
How is this accomplished in Cassandra ?
You can do this with QueryLogger.
The QueryLogger provides clients with the ability to log queries executed by the driver, and especially, it allows client to track slow queries, i.e. queries that take longer to complete than a configured threshold in milliseconds.
QueryLogger Example Code
try (Cluster cluster = Cluster.builder().addContactPoints("127.0.0.1").withCredentials("username", "password").build(); Session session = cluster.connect("test")) {
QueryLogger queryLogger = QueryLogger.builder()
.withConstantThreshold(500)
.build();
cluster.register(queryLogger);
for (Row row : session.execute("select userid, firstname from users limit 10")) {
System.out.println(row);
}
}
Here withConstantThreshold(500) means the query which takes longer than 1000 milliseconds will treat as slow query
A QueryLogger that uses a constant threshold in milliseconds to track slow queries. This implementation is the default and should be preferred to QueryLogger.DynamicThresholdQueryLogger which is still in beta state.
and You need to set your logger level to DEBUG turn Query Logging enable

Spark Cassandra Connector proper usage

I'm looking to use spark for some ETL, which will mostly consist of "update" statements (a column is a set, that'll be appended to, so a simple insert is likely not going to work). As such, it seems like issuing CQL queries to import the data is the best option. Using the Spark Cassandra Connector, I see I can do this:
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/1_connecting.md#connecting-manually-to-cassandra
Now I don't want to open a session and close it for every row in the source (am I right in not wanting this? Usually, I have one session for the entire process, and keep using that in "normal" apps). However, it says that the connector is serializable, but the session is obviously not. So, wrapping the whole import inside a single "withSessionDo" seems like it'll cause problems. I was thinking of using something like this:
class CassandraStorage(conf:SparkConf) {
val session = CassandraConnector(conf).openSession()
def store (t:Thingy) : Unit = {
//session.execute cql goes here
}
}
Is this a good approach? Do I need to worry about closing the session? Where / how best would I do that? Any pointers are appreciated.
You actually do want to use withSessionDo because it won't actually open and close a session on every access. Under the hood, withSessionDo accesses a JVM level session. This means you will only have one session object PER cluster configuration PER node.
This means code like
val connector = CassandraConnector(sc.getConf)
sc.parallelize(1 to 10000000L).map(connector.withSessionDo( Session => stuff)
Will only ever make 1 cluster and session object on each executor JVM regardless of how many cores each machine has.
For efficiency i would still recommend using mapPartitions to minimize cache checks.
sc.parallelize(1 to 10000000L)
.mapPartitions(it => connector.withSessionDo( session =>
it.map( row => do stuff here )))
In addition the session object also uses a prepare cache, which lets you cache a prepared statement in your serialized code, and it will only ever be prepared once per jvm(all other calls will return the cache reference.)

Cassandra. Not enough replica available - Java driver behaviour different from CQL console

I have a very simple cluster with 2 nodes.
I have created a keyspace with SimpleStrategy replication and a replication factor of 2.
For reads and writes I always use the default data consistency level of ONE.
If I take down one of the two nodes, by using the datastax java driver, I can still read data but when I try to write I get "Not enough replica available for query at consistency ONE (1 required but only 0 alive)".
Strangely if I execute the exactly same insert statement by using the CQL console it works without any problem. Even when using the CQL console the data consistency level was 1.
Am I missing something?
TIA
Update
I have done some more tests and the problem appears only when I use the BatchStatement. If I execute the prepared statement directly it works. Any idea ?
Here the code
Cluster cluster = Cluster.builder()
.addContactPoint("192.168.1.10")
.addContactPoint("192.168.1.12")
.build();
Session session = cluster.connect();
session.execute("use giotest");
BatchStatement batch = new BatchStatement();
PreparedStatement statement = session.prepare("INSERT INTO hourly(series_id, timestamp, value) VALUES (?, ?, ?)");
for (int i = 0; i < 50; i++) {
batch.add(statement.bind(new Long(i), new Date(), 2345.5));
}
session.execute(batch);
batch.clear();
session.close();
cluster.close();
Batches are atomic by default: if the coordinator fails mid-batch, Cassandra will make sure other nodes replay the remaining requests. It uses a distributed batch log for that (see this post for more details).
This batch log must be replicated to at least one replica other than the coordinator, otherwise that would defeat the above mechanism.
In your case, there is no other replica, only the coordinator. So Cassandra is telling you that it cannot provide the guarantees of an atomic batch. See also the discussion on CASSANDRA-7870.
If you haven't already, make sure you have specified both hosts at the driver level.

Resources