Multithreading multiple concurrent cassandra connections - multithreading

I am running an ExecutorService of more than 50 threads concurrently. Each thread is opening a connection to Cassandra and performing inserts using The problem is when I open more than 50 connections at a time, I get the following error.
Caused by: Failed to create a selector.
at com.datastax.driver.core.Connection$Factory.<init>(
at com.datastax.driver.core.Cluster$Manager.<init>(
at com.datastax.driver.core.Cluster$Manager.<init>(
at com.datastax.driver.core.Cluster.<init>(
at com.datastax.driver.core.Cluster.<init>(
at com.datastax.driver.core.Cluster.buildFrom(
at com.datastax.driver.core.Cluster$
If I open exactly 50 threads (or less), it works fine. Is there a way to configure this so I can allow more? In my cassandra.yaml file, rpc_max_threads according to the comments by default "The default is unlimited"

My guess is you are overwhelming your OS by creating too many connections. You should only create 1 Cluster instance per Cassandra cluster. Clusters create Sessions, which manage their own connection pools. Both Cluster and Session are thread safe, so you can share them between threads.
Four simple rules for coding with the driver distills these concepts well:
When writing code that uses the driver, there are four simple rules that you should follow that will also make your code efficient:
Use one cluster instance per (physical) cluster (per application lifetime)
Use at most one session instance per keyspace, or use a single Session and explicitly specify the keyspace in your queries
A Cluster instance allows to configure different important aspects of the way connections and queries will be handled. At this level you can configure everything from contact points (address of the nodes to be contacted initially before the driver performs node discovery), the request routing policy, retry and reconnection policies, and so forth. Generally such settings are set once at the application level.
While the session instance is centered around query execution, the Session it also manages the per-node connection pools. The session instance is a long-lived object, and it should not be used in a request-response, short-lived fashion. The code should share the same cluster and session instances across your application.


cassandra connections spikes load issue

I am using cassandra according to the following struct:
21 nodes , AWS EC2 i3.2xlarge , version 3.11.4 .
The application is opening about 5000 connection per node (so its 100k connections per cluster) using the datastax java connection driver.
Application is using autoscale and frequently opens/close connections.
Number of connections to open at once by app servers can reach up to 500 per node (opens simultaneously on all nodes at once - so its 10k connections opens at the same time across the cluster)
This cause spikes of load on cassandra and cause reads and writes latency.
I have noticed each time connections opens/close there are high number of reads from system_auth.roles and system_auth.role_permissions.
How can I prevent the load and resolve this issue ?
You need to modify your application to work with as small number of connections as possible. You need to have following in mind:
Create Cluster/Session object, once at start and keep it. Initialization of session is very expensive operation, it adds a load to Cassandra, and to your application as well
you may increase the number of the simultaneous requests per connection, instead of opening new connections. Protocol allows to have up to 32k requests per connection. Although, if you have too many requests in-flight, then it's a sign that your Cassandra doesn't keep with workload and can't answer fast enough. See documentation on connection pooling

nodejs | worker_thread | keep alive tcp connection within workers?

Using worker_threads from node 12, is it suitable to establish remote connection within the workers and keep those connection alive ?
I don't mean sharing the socket between the master and the workers like we could do with node cluster and fork.
The idea would be to have pools of secure connections already established within the workers to use if needed.
Let say I have a pool of 10 workers. When a worker is created, some pre-established "TLS" connection are created (streams) to server X,Y amd Z, and the worker is marked as "ready"
Each time that I use a worker to process "heavy" tasks (mapReduce, etc, ) and if I need to post data or get data to/from server X,Y or Z during the process,
I use the appropriate "TLS" connection already established from the pool.
Once the task completed, the result is return to the master and the worker just execute a new/next tasks.
1 ) Do you see any side effect / impact of doing so ?
2 ) would it be better to have the pool of "TLS" connection on the "main thread" (master) . If "remote" data are needed within the workers during the tasks, use the "postMessage" method to communicate with the "master" ( and vice/versa ).
Worker Threads do not work for remote connections. However, you can build your own system that would work similar using TLS sockets. In a case of such a system I would definitely recommend keeping these types of connections alive. There is a significant latency in setting up these connections, and having these connections active in memory, will use a minimum amount of resources.
Keep in mind that a system like this has some drawbacks:
You are working with different machines, and each of these machines can have its own set of failure conditions.
You are communicating over a network, connections with remote servers might suddenly drop, for any reason imaginable.
You are increasing the physical distance, this will cause latency.
So keep this in the back of your mind.
Would I recommend building a system like this. It is really hard to determine and it relies on your use case, time and money. You mentioned the cluster nodes are processing 'heavy tasks', and with that I reckon CPU / GPU intensive tasks. So a system like this might be a good solution, however, a simple rest API in front of your processing servers might be good enough. Or maybe even database synchronized servers, that just check the database for tasks to execute.
There are many solutions for the same problem, just have to consider what works best for your project(s).

Using Sessions in Cassandra

When using cassandra datastax java driver, When can I use multiple sessions under same cluster? I am not able to find any good usecase for having a cluster and multiple sessions.
My application have multiple components/modules that accesses Cassandra. Based on the answer I may decide Should I be having one session per component/module or just one session shared across all the components of my application.
Update: Everywhere on the internet they recommend to use one session. I get it, but my question is "in what scenario do you create multiple sessions for one cluster?". If there is no such scenario, why the library allows to create multiple sessions, instead the library can just have a method to return a singleton session object.
Use Just One Session across all your component.
Because In Cassandra Session is a heavy object. Thread-safe. It maintain multiple connection, cached prepared statement etc.
Here is the JavaDoc :
A session holds connections to a Cassandra cluster, allowing it to be queried. Each session maintains multiple connections to the cluster nodes, provides policies to choose which node to use for each query (round-robin on all nodes of the cluster by default), and handles retries for failed query (when it makes sense), etc...
Session instances are thread-safe and usually a single instance is enough per application. As a given session can only be "logged" into one keyspace at a time (where the "logged" keyspace is the one used by query if the query doesn't explicitely use a fully qualified table name), it can make sense to create one session per keyspace used. This is however not necessary to query multiple keyspaces since it is always possible to use a single session with fully qualified table name in queries.
Source :

Cassandra - how to manage sessions

I am new to Cassandra, and I would like to ask you something. I have some events, and on each event, the application responds with some code that is similar to this:
Cluster cluster = Cluster.builder().addContactPoint(CONTACT_POINT).build();;
Session session = cluster.connect(KEYSPACE);
Statement statement = QueryBuilder.update(KEYSPACE, TABLE_NAME)
.with(set(STATE_COLUMN, status.toString()))
.and(set(PERCENT_DONE_COLUMN, percentDone))
.where(eq(FILE_ID_COLUMN, id));
//or whatever query I might have
My question is this:
Is it better to call cluster.connect() and cluster.close() each time, or just call cluster.connect() once at application start up?
Connections in Cassandra are designed to be persistent, so they should not be opened and closed for each CQL statement. Setting up a connection is somewhat expensive, since it creates thread pools and obtains a lot of metadata from the cluster.
You want to set up the connection once at application startup and close it when your application is shutting down. If you have multiple threads within your application, you generally want them to all share a single connection.
You need to connect and close as less as possible.
While the session instance is centered around query execution, the
Session it also manages the per-node connection pools. The session
instance is a long-lived object, and it should not be used in a
request-response, short-lived fashion. The code should share the same
cluster and session instances across your application.

How to measure effectiveness of using Token Aware connection pool?

My team is testing the token aware connection pool of Astyanax. How can we measure effectiveness of the connection pool type, i.e. how can we know how the tokens are distributed in a ring and how client connections are distributed across them?
Our initial tests by counting the number of open connection on network cards show that only 3 out of 4 or more Cassandra instances in a ring are used and the other nodes participate in request processing in a very limited scope.
What other information would help making a valid judgment/verification? Is there an Cassandra/Astyanax API or command line tools to help us out?
Use Opscenter. This will show you how balanced your cluster is, i.e. whether each node has the same amount of data, as well asbeing able to graph the incoming read / write request per node and for your entire cluster. It is free and works with open source Cassandra as well as DSE.
