Minimizing Hazelcast chatter and overhead - hazelcast

Our app needs to run both on a large number of machines and on a single standalone machine. It has three distinct clusters, each performing a mostly isolated function. Cluster A is the main one, and clusters B & C are independent, but they both need access to a map in A to know where to route requests. Access needs to be super-fast.
Which setup should I choose?
Each cluster has its own Hazelcast instance. Clusters B & C are also lite members of the A instance.
Each cluster has its own Hazelcast instance. Clusters B & C use a Hazelcast client to talk to A.
One giant instance for all clusters.
I'm concerned about chatter and overhead as the clusters get larger, to potentially hundreds of machines. Which setup is most scalable?
Also, is there a writeup anywhere which details the messages that Hazelcast passes around? I'd like to know exactly what happens when a key gets added or removed, for example.

Try to avoid lite-member setup (1) as it is harder to maintain clusters with lite-members.
If all these machines/nodes are on the same local network and if # of nodes is around 50, you can go with (3).. all in one cluster. Otherwise I would go with (2) as you can scale clients really well and they are very lite-weight.

Related

Hazelcast Embedded Topology | Latency increases with number of nodes in cluster

We are running a 5 node cluster of Hazelcast in Embedded mode.
We are running a simple use case of locking using Hazelcast IMap APi.
However, the latency of request flow increases linearly
with addition of nodes.Is this expected?
Thanks.
It depends on the data structure, but in general "yes".
For IMap the data is spread across the available nodes.
If you have a 3 node cluster, you have the primary copy of 1/3 of the data locally. If you are accessing randomly, then you'll find 66.66% of the calls need to go to other nodes, so will see the impact of the network.
If you expand this to a 5 node cluster, then you have primary copy of 1/5 of the data locally. For the same random access, now it's 80% of the calls involve the network.
As the number of nodes goes up, the benefits of data locality in embedded mode reduce.
Note also this is for random access, if you frequently access the same key you could be lucky and it's local or unlucky and it's remote.

How to run two service with different node with voltdb

I have a three node cluster configured for voltdb. Currently 2 applications are running and all the traffic is going to only single node. ( Only one server)
As we have 3 cluster ( 3 nodes) and data is replicated around all the nodes. Can i run one service on one Node and other service on another node? Is that possible?
Yes, as long as both these services use the same database, they can both point to different nodes in the cluster, and VoltDB will reroute the data to the proper partition accordingly.
However, it is recommended to connect applications to all of the nodes in a cluster, so they can send requests to the cluster more evenly. Depending on which client is being used, there are optimizations that send each request to the optimal server based on which partition is involved. This is often called "client affinity". Clients can also simply send to each node in a round-robin style. Both client affinity and round-robin are much more efficient than simply sending all traffic to 1 node.
Also, be cautious of running applications on the same hosts as VoltDB nodes, because they could unpredictably starve the VoltDB process of resources that it needs. However, for applications that behave well and on servers where there are adequate resources, they can be co-located and many VoltDB customers do this.
Full Disclosure: I work at VoltDB.

Auto scaling resources for foxx/arangodb on mesos

Is it possible to separately autoscale foxx and arangodb independently of each other in liu of trying to strike balance, and sure enough autoscale right amount of ram/storage/cpu? Simply if it's a good idea to try and autoscale deployment is answer good enough.
You are not very specific about what you mean by saying "scaling ArangoDB".
In general, you can add more DB server nodes (primaries) independent of the number of coordinator nodes, if that is what you're asking. Foxx is executed on coordinators in a cluster. Your data is stored on DB servers. The cluster configuration is managed by agent nodes. An agency count of 3 is recommended (it must always be an odd number).
yes you can scale the different roles of an ArangoDB cluster independently to serve different use cases best (Agents, Coordinators, DBservers).
If you are using Foxx a lot, then you can increase the number of coordinator instances where the Foxx services live. You should be able to do this for Mesos by accessing the ArangoDB WebUI -> Nodes and in the upper right corner you have a button for Coordinators and one for DBservers. Just click on the "+" sign and a new coordinator will get started.

Is it possible to isolate spark cluster nodes for each individual application

We have a spark cluster comprising of 16 nodes. Is it possible to limit nodes 1 & 2 for application 'A'; nodes 3,4,5 for application 'B'; nodes 10,11,12,15 for application 'C'; and so on?
From the documentation, I understand that we can set some properties to control spark executor cores, number of executors to be launched, memories etc. But, I am curious to know if I can achieve the above use case.
One obvious way to do that is to configure 3 different clusters with the desired topology, otherwise you're out of luck, spark does not have any provision,
because it is usually a bad idea and generally against the design principles of spark and clustering in general. Why? If you assign application A to specific hosts, but it gets idle, while application B is running at 100%, you have 2 idle hosts that could be working for B, so you would be wasting costly computing resources. Usually, what you want is to assign a certain number of resources per application and let the system decide how to allocate them (scheduling.. plain spark is pretty elementary, but running under YARN and Mesos you can be more sophisticated).
Another reason why it's a bad idea is that you don't want rules that specify a specific host or set of hosts. What if you assign node 1&2 to application A and they both go down? Beside not using your resources efficiently, tying your app to specific hosts makes it also difficult to make them resilient to failure by rescheduling them on other hosts.
You may have other ways to do something similar though, if you're running spark under YARN or Mesos, you can define queues or quotas and limit the amount of resources that each application can use at a given time.
In general, it depends on the reason, why do you want to statically allocate resources to applications. If it's for resource management, you should instead looking at schedulers and queues. If it's for security, you should have multiple clusters, keeping in mind that you'd be losing in performance.

Preventing Cassandra Node from Being Overwhelemed

When in Java, I create a Cassandra cluster builder, I provide a list of multiple Cassandra nodes as shown below:
Cluster cluster = Cluster.builder().addContactPoint(host1, host2, host3, host4).build();
But from what I understand, the connector connects only to the first host in the list that is available, and that host becomes my connection point to the Cassandra cluster.
Now, my question is if my Java application reads/writes huge amount of data from/to Cassandra, then doesn't my Java application overwhelm the node that it is connected to?
Is there a way to configure my connection such that it uses multiple nodes of Cassandra for its reads/writes? What is the common practice?
It uses the contact point to find the rest of the nodes in the cluster, then creates a pool of connections to all the hosts and balances the requests among them. It doesn't only connect to the hosts you provide unless you use the whitelist load balancing policy or a custom one.
If your worried about overwhelming nodes use the RoundRobinLoadBalancingPolicy (DC aware if multiple DCs) and it will distribute it amongst all of them evenly. If you have hot spots of data and use the TokenAware policy you may have it uneven, but you shouldn't need to worry about it.

Resources