Reconnection NFS share proxmox cluster - proxmox

in a Proxmox VE 7 x cluster only on one node I have a problem delaying reconnecting the network shares of a NAS in nfs. I know that some users have had the same problem, but at the moment I continue to have this discomfort. Did anyone have the same problem and maybe later solved it? .

Related

Hazelcast nodes refusing to join eachother

I have four VM's. I am running hazelcast in "embedded" mode along with my application, attempting to use it for hibernate l2 cache.
I get mixed behavior when I attempt to start up different groups. I think I'm having issues due to how the machines are subnetted. The machines /sbin/ifconfig show that there are three subnets of 2 machines between the four nodes (node1 and node4 show two network devices other than loopback).
Mancenter is running on a fifth node.
Machine/subnet1/subnet2
node1 10.10.40.1 10.10.27.1
node2 10.10.42.1
node3 10.10.40.2
node4 10.10.42.2 10.10.27.2
So node1 and node3 share a subnet, node1 and node4 share, node2 and node4 share.
Behavior is very inconsistent, though node1 and node2 starting together seems to reliably form a cluster, as does node1 and node3. Other combinations seem to enter split brain scenario, where it appears I have two or more clusters with the same name.
Querying our internal DNS the host names will resolve to the 10.10.40 and 10.10.42 IPs.
They have identical configurations. I have tried turning on interfaces to 10.10.40.* and 10.10.42.* along with turning hazelcast.socket.bind.any to false. Due to our deployment framework having identical configurations across a cluster is a high priority.
I have tried listing the nodes by both hostname and IP (the one that is resolved from nslookup of the hostname). Listing by hostname is going to be a requirement from operations.
In some situations I have managed to get them forming a cluster, though migration fails because it complains that it cannot reach one of the nodes.
Of curiosity I have noticed that mancenter will sometimes identify a node as another, such as currenly I have node3 and node4 running (with node1 and node2's application shut down) and it is identifying one of them as node2. I am wondering if this has to do with the fact that the nodes are running on VM's (one instance per VM). I believe the hostOS is redhat and the VM is running centOS.
Am I on the wrong track thinking this is an issue? What else can be causing this?
I managed to fix the issue, though I'm not 100% sure what the root cause was.
The version of Hazelcast I was using was 3.8, but I carried over a configuration from 3.7. At a glance the only difference was the schema being changed from "hazelcast-config-3.7.xsd" to "hazelcast-config-3.8.xsd".
I also upgraded mancenter from the 3.7 version to 3.8 version. However I'm not sure that would have any affect.
Either way my four nodes are now up and talking to each other. So if you come across this and are having a similar issue I'd recommend making sure your configuration version matches your deployed version.

Cassandra improve failover time

We are using a 3-node cassandra cluster (each node on a different vm) and currently investigating failover times during write and read operations in case one of the nodes dies.
Failover times are pretty good when shutting down one node gracefully, however, when killing a node (by shutting down the VM) the latency during the tests is about 12 seconds. I guess this has something to do with the tcp timeout?
Is there any way to tweak this?
Edit:
At the moment we are using Cassandra Version 2.0.10.
We are using the java client driver, version 2.1.9.
To describe the situation in more detail:
The write/read operations are performed using the QUROUM Consistency Level with a replication factor of 3. The cluster consists of 3 nodes (c1,c2,c3), each on a different host (VM). The client driver is connected to c1. During the tests I shutdown the host for c2. From then on we observe that the client is blocking for > 12 seconds, until the other nodes realize that c2 is gone. So i think this is not a client issue, since the client is connected to node c1, which is still running in this scenario.
Edit: I don't believe that the fact that running cassandra inside a VM affects the network stack. In fact, killing the VM has the effect, that the TCP connections are not terminated. In this case, a remote host can notice this only through some time out mechanism (either a timeout on the application level protocol or a TCP timeout).
If the process is killed on OS level, the TCP stack of the OS will take care of terminating the TCP connection (IMHO with a TCP reset) enabling a remote host to immediately be notified about the failure. 
However, it might be important that even in situations, where a host crashes due to a hardware failure, or where a host is disconnected due to an unplugged network cable (in both cases the TCP connection will not be terminated immediately) the failover time is low.  I've tried to sigkill the cassandra process inside the VM. In this case the failover time is about 600ms, which is fine.
kind regards
Failover times are pretty good when shutting down one node gracefully, however, when killing a node (by shutting down the VM) the latency during the tests is about 12 seconds
12 secs is a pretty huge value. Some questions before investigating further
what is your Cassandra version ? Since version 2.0.2 there is a speculative retry mechanism that help reducing the latency for such failover scenario: http://www.datastax.com/dev/blog/rapid-read-protection-in-cassandra-2-0-2
what is the client driver you're using (java ? c# ? version ?). Normally with a properly configured load balancing policy, when a node is down the client will retry automatically the query by re-routing to another replica. There is also speculative retry implemented at the driver-side : http://datastax.github.io/java-driver/manual/speculative_execution/
Edit: for a node to be marked down, the gossip protocol is using the phi accrual detector. Instead of having a binary state (UP/DOWN), the algorithm adjust the suspicion level and if the value is above a threshold, the node is considered down
This algorithm is necessary to avoid marking down a node because of a micro network issue.
Look in the cassandra.yaml file for this config:
# phi value that must be reached for a host to be marked down.
# most users should never need to adjust this.
# phi_convict_threshold: 8
Another question is: what load balancing strategy are you using from the driver ? And did you use the speculative retry policy ?

DataStax commnity AMI installation doesn't join other nodes

I kicked off a 6 node cluster as per the documentation on http://docs.datastax.com/en/cassandra/2.1/cassandra/install/installAMILaunch.html. All worked ok. It's meant to be a 6 node cluster - I can see the 6 nodes working on EC dashborad. I can see OpsWork working on node 0. But the nodes are not seeing each other... I dont have access to OpsWork via browser but I can ssh to each node and verify cassandra is working.
What do I need to do so that they join the cluster. Note they all in the same VPC, same subnet in the same IP range with the same cluster name. All launched using the AMI specified in the document.
Any help will be much appreciated.
Hope your listen address is configured. Add the "auto_bootstrap": false attribute to each node, and restart each node. Check the logs too. That would be of great help.
In my situation, turning on broadcast address to public-ip caused a similar issue. Make the broadcast address your private-ip, or just leave it untouched. If broadcasts address is a must have, have your architect modify the firewall rules.

SSH slow Cassandra cluster

We had a strange issue that we don't know where to look for answer. We are using Cassandra (2.0.10) cluster with 4 nodes. The OS is using CentOS 6.4. We are monitoring those Dell machines using rsh command every minute by dumping their ps and top list. Once a while (about 2 weeks to more than one month), we found rsh command return from one of the machines was very slow (more than 5 sec as compared to less than 1 sec) . At those moment, we noticed that it was very slow to putty to that specific machine as well, while it was fine with other nodes in the same cluster. On the problematic machine, even after we stopped cassandra service, things didn't improve. SSH and ps command still came back much slower than other machines. This didn't get improved until we rebooted the machine. This happened to two of the 4 machines used in the cluster. We looked at the message logs, hardware logs, cassandra logs and can't find the source of this issue. Has anybody experienced this? We love to hear any suggestions.

how to configure high availibility with hadoop 1.0 on AWS ec virtual machines

I Have already configured this setup using heartbeat and virtual IP mechanism on the Non VM setup.
I am using hadoop 1.0.3 and using shared directory for the Namenode metadata sharing. Problem is that on the amazon cloud there is nothing like virtual Ip to get the High Availibility using Linux-ha.
Has anyone been able to achieve this. Kindly let me the know the steps required?
For now I am using the Hbase replication WAL on hbase. Hbase later than 0.92 supports this.
For the hadoop clustering on cloud , I will wait for the 2.0 release getting stable.
Used the following
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/replication/package-summary.html#requirements
On the client Side I added the logic to have 2 master servers , used alternatively to reconnect in case of network disruption.
This thing worked for a simple 2 machines backking up each other , not recommended for higher number of server.
Hope it helps.
Well, there's 2 parts to Hadoop to make it highly available. The first and more important is, of course, the NameNode. There's a secondary/checkpoint NameNode that can you startup and configure. This will help keep HDFS up and running in the event that your primary NameNode goes down. Next is the JobTracker, which runs all the jobs. To the best of my (outdated by 10 months) knowledge, there is no backup to the JobTracker that you can configured, so it's up to you to monitor and start up a new one with the correct configuration in the event that it goes down.

Resources