Unable to add compute nodes to HPC Cluster - azure

I am trying to setup HPC cluster environment with Azure VMs as head node and compute nodes.
Head node is working properly. However, when I try to add compute nodes from HPC Cluster Manager of head node, compute nodes don't show up. If I try to open the HPC Cluster Manager from compute nodes, it asks for the head node and when I provide the name of the head node, it fails with the below error.
"Failed to communicate with remote SDM store. Connection Failed. A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond 10.195.15.194:9893"
All the VM have Windows Server 2012 R2 and they are in the same VNet/Domain.
Any pointers to resolve this issue?

The following 2 templates should help you out. Basically a step by step wizard experience.
https://azure.microsoft.com/en-us/documentation/templates/create-hpc-cluster-custom-image/
https://azure.microsoft.com/en-us/documentation/templates/create-hpc-cluster/

Related

Random error on external Oracle database connection with Kubernetes

After month of research, we are here, hoping for someone to have a insight about these issue:
On a GKE cluster, our pods (node.JS) are having trouble connecting to our external oracle business database.
To be more precise, ~70% of our connection tentative are ending in error:
ORA-12545: Connect failed because target host or object does not exist
The 30% left are working well, and doesn't reset or end prematurely. Once it's connected, it's all good from here.
Our stack:
Our flux are handed by containers based on a node:12.15.0-slim image, at which we add LIBAIO1 and a instant oracle client (v12.2). We use oracleDB v5.0.0 as node module
We use cron job pod handling our node container, in a clusterIP service on a GKE cluster (1.16.15-gke.4300).
Our external oracle database in on a private network (which our cluster have access), in a Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bi version, behind a load balancer
I can give more detail if needed.
What we have already tried:
We have tried to pass directly on the database, cutting off the load balancer: no effect
We had cron job pod doing ping each min on the database server for a day: no error, although flux pod somehow encounter the ORA-12545 error
We redo all our code, connecting differently to the database and making update for our node module oracledb (v4 to v5): no effect
We tried to monitore the load up over the oracle database and take action spreading our flux over all night instead of a 1 hour window: no effect
We had our own kubernetes cluster before GKE, directly in our private network, causing the exactly same error.
We had a audit by some expert on kubernetes, without them finding the issue or seeing a critical issue over our cluster/k8s configuration
What works:
All our pods, some requesting into mySql database, micro service, web front, are all working fine.
All our business tool (dozen of, including Talend and some custom software) are using the oracle database without issue.
Our own flux handling node container are working fine with the oracle database as long they are into a docker env, and not a kube one.
To resume: We have a mysterious issue when trying to connect to an oracle database from a kubernetes env, where pods are randomly unable to reach the database
We are looking for any hint we can have

Adding workstation nodes in HPC Pack 2016

I am using Microsoft HPC Pack 2016 update 2 on a local network and on-premise cluster. We have employed topology 5 (all nodes on the enterprise network). Head node is successfully setup and running. The problem is that after manual installation of HPC Pack 2016 update 2 on different Windows 10 workstations which are all on the same local network, some cannot be found and added to the cluster using the HPC Cluster Manager. I can’t see them on the HPC Cluster Manager running on the head node, neither through “resource management > nodes”, nor using the wizard to add node. While the same steps to install and add node work for some of the workstations, it does not work on some others. Is there any way to track down to find the cause?
In my case the problem was due to trust relationship. This can be verified using nltest /trusted_domains command. Resetting the trust relationship fixed the problem.

Icinga2 cluster node local checks not executing

I am using Icinga2-2.3.2 cluster HA setup with three nodes in the same zone and database in a seperate server for idodb. All are Cent OS 6.5. Installed IcingaWeb2 in the active master.
Configured four local checks for each node including cluster health check as described in the documentation. Installed Icinga Classi UI in all three nodes, beacuse I am not able to see the local checks configured for nodes in Icinga Web2.
Configs are syncing, checks are executing & all three nodes are connected among them. But the configured local checks, specific to the node alone are not happening properly and verified it in the classic ui.
a. All local checks are executed only one time whenever
- one of the node is disconnected or reconnected
- configuration changes done in the master and reload icinga2
b. But after that, only one check is hapenning properly in one node and the remaining are not.
I have attached the screenshot of all node classic ui.
Please help me to fix and Thanks in advance.

DataStax Opscenter do not sees remote agents (windows)

I am new to Cassandra.
I'm trying to deploy a test environment.
Win server 2012 (192.168.128.71) -> seed node
Win server 2008 (192.168.128.70) -> simple node
Win server 2008 (192.168.128.69) -> simple node
On all nodes, I installed the same version Cassandra (2.0.9 from Datastax).
Disabled windows firewall.
The cluster Ring formed. But on each node I see
Test Cluster (Cassandra 2.0.9) 1 of 3 agents connected
Node does not see the Remote Agent. On each PC, the agent service is running.
In file datastax_opscenter_agent-stderr, I see the following line
log4j:ERROR Could not read configuration file [log4j.properties].
log4j:ERROR Ignoring configuration file [log4j.properties].
Please tell me the possible cause how can I diagnose.
Thanks in advance!
The problem is that you have OpsCenter server running on all machines in the cluster. Agents connect to the local OpsCenter servers, so when you open the UI for one of them, you only see one agent connected.
To fix this, stop the server processes (DataStax_OpsCenter_Community) on all machines except for one, and add stomp_interface: <server-ip> to the address.yaml for the agents on all machines, then restart the agents.

Remote Desktop Not Working on Hadoop on Azure

I am able to allocate a Hadoop cluster on Windows Azure by entering my Windows Live ID, but after that, I am unable to do Remote Desktop to the master node there.
Before the cluster creation, it's showing a message that says "Microsoft has got overwhelming positive feedback from Hadoop On Azure users, hence it's giving a free trial for 5 days with 2 slave nodes."
[P.S. that this Preview Version of HoA was working before]
Any suggestions for this problem?
Thanks in advance..
When you created your Hadoop cluster, you were asked to enter the DNS name for cluster which could something like your_hadoop_cluster.cloudapp.net.
So first please ping to your Hadoop cluster name to see if it returns back an IP address, this will prove if you really have any cluster configured at all. IF you dont get an IP back then you don't have a Hadoop cluster on Azure and trying creating one.
IF you are sure you do have a Hadoop cluster on Winodws Azure, try to post your question the following Hadoop on Azure CTP forum and you will get proper help you need:
http://tech.groups.yahoo.com/group/HadoopOnAzureCTP/

Resources