I could not submit a job to the executing node in condor apart from the central manager - linux

I have a condor pool which consist of 4 dedicated machine one is set as a centeral manager, submitting, and executing node while the other three is set to be executing nodes I used CentOS 5.4 as an OS for all the machines. My problem is when I submitted a job from the central manager it works just on the central manager so when I specify in the JDL file that the job should run in any machine apart from the central manager the job stay in hold and does not run. When I type condor_status all nodes appear. I keep the daemon MASTER, STARTD in the daemon list for the executing nodes. Does any one come across this problem?

There's not enough information to answer your question, but the first thing to do is to run condor_q -analyze <jobid> and see what it tells you. See the Condor manual Section 2.6.5: Why is the job not running?
One possible cause is that you're not telling Condor to transfer your input/output files for you, and your nodes have different "filesystem domains", so Condor is unable to find a host which shares a common filesystem with your submit host.

Related

How to know if a machine in a Spark cluster 'participate's a job

I wanted to know when it is safe to remove a node from a machine from a cluster.
My assumption is that it could be safe to remove a machine if the machine does not have any containers, and it does not store any useful data.
By the APIs at https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html, we can do
GET http://<rm http address:port>/ws/v1/cluster/nodes
to get the information of each node like
<node>
<rack>/default-rack</rack>
<state>RUNNING</state>
<id>host1.domain.com:54158</id>
<nodeHostName>host1.domain.com</nodeHostName>
<nodeHTTPAddress>host1.domain.com:8042</nodeHTTPAddress>
<lastHealthUpdate>1476995346399</lastHealthUpdate>
<version>3.0.0-SNAPSHOT</version>
<healthReport></healthReport>
<numContainers>0</numContainers>
<usedMemoryMB>0</usedMemoryMB>
<availMemoryMB>8192</availMemoryMB>
<usedVirtualCores>0</usedVirtualCores>
<availableVirtualCores>8</availableVirtualCores>
<resourceUtilization>
<nodePhysicalMemoryMB>1027</nodePhysicalMemoryMB>
<nodeVirtualMemoryMB>1027</nodeVirtualMemoryMB>
<nodeCPUUsage>0.006664445623755455</nodeCPUUsage>
<aggregatedContainersPhysicalMemoryMB>0</aggregatedContainersPhysicalMemoryMB>
<aggregatedContainersVirtualMemoryMB>0</aggregatedContainersVirtualMemoryMB>
<containersCPUUsage>0.0</containersCPUUsage>
</resourceUtilization>
</node>
If numContainers is 0, I assume it does not run containers. However can it still store any data on disk that other downstream tasks can read?
I did not get if Spark lets us know this. I assume if a machine still stores some data useful for the running job, the machine may maintain a heart beat with Spark Driver or some central controller? Can we check this by scanning tcp or udp connections?
Is there any other way to check if a machine in a Spark cluster participates a job?
I am not sure whether you just want to know if a node is running any task (is that's what you mean by 'participate') or you want to know if it is safe to remove a node from the Spark cluster
I will try to explain the latter point.
Spark has the ability to recover from the failure, which also applies to any node being removed from the cluster.
The node removed can be an executor or an application master.
If an application master is removed, the entire job fails. But is you are using yarn as a resource manager, the job is retried and yarn gives a new application master. The number if retries is configured in :
yarn.resourcemanager.am.max-attempts
By default, this value is 2
If a node on which a task is running is removed, the resource manager (which is handled by yarn) will stop getting heartbeats from that node. Application master will know it is supposed to reschedule the failed job as it will no longer receive progress status from the previous node. It will then request resource manager for resources and then reschedule the job.
As far as data on these nodes is concerned, you need to understand how the tasks and their output are handled. Every node has its own local storage to store the output of the tasks running on them. After the tasks are run successfully, the OutputCommitter will move the output from local storage to the shared storage (HDFS) of the job from where the data is picked for the next step of the job.
When a task fails (may be because the node that runs this job failed or was removed), the task is rerun on another available node.
In fact, the application master will also rerun the successfully run tasks on this node as their output stored on the node's local storage will not longer be available.

How to setup Jenkins with HA?

Currently we are using a Jenkins as our CI system and there is one master server and slaves which are provisioned by Saltstack on Openstack. If our Jenkins master server goes down, we need to create a new master and we need to pull the files from the old master & put it in new ones but it's gonna take at least 30mins.
Is there any way to setup Jenkins with High Availability?
I already check with Gearman Plugin, however if the Gearman server goes down for some reason, we need to setup a HA for Gearman also.
Is there any other ways to setup a High Availability for Jenkins?
Jenkins doesn't have a great HA story; the best you can do with the open source version is to put all of the files in $JENKINS_HOME on a shared file system, and then have a cold standby master machine that you can spin up if the active master goes down. That would reduce your failover time to however long it takes for the master to restart, which is usually just a few minutes.
You could also look at CloudBees' Jenkins Enterprise offering, which includes a High Availability Plugin.
I use cluster from scratch doc to create a Jenkins WAN-HA active/passive cluster. See the attached Architecture Diagram for Jenkins HA using pacemaker .
/etc/init.d/jenkins will need to be converted to be an ocf agent script. Currently I manually start up Jenkins via systemd on pcmk-2 server when pcmk-1 is down.

Icinga2 cluster node local checks not executing

I am using Icinga2-2.3.2 cluster HA setup with three nodes in the same zone and database in a seperate server for idodb. All are Cent OS 6.5. Installed IcingaWeb2 in the active master.
Configured four local checks for each node including cluster health check as described in the documentation. Installed Icinga Classi UI in all three nodes, beacuse I am not able to see the local checks configured for nodes in Icinga Web2.
Configs are syncing, checks are executing & all three nodes are connected among them. But the configured local checks, specific to the node alone are not happening properly and verified it in the classic ui.
a. All local checks are executed only one time whenever
- one of the node is disconnected or reconnected
- configuration changes done in the master and reload icinga2
b. But after that, only one check is hapenning properly in one node and the remaining are not.
I have attached the screenshot of all node classic ui.
Please help me to fix and Thanks in advance.

Typical Hadoop setup for remote job submission

So I am still a bit new to hadoop and am currently in the process of setting up a small test cluster on Amazonaws. So my question relates to some tips on the structuring of the cluster so it is possible to work submit jobs from remote machines.
Currently I have 5 machines. 4 are basically the Hadoop cluster with the NameNodes, Yarn etc. One machine is used as a manager machine( Cloudera Manager). I am gonna describe my thinking process on the setup and if anyone can chime in the points I am not clear with, that would be great.
I was thinking what was the best setup for a small cluster. So I decided to expose only one manager machine and probably use that to submit all the jobs through it. The other machines will see each other etc, but not be accessible from the outside world. I am have conceptual idea on how to do this,but I am not sure how to properly go about doing this though, if anyone could point me in the right direction that would great.
Also another big point is, I want to be able to submit jobs to the cluster through exposed machine from a client machine (might be Windows). I am not so clear on this setup as well. Do I need to have Hadoop installed on the machine in order to use the normal hadoop commands, and to write/submit jobs say from Eclipse or something similar.
So to sum it up my questions are,
Is this an ok setup for a small test cluster
How can I go about using one exposed machine to submit/route jobs to the cluster, without having any of the Hadoop nodes on it.
How do I setup a client machine to submit jobs to a remote cluster, and an example on how to do it on Windows. Also if there are any reason not to use Windows as a client machine in this setup.
Thanks I would greatly appreciate any advice or help on this.
Since this is not answered I will attempt to answer it.
1. Rest api to submit an application:
Resource 1(Cluster Applications API(Submit Application)): https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Applications_APISubmit_Application
Resource 2: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.5/bk_yarn-resource-management/content/ch_yarn_rest_apis.html
Resource 3: https://hadoop-forum.org/forum/general-hadoop-discussion/miscellaneous/2136-how-can-i-run-mapreduce-job-by-rest-api
Resource 4: Run a MapReduce job via rest api
2. Submitting hadoop job fromĀ  client machine
Resource 1: https://pravinchavan.wordpress.com/2013/06/18/submitting-hadoop-job-from-client-machine/
3. Sending program to remote hadoop cluster
It is possible to send the program to a remote Hadoop cluster for running it. All you need to ensure is that you have set the resource manager address, fs.defaultFS, library files, and mapreduce.framework.name correctly before running the actual job.
Resource 1: (how to submit mapreduce job with yarn api in java)

prioritizing gearman job servers?

I have 2 machines running same workers. One machine shoud be "primary" as it is very powerful, and the other machine should server as a backup for when primary machine goes down or crashes. When primary machine is up and running, all jobs should default to the primary machine for as long as there are available workers.
From my tests, I've noticed that gearmanD randomly picks a machine to send the job to. Is there any way at all to prioritize the machines to send jobs to?
Example:
Primary machine running 8 instasnces of the same worker
Backup machine running 1 instance
Do:
Use primary machine until it no more available workers are there to fullfill the job queue, then continue onto backup machine.
Any way of accomplishing this?
Thanks everyone!
I don't think this is possible with the current API. You could run 2 gearmand's though, one on each of your worker servers and set them both in the client, the powerful machine's first. This way, at least current versions of the client APIs I'm aware of, will first use the first gearmand and its workers and if that isn't available, it will switch to the second, which has the less powerful machine's workers...

Resources