I followed the steps to troubleshoot here: https://slurm.schedmd.com/troubleshoot.html.
When running scontrol show slurmd, I get:
Active Steps = NONE
Actual CPUs = 1
Actual Boards = 1
Actual sockets = 1
Actual cores = 1
Actual threads per core = 1
Actual real memory = 984 MB
Actual temp disk space = 492 MB
Boot time = 2019-03-27T17:53:56
Hostname = fedora2
Last slurmctld msg time = NONE
Slurmd PID = 1549
Slurmd Debug = 4
Slurmd Logfile = /var/log/slurmd.log
Version = 17.11.13-2
I don't know why slurmd on fedora2 can't communicate with the controller on fedora1. slurmctld daemon is running fine on fedora1.
The slurm.conf is as follows:
# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
#SlurmctldHost=fedora1
#
ControlMachine=fedora1
ControlAddr=192.168.1.4
MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/var/run/slurm/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
SlurmdUser=root
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
TaskPlugin=task/affinity
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
#
# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
ClusterName=fedora
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=verbose
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=verbose
SlurmdLogFile=/var/log/slurmd.log
#
#
# COMPUTE NODES
NodeName=fedora1 NodeAddr=192.168.1.4 CPUs=1 State=UNKNOWN
NodeName=fedora2 NodeAddr=192.168.1.5 CPUs=1 State=UNKNOWN
PartitionName=debug Nodes=fedora[1-2] Default=YES MaxTime=INFINITE State=UP
The output of tail /var/log/slurmd.log on fedora2, on multiple lines:
error: Unable to register: Unable to contact slurm controller (connect failure)
Make sure that:
no firewall prevents the slurmd daemon from talking to the controller
munge is running on each server
the dates are in sync
the Slurm versions are identical
the name fedora1 can be resolved to the correct IP
Hi I am attempting to utilize a processing pipeline which is written to run on multiple computer clusters using slurm however I would prefer to run it on a single compluter. I am on Ubuntu 18 and have installed slurm-wlm however I have not been able to get the pipeline to read my slurm.conf file which I made from Slurm Version 18.08 Configuration Tool online with the goal of running this as a single node so I dont have to rewrite the pipeline code.
Everytime I attempt to run this pipeline sh script the log-file gives this error
sbatch: error: _parse_next_key: Parsing error at unrecognized key: SlurmctldHost
sbatch: error: Parse error in file /etc/slurm-llnl/slurm.conf line 2: "SlurmctldHost=charlie-Z370M-D3H"
sbatch: fatal: Unable to process configuration file
charlie-Z370M-D3H is the hostname
below is my slurm.conf text and I hope someone can see what I need to do to get this to work
#
SlurmctldHost=charlie-Z370M-D3H
#SlurmctldHost=
#
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=999999
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=1
#KillOnBadExit=0
#LaunchType=launch/slurm
#Licenses=foo*4,bar
#MailProg=/bin/mail
#MaxJobCount=5000
#MaxStepCount=40000
#MaxTasksPerNode=128
MpiDefault=none
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/cgroup
#Prolog=
#PrologFlags=
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#RebootProgram=
ReturnToService=1
#SallocDefaultCommand=
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/var/spool
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/affinity
TaskPluginParam=Sched
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFS=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
#
# SCHEDULING
#DefMemPerCPU=0
FastSchedule=1
#MaxMemPerCPU=0
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core
#
#
# JOB PRIORITY
#PriorityFlags=
#PriorityType=priority/basic
#PriorityDecayHalfLife=
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
#AccountingStorageHost=
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStoragePort=
AccountingStorageType=accounting_storage/none
#AccountingStorageUser=
AccountingStoreJobComment=YES
ClusterName=cluster
#DebugFlags=
#JobCompHost=
#JobCompLoc=
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/none
#JobCompUser=
#JobContainerType=job_container/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=3
#SlurmctldLogFile=
SlurmdDebug=3
#SlurmdLogFile=
#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
NodeName=linux[1-32] CPUs=1 State=UNKNOWN
PartitionName=debug Nodes=linux[1-32] Default=YES MaxTime=INFINITE State=UP
I have had the same issue and it turns out that the conf-file generated on that webpage is only valid for 18.08
If you look at the webpage where you created the slurm.conf-file you may notice that it is only valid for version 18.08.
Thus, please verify that your version of SLURM is at least 18.x, since the key "SlurmctldHost" in the conf-file was introduced then.
You can verify your version of SLURM by simple typing "dpkg -l | grep slurm" and note which version is installed. For Ubuntu 18.x the default package installed is of slurm-version 17.11.9. (You might have to download the source-code from https://www.schedmd.com/archives.php by selecting the version you have installed and download it to your local machine.
Unpack it and look into "/doc/html/"-dir where you´ll find t he corrensponding configurator-html-script for your version.) E.g. if your version is 17.11.9, then the corresponding key of "SlurmctldHost" (as introduced in 18.08), is "ControlMachine" in version 17.11.9. So use the configurator-html-script in your local slurm-doc-dir to generate a valid slurm.conf for your installed version of slurm.
I did that and it works fine.
I am trying to deploy a prediction web service to Azure using ML Workbench process using cluster mode in this tutorial (https://learn.microsoft.com/en-us/azure/machine-learning/preview/tutorial-classifying-iris-part-3#prepare-to-operationalize-locally)
The model gets sent to the manifest, the scoring script and schema
Creating
service..........................................................Error
occurred: {'Error': {'Code': 'KubernetesDeploymentFailed', 'Details':
[{'Message': 'Back-off 40s restarting failed container=...pod=...',
'Code': 'CrashLoopBackOff'}], 'StatusCode': 400, 'Message':
'Kubernetes Deployment failed'}, 'OperationType': 'Service',
'State':'Failed', 'Id': '...', 'ResourceLocation':
'/api/subscriptions/...', 'CreatedTime':
'2017-10-26T20:30:49.77362Z','EndTime': '2017-10-26T20:36:40.186369Z'}
Here is the result of checking the ml service realtime logs
C:\Users\userguy\Documents\azure_ml_workbench\projecto>az ml service logs realtime -i projecto
2017-10-26 20:47:16,118 CRIT Supervisor running as root (no user in config file)
2017-10-26 20:47:16,120 INFO supervisord started with pid 1
2017-10-26 20:47:17,123 INFO spawned: 'rsyslog' with pid 9
2017-10-26 20:47:17,124 INFO spawned: 'program_exit' with pid 10
2017-10-26 20:47:17,124 INFO spawned: 'nginx' with pid 11
2017-10-26 20:47:17,125 INFO spawned: 'gunicorn' with pid 12
2017-10-26 20:47:18,160 INFO success: rsyslog entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2017-10-26 20:47:18,160 INFO success: program_exit entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2017-10-26 20:47:22,164 INFO success: nginx entered RUNNING state, process has stayed up for > than 5 seconds (startsecs)
2017-10-26T20:47:22.519159Z, INFO, 00000000-0000-0000-0000-000000000000, , Starting gunicorn 19.6.0
2017-10-26T20:47:22.520097Z, INFO, 00000000-0000-0000-0000-000000000000, , Listening at: http://127.0.0.1:9090 (12)
2017-10-26T20:47:22.520375Z, INFO, 00000000-0000-0000-0000-000000000000, , Using worker: sync
2017-10-26T20:47:22.521757Z, INFO, 00000000-0000-0000-0000-000000000000, , worker timeout is set to 300
2017-10-26T20:47:22.522646Z, INFO, 00000000-0000-0000-0000-000000000000, , Booting worker with pid: 22
2017-10-26 20:47:27,669 WARN received SIGTERM indicating exit request
2017-10-26 20:47:27,669 INFO waiting for nginx, gunicorn, rsyslog, program_exit to die
2017-10-26T20:47:27.669556Z, INFO, 00000000-0000-0000-0000-000000000000, , Handling signal: term
2017-10-26 20:47:30,673 INFO waiting for nginx, gunicorn, rsyslog, program_exit to die
2017-10-26 20:47:33,675 INFO waiting for nginx, gunicorn, rsyslog, program_exit to die
Initializing logger
2017-10-26T20:47:36.564469Z, INFO, 00000000-0000-0000-0000-000000000000, , Starting up app insights client
2017-10-26T20:47:36.564991Z, INFO, 00000000-0000-0000-0000-000000000000, , Starting up request id generator
2017-10-26T20:47:36.565316Z, INFO, 00000000-0000-0000-0000-000000000000, , Starting up app insight hooks
2017-10-26T20:47:36.565642Z, INFO, 00000000-0000-0000-0000-000000000000, , Invoking user's init function
2017-10-26 20:47:36.715933: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instruc
tions, but these are available on your machine and could speed up CPU computations.
2017-10-26 20:47:36,716 INFO waiting for nginx, gunicorn, rsyslog, program_exit to die
2017-10-26 20:47:36.716376: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instruc
tions, but these are available on your machine and could speed up CPU computations.
2017-10-26 20:47:36.716542: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructio
ns, but these are available on your machine and could speed up CPU computations.
2017-10-26 20:47:36.716703: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructi
ons, but these are available on your machine and could speed up CPU computations.
2017-10-26 20:47:36.716860: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructio
ns, but these are available on your machine and could speed up CPU computations.
this is the init
2017-10-26T20:47:37.551940Z, INFO, 00000000-0000-0000-0000-000000000000, , Users's init has completed successfully
Using TensorFlow backend.
2017-10-26T20:47:37.553751Z, INFO, 00000000-0000-0000-0000-000000000000, , Worker exiting (pid: 22)
2017-10-26T20:47:37.885303Z, INFO, 00000000-0000-0000-0000-000000000000, , Shutting down: Master
2017-10-26 20:47:37,885 WARN killing 'gunicorn' (12) with SIGKILL
2017-10-26 20:47:37,886 INFO stopped: gunicorn (terminated by SIGKILL)
2017-10-26 20:47:37,889 INFO stopped: nginx (exit status 0)
2017-10-26 20:47:37,890 INFO stopped: program_exit (terminated by SIGTERM)
2017-10-26 20:47:37,891 INFO stopped: rsyslog (exit status 0)
Received 41 lines of log
My best guess is theres something silent happening to cause "WARN received SIGTERM indicating exit request". The rest of the scoring.py script seems to kick off - see tensorflow get initiated and the "this is the init" print statement.
http://127.0.0.1:63437 is accessible from my local machine, but the ui endpoint is blank.
Any ideas on how to get this up and running in an Azure cluster? I'm not very familiar with how Kubernetes works, so any basic debugging guidance would be appreciated.
We discovered a bug in our system that could have caused this. The fix was deployed last night. Can you please try again and let us know if you still encounter this issue?
I killed the process that used the port 7199, then i wanted to run cassandra using
cassandra -f -R
But I had this message :
INFO 05:45:43 Initializing system.schema_functions
INFO 05:45:43 Initializing system.schema_aggregates
INFO 05:45:43 Not submitting build tasks for views in keyspace system as storage service is not initialized
INFO 05:45:43 Configured JMX server at: ****service:jmx:rmi://127.0.0.1/jndi/rmi://127.0.0.1:7199/jmxrmi****
Exception (java.lang.RuntimeException) encountered during startup: java.util.concurrent.ExecutionException: FSWriteError in
java.lang.RuntimeException: java.util.concurrent.ExecutionException: FSWriteError in at org.apache.cassandra.utils.FBUtilities.waitOnFuture(FBUtilities.java:403)
I want to run the process that use the port 7199,
I killed that because I had a message the port 7199 already used.
Try to kill the process completely. If it is standalone use this,
$ ps auwx | grep cassandra
$ sudo kill pid
or $ sudo service cassandra stop if you have a local setup
Follow the following steps-
$ jps
You see some processes running. For example:
9107 Jps
1112 CassandraDaemon
Then kill the CassandraDaemon process by the process id you see after executing jps. In my example, here process id 1112 for CassandraDaemon.
$ kill -9 1112
Then check processes again after a while-
$ jps
You will see CassandraDaemon will no longer be available.
9170 Jps
Then remove your saved_caches and start cassandra again.
During an investigation into a different problem (Inconsistent systemd startup of freeswitch) I discovered that both the latest freeswitch 1.6 and 1.7 paused for several minutes at a time (between 4 and 14) during boot up on centos 7.1. Whilst it was intermittent, it was as often as one time in 3 or 4.
Running this from the command line :
/usr/bin/freeswitch -nonat -db /dev/shm -log /usr/local/freeswitch/log -conf /usr/local/freeswitch/conf -run /usr/local/freeswitch/run
caused the following output (note the time difference between the Add task 2 and the line after it) :
2015-10-23 15:40:14.160101 [INFO] switch_event.c:685 Activate Eventing Engine.
2015-10-23 15:40:14.170805 [WARNING] switch_event.c:656 Create additional event dispatch thread 0
2015-10-23 15:40:14.272850 [INFO] switch_core_sqldb.c:3381 Opening DB
2015-10-23 15:40:14.282317 [INFO] switch_core_sqldb.c:1693 CORE Starting SQL thread.
2015-10-23 15:40:14.285266 [NOTICE] switch_scheduler.c:183 Starting task thread
2015-10-23 15:40:14.293743 [DEBUG] switch_scheduler.c:249 Added task 1 heartbeat (core) to run at 1445611214
2015-10-23 15:40:14.293837 [DEBUG] switch_scheduler.c:249 Added task 2 check_ip (core) to run at 1445611214
2015-10-23 15:49:47.883158 [NOTICE] switch_core.c:1386 Created ip list rfc6598.auto default (deny)
When I ran it from 1.6 on centos6.7 using the same command line as above I got this - note the delay is a more reasonable 14 seconds :
2015-10-23 10:31:00.274533 [INFO] switch_event.c:685 Activate Eventing Engine.
2015-10-23 10:31:00.285807 [WARNING] switch_event.c:656 Create additional event dispatch thread 0
2015-10-23 10:31:00.434780 [INFO] switch_core_sqldb.c:3381 Opening DB
2015-10-23 10:31:00.465158 [INFO] switch_core_sqldb.c:1693 CORE Starting SQL thread.
2015-10-23 10:31:00.481306 [DEBUG] switch_scheduler.c:249 Added task 1 heartbeat (core) to run at 1445610660
2015-10-23 10:31:00.481446 [DEBUG] switch_scheduler.c:249 Added task 2 check_ip (core) to run at 1445610660
2015-10-23 10:31:00.481723 [NOTICE] switch_scheduler.c:183 Starting task thread
2015-10-23 10:31:14.286702 [NOTICE] switch_core.c:1386 Created ip list rfc6598.auto default (deny)
It's the same on FS 1.7 as well.
This suggests heavily that centos 7.1 & FS have an issue together. Can anyone help me diagnose further or shine some more light on this, please?
This all came to light as I tried to understand why FS would not accept the cli connection for several minutes after I thought it had booted up (using -nc from systemd service).
Thanks to the FS userlist and ultimately Anthony Minessale, the issue was to do with RNG entropy.
This is a good explanation -
https://www.digitalocean.com/community/tutorials/how-to-setup-additional-entropy-for-cloud-servers-using-haveged
Here are some extracts :
There are two general random devices on Linux: /dev/random and
/dev/urandom. The best randomness comes from /dev/random, since it's a
blocking device, and will wait until sufficient entropy is available
to continue providing output.
The key here is that it's a blocking device, so any program waiting for a random number from /dev/random will pause until sufficient entropy is available for a "safe" random number.
This is a headless server, so the usual sources of entropy such as mouse/keyboard activity (and many others) do not apply. Thus the delays,
The fix is this :
Based on the HAVEGE principle, and previously based on its associated
library, haveged allows generating randomness based on variations in
code execution time on a processor......(google the rest!)
Install like this :
yum install haveged
and start it up like this :
haveged -w 1024
making sure it restarts on reboot :
chkconfig haveged on
Hope this helps someone.