slurm: frontend as compute node not responding - slurm

Similar to slurm: use a control node also for computing.
I would like to use the frontend as an compute node. I made the following entries in slurm.conf
NodeName=gisc RealMemory=63000 Sockets=1 CoresPerSocket=8 ThreadsPerCore=2 State=UNKNOWN Weight=2
NodeName=c[0-2] RealMemory=126000 Sockets=1 CoresPerSocket=16 ThreadsPerCore=2 State=UNKNOWN Weight=1
PartitionName=normal Nodes=gisc,c[0-2] Default=YES MaxTime=INFINITE State=UP
And restarted both slurmd and slurmctld.
However, I always get no response from the frontend node which is proved by an asterix in the status.
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
normal* up infinite 1 idle* gisc
normal* up infinite 2 alloc c[0-1]
normal* up infinite 1 idle c2
Also, I cannot start slurmd on the frontend node. The logs do not help.
Could it be that slurmd and slurmctld are conflicting on the frontend node?
My /etc/hosts looks as follows
192.168.1.1 gisc.localdomain gisc gisc-eth0.localdomain gisc-eth0
### ALL ENTRIES BELOW THIS LINE WILL BE OVERWRITTEN BY WAREWULF ###
#
# See provision.conf for configuration paramaters
# Node Entry for node: c0 (ID=22)
192.168.1.2 c0.localdomain c0 c0-eth0.localdomain c0-eth0
# Node Entry for node: c1 (ID=23)
192.168.1.3 c1.localdomain c1 c1-eth0.localdomain c1-eth0
# Node Entry for node: c2 (ID=24)
192.168.1.4 c2.localdomain c2 c2-eth0.localdomain c2-eth0

facepalm The slurm-client library was missing on the frontend. Only the slurm-server library was installed...

Related

cassandra 'handshaking version with'

I have 2 nodes
ip1 node1's ip
ip2 nodes2's ip
each node starting but not connecting each other.. For example nodetool status show own node. Not other node
in node1's log:
Handshaking version with /ip2
in node2's log there are no info or error messages related to node1
no error messages both of them. What causes this problem?
A node should not normally be in its own seed list; if it is, it will not try to join the existing cluster. Only the first node in a cluster should be in its own seed list.
Try putting only ip1 in both nodes' seed list and leave ip2 out of the seed list entirely. Also, set auto_bootstrap: true on node 2. Shut down the nodes, remove the /var/lib/cassandra directory from both nodes, and then start node 1. When node 1 finishes starting up (check for status UN using nodetool status), then start node 2. It should now talk to node 1 and join the cluster.

cassandra service (3.11.5) stops automaticall after it starts/restart on AWS linux

cassandra service (3.11.5) stops automatically after it starts/restart on AWS linux.
I have fresh installation of cassandra on new instance of AWS linux (t3.xlarge) and
sudo service cassandra start
or
sudo service cassandra restart
after 1 or 2 seconds, the service stop automatically. I looked into logs and I found these.
I am not sure, I havent change configs related to snitch and its always SimpleSnitch. I dont have any multiple cassandras. Just only on single EC2.
Logs
INFO [main] 2020-02-12 17:40:50,833 ColumnFamilyStore.java:426 - Initializing system.schema_aggregates
INFO [main] 2020-02-12 17:40:50,836 ViewManager.java:137 - Not submitting build tasks for views in keyspace system as storage service is not initialized
INFO [main] 2020-02-12 17:40:51,094 ApproximateTime.java:44 - Scheduling approximate time-check task with a precision of 10 milliseconds
ERROR [main] 2020-02-12 17:40:51,137 CassandraDaemon.java:759 - Cannot start node if snitch's data center (datacenter1) differs from previous data center (dc1). Please fix the snitch configuration, decommission and rebootstrap this node or use the flag -Dcassandra.ignore_dc=true.
Installation steps
sudo curl -OL https://www.apache.org/dist/cassandra/redhat/311x/cassandra-3.11.5-1.noarch.rpm
sudo rpm -i cassandra-3.11.5-1.noarch.rpm
sudo pip install cassandra-driver
export CQLSH_NO_BUNDLED=true
sudo chkconfig --levels 3 cassandra on
The issue is in your log file:
ERROR [main] 2020-02-12 17:40:51,137 CassandraDaemon.java:759 - Cannot start node if snitch's data center (datacenter1) differs from previous data center (dc1). Please fix the snitch configuration, decommission and rebootstrap this node or use the flag -Dcassandra.ignore_dc=true.
It seems that you started the cluster, stopped it and renamed the datacenter from dc1 to datacenter1.
In order to fix:
If no data is stored, delete the data directories
If data is stored, rename the datacenter back to dc1 in the config
I had the same problem , where cassandra service immediately stops after it was started.
in the cassandra configuration file located at /etc/cassandra/cassandra.yaml change the cluster_name to the previous one, like this:
...
# The name of the cluster. This is mainly used to prevent machines in
# one logical cluster from joining another.
cluster_name: 'dc1'
# This defines the number of tokens randomly assigned to this node on the ring
# The more tokens, relative to other nodes, the larger the proportion of data
...

slurmd unable to communicate with slurmctld

I followed the steps to troubleshoot here: https://slurm.schedmd.com/troubleshoot.html.
When running scontrol show slurmd, I get:
Active Steps = NONE
Actual CPUs = 1
Actual Boards = 1
Actual sockets = 1
Actual cores = 1
Actual threads per core = 1
Actual real memory = 984 MB
Actual temp disk space = 492 MB
Boot time = 2019-03-27T17:53:56
Hostname = fedora2
Last slurmctld msg time = NONE
Slurmd PID = 1549
Slurmd Debug = 4
Slurmd Logfile = /var/log/slurmd.log
Version = 17.11.13-2
I don't know why slurmd on fedora2 can't communicate with the controller on fedora1. slurmctld daemon is running fine on fedora1.
The slurm.conf is as follows:
# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
#SlurmctldHost=fedora1
#
ControlMachine=fedora1
ControlAddr=192.168.1.4
MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/var/run/slurm/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
SlurmdUser=root
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
TaskPlugin=task/affinity
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
#
# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
ClusterName=fedora
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=verbose
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=verbose
SlurmdLogFile=/var/log/slurmd.log
#
#
# COMPUTE NODES
NodeName=fedora1 NodeAddr=192.168.1.4 CPUs=1 State=UNKNOWN
NodeName=fedora2 NodeAddr=192.168.1.5 CPUs=1 State=UNKNOWN
PartitionName=debug Nodes=fedora[1-2] Default=YES MaxTime=INFINITE State=UP
The output of tail /var/log/slurmd.log on fedora2, on multiple lines:
error: Unable to register: Unable to contact slurm controller (connect failure)
Make sure that:
no firewall prevents the slurmd daemon from talking to the controller
munge is running on each server
the dates are in sync
the Slurm versions are identical
the name fedora1 can be resolved to the correct IP

Emulating SLURM on Ubuntu 16.04

I want to emulate SLURM on Ubuntu 16.04. I don't need serious resource management, I just want to test some simple examples. I cannot install SLURM in the usual way, and I am wondering if there are other options. Other things I have tried:
A Docker image. Unfortunately, docker pull agaveapi/slurm; docker run agaveapi/slurm gives me errors:
/usr/lib/python2.6/site-packages/supervisor/options.py:295: UserWarning: Supervisord is running as root and it is searching for its configuration file in default locations (including its current working directory); you probably want to specify a "-c" argument specifying an absolute path to a configuration file for improved security.
'Supervisord is running as root and it is searching '
2017-10-29 15:27:45,436 CRIT Supervisor running as root (no user in config file)
2017-10-29 15:27:45,437 INFO supervisord started with pid 1
2017-10-29 15:27:46,439 INFO spawned: 'slurmd' with pid 9
2017-10-29 15:27:46,441 INFO spawned: 'sshd' with pid 10
2017-10-29 15:27:46,443 INFO spawned: 'munge' with pid 11
2017-10-29 15:27:46,443 INFO spawned: 'slurmctld' with pid 12
2017-10-29 15:27:46,452 INFO exited: munge (exit status 0; not expected)
2017-10-29 15:27:46,452 CRIT reaped unknown pid 13)
2017-10-29 15:27:46,530 INFO gave up: munge entered FATAL state, too many start retries too quickly
2017-10-29 15:27:46,531 INFO exited: slurmd (exit status 1; not expected)
2017-10-29 15:27:46,535 INFO gave up: slurmd entered FATAL state, too many start retries too quickly
2017-10-29 15:27:46,536 INFO exited: slurmctld (exit status 0; not expected)
2017-10-29 15:27:47,537 INFO success: sshd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2017-10-29 15:27:47,537 INFO gave up: slurmctld entered FATAL state, too many start retries too quickly
This guide to start a SLURM VM via Vagrant. I tried, but copying over my munge key timed out.
sudo scp /etc/munge/munge.key vagrant#server:/home/vagrant/
ssh: connect to host server port 22: Connection timed out
lost connection
So ... we have an existing cluster here but it runs an older Ubuntu version which does not mesh well with my workstation running 17.04.
So on my workstation, I just made sure I slurmctld (backend) and slurmd installed, and then set up a trivial slurm.conf with
ControlMachine=mybox
# ...
NodeName=DEFAULT CPUs=4 RealMemory=4000 TmpDisk=50000 State=UNKNOWN
NodeName=mybox CPUs=4 RealMemory=16000
after which I restarted slurmcltd and then slurmd. Now all is fine:
root#mybox:/etc/slurm-llnl$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
demo up infinite 1 idle mybox
root#mybox:/etc/slurm-llnl$
This is a degenerate setup, our real one has a mix of dev and prod machine and appropriate partitions. But this should answer your "can backend really be client" question. Also, my machine is not really called mybox but is not really pertinent for the question in either case.
Using Ubuntu 17.04, all stock, with munge to communicate (which is the default anyway).
Edit: To wit:
me#mybox:~$ COLUMNS=90 dpkg -l '*slurm*' | grep ^ii
ii slurm-client 16.05.9-1ubun amd64 SLURM client side commands
ii slurm-wlm-basic- 16.05.9-1ubun amd64 SLURM basic plugins
ii slurmctld 16.05.9-1ubun amd64 SLURM central management daemon
ii slurmd 16.05.9-1ubun amd64 SLURM compute node daemon
me#mybox:~$
I would still prefer to run SLURM natively, but I caved and spun up a Debian 9.2 VM. See here for my efforts to troubleshoot a native installation. The directions here worked smoothly, but I needed to make the following changes to slurm.conf. Below, Debian64 is the hostname, and wlandau is my user name.
ControlMachine=Debian64
SlurmUser=wlandau
NodeName=Debian64
Here is the complete slurm.conf. An analogous slurm.conf did not work on my native Ubuntu 16.04.
# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=Debian64
#ControlAddr=
#BackupController=
#BackupAddr=
#
AuthType=auth/munge
#CheckpointType=checkpoint/none
CryptoType=crypto/munge
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=999999
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobCheckpointDir=/var/lib/slurm-llnl/checkpoint
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=1
#KillOnBadExit=0
#LaunchType=launch/slurm
#Licenses=foo*4,bar
#MailProg=/usr/bin/mail
#MaxJobCount=5000
#MaxStepCount=40000
#MaxTasksPerNode=128
MpiDefault=none
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/pgid
#Prolog=
#PrologFlags=
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#RebootProgram=
ReturnToService=1
#SallocDefaultCommand=
SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
SlurmUser=wlandau
#SlurmdUser=root
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/var/lib/slurm-llnl/slurmctld
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/none
#TaskPluginParam=
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFS=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
#
# SCHEDULING
#DefMemPerCPU=0
FastSchedule=1
#MaxMemPerCPU=0
#SchedulerRootFilter=1
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/linear
#SelectTypeParameters=
#
#
# JOB PRIORITY
#PriorityFlags=
#PriorityType=priority/basic
#PriorityDecayHalfLife=
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
#AccountingStorageHost=
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStoragePort=
AccountingStorageType=accounting_storage/none
#AccountingStorageUser=
AccountingStoreJobComment=YES
ClusterName=cluster
#DebugFlags=
#JobCompHost=
#JobCompLoc=
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/none
#JobCompUser=
#JobContainerType=job_container/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
NodeName=Debian64 CPUs=1 RealMemory=744 CoresPerSocket=1 ThreadsPerCore=1 State=UNKNOWN
PartitionName=debug Nodes=Debian64 Default=YES MaxTime=INFINITE State=UP

Cassandra restart issues while restoring to a new cluster

I am restoring to a fresh new Cassandra 2.2.5 cluster consisting of 3 nodes.
Initial cluster health of the NEW cluster:
-- Address Load Tokens Owns Host ID Rack
UN 10.40.1.1 259.31 KB 256 ? d2b29b08-9eac-4733-9798-019275d66cfc uswest1adevc
UN 10.40.1.2 230.12 KB 256 ? 5484ab11-32b1-4d01-a5fe-c996a63108f1 uswest1adevc
UN 10.40.1.3 248.47 KB 256 ? bad95fe2-70c5-4a2f-b517-d7fd7a32bc45 uswest1cdevc
As part of the restore instructions in Datastax docs, i do the following on the new cluster:
1) cassandra stop on all of the three nodes one by one.
2) Edit cassandra.yaml for all of the three nodes with the backup'ed token ring information. [Step 2 from docs]
3) Remove the contents from /var/lib/cassandra/data/system/* [Step 4 from docs]
4) cassandra start on nodes 10.40.1.1, 10.40.1.2, 10.40.1.3 respectively.
Result:
10.40.1.1 restarts back successfully:
-- Address Load Tokens Owns Host ID Rack
UN 10.40.1.1 259.31 KB 256 ? 2d23add3-9eac-4733-9798-019275d125d3 uswest1adevc
But the second and the third nodes fail to restart stating:
java.lang.RuntimeException: A node with address 10.40.1.2 already exists, cancelling join. Use cassandra.replace_address if you want to replace this node.
at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:546) ~[apache-cassandra-2.2.5.jar:2.2.5]
at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:766) ~[apache-cassandra-2.2.5.jar:2.2.5]
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:693) ~[apache-cassandra-2.2.5.jar:2.2.5]
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:585) ~[apache-cassandra-2.2.5.jar:2.2.5]
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:300) [apache-cassandra-2.2.5.jar:2.2.5]
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:516) [apache-cassandra-2.2.5.jar:2.2.5]
at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:625) [apache-cassandra-2.2.5.jar:2.2.5]
INFO [StorageServiceShutdownHook] 2016-08-09 18:13:21,980 Gossiper.java:1449 - Announcing shutdown
java.lang.RuntimeException: A node with address 10.40.1.3 already exists, cancelling join. Use cassandra.replace_address if you want to replace this node.
...
Eventual cluster health:
-- Address Load Tokens Owns Host ID Rack
UN 10.40.1.1 259.31 KB 256 ? 2d23add3-9eac-4733-9798-019275d125d3 uswest1adevc
DN 10.40.1.2 230.12 KB 256 ? 6w2321ad-32b1-4d01-a5fe-c996a63108f1 uswest1adevc
DN 10.40.1.3 248.47 KB 256 ? 9et4944d-70c5-4a2f-b517-d7fd7a32bc45 uswest1cdevc
I understand that the HostID of a node might change after system dirs are removed.
My question is:
Do i need to explicitly state during the start to replace itself? Are the docs incomplete or am i missing something in my steps?
Turns out there were stale directories commit_log and saved_caches which i missed to delete earlier. The instructions work correctly with those directories deleted.
Usually on a situation like this, after i do a
$ systemctl stop cassandra
It i will run the
$ ps awxs | grep cassandra
will notice cassandra still has some features up.
I usually do a
$ kill -9 cassandra.pid
and
$ rm -rf /var/lib/cassandra/data/* && /var/lib/cassandra/commitlog/*
java.lang.RuntimeException: A node with address 10.40.1.3 already exists, cancelling join. Use cassandra.replace_address if you want to replace this node.
If you are still facing this above error, that means your cassandra process is running on that node. Login to 10.40.1.3 node firstly. Then follow the following steps-
$ jps
You see some processes running. For example:
9107 Jps
1112 CassandraDaemon
Then kill the CassandraDaemon process by the process id you see after executing jps. In my example, here process id 1112 for CassandraDaemon.
$ kill -9 1112
Then check processes again after a while-
$ jps
You will see CassandraDaemon will no longer be available.
9170 Jps
Then remove your saved_caches and commitlog and start cassandra again.
Do this for all nodes you are suffering with above error you mentioned.

Resources