slurmd ignores slurm config on startup

slurmd ignores slurm config on startup - slurm

I don't see why my config is being ignored, even when specifying -f directly. Google yields no results, is there any relevant documentation I can look at for this?
Hopefully I just completely missed some critical information for this
after starting slurmctl daemon on one machine, attempting to run sudo slurmd -f /usr/local/etc/slurm.conf -D -vvvvvvv (for testing) gives output (relevant excerpt) (note RealMemory = 3907):
slurmd: debug3: Confile = `/usr/local/etc/slurm.conf'
slurmd: debug3: Debug = 3
slurmd: debug3: CPUs = 2 (CF: 2, HW: 2)
slurmd: debug3: Boards = 1 (CF: 1, HW: 1)
slurmd: debug3: Sockets = 2 (CF: 1, HW: 2)
slurmd: debug3: Cores = 1 (CF: 2, HW: 1)
slurmd: debug3: Threads = 1 (CF: 1, HW: 1)
slurmd: debug3: UpTime = 8838 = 02:27:18
slurmd: debug3: Block Map = 0,1
slurmd: debug3: Inverse Map = 0,1
slurmd: debug3: RealMemory = 3907
slurmd: debug3: TmpDisk = 19018
slurmd: debug3: Epilog = `(null)'
slurmd: debug3: Logfile = `/var/log/slurmd.log'
slurmd: debug3: HealthCheck = `(null)'
slurmd: debug3: NodeName = node1
slurmd: debug3: Port = 6818
slurmd: debug3: Prolog = `(null)'
slurmd: debug3: TmpFS = `/tmp'
slurmd: debug3: Public Cert = `(null)'
slurmd: debug3: Slurmstepd = `/usr/local/sbin/slurmstepd'
slurmd: debug3: Spool Dir = `/var/spool/slurmd'
slurmd: debug3: Syslog Debug = 10
slurmd: debug3: Pid File = `/var/run/slurm/slurmd.pid'
slurmd: debug3: Slurm UID = 64030
slurmd: debug3: TaskProlog = `(null)'
slurmd: debug3: TaskEpilog = `(null)'
slurmd: debug3: TaskPluginParam = 0
slurmd: debug3: UsePAM = 0
ctld spams
slurmctld: debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from UID=0
slurmctld: debug: Node node1 has low real_memory size (3907 < 2000000)
slurm.conf output from cat /usr/local/etc/slurm.conf | grep -v "#" (Note RealMemory=2000000, amongst other ignored configuration details):
ClusterName=scluster_0
SlurmctldHost=controller
MpiDefault=none
ProctrackType=proctrack/cgroup
ReturnToService=0
SlurmctldPidFile=/var/run/slurm/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
TaskPlugin=task/none
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
AccountingStorageType=accounting_storage/none
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/cgroup
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurmd.log
NodeName=node[1-2] CPUs=2 RealMemory=2000000 Sockets=1 CoresPerSocket=2 ThreadsPerCore=1 State=UNKNOWN
PartitionName=pdefault Nodes=ALL Default=YES MaxTime=INFINITE State=UP
The configuration of both systems (slurmctl daemon, and slurm daemon) is identical
I also have cgroup_allowed_devices.conf & cgroup.conf if those would be relevant

My guess is the followong:
The slurmd ist reading the config file correctly. What happens is that Slurm cross-checks the configuration with the actual detected hardware. It notices it should have 2000000 RealMemory, according to the config, but only finds 3907 when looking at the hardware. This mismatch is reported and the node drained.
This behaviour makes sure you don't have faulty DIMM in your server without noticing.

#Marcus Boden is correct.
The RealMemory = 3907 from the slurmd output is what Slurm discovers on the server, not what it reads from the documentation.
It finds there 3907MB of RAM and compares it to the 2000000 it finds in the configuration file and complains that
slurmctld: debug: Node node1 has low real_memory size (3907 < 2000000)
so, basically, that it finds 4GB of RAM while it expected to find 2TB based on the configuration.
You should check on the server the exact amount of memory Linux finds with the free command and make sure it matches the specification you believe it to have.
See more information here for instance.

Related

eof error when trying to initiate commands in python ftp_tls module

I have bought two servers both being from the same hosting company and same package,
the first server bought works perfectly, this was just for testing and experimental,
the other server which I have been told by support team that their ssl and ftp versions are the same
does not work and gives me an error in my python script when trying to initiate commands.
'''
from ftplib_custom import FTP_TLS
import ssl
import ftplib_custom
import socket
def launch():
#Wait upon user input
print("Press Enter To Initialise Server Connection: ")
input()
#Info
print("Server Found")
print("Admin Auto Login...")
print("\n")
#Connection Initiation
#Working Test Server
ftp = FTP_TLS('MyHostname', user='Username', passwd='Password')
#Not Working Server
ftp = FTP_TLS('MyHostname', user='Username', passwd='Password')
ftp.ssl_version = ssl.PROTOCOL_TLS
print(ftp.getwelcome())
ftp.set_debuglevel(1)
ftp.set_pasv(True)
ftp.prot_p()
ftp.ccc()
print ("Login Successful")
def listLineCallback(line):
msg = ("** %s*"%line)
print(msg)
#Commands
ftp.pwd()
ftp.cwd("/")
ftp.retrlines('LIST', listLineCallback)
#ftp.dir()
launch()
'''
This is what I get from the working test server...
'''
Press Enter To Initialise Server Connection:
Server Found
Admin Auto Login...
220---------- Welcome to Pure-FTPd [privsep] [TLS] ----------
220-You are user number 1 of 50 allowed.
220-Local time is now 10:53. Server port: 21.
220-This is a private system - No anonymous login
220-IPv6 connections are also welcome on this server.
220 You will be disconnected after 15 minutes of inactivity.
*cmd* 'PBSZ 0'
*resp* '200 PBSZ=0'
*cmd* 'PROT P'
*resp* '200 Data protection level set to "private"'
*cmd* 'CCC'
*resp* '200 Control connection unencrypted'
Login Successful
*cmd* 'PWD'
*resp* '257 "/" is your current location'
*cmd* 'CWD /'
*resp* '250 OK. Current directory is /'
*cmd* 'TYPE A'
*resp* '200 TYPE is now ASCII'
*cmd* 'PASV'
*resp* '227 Entering Passive Mode (91,103,219,222,232,11)'
*cmd* 'LIST'
*resp* '150 Accepted data connection'
** drwxr-xr-x 2 sensitive sensitive 4096 Dec 18 15:53 .*
** drwxr-xr-x 2 sensitive sensitive 4096 Dec 18 15:53 ..*
** -rw------- 1 sensitive sensitive 4 Oct 28 15:59 .ftpquota*
*resp* '226-Options: -a -l \n226 3 matches total'
'''
This is what I get if I try and connect to the main server...
'''
Server Found
Admin Auto Login...
220---------- Welcome to Pure-FTPd [privsep] [TLS] ----------
220-You are user number 1 of 50 allowed.
220-Local time is now 10:57. Server port: 21.
220-This is a private system - No anonymous login
220-IPv6 connections are also welcome on this server.
220 You will be disconnected after 15 minutes of inactivity.
*cmd* 'PBSZ 0'
*resp* '200 PBSZ=0'
*cmd* 'PROT P'
*resp* '200 Data protection level set to "private"'
*cmd* 'CCC'
*resp* '200 Control connection unencrypted'
Login Successful
*cmd* 'PWD'
*resp* '257 "/" is your current location'
*cmd* 'CWD /'
*resp* '250 OK. Current directory is /'
*cmd* 'TYPE A'
*resp* '200 TYPE is now ASCII'
*cmd* 'PASV'
*resp* '227 Entering Passive Mode (91,146,105,202,253,214)'
*cmd* 'LIST'
*resp* '150 Accepted data connection'
Traceback (most recent call last):
File "C:\Users\install\Desktop\WebsiteConnecting\Website_FTP_Testing.py", line 54, in <module>
launch()
File "C:\Users\install\Desktop\WebsiteConnecting\Website_FTP_Testing.py", line 51, in launch
ftp.retrlines('LIST', listLineCallback)
File "C:\Users\install\AppData\Local\Programs\Python\Python37\lib\ftplib_custom.py", line 488, in retrlines
return self.voidresp()
File "C:\Users\install\AppData\Local\Programs\Python\Python37\lib\ftplib_custom.py", line 251, in voidresp
resp = self.getresp()
File "C:\Users\install\AppData\Local\Programs\Python\Python37\lib\ftplib_custom.py", line 236, in getresp
resp = self.getmultiline()
File "C:\Users\install\AppData\Local\Programs\Python\Python37\lib\ftplib_custom.py", line 222, in getmultiline
line = self.getline()
File "C:\Users\install\AppData\Local\Programs\Python\Python37\lib\ftplib_custom.py", line 210, in getline
raise EOFError
EOFError
'''
Any help would be much appreciated and I will edit or add upon request for whats needed.
'''
EDIT:
Status: Resolving address of ########
Status: Connecting to 91.146.105.202:21...
Status: Connection established, initializing TLS...
Error: GnuTLS error -15: An unexpected TLS packet was received.
Status: Connection attempt failed with "ECONNABORTED - Connection aborted".
Error: Could not connect to server
Status: Disconnected from server
Status: Selected port usually in use by a different protocol.
Status: Resolving address of ######
Status: Connecting to 91.146.105.202:21...
Status: Connection established, initializing TLS...
Error: GnuTLS error -15: An unexpected TLS packet was received.
Status: Connection attempt failed with "ECONNABORTED - Connection aborted".
Error: Could not connect to server
Status: Waiting to retry...
Status: Resolving address of #######
Status: Connecting to 91.146.105.202:21...
Status: Connection established, initializing TLS...
Error: GnuTLS error -15: An unexpected TLS packet was received.
Status: Connection attempt failed with "ECONNABORTED - Connection aborted".
Error: Could not connect to server
'''

Ceph-rgw Service stop automatically after installation

in my local cluster (4 Raspberry PIs) i try to configure a rgw gateway. Unfortunately the services disappears automatically after 2 minutes.
[ceph_deploy.rgw][INFO ] The Ceph Object Gateway (RGW) is now running on host OSD1 and default port 7480
cephuser#admin:~/mycluster $ ceph -s
cluster:
id: 745d44c2-86dd-4b2f-9c9c-ab50160ea353
health: HEALTH_WARN
too few PGs per OSD (24 < min 30)
services:
mon: 1 daemons, quorum admin
mgr: admin(active)
osd: 4 osds: 4 up, 4 in
rgw: 1 daemon active
data:
pools: 4 pools, 32 pgs
objects: 80 objects, 1.09KiB
usage: 4.01GiB used, 93.6GiB / 97.6GiB avail
pgs: 32 active+clean
io:
client: 5.83KiB/s rd, 0B/s wr, 7op/s rd, 1op/s wr
After one minute the service(rgw: 1 daemon active) is no longer visible:
cephuser#admin:~/mycluster $ ceph -s
cluster:
id: 745d44c2-86dd-4b2f-9c9c-ab50160ea353
health: HEALTH_WARN
too few PGs per OSD (24 < min 30)
services:
mon: 1 daemons, quorum admin
mgr: admin(active)
osd: 4 osds: 4 up, 4 in
data:
pools: 4 pools, 32 pgs
objects: 80 objects, 1.09KiB
usage: 4.01GiB used, 93.6GiB / 97.6GiB avail
pgs: 32 active+clean
Many thanks for the help

Solution:
On the gateway node, open the Ceph configuration file in the /etc/ceph/ directory.
Find an RGW client section similar to the example:
[client.rgw.gateway-node1]
host = gateway-node1
keyring = /var/lib/ceph/radosgw/ceph-rgw.gateway-node1/keyring
log file = /var/log/ceph/ceph-rgw-gateway-node1.log
rgw frontends = civetweb port=192.168.178.50:8080 num_threads=100
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html-single/object_gateway_guide_for_red_hat_enterprise_linux/index

Docker inside LXC unprivileged container

I am trying to run Docker containers inside LXC unprivileged container. Can anyone suggest what am I missing?
If I remove apparmor from the LXC container it works fine. Seems like I need to do some apparmor magic to make it work without disabling apparmor?
This is my current LXC container config:
lxc.include = /usr/share/lxc/config/nesting.conf
# Distribution configuration
lxc.include = /usr/share/lxc/config/common.conf
# For Ubuntu 14.04
lxc.mount.entry = /sys/kernel/debug sys/kernel/debug none bind,optional 0 0
lxc.mount.entry = /sys/kernel/security sys/kernel/security none bind,optional 0 0
lxc.mount.entry = /sys/fs/pstore sys/fs/pstore none bind,optional 0 0
lxc.mount.entry = mqueue dev/mqueue mqueue rw,relatime,create=dir,optional 0 0
lxc.include = /usr/share/lxc/config/userns.conf
# For Ubuntu 14.04
lxc.mount.entry = /sys/firmware/efi/efivars sys/firmware/efi/efivars none bind,optional 0 0
lxc.mount.entry = /proc/sys/fs/binfmt_misc proc/sys/fs/binfmt_misc none bind,optional 0 0
lxc.arch = linux64
# Container specific configuration
lxc.idmap = u 0 1258512 65536
lxc.idmap = g 0 1258512 65536
lxc.rootfs.path = dir:/var/lib/lxc/ubuntu/rootfs
lxc.uts.name = ubuntu
# Network configuration
lxc.net.0.type = veth
lxc.net.0.link = br0
lxc.net.0.link = lxcbr0
lxc.net.0.flags = up
lxc.net.0.hwaddr = 00:16:3e:3e:3f:77
lxc.net.0.ipv4.address = 10.0.3.242/24
lxc.net.0.ipv4.gateway = auto
lxc.cgroup.memory.limit_in_bytes = 512M
lxc.cgroup.cpuset.cpus = 0-31
lxc.start.auto = 1

Is the following in the config helpful in resolving:
lxc.aa_profile = unconfined
It may break your security profile, but may get you started in the right direction.

snmpwalk failed with authorizationError

I tried to execute:
snmpwalk -v 3 -u snmpv3username -A <passphrase> -a MD5 -l authNoPriv localhost .1.3.6.1.4.1.334.72.1.1.6.2.1.0
However, I got the following error:
Error in packet.
Reason: authorizationError (access denied to that object)
I have already define the following in /etc/snmp/snmpd.conf:
createUser snmpv3username MD5 <passphrase> AES <passphrase>
Question is:
1. What is the meaning of this error? I thought I have defined the user in the config file
2. How to solve this issue?
If I execute:
snmpwalk -v 1 -c public -O e 127.0.0.1
I got this result:
SNMPv2-MIB::sysDescr.0 = STRING: Linux ip-10-251-138-141 2.6.32-358.14.1.el6.x86_64 #1 SMP Mon Jun 17 15:54:20 EDT 2013 x86_64
SNMPv2-MIB::sysObjectID.0 = OID: NET-SNMP-MIB::netSnmpAgentOIDs.10 DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (615023) 1:42:30.23
SNMPv2-MIB::sysContact.0 = STRING: Root <root#localhost>
SNMPv2-MIB::sysName.0 = STRING: ip-10-251-138-141
SNMPv2-MIB::sysLocation.0 = STRING: aws-west
SNMPv2-MIB::sysORLastChange.0 = Timeticks: (2) 0:00:00.02
SNMPv2-MIB::sysORID.1 = OID: SNMP-MPD-MIB::snmpMPDMIBObjects.3.1.1
SNMPv2-MIB::sysORID.2 = OID: SNMP-USER-BASED-SM-MIB::usmMIBCompliance
SNMPv2-MIB::sysORID.3 = OID: SNMP-FRAMEWORK-MIB::snmpFrameworkMIBCompliance
SNMPv2-MIB::sysORID.4 = OID: SNMPv2-MIB::snmpMIB
SNMPv2-MIB::sysORID.5 = OID: TCP-MIB::tcpMIB
SNMPv2-MIB::sysORID.6 = OID: IP-MIB::ip
SNMPv2-MIB::sysORID.7 = OID: UDP-MIB::udpMIB
SNMPv2-MIB::sysORID.8 = OID: SNMP-VIEW-BASED-ACM-MIB::vacmBasicGroup
SNMPv2-MIB::sysORDescr.1 = STRING: The MIB for Message Processing and Dispatching.
SNMPv2-MIB::sysORDescr.2 = STRING: The MIB for Message Processing and Dispatching.
SNMPv2-MIB::sysORDescr.3 = STRING: The SNMP Management Architecture MIB.
SNMPv2-MIB::sysORDescr.4 = STRING: The MIB module for SNMPv2 entities
SNMPv2-MIB::sysORDescr.5 = STRING: The MIB module for managing TCP implementations
SNMPv2-MIB::sysORDescr.6 = STRING: The MIB module for managing IP and ICMP implementations
SNMPv2-MIB::sysORDescr.7 = STRING: The MIB module for managing UDP implementations
SNMPv2-MIB::sysORDescr.8 = STRING: View-based Access Control Model for SNMP.
SNMPv2-MIB::sysORUpTime.1 = Timeticks: (2) 0:00:00.02
SNMPv2-MIB::sysORUpTime.2 = Timeticks: (2) 0:00:00.02
SNMPv2-MIB::sysORUpTime.3 = Timeticks: (2) 0:00:00.02
SNMPv2-MIB::sysORUpTime.4 = Timeticks: (2) 0:00:00.02
SNMPv2-MIB::sysORUpTime.5 = Timeticks: (2) 0:00:00.02
SNMPv2-MIB::sysORUpTime.6 = Timeticks: (2) 0:00:00.02
SNMPv2-MIB::sysORUpTime.7 = Timeticks: (2) 0:00:00.02
SNMPv2-MIB::sysORUpTime.8 = Timeticks: (2) 0:00:00.02
HOST-RESOURCES-MIB::hrSystemUptime.0 = Timeticks: (562693901) 65 days, 3:02:19.01
End of MIB
Thanks in advance

You do the snmpwalk with seclevel authnopriv but your user has seclevel authpriv configured.
Try:
snmpwalk -v 3 -u snmpv3username -A <passphrase> -a MD5 -x AES -X <passphrase> -l authNoPriv localhost .1.3.6.1.4.1.334.72.1.1.6.2.1.0

Besides creating the user, you must also "authorize" it to see data. Users can exist without any permissions to see data (its part of the SNMPv3 specifications).
For Net-SNMP, you can do this easily by granting it read-only access using this line in your snmpd.conf file:
rouser snmpv3username
or for write access to everything:
rwuser snmpv3username
Edit: Additionally, you should put the create user line in /var/net-snmp/snmpd.conf instead so it gets replaced by a private, localized key that can't be stolen and used in other devices.

Understanding the Scapy "Mac address to reach destination not found. Using broadcast." warning

If I generate an Ethernet frame without any upper layers payload and send it at layer two with sendp(), then I receive the "Mac address to reach destination not found. Using broadcast." warning and frame put to wire indeed uses ff:ff:ff:ff:ff:ff as a destination MAC address. Why is this so? Shouldn't the Scapy send exactly the frame I constructed?
My crafted package can be seen below:
>>> ls(x)
dst : DestMACField = '01:00:0c:cc:cc:cc' (None)
src : SourceMACField = '00:11:22:33:44:55' (None)
type : XShortEnumField = 0 (0)
>>> sendp(x, iface="eth0")
WARNING: Mac address to reach destination not found. Using broadcast.
.
Sent 1 packets.
>>>

Most people encountering this issue are incorrectly using send() (or sr(), sr1(), srloop()) instead of sendp() (or srp(), srp1(), srploop()). For the record, the "without-p" functions like send() are for sending layer 3 packets (send(IP())) while the "with-p" variants are for sending layer 2 packets (sendp(Ether() / IP())).
If you define x like I do below and use sendp() (and not send()) and you still have this issue, you should probably try with the latest version from the project's git repository (see https://github.com/secdev/scapy).
I've tried:
>>> x = Ether(src='01:00:0c:cc:cc:cc', dst='00:11:22:33:44:55')
>>> ls(x)
dst : DestMACField = '00:11:22:33:44:55' (None)
src : SourceMACField = '01:00:0c:cc:cc:cc' (None)
type : XShortEnumField = 0 (0)
>>> sendp(x, iface='eth0')
.
Sent 1 packets.
At the same time I was running tcpdump:
# tcpdump -eni eth0 ether host 00:11:22:33:44:55
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
12:33:47.774570 01:00:0c:cc:cc:cc > 00:11:22:33:44:55, 802.3, length 14: [|llc]

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

slurmd ignores slurm config on startup - slurm

Related

eof error when trying to initiate commands in python ftp_tls module

Ceph-rgw Service stop automatically after installation

Docker inside LXC unprivileged container

snmpwalk failed with authorizationError

Understanding the Scapy "Mac address to reach destination not found. Using broadcast." warning

Categories

Resources