Torque cannot communicate with host - linux

I have been attempting to setup the torque scheduler for a small cluster. I followed the steps to setup the scheduler from http://docs.adaptivecomputing.com/torque/archive/3-0-2/1.2configuring_torque_on_server.php
However when i attempt
qterm -t quick
I get the following error
$ sudo qterm -t quick
Unable to communicate with Terra(192.168.1.25)
Cannot connect to specified server host 'Terra'.
qterm: could not connect to server '' (111) Connection refused
but the server starts just fine. However when I attempt to run a command that runs on multiple nodes such as
qsub -l nodes=2:ppn=4 /home/user/scripts/someScript
it prints out somethign like
7.Terra
where Terra is the name of the head node, but is also a node in the cluster. This isn't the problem. The problem is that it does not run. nor does it have any output anywhere :/
The torque server log: https://ptpb.pw/EaKo
The terra node log: https://ptpb.pw/9w5M
and the Marte log: https://ptpb.pw/o4PT
I can get it to run with a pbs script but only with one node....
#!/bin/bash
#PBS -l pmem=1gb,nodes=1:ppn=4
#PBS -m abe
cd Documents/
wc -l largeTest.csv
Here is the ouput of qstat after submitting a job
Job ID Name User Time Use S
Queue
------------------------- ---------------- --------------- -------- - -----
16.Terra testPerformance justin 0 R batch
the output of pbsnodes -a
Terra
state = free
power_state = Running
np = 4
properties = Tower
ntype = cluster
status = opsys=linux,uname=Linux Terra 4.17.14-arch1-1-ARCH #1 SMP PREEMPT Thu Aug 9 11:56:50 UTC 2018 x86_64,sessions=11525 22029,nsessions=2,nusers=1,idletime=57964,totmem=8111556kb,availmem=7539284kb,physmem=8111556kb,ncpus=4,loadave=0.00,gres=,netload=30570521372,state=free,varattr= ,cpuclock=Fixed,macaddr=e0:3f:49:44:72:20,version=6.1.1.1,rectime=1534937388,jobs=
mom_service_port = 15002
mom_manager_port = 15003
gpus = 1
Marte
state = free
power_state = Running
np = 4
properties = NFSServer
ntype = cluster
status = opsys=linux,uname=Linux Marte 4.18.1-arch1-1-ARCH #1 SMP PREEMPT Wed Aug 15 21:11:55 UTC 2018 x86_64,sessions=366 556 563,nsessions=3,nusers=2,idletime=58140,totmem=7043404kb,availmem=6703808kb,physmem=7043404kb,ncpus=4,loadave=0.02,gres=,netload=36500663511,state=free,varattr= ,cpuclock=Fixed,macaddr=c8:5b:76:4a:65:91,version=6.1.1.1,rectime=1534937359,jobs=
mom_service_port = 15002
mom_manager_port = 15003
and the /var/spool/torque/server_priv/nodes
Terra np=4 gpus=1 Tower
Marte np=4 NFSServer
Edit: Here are the most recent logs as well
Mom Log for Node: https://ptpb.pw/DhKi
Mom Log for head node: https://ptpb.pw/MTlD
and the server log: https://ptpb.pw/HPkE

Related

How to see python script errors being run from rsyslog action

This is my action in rsyslog.conf:
module(load="omprog")
if( $msg contains "UPDOWN") then {
action(type="omprog" binary="/etc/rsyslog.d/netmiko.py" template="RSYSLOG_TraditionalFileFormat")
}
This is the python script I am working on:
pattern = re.compile('GigabitEthernet0\/\d{1,2}')
def process_line(line):
state = ''
if 'to up' in line:
state = f'UP\n'
elif 'to down' in line:
state = f'DOWN\n'
file = open("/home/blinky/python.log","a")
result = re.findall(pattern, line)
if len(result) > 0:
file.write(f'{result} - {state}')
file.close()
try:
msg = sys.stdin.readline()
file = open("/home/blinky/python.log","a")
file.write(line)
file.close()
process_line(msg)
except Exception as e:
file = open("/etc/rsyslog.d/python_error.log","a")
file.write(e)
file.close()
So the issue I have is trying to debug the python script, I can not see any of the errors it produces, as you can see I am trying to output the exception to a file but I get nothing there either. I have looked in the log file and this is the response I get from doing a shut no shut on the switch port:
Nov 20 21:50:39 10.0.0.254 1281: Nov 20 21:50:38.013: %LINK-5-CHANGED: Interface GigabitEthernet0/14, changed state to administratively down
Nov 20 21:50:39 10.0.0.254 1282: Nov 20 21:50:39.013: %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet0/14, changed state to down
Nov 20 21:50:39 repperio rsyslogd: omprog: program '/etc/rsyslog.d/netmiko.py' (pid 2006160) terminated; will be restarted [v8.2112.0 try https://www.rsyslog.com/e/2119 ]
Nov 20 21:50:39 repperio rsyslogd: action 'action-1-omprog' suspended (module 'omprog'), retry 0. There should be messages before this one giving the reason for suspension. [v8.2112.0 try https://www.rsyslog.com/e/2007 ]
Nov 20 21:50:40 repperio rsyslogd: action 'action-1-omprog' resumed (module 'omprog') [v8.2112.0 try https://www.rsyslog.com/e/2359 ]
Nov 20 21:50:43 10.0.0.254 1283: Nov 20 21:50:42.756: %LINK-3-UPDOWN: Interface GigabitEthernet0/14, changed state to up
Nov 20 21:50:43 10.0.0.254 1284: Nov 20 21:50:43.756: %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet0/14, changed state to up
Nov 20 21:50:43 repperio rsyslogd: child process (pid 2006317) exited with status 1 [v8.2112.0]
Nov 20 21:50:43 repperio rsyslogd: omprog: program '/etc/rsyslog.d/netmiko.py' (pid 2006317) terminated; will be restarted [v8.2112.0 try https://www.rsyslog.com/e/2119 ]
Nov 20 21:50:43 repperio rsyslogd: action 'action-1-omprog' suspended (module 'omprog'), retry 0. There should be messages before this one giving the reason for suspension. [v8.2112.0 try https://www.rsyslog.com/e/2007 ]
Nov 20 21:50:44 repperio rsyslogd: action 'action-1-omprog' resumed (module 'omprog') [v8.2112.0 try https://www.rsyslog.com/e/2359 ]
The script monitors the Cisco switch for interfaces going up and down and triggers the python script, this in turn will alter the configuration of the switch port using Netmiko. Without the ability to debug the python script I am scuppered, any ideas?

ParallelSSHClient - Python - Handle authentication errors

I have a problem which I didn't found the solution here.
I am working with SSHClient for connecting to multiple servers.
But, if there is 1 server in the list that I cannot access with my username and password (SSH), it's throwing me an exception.
I've tried to work with try and except but it didn't work.
Here is my original code:
from pssh.clients import ParallelSSHClient
import configparser
res = []
servers = ['test1', 'test2', 'test3', 'test4']
client = ParallelSSHClient(servers, user='test', password='test')
output = client.run_command('service ntpd status')
client.join(output)
for host_out in output:
for line in host_out.stdout:
if 'running' or 'Running' in line:
continue
else:
res.append(host_out.host + ' is not running')
if res:
return res
else:
return "All servers are running"
server test2 isn't accessible with my username so the script is throwing me an exception and failing the script:
pssh.exceptions.AuthenticationError: No authentication methods succeeded
How can I continue the script without the server test2 (if it is not accessible)
try something like this:
from pssh.clients import ParallelSSHClient
from pssh.config import HostConfig
host_config = [
HostConfig(user='user',
password='pwd'),
HostConfig(user='user',
password='pwd'),
HostConfig(user='user',
password='pwd')
]
servers = ['10.97.180.90', '10.97.180.99', "10.97.180.88"]
client = ParallelSSHClient(servers, host_config=host_config, num_retries=1, timeout=3)
output = client.run_command('uname -a', stop_on_errors=False, timeout=3)
for host_output in output:
try:
for line in host_output.stdout:
print(line)
exit_code = host_output.exit_code
except TypeError:
print("timeOut...")
And output looks like this:
Linux hostName-001 3.10.0-957.21.3.el7.x86_64 #1 SMP Fri Jun 14 02:54:29 EDT 2019 x86_64 x86_64 x86_64 GNU/Linux
Linux hostName-003 3.10.0-1160.15.2.el7.x86_64 #1 SMP Thu Jan 21 16:15:07 EST 2021 x86_64 x86_64 x86_64 GNU/Linux
timeOut...
More info can be found here on stackoverflow:
Python parallel-ssh run_command does not timeout when using pssh.clients

running background tasks through dramatic does not work

I'm trying to run background task processing, redis and rabbitMQ work in separate docker containers
#dramatiq.actor(store_results=True)
def count_words(url):
try:
response = requests.get(url)
count = len(response.text.split(" "))
print(f"There are {count} words at {url!r}.")
except requests.exceptions.MissingSchema:
print(f"Message dropped due to invalid url: {url!r}")
result_backend = RedisBackend(host="172.17.0.2", port=6379)
result_broker = RabbitmqBroker(host="172.17.0.5", port=5672)
result_broker.add_middleware(Results(backend=result_backend))
dramatiq.set_broker(result_broker)
message = count_words.send('https://github.com/Bogdanp/dramatiq')
print(message.get_result(block=True))
RabbitMQ:
{"queue_name":"default","actor_name":"count_words","args":["https://github.com/Bogdanp/dramatiq"],"kwargs":{},"options":{},"message_id":"8e10b6ef-dfef-47dc-9f28-c6e07493efe4","message_timestamp":1608877514655}
Redis
1:C 22 Dec 2020 13:38:15.415 # Warning: no config file specified, using the default config. In order to specify a config file use redis-server /path/to/redis.conf
1:M 22 Dec 2020 13:38:15.417 * Running mode=standalone, port=6379.
1:M 22 Dec 2020 13:38:15.417 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
1:M 22 Dec 2020 13:38:15.417 # Server initialized
1:M 22 Dec 2020 13:38:15.417 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
1:M 22 Dec 2020 13:38:15.417 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo madvise > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled (set to 'madvise' or 'never').
1:M 25 Dec 2020 10:08:12.274 * Background saving terminated with success
1:M 26 Dec 2020 19:23:59.445 * 1 changes in 3600 seconds. Saving...
1:M 26 Dec 2020 19:23:59.660 * Background saving started by pid 24
24:C 26 Dec 2020 19:23:59.890 * DB saved on disk
24:C 26 Dec 2020 19:23:59.905 * RDB: 4 MB of memory used by copy-on-write
1:M 26 Dec 2020 19:23:59.961 * Background saving terminated with success
Error:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/dramatiq/message.py", line 147, in get_result
return backend.get_result(self, block=block, timeout=timeout)
File "/usr/local/lib/python3.6/dist-packages/dramatiq/results/backends/redis.py", line 81, in get_result
raise ResultTimeout(message)
dramatiq.results.errors.ResultTimeout: count_words('https://github.com/Bogdanp/dramatiq')

Unable to use preloaded uWSGI cheaper algorithm

I'm unable to use uWSGI's busyness cheaper algorithm although it appears to be pre-loaded in my installation. Does it still merit an explicit install?
Is so, where can I download the standalone plugin package from?
Any help is appreciated, thank you.
uWSGI Information
uwsgi --version
2.0.19.1
uwsgi --cheaper-algos-list
*** uWSGI loaded cheaper algorithms ***
busyness
spare
backlog
manual
--- end of cheaper algorithms list ---
uWSGI Configuration File
[uwsgi]
module = myapp:app
socket = /path/to/myapp.sock
stats = /path/to/mystats.sock
chmod-socket = 766
socket-timeout = 60 ; Set internal sockets timeout
logto = /path/to/logs/%n.log
log-maxsize = 5000000 ; Max size before rotating file
disable-logging = true ; Disable built-in logging
log-4xx = true ; But log 4xx
log-5xx = true ; And 5xx
strict = true ; Enable strict mode (placeholder cannot be used)
master = true ; Enable master process
enable-threads = true ; Enable threads
vacuum = true ; Delete sockets during shutdown
single-interpreter = true ; Do not use multiple interpreters (single web app per uWSGI process)
die-on-term = true ; Shutdown when receiving SIGTERM (default is respawn)
need-app = true ; Exit if no app can be loaded
harakiri = 300 ; Forcefully kill hung workers after desired time in seconds
max-requests = 1000 ; Restart workers after this many requests
max-worker-lifetime = 3600 ; Restart workers after this many seconds
reload-on-rss = 1024 ; Restart workers after this much resident memory (this is per worker)
worker-reload-mercy = 60 ; How long to wait for workers to reload before forcefully killing them
cheaper-algo = busyness ; Specify the cheaper algorithm here
processes = 16 ; Maximum number of workers allowed
threads = 4 ; Number of threads per worker allowed
thunder-lock = true ; Specify thunderlock activation
cheaper = 8 ; Number of workers to keep idle
cheaper-initial = 8 ; Workers created at startup
cheaper-step = 4 ; Number of workers to spawn at once
cheaper-overload = 30 ; Check the busyness of the workers at this interval (in seconds)
uWSGI Log
*** Operational MODE: preforking+threaded ***
WSGI app 0 (mountpoint='') ready in 0 seconds on interpreter 0x15e8f10 pid: 14479 (default app)
spawned uWSGI master process (pid: 14479)
spawned uWSGI worker 1 (pid: 14511, cores: 4)
spawned uWSGI worker 2 (pid: 14512, cores: 4)
spawned uWSGI worker 3 (pid: 14516, cores: 4)
spawned uWSGI worker 4 (pid: 14520, cores: 4)
spawned uWSGI worker 5 (pid: 14524, cores: 4)
spawned uWSGI worker 6 (pid: 14528, cores: 4)
spawned uWSGI worker 7 (pid: 14529, cores: 4)
spawned uWSGI worker 8 (pid: 14533, cores: 4)
THIS LINE --> unable to find requested cheaper algorithm, falling back to spare <-- THIS LINE
OS Information
Red Hat Enterprise Linux Server release 7.7 (Maipo)
Other Details
uWSGI was installed using pip
The comments were messing things up for me - the below format fixed the issue
########################################################
# #
# Cheaper Algo and Worker Count #
# #
########################################################
cheaper-algo = busyness
processes = 16
threads = 4
thunder-lock = true
cheaper = 8
cheaper-initial = 8
cheaper-step = 4
cheaper-overload = 30

Groovy File() not reporting correct size / length

I have a Jenkins post build Groovy script running out of the "Post build task plugin". From the same plugin, immediately before running the Groovy script, I check for the existence of the file and its size. The log shows:
09:14:53 -rw-r--r-- 1 aaa users 978243 Nov 4 08:53 /jk/workspace/xxxx/output/delta.txt
09:14:53 cppcheck.groovy: Checking build result: SUCCESS
09:14:53 cppcheck.groovy: workspace = /jk/workspace/xxxx
09:14:53 cppcheck.groovy: delta = /jk/workspace/xxxx/output/delta.txt
09:14:53 cppcheck.groovy: delta.txt length = 0
The groovy script is as follows:
import hudson.model.*
def build = Thread.currentThread().executable
def result = build.getResult()
println("cppcheck.groovy: Checking build result: " + result.toString())
if (result.isBetterOrEqualTo(hudson.model.Result.SUCCESS)) {
def workspace = build.getEnvVars()["WORKSPACE"]
def delta = workspace + "/output/delta.txt"
println("cppcheck.groovy: workspace = " + workspace)
println("cppcheck.groovy: delta = " + delta)
def f = new File(delta)
println("cppcheck.groovy: delta.txt length = " + f.length())
if (f.length() > 0) {
build.setResult(hudson.model.Result.UNSTABLE)
}
}
What am I doing wrong here?
Update: There seems to be some scepticism that the file exists and that there is some sort of race condition. To put your minds at rest, let's rule that out. I have modified the build to execute the same ls -l command after it runs the groovy script, to prove the file does exist and that this problem is ultimately Groovy not being able to open the file. I also added the file exists() check to the above Groovy script, which as I suspected it would, reports the file doesn't exist. I don't dispute that Groovy thinks the file doesn't exist. What I am trying to work out is why?
10:31:39 [xxxx] $ /bin/sh -xe /tmp/hudson8964729240493636268.sh
10:31:39 + ls -l /jk/workspace/xxxx/output/delta.txt
10:31:39 -rw-r--r-- 1 aaa users 978243 Nov 4 08:53 /jk/workspace/xxxx/output/delta.txt
10:31:40 cppcheck.groovy: Checking build result: SUCCESS
10:31:40 cppcheck.groovy: workspace = /jk/workspace/xxxx
10:31:40 cppcheck.groovy: delta = /jk/workspace/xxxx/output/delta.txt
10:31:40 cppcheck.groovy: delta.txt length = 0
10:31:40 cppcheck.groovy: delta.txt exists = false
10:31:40 [xxxx] $ /bin/sh -xe /tmp/hudson8007562636561507409.sh
10:31:40 + ls -l /jk/workspace/xxxx/output/delta.txt
10:31:40 -rw-r--r-- 1 aaa users 978243 Nov 4 08:53 /jk/workspace/xxxx/output/delta.txt
Also, notice the timestamp on said file, is still 08:53 when it was created.
I suspected that the Groovy script was running on the build master as opposed to the build node that this particular build was running on. I added some debug to print the hostname for which the Groovy script was running and sure enough it wasn't the same host that the shell variant of the script was running.

Resources