All nodes down after Torque install

All nodes down after Torque install - linux

All nodes registering as down after new torque install. I'm not sure why
[root#rbx-1 6.0.1]# pbsnodes -a
rbx-1
state = down
power_state = Running
np = 1
ntype = cluster
mom_service_port = 15002
mom_manager_port = 15003
rbx-2
state = down
power_state = Running
np = 1
ntype = cluster
mom_service_port = 15002
mom_manager_port = 15003
Here is qmgr says
[root#rbx-1 6.0.1]# qmgr -c 'p s'
create queue batch
set queue batch queue_type = Execution
set queue batch resources_default.nodes = 1
set queue batch resources_default.walltime = 01:00:00
set queue batch enabled = True
set queue batch started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_hosts = rbx-1
set server managers = root#rbx-1
set server operators = root#rbx-1
set server default_queue = batch
set server log_events = 2047
set server mail_from = adm
set server node_check_rate = 150
set server tcp_timeout = 300
set server job_stat_rate = 300
set server poll_jobs = True
set server down_on_error = True
set server mom_job_sync = True
set server keep_completed = 300
set server next_job_number = 0
set server moab_array_compatible = True
set server nppcu = 1
set server timeout_for_job_delete = 120
set server timeout_for_job_requeue = 120
Please help- I don't know what's causing this or what to try next. Any ideas on tutorials or other would be helpful

Try running momctl -d0 -h rbx-1 to see if the MOMs are communicating with the server. Make sure the host names in the server_name file match up with /etc/hosts on the server and the compute nodes. I'd guess you don't have the short names in /etc/hosts on the nodes.

Related

Django + uwsgi = Sqlite: cannot operate on a closed database

I have a Django application that works fine when running it with python manage.py runserver. After I added uwsgi, I frequently started to encounter the error Cannot operate on a closed database. The very same endpoint that raises this error works fine if I call it manually from a browser. The errors occur usually after a few hundreds / thousands call (coming really fast) which are made by another service.
Here's my uwsgi settings:
[uwsgi]
chdir = ./src
http = :8000
enable-threads = false
master = true
module = config.wsgi:application
workers = 5
thunder-lock = true
vacuum = true
workdir = ./src
add-header = Connection: Keep-Alive
http-keepalive = 65000
max-requests = 50000
max-requests-delta = 10000
max-worker-lifetime = 360000000000 ; Restart workers after this many seconds
reload-on-rss = 2048 ; Restart workers after this much resident memory
worker-reload-mercy = 60 ; How long to wait before forcefully killing workers
lazy-apps = true
ignore-sigpipe = true
ignore-write-errors = true
http-auto-chunked = true
disable-write-exception = true
Note: this is a private project and it will never reach production. My goal is to have a fast way for django to handle multiple requests using sqlite. Even a dirty solution would be acceptable.

ambari_agent restart cause ansible crash

we have big-data Hadoop cluster based on horton-works HDP version 2.6.4 and ambari 2.6.1 version
all machines are with RHEL 7.2 version
in our cluster we have more then 540 machines and on all machines we have ambari-agent that communicate with ambari server , ( Ambari server is installed only on one machine ) while ambari-agent installed on all machines
until using ansible everything was good , when we do ambari-agent upgrade and ambari-agent restart
but recently we start to use ansible ( ansible-playbook ) in order to automate the installation
and ansible is running on all machines
so when task do the ambari-agent restart , then imminently we notice that ansible execution stooped and killed
after some investigation we saw that ambari agent is using the following ports
url_port = 8440
secured_url_port = 8441
ping_port = 8670
but I not see that any ansible process used above ports , so we not think its related
but the basic issue is clear
when ansible task doing on remote machine - ambari-agent restart , then its caused ansible interrupt and ansible killed
ambari-agent configuration looks like this
[server]
hostname = datanode02.gtfactory.com
url_port = 8440
secured_url_port = 8441
connect_retry_delay = 10
max_reconnect_retry_delay = 30
[agent]
logdir = /var/log/ambari-agent
piddir = /var/run/ambari-agent
prefix = /var/lib/ambari-agent/data
loglevel = INFO
data_cleanup_interval = 86400
data_cleanup_max_age = 2592000
data_cleanup_max_size_mb = 100
ping_port = 8670
cache_dir = /var/lib/ambari-agent/cache
tolerate_download_failures = true
run_as_user = root
parallel_execution = 0
alert_grace_period = 5
status_command_timeout = 5
alert_kinit_timeout = 14400000
system_resource_overrides = /etc/resource_overrides
[security]
keysdir = /var/lib/ambari-agent/keys
server_crt = ca.crt
passphrase_env_var_name = AMBARI_PASSPHRASE
ssl_verify_cert = 0
credential_lib_dir = /var/lib/ambari-agent/cred/lib
credential_conf_dir = /var/lib/ambari-agent/cred/conf
credential_shell_cmd = org.apache.hadoop.security.alias.CredentialShell
[network]
use_system_proxy_settings = true
[services]
pidlookuppath = /var/run/
[heartbeat]
state_interval_seconds = 60
dirs = /etc/hadoop,/etc/hadoop/conf,/etc/hbase,/etc/hcatalog,/etc/hive,/etc/oozie,
/etc/sqoop,
/var/run/hadoop,/var/run/zookeeper,/var/run/hbase,/var/run/templeton,/var/run/oozie,
/var/log/hadoop,/var/log/zookeeper,/var/log/hbase,/var/run/templeton,/var/log/hive
log_lines_count = 300
idle_interval_min = 1
idle_interval_max = 10
[logging]
syslog_enabled = 0
for now we are thinking about the following:
maybe ansible crash because TLSv1 is restricted ( Transport Layer Security ) , the default is that ambari-agent connects to TLSv1
so we think to set force_https_protocol=PROTOCOL_TLSv1_2 in ambari agent configuration , but this is only assumption
our suggestion and the new conf that maybe can help?
[security]
force_https_protocol=PROTOCOL_TLSv1_2 <------ the new update
keysdir = /var/lib/ambari-agent/keys
server_crt = ca.crt
passphrase_env_var_name = AMBARI_PASSPHRASE
ssl_verify_cert = 0
credential_lib_dir = /var/lib/ambari-agent/cred/lib
credential_conf_dir = /var/lib/ambari-agent/cred/conf
credential_shell_cmd = org.apache.hadoop.security.alias.CredentialShell

InfluxDB write failure node-influx library

I am writing data to influxdb using the node-influx library.
https://github.com/node-influx/node-influx
It writes about 500,000 records and then I start seeing this error following which there are no more writes. It looks like a dns issue but I am running it inside a docker container on ubuntu 18.04 host.
Error: getaddrinfo EAGAIN influxdb influxdb:8086 |at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:56:26)
I have the logging level set to debug but I am not seeing any other errors. Any idea what might be causing this?
Update
tried with a different influx version
increased ulimit of host
used ip address of docker-container instead of service name, no error is thrown, writes stop after sometime silently
tried to call the write API with curl
curl -i -XPOST 'http://localhost:8086/write?db=mydb' --data-binary 'snmp,hostipv4=172.16.102.82,oidname=cpu_idle,site=gotham value=1000 1574751020815819489'
This works and a record is inserted in the DB.
Update2
It seems to be a dns issue on the docker network. I am not able to ping the influxdb container from the worker container. The writes are not reaching influx.
Update3
As a workaround for now, I am forcing a process.exit(1) on catching the error on my worker and using docker-compose restart: on-failure to restart the service. This resumes the writes.
retention policy is set for 2 days on the DB
influxdb.conf
reporting-disabled = true
[meta]
dir = "/var/lib/influxdb/meta"
retention-autocreate = false
logging-enabled = true
[logging]
format = "auto"
level = "debug"
[data]
engine = "tsm1"
dir = "/var/lib/influxdb/data"
wal-dir = "/var/lib/influxdb/wal"
wal-fsync-delay = "200ms"
index-version = "inmem"
wal-logging-enabled = true
query-log-enabled = true
cache-max-memory-size = "2g"
cache-snapshot-memory-size = "256m"
cache-snapshot-write-cold-duration = "20m"
compact-full-write-cold-duration = "24h"
max-concurrent-compactions = 0
compact-throughput = "48m"
max-points-per-block = 0
max-series-per-database = 1000000
trace-logging-enabled = false
[coordinator]
write-timeout = "10s"
max-concurrent-queries = 0
query-timeout = "0s"
log-queries-after = "0s"
max-select-point = 0
max-select-series = 0
max-select-buckets = 0
[retention]
enabled = true
check-interval = "30m0s"
[shard-precreation]
enabled = true
check-interval = "10m0s"
advance-period = "30m0s"
[monitor]
store-enabled = true
store-database = "_internal"
store-interval = "10s"
[http]
enabled = true
bind-address = ":8086"
auth-enabled = false
log-enabled = true
max-concurrent-write-limit = 0
max-enqueued-write-limit = 0
enqueued-write-timeout = 0
[continuous_queries]
enabled = false
log-enabled = true
run-interval = "10s"

DEBUG: Routing failed, re-queued. Kannel

Hi I am trying to setup Kannel 1.4.3 for sending and receiving SMS. But I'm getting Routing Failed error
ERROR: AT2[Huawei-E220-00]: Couldn't connect (retrying in 10 seconds).
Message in the browser:
3: Queued for later delivery
Details:
Modem - Huawei e220
Sim - AT&T
OS - Ubuntu 14.04 LTS
Please let me know if there is anything wrong in the following smskannel.conf:
#---------------------------------------------
# CORE
#
group = core
admin-port = 13000
smsbox-port = 13001
admin-password = bar
#status-password = foo
box-deny-ip = "*.*.*.*"
box-allow-ip = "127.0.0.1"
#unified-prefix = "+358,00358,0;+,00"
#---------------------------------------------
# SMSC CONNECTIONS
#
group = smsc
smsc = at
smsc-id = Huawei-E220-00
port = 10000
modemtype = huawei_e220_00
device = /dev/ttyUSB0
sms-center = +13123149810
my-number = +1xxxxxxxxxx
connect-allow-ip = 127.0.0.1
sim-buffering = true
keepalive = 5
#---------------------------------------------
# SMSBOX SETUP
#
group = smsbox
bearerbox-host = 127.0.0.1
sendsms-port = 13013
global-sender = 13013
#---------------------------------------------
# SEND-SMS USERS
#
group = sendsms-user
username = tester
password = foobar
#---------------------------------------------
# SERVICES
group = sms-service
keyword = nop
text = "You asked nothing and I did it!"
group = sms-service
keyword = default
text = "No service specified"
group = sms-service
keyword = complex
catch-all = yes
accept-x-kannel-headers = true
max-messages = 3
concatenation = true
get-url = "http://127.0.0.1:13013/cgi-bin/sendsms?username=tester&password=foobar&to=+16782304782&text=Hello World"
#---------------------------------------------
# MODEMS
#
group = modems
id = huawei_e220_00
name = "Huawei E220"
detect-string = "huawei"
init-string = "ATQ0 V1 E1 S0=0 &C1 &D2 +FCLASS=0"
message-storage = "SM"
need-sleep = true
speed = 460800`

How to set max,count,active and inactive configuration in phusion passanger

I want to set below configuration in phusion passenger.
max = 6
count = 2
active = 2
inactive = 0
Waiting on global queue = 0
Can you please tell me where to set this.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

All nodes down after Torque install - linux

Try running momctl -d0 -h rbx-1 to see if the MOMs are communicating with the server. Make sure the host names in the server_name file match up with /etc/hosts on the server and the compute nodes. I'd guess you don't have the short names in /etc/hosts on the nodes.

Related

Django + uwsgi = Sqlite: cannot operate on a closed database

ambari_agent restart cause ansible crash

InfluxDB write failure node-influx library

DEBUG: Routing failed, re-queued. Kannel

How to set max,count,active and inactive configuration in phusion passanger

Categories

Resources