we have big-data Hadoop cluster based on horton-works HDP version 2.6.4 and ambari 2.6.1 version
all machines are with RHEL 7.2 version
in our cluster we have more then 540 machines and on all machines we have ambari-agent that communicate with ambari server , ( Ambari server is installed only on one machine ) while ambari-agent installed on all machines
until using ansible everything was good , when we do ambari-agent upgrade and ambari-agent restart
but recently we start to use ansible ( ansible-playbook ) in order to automate the installation
and ansible is running on all machines
so when task do the ambari-agent restart , then imminently we notice that ansible execution stooped and killed
after some investigation we saw that ambari agent is using the following ports
url_port = 8440
secured_url_port = 8441
ping_port = 8670
but I not see that any ansible process used above ports , so we not think its related
but the basic issue is clear
when ansible task doing on remote machine - ambari-agent restart , then its caused ansible interrupt and ansible killed
ambari-agent configuration looks like this
[server]
hostname = datanode02.gtfactory.com
url_port = 8440
secured_url_port = 8441
connect_retry_delay = 10
max_reconnect_retry_delay = 30
[agent]
logdir = /var/log/ambari-agent
piddir = /var/run/ambari-agent
prefix = /var/lib/ambari-agent/data
loglevel = INFO
data_cleanup_interval = 86400
data_cleanup_max_age = 2592000
data_cleanup_max_size_mb = 100
ping_port = 8670
cache_dir = /var/lib/ambari-agent/cache
tolerate_download_failures = true
run_as_user = root
parallel_execution = 0
alert_grace_period = 5
status_command_timeout = 5
alert_kinit_timeout = 14400000
system_resource_overrides = /etc/resource_overrides
[security]
keysdir = /var/lib/ambari-agent/keys
server_crt = ca.crt
passphrase_env_var_name = AMBARI_PASSPHRASE
ssl_verify_cert = 0
credential_lib_dir = /var/lib/ambari-agent/cred/lib
credential_conf_dir = /var/lib/ambari-agent/cred/conf
credential_shell_cmd = org.apache.hadoop.security.alias.CredentialShell
[network]
use_system_proxy_settings = true
[services]
pidlookuppath = /var/run/
[heartbeat]
state_interval_seconds = 60
dirs = /etc/hadoop,/etc/hadoop/conf,/etc/hbase,/etc/hcatalog,/etc/hive,/etc/oozie,
/etc/sqoop,
/var/run/hadoop,/var/run/zookeeper,/var/run/hbase,/var/run/templeton,/var/run/oozie,
/var/log/hadoop,/var/log/zookeeper,/var/log/hbase,/var/run/templeton,/var/log/hive
log_lines_count = 300
idle_interval_min = 1
idle_interval_max = 10
[logging]
syslog_enabled = 0
for now we are thinking about the following:
maybe ansible crash because TLSv1 is restricted ( Transport Layer Security ) , the default is that ambari-agent connects to TLSv1
so we think to set force_https_protocol=PROTOCOL_TLSv1_2 in ambari agent configuration , but this is only assumption
our suggestion and the new conf that maybe can help?
[security]
force_https_protocol=PROTOCOL_TLSv1_2 <------ the new update
keysdir = /var/lib/ambari-agent/keys
server_crt = ca.crt
passphrase_env_var_name = AMBARI_PASSPHRASE
ssl_verify_cert = 0
credential_lib_dir = /var/lib/ambari-agent/cred/lib
credential_conf_dir = /var/lib/ambari-agent/cred/conf
credential_shell_cmd = org.apache.hadoop.security.alias.CredentialShell
Related
I have a Django application that works fine when running it with python manage.py runserver. After I added uwsgi, I frequently started to encounter the error Cannot operate on a closed database. The very same endpoint that raises this error works fine if I call it manually from a browser. The errors occur usually after a few hundreds / thousands call (coming really fast) which are made by another service.
Here's my uwsgi settings:
[uwsgi]
chdir = ./src
http = :8000
enable-threads = false
master = true
module = config.wsgi:application
workers = 5
thunder-lock = true
vacuum = true
workdir = ./src
add-header = Connection: Keep-Alive
http-keepalive = 65000
max-requests = 50000
max-requests-delta = 10000
max-worker-lifetime = 360000000000 ; Restart workers after this many seconds
reload-on-rss = 2048 ; Restart workers after this much resident memory
worker-reload-mercy = 60 ; How long to wait before forcefully killing workers
lazy-apps = true
ignore-sigpipe = true
ignore-write-errors = true
http-auto-chunked = true
disable-write-exception = true
Note: this is a private project and it will never reach production. My goal is to have a fast way for django to handle multiple requests using sqlite. Even a dirty solution would be acceptable.
I am writing data to influxdb using the node-influx library.
https://github.com/node-influx/node-influx
It writes about 500,000 records and then I start seeing this error following which there are no more writes. It looks like a dns issue but I am running it inside a docker container on ubuntu 18.04 host.
Error: getaddrinfo EAGAIN influxdb influxdb:8086 |at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:56:26)
I have the logging level set to debug but I am not seeing any other errors. Any idea what might be causing this?
Update
tried with a different influx version
increased ulimit of host
used ip address of docker-container instead of service name, no error is thrown, writes stop after sometime silently
tried to call the write API with curl
curl -i -XPOST 'http://localhost:8086/write?db=mydb' --data-binary 'snmp,hostipv4=172.16.102.82,oidname=cpu_idle,site=gotham value=1000 1574751020815819489'
This works and a record is inserted in the DB.
Update2
It seems to be a dns issue on the docker network. I am not able to ping the influxdb container from the worker container. The writes are not reaching influx.
Update3
As a workaround for now, I am forcing a process.exit(1) on catching the error on my worker and using docker-compose restart: on-failure to restart the service. This resumes the writes.
retention policy is set for 2 days on the DB
influxdb.conf
reporting-disabled = true
[meta]
dir = "/var/lib/influxdb/meta"
retention-autocreate = false
logging-enabled = true
[logging]
format = "auto"
level = "debug"
[data]
engine = "tsm1"
dir = "/var/lib/influxdb/data"
wal-dir = "/var/lib/influxdb/wal"
wal-fsync-delay = "200ms"
index-version = "inmem"
wal-logging-enabled = true
query-log-enabled = true
cache-max-memory-size = "2g"
cache-snapshot-memory-size = "256m"
cache-snapshot-write-cold-duration = "20m"
compact-full-write-cold-duration = "24h"
max-concurrent-compactions = 0
compact-throughput = "48m"
max-points-per-block = 0
max-series-per-database = 1000000
trace-logging-enabled = false
[coordinator]
write-timeout = "10s"
max-concurrent-queries = 0
query-timeout = "0s"
log-queries-after = "0s"
max-select-point = 0
max-select-series = 0
max-select-buckets = 0
[retention]
enabled = true
check-interval = "30m0s"
[shard-precreation]
enabled = true
check-interval = "10m0s"
advance-period = "30m0s"
[monitor]
store-enabled = true
store-database = "_internal"
store-interval = "10s"
[http]
enabled = true
bind-address = ":8086"
auth-enabled = false
log-enabled = true
max-concurrent-write-limit = 0
max-enqueued-write-limit = 0
enqueued-write-timeout = 0
[continuous_queries]
enabled = false
log-enabled = true
run-interval = "10s"
All nodes registering as down after new torque install. I'm not sure why
[root#rbx-1 6.0.1]# pbsnodes -a
rbx-1
state = down
power_state = Running
np = 1
ntype = cluster
mom_service_port = 15002
mom_manager_port = 15003
rbx-2
state = down
power_state = Running
np = 1
ntype = cluster
mom_service_port = 15002
mom_manager_port = 15003
Here is qmgr says
[root#rbx-1 6.0.1]# qmgr -c 'p s'
create queue batch
set queue batch queue_type = Execution
set queue batch resources_default.nodes = 1
set queue batch resources_default.walltime = 01:00:00
set queue batch enabled = True
set queue batch started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_hosts = rbx-1
set server managers = root#rbx-1
set server operators = root#rbx-1
set server default_queue = batch
set server log_events = 2047
set server mail_from = adm
set server node_check_rate = 150
set server tcp_timeout = 300
set server job_stat_rate = 300
set server poll_jobs = True
set server down_on_error = True
set server mom_job_sync = True
set server keep_completed = 300
set server next_job_number = 0
set server moab_array_compatible = True
set server nppcu = 1
set server timeout_for_job_delete = 120
set server timeout_for_job_requeue = 120
Please help- I don't know what's causing this or what to try next. Any ideas on tutorials or other would be helpful
Try running momctl -d0 -h rbx-1 to see if the MOMs are communicating with the server. Make sure the host names in the server_name file match up with /etc/hosts on the server and the compute nodes. I'd guess you don't have the short names in /etc/hosts on the nodes.
I am having a strange error when setting up a unixODBC connection to a Oracle 11g R1 database. After everything was set up I wanted to try to test the connection using isql. It keeps on returning the error
[08004][unixODBC][Oracle][ODBC][Ora]ORA-12154: TNS:could not resolve the connect identifier specified
What is confusing to me is that I can connect via sqlplus using the same environment and TNS notation just fine
sqlplus dbuser/password#DBOPBAC9
Copyright (c) 1982, 2011, Oracle. All rights reserved.
Connected to:
Oracle Database 11g Release 11.2.0.3.0 - 64bit Production
SQL>
I am working on the problems for two days now and can't find a solution. ORA-12154 is a common error for which I have found a lot of possible solutions. But none of them worked for me. It is frustraiting.
Here is what I have tryed:
Environment variables that are mentioned are all set before starting isql
ORACLE_SID=DBOPBAC9
ORACLE_BASE=/CSGPBAC9/DBA/oracle
ORACLE_INSTANT_CLIENT_64=/CSGPBAC9/opt/myuser/tools/instantclient_11_2_x64
ORACLE_HOME=/CSGPBAC9/DBA/oracle/product/11.2.0
TNS_ADMIN=/CSGPBAC9/DBA/oracle/product/11.2.0/network/admin
This is the tnsnames.ora found in $TNS_ADMIN directory
DBOPBAC9 =
(DESCRIPTION =
(ADDRESS = (PROTOCOL = TCP)(HOST = host IP)(PORT = 1480))
(CONNECT_DATA =
(SERVER = DEDICATED)
(SERVICE_NAME = DBOPBAC9)
)
)
This is the sqlnet.ora
TRACE_LEVEL_CLIENT = OFF
SQLNET.EXPIRE_TIME = 10
NAMES.DIRECTORY_PATH = (TNSNAMES)
DIAG_ADR_ENABLED=off
This is my unixODBC setup. I have installed unixODBC into directory /opt/unixODBC and set the environment variables accordingly. The odbc.ini is in directory /opt/myuser/tools/unixODBC and variables are also set.
odbc.ini
[OracleODBC-11g]
Application Attributes = T
Attributes = W
BatchAutocommitMode = IfAllSuccessful
BindAsFLOAT = F
CloseCursor = F
DisableDPM = F
DisableMTS = T
Driver = Oracle 11g ODBC driver
DSN = OracleODBC-11g
EXECSchemaOpt =
EXECSyntax = T
Failover = T
FailoverDelay = 10
FailoverRetryCount = 10
FetchBufferSize = 64000
ForceWCHAR = F
Lobs = T
Longs = T
MaxLargeData = 0
MetadataIdDefault = F
QueryTimeout = T
ResultSets = T
ServerName = //host.ip/DBOPBAC9
SQLGetData extensions = F
Translation DLL =
Translation Option = 0
DisableRULEHint = T
UserID =
StatementCache=F
CacheBufferSize=20
UseOCIDescribeAny=F
odbcinst.ini
[Oracle 11g ODBC driver]
Description = Oracle ODBC driver for Oracle 11g
Driver =
Driver64 = /CSGPBAC9/opt/myuser/tools/instantclient_11_2_x64/libsqora.so.11.1
Setup =
FileUsage =
CPTimeout =
CPReuse =
I have created a strace output to check for errors but unfortunatly I can't find anything. To me it looks like it is able to find tnsnames.ora file and read it
You need to edit odbc.ini
ServerName = TNS_ALIAS
My conf-file:
external_url "http://192.168.3.23" # note the use of a dotted ip
gitlab_rails['gitlab_email_enabled'] = true
gitlab_rails['gitlab_email_from'] = 'gitlab#myhome.com'
gitlab_rails['gitlab_email_display_name'] = 'gitlab'
#gitlab_rails['gitlab_email_reply_to'] = 'gitlab#myhome.com'
gitlab_rails['smtp_enable'] = true
gitlab_rails['smtp_address'] = "mail.home"
gitlab_rails['smtp_port'] = 25
gitlab_rails['smtp_domain'] = "myhome.com"
mattermost_external_url 'http://192.168.3.23'
mattermost['gitlab_enable'] = true
mattermost['gitlab_secret'] = "4d1e<***>bdbfe"
mattermost['gitlab_id'] = "1c441<***>092df"
mattermost['gitlab_scope'] = ""
mattermost['gitlab_auth_endpoint'] = "http://192.168.3.23/oauth/authorize"
mattermost['gitlab_token_endpoint'] = "http://192.168.3.23/oauth/token"
mattermost['gitlab_user_api_endpoint'] = "http://192.168.3.23/api/v3/user"
# Shut down GitLab services on the Mattermost server
#gitlab_rails['enable'] = false
But now by the address 192.168.3.23 loading only gitlab.
GitLab Community Edition 8.4.4 9c31cc6!
How to start use gitlab and mattermost together?
Need use different url-address for GitLab and Mattermost.
extermanl_url "http://192.168.3.23"
...
mattermost_external_url "http://192.168.3.23:8065"
Solve here.