I have a Django application that works fine when running it with python manage.py runserver. After I added uwsgi, I frequently started to encounter the error Cannot operate on a closed database. The very same endpoint that raises this error works fine if I call it manually from a browser. The errors occur usually after a few hundreds / thousands call (coming really fast) which are made by another service.
Here's my uwsgi settings:
[uwsgi]
chdir = ./src
http = :8000
enable-threads = false
master = true
module = config.wsgi:application
workers = 5
thunder-lock = true
vacuum = true
workdir = ./src
add-header = Connection: Keep-Alive
http-keepalive = 65000
max-requests = 50000
max-requests-delta = 10000
max-worker-lifetime = 360000000000 ; Restart workers after this many seconds
reload-on-rss = 2048 ; Restart workers after this much resident memory
worker-reload-mercy = 60 ; How long to wait before forcefully killing workers
lazy-apps = true
ignore-sigpipe = true
ignore-write-errors = true
http-auto-chunked = true
disable-write-exception = true
Note: this is a private project and it will never reach production. My goal is to have a fast way for django to handle multiple requests using sqlite. Even a dirty solution would be acceptable.
Related
we have big-data Hadoop cluster based on horton-works HDP version 2.6.4 and ambari 2.6.1 version
all machines are with RHEL 7.2 version
in our cluster we have more then 540 machines and on all machines we have ambari-agent that communicate with ambari server , ( Ambari server is installed only on one machine ) while ambari-agent installed on all machines
until using ansible everything was good , when we do ambari-agent upgrade and ambari-agent restart
but recently we start to use ansible ( ansible-playbook ) in order to automate the installation
and ansible is running on all machines
so when task do the ambari-agent restart , then imminently we notice that ansible execution stooped and killed
after some investigation we saw that ambari agent is using the following ports
url_port = 8440
secured_url_port = 8441
ping_port = 8670
but I not see that any ansible process used above ports , so we not think its related
but the basic issue is clear
when ansible task doing on remote machine - ambari-agent restart , then its caused ansible interrupt and ansible killed
ambari-agent configuration looks like this
[server]
hostname = datanode02.gtfactory.com
url_port = 8440
secured_url_port = 8441
connect_retry_delay = 10
max_reconnect_retry_delay = 30
[agent]
logdir = /var/log/ambari-agent
piddir = /var/run/ambari-agent
prefix = /var/lib/ambari-agent/data
loglevel = INFO
data_cleanup_interval = 86400
data_cleanup_max_age = 2592000
data_cleanup_max_size_mb = 100
ping_port = 8670
cache_dir = /var/lib/ambari-agent/cache
tolerate_download_failures = true
run_as_user = root
parallel_execution = 0
alert_grace_period = 5
status_command_timeout = 5
alert_kinit_timeout = 14400000
system_resource_overrides = /etc/resource_overrides
[security]
keysdir = /var/lib/ambari-agent/keys
server_crt = ca.crt
passphrase_env_var_name = AMBARI_PASSPHRASE
ssl_verify_cert = 0
credential_lib_dir = /var/lib/ambari-agent/cred/lib
credential_conf_dir = /var/lib/ambari-agent/cred/conf
credential_shell_cmd = org.apache.hadoop.security.alias.CredentialShell
[network]
use_system_proxy_settings = true
[services]
pidlookuppath = /var/run/
[heartbeat]
state_interval_seconds = 60
dirs = /etc/hadoop,/etc/hadoop/conf,/etc/hbase,/etc/hcatalog,/etc/hive,/etc/oozie,
/etc/sqoop,
/var/run/hadoop,/var/run/zookeeper,/var/run/hbase,/var/run/templeton,/var/run/oozie,
/var/log/hadoop,/var/log/zookeeper,/var/log/hbase,/var/run/templeton,/var/log/hive
log_lines_count = 300
idle_interval_min = 1
idle_interval_max = 10
[logging]
syslog_enabled = 0
for now we are thinking about the following:
maybe ansible crash because TLSv1 is restricted ( Transport Layer Security ) , the default is that ambari-agent connects to TLSv1
so we think to set force_https_protocol=PROTOCOL_TLSv1_2 in ambari agent configuration , but this is only assumption
our suggestion and the new conf that maybe can help?
[security]
force_https_protocol=PROTOCOL_TLSv1_2 <------ the new update
keysdir = /var/lib/ambari-agent/keys
server_crt = ca.crt
passphrase_env_var_name = AMBARI_PASSPHRASE
ssl_verify_cert = 0
credential_lib_dir = /var/lib/ambari-agent/cred/lib
credential_conf_dir = /var/lib/ambari-agent/cred/conf
credential_shell_cmd = org.apache.hadoop.security.alias.CredentialShell
I am writing data to influxdb using the node-influx library.
https://github.com/node-influx/node-influx
It writes about 500,000 records and then I start seeing this error following which there are no more writes. It looks like a dns issue but I am running it inside a docker container on ubuntu 18.04 host.
Error: getaddrinfo EAGAIN influxdb influxdb:8086 |at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:56:26)
I have the logging level set to debug but I am not seeing any other errors. Any idea what might be causing this?
Update
tried with a different influx version
increased ulimit of host
used ip address of docker-container instead of service name, no error is thrown, writes stop after sometime silently
tried to call the write API with curl
curl -i -XPOST 'http://localhost:8086/write?db=mydb' --data-binary 'snmp,hostipv4=172.16.102.82,oidname=cpu_idle,site=gotham value=1000 1574751020815819489'
This works and a record is inserted in the DB.
Update2
It seems to be a dns issue on the docker network. I am not able to ping the influxdb container from the worker container. The writes are not reaching influx.
Update3
As a workaround for now, I am forcing a process.exit(1) on catching the error on my worker and using docker-compose restart: on-failure to restart the service. This resumes the writes.
retention policy is set for 2 days on the DB
influxdb.conf
reporting-disabled = true
[meta]
dir = "/var/lib/influxdb/meta"
retention-autocreate = false
logging-enabled = true
[logging]
format = "auto"
level = "debug"
[data]
engine = "tsm1"
dir = "/var/lib/influxdb/data"
wal-dir = "/var/lib/influxdb/wal"
wal-fsync-delay = "200ms"
index-version = "inmem"
wal-logging-enabled = true
query-log-enabled = true
cache-max-memory-size = "2g"
cache-snapshot-memory-size = "256m"
cache-snapshot-write-cold-duration = "20m"
compact-full-write-cold-duration = "24h"
max-concurrent-compactions = 0
compact-throughput = "48m"
max-points-per-block = 0
max-series-per-database = 1000000
trace-logging-enabled = false
[coordinator]
write-timeout = "10s"
max-concurrent-queries = 0
query-timeout = "0s"
log-queries-after = "0s"
max-select-point = 0
max-select-series = 0
max-select-buckets = 0
[retention]
enabled = true
check-interval = "30m0s"
[shard-precreation]
enabled = true
check-interval = "10m0s"
advance-period = "30m0s"
[monitor]
store-enabled = true
store-database = "_internal"
store-interval = "10s"
[http]
enabled = true
bind-address = ":8086"
auth-enabled = false
log-enabled = true
max-concurrent-write-limit = 0
max-enqueued-write-limit = 0
enqueued-write-timeout = 0
[continuous_queries]
enabled = false
log-enabled = true
run-interval = "10s"
In gunicorn app, I need to allow only certain number of connections and reject the rest with error. I have this testing config:
timeout = 60
graceful_timeout = 60
workers = 1
worker_connections = 1
backlog = 1
worker_class = "gevent"
max_requests = 1000
max_requests_jitter = 42
preload_app = True
bind = "0.0.0.0:8080"
loglevel = "debug"
accesslog = "-" # Send access log to stdout.
which I expected should result in accepting only one connection at a time and rejecting the rest. But when I send multiple requests at once, they are queued and processed one by one. For testing purposes, it takes 10 seconds to process one request to make sure there is one active connection.
Using gunicorn version 19.9.0
All nodes registering as down after new torque install. I'm not sure why
[root#rbx-1 6.0.1]# pbsnodes -a
rbx-1
state = down
power_state = Running
np = 1
ntype = cluster
mom_service_port = 15002
mom_manager_port = 15003
rbx-2
state = down
power_state = Running
np = 1
ntype = cluster
mom_service_port = 15002
mom_manager_port = 15003
Here is qmgr says
[root#rbx-1 6.0.1]# qmgr -c 'p s'
create queue batch
set queue batch queue_type = Execution
set queue batch resources_default.nodes = 1
set queue batch resources_default.walltime = 01:00:00
set queue batch enabled = True
set queue batch started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_hosts = rbx-1
set server managers = root#rbx-1
set server operators = root#rbx-1
set server default_queue = batch
set server log_events = 2047
set server mail_from = adm
set server node_check_rate = 150
set server tcp_timeout = 300
set server job_stat_rate = 300
set server poll_jobs = True
set server down_on_error = True
set server mom_job_sync = True
set server keep_completed = 300
set server next_job_number = 0
set server moab_array_compatible = True
set server nppcu = 1
set server timeout_for_job_delete = 120
set server timeout_for_job_requeue = 120
Please help- I don't know what's causing this or what to try next. Any ideas on tutorials or other would be helpful
Try running momctl -d0 -h rbx-1 to see if the MOMs are communicating with the server. Make sure the host names in the server_name file match up with /etc/hosts on the server and the compute nodes. I'd guess you don't have the short names in /etc/hosts on the nodes.
When configuring watchers, what would be the purpose of including both of these settings under a watching:
singleton = True
numprocess = 1
The documentation states that setting singleton has the following effect:
singleton:
If set to True, this watcher will have at the most one process. Defaults to False.
I read that as negating the need to specify numprocesses however in the github repository they provide an example:
https://github.com/circus-tent/circus/blob/master/examples/example6.ini
Included here as well, where they specify both:
[circus]
check_delay = 5
endpoint = tcp://127.0.0.1:5555
pubsub_endpoint = tcp://127.0.0.1:5556
stats_endpoint = tcp://127.0.0.1:5557
httpd = True
debug = True
httpd_port = 8080
[watcher:swiss]
cmd = ../bin/python
args = -u flask_app.py
warmup_delay = 0
numprocesses = 1
singleton = True
stdout_stream.class = StdoutStream
stderr_stream.class = StdoutStream
So I would assume they do something different and in some way work together?
numprocess is the initial number of process for a given watcher. In the example you provided it is set to 1, but a user can typically add more processes as needed.
singleton would only allow a maxiumum of 1 process running for a given watcher, so it would forbid you from increment the number of processes dynamically.
The code below from circus test suite describes it well ::
#tornado.testing.gen_test
def test_singleton(self):
# yield self._stop_runners()
yield self.start_arbiter(singleton=True, loop=get_ioloop())
cli = AsyncCircusClient(endpoint=self.arbiter.endpoint)
# adding more than one process should fail
yield cli.send_message('incr', name='test')
res = yield cli.send_message('list', name='test')
self.assertEqual(len(res.get('pids')), 1)
yield self.stop_arbiter()