Varnish process not closed and taking huge memory - linux

We are using Varnish cache 4.1 in centos server, When we started Varnish server lots of varnish process starting and its not closing, due to this issue we are facing memory leak issue, pls let us know how we can resolve it
My Configuration is: /etc/sysconfig/varnish
#DAEMON_OPTS="-a :80 \
# -T localhost:6082 \
# -f /etc/varnish/default.vcl \
# -S /etc/varnish/secret \
# -p thread_pools=8 \
# -p thread_pool_max=4000 \
# -p thread_pool_add_delay=1 \
# -p send_timeout=30 \
# -p listen_depth=4096 \
# -s malloc,2G"
backend default {
.host = "127.0.0.1";
.port = "8080";
.probe = {
.url = "/";
.interval = 5s;
.timeout = 1s;
.window = 5;
.threshold = 3;
}
}
34514 89208 83360 5 0.0 4.3 0:00.00 /usr/sbin/varnishd -a :80 -f /etc/varnish/default.vcl -T 127.0.0.1:6082 -t 120 -p thread pool min=50 -p t 1678 varnish 20 0 345M 89208 83360 S 0.0 4.3 0:00.03 /usr/sbin/varnishd -a :80 -f /etc/varnish/default.vcl -T 127.0.0.1:6082 -t 120 -p thread_pool_min=50 -p • 1679 varnish 20 0

You are not limiting space for transient objects. By default an unlimited malloc is used (see the official doc : https://www.varnish-cache.org/docs/4.0/users-guide/storage-backends.html#transient-storage )
From what I see in your message, you are not using the parameter DAEMON_OPT.
What are the content of your varnishd.service file and /etc/varnish/varnish.params ?
EDIT
Nothing's wrong with your init.d script. It should use the settings found in /etc/sysconfig/varnish.
How many RAM is consumed by varnish?
All the varnish threads are sharing the same storage (malloc 2G + Transient malloc 100M) so it should take up to 2.1G for storage. you need to add an average overhead of 1KB per object stored in cache to get the total memory used.
I don't think you are suffering memory leak, the process are normal. You told varnish to spawn 50 processes (with the thread_pools parameter) so they are expected.
I'd recommend decreasing the number of thread_pools, you are setting it to 50. You should be able to lessen it to something between 2 and 8, at the same time it will help to increase the thread_pool_max to 5000and set the thread_pool_min to 1000.
We are running very large server with 2 pools * 1000-5000 threads and have no issue.

Related

Nrpe Not pulling date NRPE: Unable to read output

I'm trying to get memory metric from client machine. I installed nrpe in client machine and works well for default checks like load, users and all.
Manual output from client machine,
root#Nginx:~# /usr/lib/nagios/plugins/check_mem -w 50 -c 40
OK - 7199 MB (96%) Free Memory
But when i try from server, other metrics works but memory metrics not working,
[ec2-user#ip-10-0-2-179 ~]$ /usr/lib64/nagios/plugins/check_nrpe -H 107.XX.XX.XX -c check_mem
NRPE: Unable to read output
Other metrics works well
[ec2-user#ip-10-0-2-179 ~]$ /usr/lib64/nagios/plugins/check_nrpe -H 107.XX.XX.XX -c check_load
OK - load average: 0.00, 0.01, 0.05|load1=0.000;15.000;30.000;0; load5=0.010;10.000;25.000;0; load15=0.050;5.000;20.000;0;
I ensured that check_mem command has execution permission for all,
root#Nginx:~# ll /usr/lib/nagios/plugins/check_mem
-rwxr-xr-x 1 root root 2394 Sep 6 00:00 /usr/lib/nagios/plugins/check_mem*
Also here is my client side nrpe config commands
command[check_users]=/usr/lib/nagios/plugins/check_users -w 5 -c 10
command[check_load]=/usr/lib/nagios/plugins/check_load -w 15,10,5 -c 30,25,20
command[check_disk]=/usr/lib/nagios/plugins/check_disk -w 20% -c 10% -p /dev/sda1
command[check_zombie_procs]=/usr/lib/nagios/plugins/check_procs -w 5 -c 10 -s Z
command[check_procs]=/usr/lib/nagios/plugins/check_procs -w 200 -c 250
command[check_http]=/usr/lib/nagios/plugins/check_http -I 127.0.0.1
command[check_swap]=/usr/lib/nagios/plugins/check_swap -w 30 -c 20
command[check_mem]=/usr/lib/nagios/plugins/check_mem -w 30 -c 20
Can anyone help me to fix the issue?

Jetty Websocket Scaling

what is the maximum number of websockets any one has opened using jetty websocket server. I recently tried to load test the same and was able to open 200k concurrent connections on a 8 core linux VM as server and 16 clients with 4 core each. Each client was able to make 12500 concurrent connections post which they started to get socket timeout exceptions. Also I had tweaked the number of open files as well as tcp connections settings of both client and server as follows.
sudo sysctl -w net.core.rmem_max=16777216
sudo sysctl -w net.core.wmem_max=16777216
sudo sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216"
sudo sysctl -w net.ipv4.tcp_wmem="4096 16384 16777216"
sudo sysctl -w net.core.somaxconn=8192
sudo sysctl -w net.core.netdev_max_backlog=16384
sudo sysctl -w net.ipv4.tcp_max_syn_backlog=8192
sudo sysctl -w net.ipv4.tcp_syncookies=1
sudo sysctl -w net.ipv4.ip_local_port_range="1024 65535"
sudo sysctl -w net.ipv4.tcp_tw_recycle=1
sudo sysctl -w net.ipv4.tcp_congestion_control=cubic
On the contrary one 2 core machine running node was able to scale upto 90k connections.
My questions are as follows
Can we increase the throughput of jetty VM any more
What is the reason of node.js higher performance over jetty.

How to set nginx max open files?

Though I have done the following setting, and even restarted the server:
# head /etc/security/limits.conf -n2
www-data soft nofile -1
www-data hard nofile -1
# /sbin/sysctl fs.file-max
fs.file-max = 201558
The open files limitation of specific process is still 1024/4096:
# ps aux | grep nginx
root 983 0.0 0.0 85872 1348 ? Ss 15:42 0:00 nginx: master process /usr/sbin/nginx
www-data 984 0.0 0.2 89780 6000 ? S 15:42 0:00 nginx: worker process
www-data 985 0.0 0.2 89780 5472 ? S 15:42 0:00 nginx: worker process
root 1247 0.0 0.0 11744 916 pts/0 S+ 15:47 0:00 grep --color=auto nginx
# cat /proc/984/limits
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size 0 unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 15845 15845 processes
Max open files 1024 4096 files
Max locked memory 65536 65536 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 15845 15845 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us
I've tried all possible solutions from googling but in vain. What setting did I miss?
On CentOS (tested on 7.x):
Create file /etc/systemd/system/nginx.service.d/override.conf with the following contents:
[Service]
LimitNOFILE=65536
Reload systemd daemon with:
systemctl daemon-reload
Add this to Nginx config file:
worker_rlimit_nofile 16384; (has to be smaller or equal to LimitNOFILE set above)
And finally restart Nginx:
systemctl restart nginx
You can verify that it works with cat /proc/<nginx-pid>/limits.
I found the answer in few minutes after posting this question...
# cat /etc/default/nginx
# Note: You may want to look at the following page before setting the ULIMIT.
# http://wiki.nginx.org/CoreModule#worker_rlimit_nofile
# Set the ulimit variable if you need defaults to change.
# Example: ULIMIT="-n 4096"
ULIMIT="-n 15000"
/etc/security/limit.conf is used by PAM, so it shoud be nothing to do with www-data (it's nologin user).
For nginx simply editing nginx.conf and setting worker_rlimit_nofile should change the limitation.
I initially thought it is a self-imposed limit of nginx, but it increases limit per process:
worker_rlimit_nofile 4096;
You can test by getting a nginx process ID (from top -u nginx), then run:
cat /proc/{PID}/limits
To see the current limits
Another way on CentOS 7 is by systemctl edit SERVICE_NAME, add the variables there:
[Service]
LimitNOFILE=65536
save that file and reload the service.
For those looking for an answer for pre-systemd Debian machines, the nginx init script executes /etc/default/nginx. So, adding the line
ulimit -n 9999
will change the limit for the nginx daemon without messing around with the init script.
Adding ULIMIT="-n 15000" as in a previous answer didn't work with my nginx version.

Why varnish ban requests not removing/invalidating cached data?

I am trying to debug inconsistent behavior with varnish.
I have an application in which when a piece of content is updated a ban request is issued to varnish in order to remove it from cache and invalidate that cache. The problem is that this works fine just a few times but not in the majority of times although I can see the bans in the varnish log. Just to rephrase, when I save a piece of content, a ban is issued to varnish of the form
1374003254.031996 75 req.http.host ~ www.example.com && req.url ~ ^(.*)(?<!\d{1})539250(?!\d{1}) where 539250 is the unique content id present in the url.
I logged into the varnish host and check the varnish process. Executing ps -ef |grep varn gives
root 8889 1 0 15:19 ? 00:00:00 /usr/sbin/varnishd -P /var/run/varnish.pid -a :80 -T :8100 -f /etc/varnish/qa.vcl -u varnish -g varnish -h critbit -p http_max_hdr 256 -p thread_pool_min 200 -p thread_pool_max 4000 -p thread_pools 2 -p thread_pool_stack 262144 -p thread_pool_add_delay 2 -p session_linger 100 -p sess_timeout 60 -p listen_depth 4096 -p lru_interval 20 -p ban_lurker_sleep 0.2 -s malloc,1G
varnish 8897 8889 0 15:19 ? 00:00:00 /usr/sbin/varnishd -P /var/run/varnish.pid -a :80 -T :8100 -f /etc/varnish/qa.vcl -u varnish -g varnish -h critbit -p http_max_hdr 256 -p thread_pool_min 200 -p thread_pool_max 4000 -p thread_pools 2 -p thread_pool_stack 262144 -p thread_pool_add_delay 2 -p session_linger 100 -p sess_timeout 60 -p listen_depth 4096 -p lru_interval 20 -p ban_lurker_sleep 0.2 -s malloc,1G
Is it Normal to have 2 processes?
then I did an ban.list in the varnish cli:
1374003254.031996 75 req.http.host ~ example.com && req.url ~ ^(.*)(?<!\d{1})539250(?!\d{1})
1374003202.365076 224G req.http.host ~ example.com && req.url ~ ^(.*)(?<!\d{1})539250(?!\d{1})
1374003116.772315 83G req.http.host ~ example.com && req.url ~ ^(.*)(?<!\d{1})539250(?!\d{1})
1374002967.450431 267G req.http.host ~ example.com && req.url ~ ^(.*)(?<!\d{1})539250(?!\d{1})
1374002756.701640 187G req.http.host ~ example.com && req.url ~ ^(.*)(?<!\d{1})539250(?!\d{1})
All I want to know if there is somthing wrong causing the ban not remove cached data.
Your varnish process lines are looking good.
Varnish has a management process which starts (and watches over) a child where all the request handling is done. These are the two processes you are seeing.
If you do a lot of bans, you should consider reading the "Smart bans" chapter in the Varnish book. It will help you keep the list of bans shorter.
https://www.varnish-software.com/static/book/Cache_invalidation.html#smart-bans

Why does Ubuntu terminal shut down while running load tests?

Facing a peculiar problem when doing load testing on my laptop with 2000 comcurrent users using cometd. Following all steps in http://cometd.org/documentation/2.x/howtos/loadtesting.
These tests run fine for about 1000 concurrent client.
But when I increase the load to about 2000 CCUs, the terminal just shuts down.
Any idea what's happening here?
BTW, i have followed all the OS level settings as per the site. i.e.
# ulimit -n 65536
# ifconfig eth0 txqueuelen 8192 # replace eth0 with the ethernet interface you are using
# /sbin/sysctl -w net.core.somaxconn=4096
# /sbin/sysctl -w net.core.netdev_max_backlog=16384
# /sbin/sysctl -w net.core.rmem_max=16777216
# /sbin/sysctl -w net.core.wmem_max=16777216
# /sbin/sysctl -w net.ipv4.tcp_max_syn_backlog=8192
# /sbin/sysctl -w net.ipv4.tcp_syncookies=1
Also, I have noticed this happened even when I run load tests for other platforms. I know this has to be something related to the OS, but I cannot figure out what it could be.
ulimit command has been executed correctly? I read something about it in Ubuntu forum archive and Ubuntu apache problem.

Resources