Troubleshooting varnish sess_closed_err - varnish

We deliver IPTV-services to our Customers with Varnish-cache 6.0 and I have been a bit worried that there might be a problem with our Varnish-cache servers. This assumption is based on the amount of customer incident reports when the IPTV-stream flows through our Varnish-cache instead of the backend-server directly.
That is why I would like to eliminate all errors from varnishstat to narrow down the possible reasons for the incidents, since at the moment I don't have a better angle to troubleshoot the problem.
Let's state that I am far from being familiar or an expert with Varnish.
So let's dig in to the "problem":
varnishstat -1 output:
MAIN.sess_closed 38788 0.01 Session Closed
MAIN.sess_closed_err 15260404 3.47 Session Closed with error
Basically almost all of the connections to Varnish-cache servers close with error. I set up a Virtualized Demo-server to our Network with identical Varnish configuration and there the only sess_closed_err were generated when I changed channels in my VLC-mediaplayer. Let's note that I was not able to run but a few VLC's at the same time to the server and that our customers use STB-boxes to use the service.
So my actual question is, how can I troubleshoot what causes the sessions to close with error?

There are some other counters that will show more specifically what happens with the sessions. The next step in your troubleshooting is therefore to look at these counters:
varnishstat -1 | grep ^MAIN.sc_
I'll elaborate a bit with a typical example:
$ sudo varnishstat -1 | egrep "(sess_closed|sc_)"
MAIN.sess_closed 8918046 1.45 Session Closed
MAIN.sess_closed_err 96244948 15.69 Session Closed with error
MAIN.sc_rem_close 86307498 14.07 Session OK REM_CLOSE
MAIN.sc_req_close 8402217 1.37 Session OK REQ_CLOSE
MAIN.sc_req_http10 45930 0.01 Session Err REQ_HTTP10
MAIN.sc_rx_bad 0 0.00 Session Err RX_BAD
MAIN.sc_rx_body 0 0.00 Session Err RX_BODY
MAIN.sc_rx_junk 132 0.00 Session Err RX_JUNK
MAIN.sc_rx_overflow 2 0.00 Session Err RX_OVERFLOW
MAIN.sc_rx_timeout 96193210 15.68 Session Err RX_TIMEOUT
MAIN.sc_tx_pipe 0 0.00 Session OK TX_PIPE
MAIN.sc_tx_error 0 0.00 Session Err TX_ERROR
MAIN.sc_tx_eof 3 0.00 Session OK TX_EOF
MAIN.sc_resp_close 0 0.00 Session OK RESP_CLOSE
MAIN.sc_overload 0 0.00 Session Err OVERLOAD
MAIN.sc_pipe_overflow 0 0.00 Session Err PIPE_OVERFLOW
MAIN.sc_range_short 0 0.00 Session Err RANGE_SHORT
MAIN.sc_req_http20 0 0.00 Session Err REQ_HTTP20
MAIN.sc_vcl_failure 0 0.00 Session Err VCL_FAILURE
The output from this specific environment shows that the majority of the sessions that close with error happens due to receive timeout (MAIN.sc_rx_timeout). This timeout controls how long varnish will keep idle connections open, and is set using the timeout_idle parameter to varnishd. Its value is 5 seconds by default. Use varnishadm to see the current value and the description of the timeout:
$ sudo varnishadm param.show timeout_idle
timeout_idle
Value is: 10.000 [seconds]
Default is: 5.000
Minimum is: 0.000
Idle timeout for client connections.
A connection is considered idle until we have received the full
request headers.
This parameter is particularly relevant for HTTP1 keepalive
connections which are closed unless the next request is
received before this timeout is reached.
Increasing timeout_idle will likely reduce the number of sessions that are closed due to idle timeout. This can be done by setting the value as a parameter when starting varnish. Example:
varnishd [...] -p timeout_idle=15
Note that there are pros and cons related to increasing this timeout.

Related

MDNS causes local HTTP requests to block for 2 seconds

It takes 0.02 seconds to send a message via python code requests.post(http://172.16.90.18:8080, files=files), but it takes 2 seconds to send a message via python code requests.post(http://sdss-server.local:8080, files=files)
The following chart is the packet I caught with wireshark, from the first column of 62 to 107, you can see that it took 2 seconds for mdns to resolve the domain name.
My system is ubuntu 18.04, I rely on this link Mac OS X slow connections - mdns 4-5 seconds - bonjour slow
I edited the /etc/hosts file, I changed this line to
127.0.0.1 localhost sdss-server.local
After modification, it still takes 2 seconds to send the message via python code requests.post(http://sdss-server.local:8080, files=files).
Normally it should be 0.02 to 0.03 seconds, what should I do to fix this and reduce the time from 2 seconds to 0.02 seconds?

Nuxt.js / Node.js Nginx requests per second

I'm trying to prepare a CentOS server to run Nuxt.js (Node.js) application via Nginx reverse proxy.
First, I fire up a simple test server that returns an HTTP 200 response with the text "ok". It easily handles ~10.000 requests/second with ~10ms of mean latency.
Then, when I switch to the hello-world NUXT example app (npx create-nuxt-app) and I run weighttp http benchmarking tool to run the following command:
weighttp -n 10000 -t 4 -c 100 localhost:3000
The results are as follows:
starting benchmark...
spawning thread #1: 25 concurrent requests, 2500 total requests
spawning thread #2: 25 concurrent requests, 2500 total requests
spawning thread #3: 25 concurrent requests, 2500 total requests
spawning thread #4: 25 concurrent requests, 2500 total requests
progress: 10% done
progress: 20% done
progress: 30% done
progress: 40% done
progress: 50% done
progress: 60% done
progress: 70% done
progress: 80% done
progress: 90% done
progress: 100% done
finished in 9 sec, 416 millisec and 115 microsec, 1062 req/s, 6424 kbyte/s
requests: 10000 total, 10000 started, 10000 done, 10000 succeeded, 0 failed,
0 errored
status codes: 10000 2xx, 0 3xx, 0 4xx, 0 5xx
traffic: 61950000 bytes total, 2000000 bytes http, 59950000 bytes data
As You can see it won't climb over 1062 req/s. Sometimes I can reach something like ~1700 req/s if I ramp up the concurrency param, but no more than that.
I'm expecting a simple hello world example app to run at least ~10.000 req/s without a high delay or latency on this machine.
I've tried checking file limits, open connection limits, Nginx workers, etc but couldn't find the root cause, so I'm really looking forward to any ideas on where to at least start searching for the root cause.
I can provide any logs or any other additional info if needed.

NodeJS CPU spikes to 100% one CPU at a time

I have a SOCKS5 Proxy server that I wrote in NodeJS.
I am utilizing the native net and dgram libraries to open TCP and UDP sockets.
It's working fine for around 2 days and all the CPUs are around 30% max. After 2 days with no restarts, one CPU spikes to 100%. After that, all CPUs take turns and stay at 100% one CPU at a time.
Here is a 7 day chart of the CPU spikes:
I am using Cluster to create instances such as:
for (let i = 0; i < Os.cpus().length; i++) {
Cluster.fork();
}
This is the output of strace while the cpu is at 100%:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
99.76 0.294432 79 3733 epoll_pwait
0.10 0.000299 0 3724 24 futex
0.08 0.000250 0 3459 15 rt_sigreturn
0.03 0.000087 0 8699 write
0.01 0.000023 0 190 190 connect
0.01 0.000017 0 3212 38 read
0.00 0.000014 0 420 close
0.00 0.000008 0 612 180 recvmsg
0.00 0.000000 0 34 mmap
0.00 0.000000 0 16 ioctl
0.00 0.000000 0 190 socket
0.00 0.000000 0 111 sendmsg
0.00 0.000000 0 190 bind
0.00 0.000000 0 482 getsockname
0.00 0.000000 0 218 getpeername
0.00 0.000000 0 238 setsockopt
0.00 0.000000 0 432 getsockopt
0.00 0.000000 0 3259 104 epoll_ctl
------ ----------- ----------- --------- --------- ----------------
100.00 0.295130 29219 551 total
And the node profile result (heavy up):
[Bottom up (heavy) profile]:
Note: percentage shows a share of a particular caller in the total
amount of its parent calls.
Callers occupying less than 1.0% are not shown.
ticks parent name
1722861 81.0% syscall
28897 1.4% UNKNOWN
Since I only use the native libraries most of my code actually runs on C++ and not JS. So any debugging that I have to do is in v8 engine. Here is a summary of node profiler (for language):
[Summary]:
ticks total nonlib name
92087 4.3% 4.5% JavaScript
1937348 91.1% 94.1% C++
15594 0.7% 0.8% GC
68976 3.2% Shared libraries
28897 1.4% Unaccounted
I was suspecting that it might be the garbage collector that was running. But I have increased the heap size of Node and the memory seems to be within range. I don't really know how to debug it since each iteration takes around 2 days.
Anyone had a similar issue and had success debugging it? I can use any help I can get.
In your question there isn't enough information for reproduce your case. Things like OS, Node.js version, your code implementation and etc can be reason of such behavior.
There is list of best practices which can resolve or avoid such issue:
Usage pm2 as supervisor for your Node.js application.
Debugging your Node.js application in production. For that:
check your ssh connection to product server
bind your debug port to localhost with ssh -N -L 9229:127.0.0.1:9229 root#your-remove-host
start debugging by command kill -SIGUSR1 <nodejs pid>
open chrome://inspect in your Chrome or use any other debugger for Node.js
Before going to production make:
stress testing
longevity testing
A few months ago we realized that another service that was running on the same box that was keeping track of open sockets was causing the issue. This service was an older version and after a while it was spiking the cpu when tracking the sockets. Upgrading the service to the latest version solved the cpu issues.
Lesson learned: Sometimes it is not you, it is them

What do the mod_pagespeed statistics mean?

Here's a dump of the stats provided my mod_pagespeed from one of my sites.
resource_url_domain_rejections: 6105
rewrite_cached_output_missed_deadline: 4801
rewrite_cached_output_hits: 116004
rewrite_cached_output_misses: 934
resource_404_count: 0
slurp_404_count: 0
total_page_load_ms: 0
page_load_count: 0
resource_fetches_cached: 0
resource_fetch_construct_successes: 45
resource_fetch_construct_failures: 0
num_flushes: 947
total_fetch_count: 0
total_rewrite_count: 0
cache_time_us: 572878
cache_hits: 872
cache_misses: 1345
cache_expirations: 242
cache_inserts: 1795
cache_extensions: 50799
not_cacheable: 0
css_file_count_reduction: 0
css_elements: 0
domain_rewrites: 0
google_analytics_page_load_count: 0
google_analytics_rewritten_count: 0
image_inline: 7567
image_rewrite_saved_bytes: 208854
image_rewrites: 34128
image_ongoing_rewrites: 0
image_webp_rewrites: 0
image_rewrites_dropped_due_to_load: 0
image_file_count_reduction: 0
javascript_blocks_minified: 12438
javascript_bytes_saved: 1173778
javascript_minification_failures: 0
javascript_total_blocks: 12439
js_file_count_reduction: 0
converted_meta_tags: 902
url_trims: 54765
url_trim_saved_bytes: 1651244
css_filter_files_minified: 0
css_filter_minified_bytes_saved: 0
css_filter_parse_failures: 2
css_image_rewrites: 0
css_image_cache_extends: 0
css_image_no_rewrite: 0
css_imports_to_links: 0
serf_fetch_request_count: 1412
serf_fetch_bytes_count: 12809245
serf_fetch_time_duration_ms: 28706
serf_fetch_cancel_count: 0
serf_fetch_active_count: 0
serf_fetch_timeout_count: 0
serf_fetch_failure_count: 0
Can someone please explain what all of the stats mean?
There's a lot of stats here. I'm going to just describe a few of them, because this will get long. We should probably add detailed doc. I can follow-up with more answers later if these are useful.
resource_url_domain_rejections: 6105: this means that since your server restarted, mod_pagespeed has found 6105 resources that it's not going rewrite a resource because their domains are not authorized for rewriting with the ModPagespeedDomain directive. This is common & occurs anytime time someone refreshes a page with a twitter, facebook, or google+ widget.
rewrite_cached_output_missed_deadline: 4801: when a resources (e.g. a jpeg image) is optimized, it happens in a background thread, and the result is cached so that subsequent page views referencing the same refresh are fast. To avoid slowing down the first view, however, we use a 10 millisecond timer to avoid slowing down the time-to-first byte. This stat counts how many times that deadline was exceeded, in which case the resource is left unchanged for that view, but the optimization continues in the background & so the cache is written.
rewrite_cached_output_hits: 116004: counts the number of times we served an optimized resource from the cache, thus avoiding the need to re-optimize it.
rewrite_cached_output_misses: 934: counts the number of times we looked up a resource in our cache and it wasn't there, forcing us to rewrite it. Note that we would also rewrite a resource that was in the cache, but whose origin cache expiration-time had expired. E.g. if your images had cache-control:max-age=600 then we would re-fetch them every 10 minutes to see if they've changed. If they have changed we must re-optimize them.
num_flushes: 947: this is the number of times the Apache resource-generator for the HTML (e.g. mod_php or Wordpress) called the Apache function ap_flush(), which causes partial HTML to be flushed all the way through to the user's browser. This is interesting for mod_pagespeed because it can limit the amount of optimization we can do (e.g. we can't combine CSS files whose elements are separated by a Flush).
cache_time_us: 572878 - the total amount of time, in microseconds, spent waiting for mod_pagespeed's HTTP Cache (file + memory) to respond to a lookup request, since the server was started.
I think that's enough for now. Are there specific other statistics you'd like to learn more about?
Most of these were created for us to monitor the health of mod_pagespeed as it's running, and to help diagnose users' issues. I have to admit we haven't used it much for that purpose, but we use them during development.

How do I stress test a web form file upload?

I need to test a web form that takes a file upload.
The filesize in each upload will be about 10 MB.
I want to test if the server can handle over 100 simultaneous uploads, and still remain
responsive for the rest of the site.
Repeated form submissions from our office will be limited by our local DSL line.
The server is offsite with higher bandwidth.
Answers based on experience would be great, but any suggestions are welcome.
Use the ab (ApacheBench) command-line tool that is bundled with Apache
(I have just discovered this great little tool). Unlike cURL or wget,
ApacheBench was designed for performing stress tests on web servers (any type of web server!).
It generates plenty statistics too. The following command will send a
HTTP POST request including the file test.jpg to http://localhost/
100 times, with up to 4 concurrent requests.
ab -n 100 -c 4 -p test.jpg http://localhost/
It produces output like this:
Server Software:
Server Hostname: localhost
Server Port: 80
Document Path: /
Document Length: 0 bytes
Concurrency Level: 4
Time taken for tests: 0.78125 seconds
Complete requests: 100
Failed requests: 0
Write errors: 0
Non-2xx responses: 100
Total transferred: 2600 bytes
HTML transferred: 0 bytes
Requests per second: 1280.00 [#/sec] (mean)
Time per request: 3.125 [ms] (mean)
Time per request: 0.781 [ms] (mean, across all concurrent requests)
Transfer rate: 25.60 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 2.6 0 15
Processing: 0 2 5.5 0 15
Waiting: 0 1 4.8 0 15
Total: 0 2 6.0 0 15
Percentage of the requests served within a certain time (ms)
50% 0
66% 0
75% 0
80% 0
90% 15
95% 15
98% 15
99% 15
100% 15 (longest request)
Automate Selenium RC using your favorite language. Start 100 Threads of Selenium,each typing a path of the file in the input and clicking submit.
You could generate 100 sequentially named files to make looping over them easyily, or just use the same file over and over again
I would perhaps guide you towards using cURL and submitting just random stuff (like, read 10MB out of /dev/urandom and encode it into base32), through a POST-request and manually fabricate the body to be a file upload (it's not rocket science).
Fork that script 100 times, perhaps over a few servers. Just make sure that sysadmins don't think you are doing a DDoS, or something :)
Unfortunately, this answer remains a bit vague, but hopefully it helps you by nudging you in the right track.
Continued as per Liam's comment:
If the server receiving the uploads is not in the same LAN as the clients connecting to it, it would be better to get as remote nodes as possible for stress testing, if only to simulate behavior as authentic as possible. But if you don't have access to computers outside the local LAN, the local LAN is always better than nothing.
Stress testing from inside the same hardware would be not a good idea, as you would do double load on the server: Figuring out the random data, packing it, sending it through the TCP/IP stack (although probably not over Ethernet), and only then can the server do its magic. If the sending part is outsourced, you get double (taken with an arbitrary sized grain of salt) performance by the receiving end.

Resources