What does sess_herd means? - varnish

Using varnishstat, the metric 'sess_herd' is increasing a lot, during trafic and it seems that I've maybe reached some a limit (300 sess_herd / s)
I think, I got no backend issue (all busy, unhealthy, retry, failed at 0).
Backend_req/Client_req is around 150 req/s.
Right now, our Varnish isn't caching at all, it is just "proxying" to our backend server. So the "pass" rates is about 150 req/s
What could explain such a sess_herd ?
Session_herd
Regards
Olivier

Session herding is a counter that indicates when an ongoing session (TCP-connection) is handed off the worker thread and to a waiter that keeps it while the client thinks.
By default a connection get to keep its worker thread for 50ms (timeout_linger parameter in 4.1) before this happens.
Since networks and clients are slow, a worker thread can in that way serve a whole lot of clients. This reduces the number of running threads needed.
In practice this happens after a response has been sent and while waiting for another request on the reused connection.

Related

Is 60 seconds long enough to wait for a TCP/IP message to get processed

When a client/server application needs to request data from the client, I send out the request message over the established socket and wait for the response - with a 60 second timeout to 'guarantee' that the server app waits long enough - but not 'forever' for a response. Occasionally these timeouts get hit, and the server app will fail. These failures tend to come in bursts.
Is there any way to know when these occur whether they're simply caused by heavy network traffic - and will eventually succeed - or whether they're caused by a harder kind of outage, and will never get a response within a reasonable time? I.e., is 60 seconds long enough to wait for such a data request over an existing socket - and if not, what would a better timeout value be? Would the TCP/IP stack (Amazon linux, in this case) end up retrying the transmission shortly after I've given up on it...?
Would the TCP/IP stack (Amazon linux, in this case) end up retrying the transmission shortly after I've given up on it...?
Giving up by closing the socket will also make the underlying TCP stack stop the retransmission of unacknowledged data. This does not mean that the data are not processed by the peer though since one cannot determine if the peer failed to receive the message or one failed to receive the response.
Is there any way to know when these occur whether they're simply caused by heavy network traffic - and will eventually succeed - or whether they're caused by a harder kind of outage, and will never get a response within a reasonable time?
No. It is up to the application protocol to handle this in a robust way, like detecting retransmission of the same message inside a newly established connection.
I.e., is 60 seconds long enough to wait for such a data request over an existing socket - and if not, what would a better timeout value be?
To detect network connectivity problems it is better to rely on TCP keep-alive instead of waiting a specific time for a response to arrive. If the response might come late since the peer application is not responding fast enough the acceptable timeout depends on the specific use case.

Can clients with slow connection break down blocking-socket-based server?

By definition of blocking sockets, all calls to send() or recv() are blocking until the whole networking operation is finished. This can take some time especially when using tcp and talking to client with slow connection. This is of course solved by introducing threads and thread pools. But what happens if all threads are blocked by some slow client? For example your server wants to serve 10 000+ clients with 100 threads sending data to all users every second. That means each thread would have to call send() 100 times every second. What happens if at some point 100 clients are connected with connections so slow that one call to send()/recv() takes 5 seconds to complete for them(or possibly attacker who does it on purpose). In that case all 100 threads are blocking and everyone else waits. How is this generally solved? Adding more threads to threadpool is probably not a solution since there can always be more slow clients and going for some really high number of threads would introduce even more problems with context switching resource consumption etc.
Can clients with slow connection break down blocking-socket-based server?
Yes, they can. And it does consume resources on the server side. And if too much of this happens, you can end up with a form of "denial of service".
Note that this is worst if you use blocking I/O on the server side because you are tying down a thread while the response is being sent. But it is still a problem with non-blocking I/O. In the latter case, you consume server side sockets, port numbers, and memory to buffer the responses waiting to be sent.
If you want to guard your server against the effects of slow clients, it needs to implement a timeout on sending responses. If the server finds that it is taking too long to write a response ... for whatever reason ... it should simply close the socket.
Typical web servers do this by default.
Finally, as David notes, using non-blocking I/O will make your server more scalable. You can handle more simultaneous requests with less resources. But there are still limits to how much a single server can scale. Beyond a certain point you need a way to spread the request load over multiple servers.

Is sharing TBB's thread pool with a HTTP server a good idea?

I know this is a weird question but hear me out. I'm working on a high throughput, compute heavy HTTP backend server in C++. It is quite straight forward:
Spins up a HTTP server
Receive some request
Do a lot of math
This step is parallelized using TBB
Send the result back (takes about 20ms)
There's no limit on how soon the response have to get out. But the lower the worst case the better it is.
Now my bottleneck is the server part of uses a different thread pool than TBB. Thus when TBB is busy doing math. The server may suddenly get tens of new requests, then the thread from the server side get scheduled, and cause a lot of cache miss and branch prediction failures.
A solution I came up is to share TBB's thread pool with the server. Then no request will be registered while TBB is busy and processed immediately after TBB is free.
Is this a good idea? Or could it have potential problems?
This is difficult to answer without knowing what that other thread pool is doing. If it handles file or network I/O then combining it with a CPU-intensive pool can be a pessimization since I/O does not consume CPU.
Normally there should be a small pool or maybe even a single thread handling the accept loop and async I/O, handing new requests off to the worker pool for processing and sending the results back to the network.
Try to avoid mixing CPU-intensive work with I/O work, as it makes resource utilization difficult to manage. Having said that, sometimes it's just easier and it's never good to run at 100% CPU anyway. So yes, you should try having just one pool. But measure the performance before/after the change.

Limit of HTTPS request per seconds

I am doing a project where I need to send device parameters to the server. I will be using Rasberry Pi for that and flask framework.
1. I want to know is there any limitation of HTTPS POST requests per second. Also, I will be using PythonAnywhere for server-side and their SQL database.
Initially, my objective was to send data over the HTTPS channel when the device is in sleep mode. But when the device (ex: car) wakes up I wanted to upgrade the HTTPS to WebSocket and transmit data in realtime. Later came to know PythonAnywhere doesn't support WebSocket.
Apart from answering the first question, can anyone put some light on the second part? I can just increase the number of HTTPS requests when the device is awake (ex: 1 per 60 min in sleep mode and 6 per 60sec when awake), but it will be unnecessary data consumption over the wake period for transmission of the overhead. It will be a consistent channel during the wake period.
PythonAnywhere developer here: from the server side, if you're running on our platform, there's no hard limit on the number of requests you can handle beyond the amount of time your Flask server takes to process each request. In a free account you would have one worker process handling all of the requests, each one in turn, so if it takes (say) 0.2 seconds to handle a request, your theoretical maximum throughput would be five requests a second. A paid "Hacker" plan would have two worker processes, and they would both be handling requests, to that would get you up to ten a second. And you could customize a paid plan and get more worker processes to increase that.
I don't know whether there would be any limits on the RPi side; perhaps someone else will be able to help with that.

Node.js Server Timeout Problems (EC2 + Express + PM2)

I'm relatively new to running production node.js apps and I've recently been having problems with my server timing out.
Basically after a certain amount of usage & time my node.js app stops responding to requests. I don't even see routes being fired on my console anymore - it's like the whole thing just comes to a halt and the HTTP calls from my client (iPhone running AFNetworking) don't reach the server anymore. But if I restart my node.js app server everything starts working again, until things inevitable stop again. The app never crashes, it just stops responding to requests.
I'm not getting any errors, and I've made sure to handle and log all DB connection errors so I'm not sure where to start. I thought it might have something to do with memory leaks so I installed node-memwatch and set up a listener for memory leaks but that doesn't get called before my server stops responding to requests.
Any clue as to what might be happening and how I can solve this problem?
Here's my stack:
Node.js on AWS EC2 Micro Instance (using Express 4.0 + PM2)
Database on AWS RDS volume running MySQL (using node-mysql)
Sessions stored w/ Redis on same EC2 instance as the node.js app
Clients are iPhones accessing the server via AFNetworking
Once again no errors are firing with any of the modules mentioned above.
First of all you need to be a bit more specific about timeouts.
TCP timeouts: TCP divides a message into packets which are sent one by one. The receiver needs to acknowledge having received the packet. If the receiver does not acknowledge having received the package within certain period of time, a TCP retransmission occurs, which is sending the same packet again. If this happens a couple of more times, the sender gives up and kills the connection.
HTTP timeout: An HTTP client like a browser, or your server while acting as a client (e.g: sending requests to other HTTP servers), can set an arbitrary timeout. If a response is not received within that period of time, it will disconnect and call it a timeout.
Now, there are many, many possible causes for this... from more trivial to less trivial:
Wrong Content-Length calculation: If you send a request with a Content-Length: 20 header, that means "I am going to send you 20 bytes". If you send 19, the other end will wait for the remaining 1. If that takes too long... timeout.
Not enough infrastructure: Maybe you should assign more machines to your application. If (total load / # of CPU cores) is over 1, or your memory usage is high, your system may be over capacity. However keep reading...
Silent exception: An error was thrown but not logged anywhere. The request never finished processing, leading to the next item.
Resource leaks: Every request needs to be handled to completion. If you don't do this, the connection will remain open. In addition, the IncomingMesage object (aka: usually called req in express code) will remain referenced by other objects (e.g: express itself). Each one of those objects can use a lot of memory.
Node event loop starvation: I will get to that at the end.
For memory leaks, the symptoms would be:
the node process would be using an increasing amount of memory.
To make things worse, if available memory is low and your server is misconfigured to use swapping, Linux will start moving memory to disk (swapping), which is very I/O and CPU intensive. Servers should not have swapping enabled.
cat /proc/sys/vm/swappiness
will return you the level of swappiness configured in your system (goes from 0 to 100). You can modify it in a persistent way via /etc/sysctl.conf (requires restart) or in a volatile way using: sysctl vm.swappiness=10
Once you've established you have a memory leak, you need to get a core dump and download it for analysis. A way to do that can be found in this other Stackoverflow response: Tools to analyze core dump from Node.js
For connection leaks (you leaked a connection by not handling a request to completion), you would be having an increasing number of established connections to your server. You can check your established connections with netstat -a -p tcp | grep ESTABLISHED | wc -l can be used to count established connections.
Now, the event loop starvation is the worst problem. If you have short lived code node works very well. But if you do CPU intensive stuff and have a function that keeps the CPU busy for an excessive amount of time... like 50 ms (50 ms of solid, blocking, synchronous CPU time, not asynchronous code taking 50 ms), operations being handled by the event loop such as processing HTTP requests start falling behind and eventually timing out.
The way to find a CPU bottleneck is using a performance profiler. nodegrind/qcachegrind are my preferred profiling tools but others prefer flamegraphs and such. However it can be hard to run a profiler in production. Just take a development server and slam it with requests. aka: a load test. There are many tools for this.
Finally, another way to debug the problem is:
env NODE_DEBUG=tls,net node <...arguments for your app>
node has optional debug statements that are enabled through the NODE_DEBUG environment variable. Setting NODE_DEBUG to tls,net will make node emit debugging information for the tls and net modules... so basically everything being sent or received. If there's a timeout you will see where it's coming from.
Source: Experience of maintaining large deployments of node services for years.

Resources