How to set timeout in IIS 6 when ColdFusion is unresponsive - iis

This maybe related to platforms other than ColdFusion.
IIS 6 Log reports "time-taken" much longer (30 minutes) than 120 seconds set in Connection Timeout for several requests to ColdFusion page.
I assume that ColdFusion was unresponsive at the moment. I would like IIS to stop the request rather than wait this long.
Is there an IIS setting that would force this?

Not really because iis is no longer handling the request once it has been passed to cf. You could try playing with application pool timeout and see if you can get that to throw an error.

This scenario can also be considered as the slow HTTP DoS attack when caused by the client. IIS doesn't provide much protection against it (at least for slow POST body) because Microsoft considers it a protocol bug, not an IIS weakness. Although I think in this case it is your server doing it to itself.
Things to check:
You didn't mention whether it is the request that is slow or the
server's response. You could try tweaking your
MinFileBytesPerSec parameter if it's the response that is slow. By
default it will drop the connection if the client is downloading at
less than 240 bytes per second.
Remember, that 120 second IIS timeout is an idle timeout. As long as the client sends or receives a few bytes inside 120 seconds, that timer will keep getting reset.
You didn't mention if this long wait is happening on all pages or always in a few specific ones. It is possible that your CF script is making another external
connection, e.g. CFQUERY, which is not subject to CF timeouts, but to the timeouts
of the server it is connecting to. Using the timeout attribute inside CFQUERY may prevent this.
You also didn't mention what your Coldfusion settings are. Maybe the IIS timeout setting is being ignored by the Coldfusion
JRUN Connector ISAPI filter, so you should check the settings in
Coldfusion Administrator. Especially check if Timeout Requests
after has been changed. If its still at the default of 60
seconds, check your code to see if it has been overridden there, e.g.
<cfsetting requestTimeOut = "3600">
Finally, there is the matter of the peculiar behavior of CF's requestTimeout that you might have to workaround by replacing some cfscript tags with CFML.

Related

Trace a request going through the clearnet / Cloudflare / Apache to precisely find out performance issues

I am hosting a RESTful API and my problem is that every first inbound request after a certain time will take about three seconds, compared to the normal ~100ms.
What I find most interesting is that it is always takes exactly 3100 to around 3250 milliseconds, not more and not less. So it seems pretty intentional to me.
I've already debugged the API and everything runs pretty much instantly except for one thing and that is this three second delay before my API even starts to receive the request.
My best guess is that something went wrong either in Apache or the DNS resolution but I don't know what exactly causes it (that's why I'm asking this question).
I am using the Apache ProxyPass like this:
ProxyRequests off
Timeout 54
ProxyTimeout 5400
ProxyPass /jokeapi http://localhost:8079
ProxyPassReverse /jokeapi http://localhost:8079
I'm using the Cloudflare/APNIC DNS gateway servers 1.1.1.1 and 0.0.0.0
Additionally, all my requests get routed through a Cloudflare SSL proxy before even reaching my network.
I've even partially rewritten the API so it responds with ReadStreams instead of loading the files into RAM and serving it at once but that didn't fix the problem.
My question is how I can fully debug the route a request takes and see precisely where this 3 second delay comes from.
Thanks!
PS: the server runs on NodeJS
I think the key is not related to network activity, but in the note that after a period of idle activity the first response to the API in a while requires slightly over 3 seconds. I am assuming that follow up actions are back to the 100ms window.
As you are using localhost, this is not a routing issue. If you want, you can just as easily use loopback, 127.0.0.1, to avoid a name resolution hit, but such a hit on a reserved hostname would be microseconds.
I suspect that the compiled version of your RESTful function has aged out of the cache for your system. The first hit after a period of non-use time then requires a recompile, and so long as the compiled instructions are exercised for a period of time they will remain in cache and contoninue to respond in the 100ms range. We observe this condition quite often in multiuser performance testing after cold boots of systems (setting initial conditions). Ramp-ups of the test users take the hit for the recompiles of common code before hitting the time under full load.
Another item to strike back at the network side of the house, DNS timeouts and bind cache entries tend to be quite long, usually significant portions of a day or even longer. Even so, the odds that a DNS lookup for an item which has aged out of the bind cache would not add three seconds to your initial connection time.

Does IIS Request Content Filtering Load the full request before filter

I'm looking into IIS Request filtering by content-length. I've set the max allowed content length :
appcmd set config /section:requestfiltering /requestlimits.maxallowedcontentlength:30000000
My question is about when the filter will occur.
Will IIS first read ALL the request into memory and then throw an error, or will it raise an issue as soon as it reaches the threshold?
The IIS Request Filtering module is processed very early in the request pipeline. Unwanted requests are quickly discarded before proceeding to application code which is slower and has a much larger attack surface. For this reason, some have reported performance increases after implementing Request Filtering settings.
Limitations
Request Filtering Limitations include the following:
Stateless - Request Filtering has no knowledge of application or session state. Each request is processed individually regardless of whether a session has or has not been established.
Request Header Only - Request Filtering can only inspect the request header. It has no visibility into the request body or any part of the response.
Basic Logic - Regular expressions and wildcard matches are not available. Most settings consist of establishing size constraints while others perform simple string matching.
maxAllowedContentLength
Request Filtering checks the value of the Content-Length request header. If the value exceeds that which is set for maxAllowedContentLength the client will receive an HTTP 404.13.
The IIS 8.5 STIG recommends a value of 30000000 or less.
IISRFBaseline
This above information is based on my PowerShell module IISRFBaseline. It helps establish an IIS Request Filtering baseline by leveraging Microsoft Logparser to scan a website's content directory and IIS logs.
Many of the settings have a dedicated markdown file providing more information about the setting. The one for maxAllowedContentLength can be found at the following:
https://github.com/phbits/IISRFBaseline/blob/master/IISRFBaseline-maxAllowedContentLength.md
Update - #johnny-5 comment
The filtering happens immediately which makes sense because Request Filtering only has visibility into the request header. This was confirmed via the following methods:
Failed Request Tracing - the Request Filtering module responded to the request with an HTTP 413 Request entity too large.
http.sys event tracing - the request is accepted and handed off to the IIS website. Shortly thereafter is an entry showing the HTTP 413 response. The time between was not nearly long enough for the upload to complete.
Packet capture - Using Microsoft Network Monitor, the HTTP conversation shows IIS immediately responded with an HTTP 413 Request entity too large.
The part you're rightfully concerned with is that IIS still accepts the upload regardless of file size. I found the limiting factor to be connectionTimeout which has a default setting of 120 seconds. If the file is "completed" before the timeout then an HTTP 413 error message is displayed. When a timeout occurs, the browser shows a connection reset since the TCP connection is destroyed by IIS after sending a TCP ACK/RST.
To test this further the timeout was increased and set to connectionTimeout=6000. Then a large upload was submitted and the following IIS components were stopped one at a time. After each stop, the upload was checked via Network Monitor and confirmed to be still running.
Website
Application Pool (Stop-WebAppPool -Name AppPoolName)
World Wide Web Publishing Service (Stop-Service -Name W3SVC)
With all three stopped I verified there was no IIS process still running and yet bytes were still being uploaded. This leads me to conclude that the connection is maintained by http.sys. The fact that connectionTimeout is closely tied to http.sys seems to support this. I do not know if the uploaded bytes go to a buffer or are simply discarded. The event tracing messages didn't provide anything helpful in this context.
Leaving out the Content-Length request header will result in an RFC protocol error (i.e. HTTP 400 Bad request) generated by http.sys since the size of the HTTP payload isn't being declared.

Does Azure's HTTP request timeout of 4 minutes apply to multipart-formdata file uploads?

Azure apparently has a 4 minute timeout for http requests before they kill the connection. This is non configurable in app services:
https://social.msdn.microsoft.com/Forums/en-US/32b76114-67a4-4e6b-ac45-61b0f0a0829f/changing-the-4-minute-request-time-out-for-app-services?forum=AzureAPIApps
I have seen this first hand in my application - I have a process that allows users to view files that exist on a network drive, select a subset of those files and upload those files to a third party service. This happens via a post request which sends the list of file names using content-type json. This operation can take a while and I receive a timeout error at almost exactly 4 minutes.
I also have another process which allows users to drag and drop files into the web application directly, these files are posted to the server using content-type multipart/form-data, and forwarded to the third party service. This request never times out no matter how long the upload takes.
Is there something about using multipart/form-data that overrides azures 4 minute timeout?
It probably does not matter but I am using Node.
The timeout is actually 3m 50s (230 seconds) and not 4 minutes.
But note that it is an idle connection timeout, meaning that it only kicks in if there is no data flowing in the request/response. So it is strange that you would hit this if you are actively uploading files. I would suggest monitoring network traffic to see if anything is being sent. If it really goes 230s with no uploaded data, then there is probably some other issue, and the timeout is just a side effect.

Weird Tomcat outage, possibly related to maxConnections

In my company we experienced a serious problem today: our production server went down. Most people accessing our software via a browser were unable to get a connection, however people who had already been using the software were able to continue using it. Even our hot standby server was unable to communicate with the production server, which it does using HTTP, not even going out to the broader internet. The whole time the server was accessible via ping and ssh, and in fact was quite underloaded - it's normally running at 5% CPU load and it was even lower at this time. We do almost no disk i/o.
A few days after the problem started we have a new variation: port 443 (HTTPS) is responding but port 80 stopped responding. The server load is very low. Immediately after restarting tomcat, port 80 started responding again.
We're using tomcat7, with maxThreads="200", and using maxConnections=10000. We serve all data out of main memory, so each HTTP request completes very quickly, but we have a large number of users doing very simple interactions (this is high school subject selection). But it seems very unlikely we would have 10,000 users all with their browser open on our page at the same time.
My question has several parts:
Is it likely that the "maxConnections" parameter is the cause of our woes?
Is there any reason not to set "maxConnections" to a ridiculously high value e.g. 100,000? (i.e. what's the cost of doing so?)
Does tomcat output a warning message anywhere once it hits the "maxConnections" message? (We didn't notice anything).
Is it possible there's an OS limit we're hitting? We're using CentOS 6.4 (Linux) and "ulimit -f" says "unlimited". (Do firewalls understand the concept of Tcp/Ip connections? Could there be a limit elsewhere?)
What happens when tomcat hits the "maxConnections" limit? Does it try to close down some inactive connections? If not, why not? I don't like the idea that our server can be held to ransom by people having their browsers on it, sending the keep-alive's to keep the connection open.
But the main question is, "How do we fix our server?"
More info as requested by Stefan and Sharpy:
Our clients communicate directly with this server
TCP connections were in some cases immediately refused and in other cases timed out
The problem is evident even connecting my browser to the server within the network, or with the hot standby server - also in the same network - unable to do database replication messages which normally happens over HTTP
IPTables - yes, IPTables6 - I don't think so. Anyway, there's nothing between my browser and the server when I test after noticing the problem.
More info:
It really looked like we had solved the problem when we realised we were using the default Tomcat7 setting of BIO, which has one thread per connection, and we had maxThreads=200. In fact 'netstat -an' showed about 297 connections, which matches 200 + queue of 100. So we changed this to NIO and restarted tomcat. Unfortunately the same problem occurred the following day. It's possible we misconfigured the server.xml.
The server.xml and extract from catalina.out is available here:
https://www.dropbox.com/sh/sxgd0fbzyvuldy7/AACZWoBKXNKfXjsSmkgkVgW_a?dl=0
More info:
I did a load test. I'm able to create 500 connections from my development laptop, and do an HTTP GET 3 times on each, without any problem. Unless my load test is invalid (the Java class is also in the above link).
It's hard to tell for sure without hands-on debugging but one of the first things I would check would be the file descriptor limit (that's ulimit -n). TCP connections consume file descriptors, and depending on which implementation is in use, nio connections that do polling using SelectableChannel may eat several file descriptors per open socket.
To check if this is the cause:
Find Tomcat PIDs using ps
Check the ulimit the process runs with: cat /proc/<PID>/limits | fgrep 'open files'
Check how many descriptors are actually in use: ls /proc/<PID>/fd | wc -l
If the number of used descriptors is significantly lower than the limit, something else is the cause of your problem. But if it is equal or very close to the limit, it's this limit which is causing issues. In this case you should increase the limit in /etc/security/limits.conf for the user with whose account Tomcat is running and restart the process from a newly opened shell, check using /proc/<PID>/limits if the new limit is actually used, and see if Tomcat's behavior is improved.
While I don't have a direct answer to solve your problem, I'd like to offer my methods to find what's wrong.
Intuitively there are 3 assumptions:
If your clients hold their connections and never release, it is quite possible your server hits the max connection limit even there is no communications.
The non-responding state can also be reached via various ways such as bugs in the server-side code.
The hardware conditions should not be ignored.
To locate the cause of this problem, you'd better try to replay the scenario in a testing environment. Perform more comprehensive tests and record more detailed logs, including but not limited:
Unit tests, esp. logic blocks using transactions, threading and synchronizations.
Stress-oriented tests. Try to simulate all the user behaviors you can come up with and their combinations and test them in a massive batch mode. (ref)
More specified Logging. Trace client behaviors and analysis what happened exactly before the server stopped responding.
Replace a server machine and see if it will still happen.
The short answer:
Use the NIO connector instead of the default BIO connector
Set "maxConnections" to something suitable e.g. 10,000
Encourage users to use HTTPS so that intermediate proxy servers can't turn 100 page requests into 100 tcp connections.
Check for threads hanging due to deadlock problems, e.g. with a stack dump (kill -3)
(If applicable and if you're not already doing this, write your client app to use the one connection for multiple page requests).
The long answer:
We were using the BIO connector instead of NIO connector. The difference between the two is that BIO is "one thread per connection" and NIO is "one thread can service many connections". So increasing "maxConnections" was irrelevant if we didn't also increase "maxThreads", which we didn't, because we didn't understand the BIO/NIO difference.
To change it to NIO, put this in the element in server.xml:
protocol="org.apache.coyote.http11.Http11NioProtocol"
From what I've read, there's no benefit to using BIO so I don't know why it's the default. We were only using it because it was the default and we assumed the default settings were reasonable and we didn't want to become experts in tomcat tuning to the extent that we now have.
HOWEVER: Even after making this change, we had a similar occurrence: on the same day, HTTPS became unresponsive even while HTTP was working, and then a little later the opposite occurred. Which was a bit depressing. We checked in 'catalina.out' that in fact the NIO connector was being used, and it was. So we began a long period of analysing 'netstat' and wireshark. We noticed some periods of high spikes in the number of connections - in one case up to 900 connections when the baseline was around 70. These spikes occurred when we synchronised our databases between the main production server and the "appliances" we install at each customer site (schools). The more we did the synchronisation, the more we caused outages, which caused us to do even more synchronisations in a downward spiral.
What seems to be happening is that the NSW Education Department proxy server splits our database synchronisation traffic into multiple connections so that 1000 page requests become 1000 connections, and furthermore they are not closed properly until the TCP 4 minute timeout. The proxy server was only able to do this because we were using HTTP. The reason they do this is presumably load balancing - they thought by splitting the page requests across their 4 servers, they'd get better load balancing. When we switched to HTTPS, they are unable to do this and are forced to use just one connection. So that particular problem is eliminated - we no longer see a burst in the number of connections.
People have suggested increasing "maxThreads". In fact this would have improved things but this is not the 'proper' solution - we had the default of 200, but at any given time, hardly any of these were doing anything, in fact hardly any of these were even allocated to page requests.
I think you need to debug the application using Apache JMeter for number of connection and use Jconsole or Zabbix to look for heap space or thread dump for tomcat server.
Nio Connector of Apache tomcat can have maximum connections of 10000 but I don't think thats a good idea to provide that much connection to one instance of tomcat better way to do this is to run multiple instance of tomcat.
In my view best way for Production server: To Run Apache http server in front and point your tomcat instance to that http server using AJP connector.
Hope this helps.
Are you absolutely sure you're not hitting the maxThreads limit? Have you tried changing it?
These days browsers limit simultaneous connections to a max of 4 per hostname/ip, so if you have 50 simultaneous browsers, you could easily hit that limit. Although hopefully your webapp responds quickly enough to handle this. Long polling has become popular these days (until websockets are more prevalent), so you may have 200 long polls.
Another cause could be if you use HTTP[S] for app-to-app communication (that is, no browser involved). Sometimes app writers are sloppy and create new connections for performing multiple tasks in parallel, causing TCP and HTTP overhead. Double check that you are not getting an inflood of requests. Log files can usually help you on this, or you can use wireshark to count the number of HTTP requests or HTTP[S] connections. If possible, modify your API to handle multiple API calls in one HTTP request.
Related to the last one, if you have many HTTP/1.1 requests going across one connection, and intermediate proxy may be splitting them into multiple connections for load balancing purposes. Sounds crazy I know, but I've seen it happen.
Lastly, some crawl bots ignore the crawl delay set in robots.txt. Again, log files and/or wireshark can help you determine this.
Overall, run more experiments with more changes. maxThreads, https, etc. before jumping to conclusions with maxConnections.

HTTP Error 503.2 - Service Unavailable. The serverRuntime#appConcurrentRequestLimit setting is being exceeded

I have a intranet SiteCore website set up on IIS 7 which randomly throws the following error message
HTTP Error 503.2 - Service Unavailable
The serverRuntime#appConcurrentRequestLimit setting is being exceeded.
To fix this issue, I have made following changes
Increased the Queue Length of application pool myrjetAppPool from 1000 to 65535.
Modified Machine.Config to increase requestQueueLimit property of ProcessModel element to 100000
Increased appConcurrentRequestLimit to 10000 by running
C:\Windows\System32\inetsrv\appcmd.exe set config /section:serverRuntime /appConcurrentRequestLimit:100000
But I'm still getting the same error. ANy help is greatly appreaciated.
You might check to see where all your threads are going. We had occurrences where threads for Media Library assets were hanging and blocking up the queue.
In IIS Manager, select the server node from the tree, then the "Worker Processes" feature icon, then right-click the application pool of interest and select "View current requests". You might find something is getting stuck. I sometimes hit F5 on this screen a few dozen times in very quick succession to see the rate the requests are going through (of course Performance Monitor is better for viewing metrics but it won't tell you what URLs are being processed).
Investigate references in the linked url to 'MaxConcurrentReqeustsPerCPU' which you may need to set by creating a new registry key, depending on your OS and framework.
https://learn.microsoft.com/en-us/archive/blogs/tmarq/asp-net-thread-usage-on-iis-7-5-iis-7-0-and-iis-6-0
As already commented - check the actual concurrent request count using performance counters to determine which limit you're hitting i.e. it could be a limit of 5000 or maybe 12 (per cpu).
Edit: I realise this may look like I'm talking about a different setting entirely, but I believe there is overlap here.
We got this problem after an installation of an IIS plugin. After long investigating we saw that the config-file C:\Windows\System32\inetsrv\config\applicationHost.config had an extra location tag for the site with the problem. After removing the extra entry and an iisreset, the site/server worked normally againg. So something must went wrong during the installation....

Resources