I am doing performance testing of my Azure Web API that receives file attachments from the client and then uploads them to the Data Lake Store. My performance test is currently running for 6 minutes with a load of 250 users making 40 requests/sec.
The file uploads are successful until around 4minutes while the requests are under 4000, once the requests exceeds 4000 the file upload starts failing with the error of Port Exhaustion.
After some research I found out that there are around 4K ports available for communication and once the client sends the FIN packet, those ports go into a TcpTimedWaitDelay which by default is 4minutes(240seconds).
The solutions I found after initial research includes
1- Minimizing the TIME_WAIT of the ports by changing the registry.
My scenario: I'm using a Web API and I do not have access to the VM.
2- Increasing the ports to 65K by changing registry.
My scenario: I'm using a Web API and I do not have access to the VM.
3- Disposing the http client that is being used to make the requests.
My scenario: I do not have access to the client directly as I am using Azure .NET SDK's DataLakeStoreFileSystemManagementClient to upload the files.
I get the error after around 4K+ requests have been made. For file upload I use
DataLakeStoreFileSystemManagementClient.FileSystem.Create(_dlAccountName, filePath, filestream, true)
Can someone please help fix this port exhaustion issue?
Something which jumps to mind is the session timeout on your file upload session. Once you hit the 4000 mark 6 minutes in then essentially you have no ports available until the earliest sessions start timing out and the transient client port connection resource on the server is released.
In a standard HTTP session environment you would have enormous flexibility to tune the session timeout to recover the ports in the configuration file for your web server/http-based applications server/HTTP ESB/ etc.... The timeout on your target seems to be set to 240 seconds. Do you have a configuration option available to reduce this value in the configuration of your target service?
Actually there is way to update the default timeout of 5 minutes:
DataLakeStoreFileSystemClient.HttpClient.Timeout = TimeSpan.FromMinutes(1);
Also, please take note that we recently released a new Data Lake Store SDK just for filesystem operations in order to improve performance. Check it out!
Nuget: https://www.nuget.org/packages/Microsoft.Azure.DataLake.Store/
Github: https://github.com/Azure/azure-data-lake-store-net
Related
We have a setup with several RESTful APIs on the same VM in Azure.
The websites run in Kestrel on IIS.
They are protected by the azure application gateway with firewall.
We now have requests that would run for at least 20 minutes.
The request run the full length uninterrupted on Kestrel (Visible in the logs) but the sender either get "socket hang up" after exactly 5 minutes or run forever even if the request finished in kestrel. The request continue in Kestrel even if the connection was interrupted for the sender.
What I have done:
Wrote a small example application that returns after a set amount of
seconds to exclude our websites being the problem.
Ran the request in the VM (to localhost): No problems, response was received.
Ran the request within Azure from one to another VM: Request ran forever.
Ran the request from outside of Azure: Request terminates after 5 minutes
with "socket hang up".
Checked set timeouts: Kestrel: 50m , IIS: 4000s, ApplicationGateway-HttpSettings: 3600
Request were tested with Postman,
Is there another request or connection timeout hidden somewhere in Azure?
We now have requests that would run for at least 20 minutes.
This is a horrible architecture and it should be rewritten to be async. Don't take this personally, it is what it is. Consider returning a 202 Accepted with a Location header to poll for the result.
You're most probably hitting the Azure SNAT layer timeout —
Change it under the Configuration blade for the Public IP.
So I ran into something like this a little while back:
For us the issue was probably the timeout like the other answer suggests but the solution was (instead of increasing timeout) to add PGbouncer in front of our postgres database to manage the connections and make sure a new one is started before the timeout fires.
Not sure what your backend connection looks like but something similar (backend db proxy) could work to give you more ability to tune connection / reconnection on your side.
For us we were running AKS (azure Kubernetes service) but all azure public ips obey the same rules that cause issues similar to this one.
While it isn't an answer I know there are also two types of public IP addresses, one of them is considered 'basic' and doesn't have the same configurability, could be something related to the difference between basic and standard public ips / load balancers?
Azure apparently has a 4 minute timeout for http requests before they kill the connection. This is non configurable in app services:
https://social.msdn.microsoft.com/Forums/en-US/32b76114-67a4-4e6b-ac45-61b0f0a0829f/changing-the-4-minute-request-time-out-for-app-services?forum=AzureAPIApps
I have seen this first hand in my application - I have a process that allows users to view files that exist on a network drive, select a subset of those files and upload those files to a third party service. This happens via a post request which sends the list of file names using content-type json. This operation can take a while and I receive a timeout error at almost exactly 4 minutes.
I also have another process which allows users to drag and drop files into the web application directly, these files are posted to the server using content-type multipart/form-data, and forwarded to the third party service. This request never times out no matter how long the upload takes.
Is there something about using multipart/form-data that overrides azures 4 minute timeout?
It probably does not matter but I am using Node.
The timeout is actually 3m 50s (230 seconds) and not 4 minutes.
But note that it is an idle connection timeout, meaning that it only kicks in if there is no data flowing in the request/response. So it is strange that you would hit this if you are actively uploading files. I would suggest monitoring network traffic to see if anything is being sent. If it really goes 230s with no uploaded data, then there is probably some other issue, and the timeout is just a side effect.
I am running a load test using JMeter on my Azure web services.
I scale my services on S2 with 4 instances and run JMeter 4 instances with 500 threads on each.
It starts perfectly fine but after a while calls start failing and giving Timeout error (HTTP status:500).
I have checked HTTP request queue on azure and found that on 2nd instance it is very high and two instances it is very low.
Please help me to success my load test.
I assume you are using Azure App Service. If you check the settings of your App, you will notice ARR’s Instance Affinity will be enabled by default. A brief explanation:
ARR cleverly keeps track of connecting users by giving them a special cookie (known as an affinity cookie), which allows it to know, upon subsequent requests, to which server instance they were talking to. This way, we can be sure that once a client establishes a session with a specific server instance, it will keep talking to the same server as long as his session is active.
This is an important feature for session-sensitive applications, but if it's not your case then you can safely disable it to improve the load balance between your instances and avoid situations like the one you've described.
Disabling ARR’s Instance Affinity in Windows Azure Web Sites
It might be due to caching of network names resolution on JVM or OS level so all your requests are hitting only one server. If it is the case - add DNS Cache Manager to your Test Plan and it should resolve your issue.
See The DNS Cache Manager: The Right Way To Test Load Balanced Apps article for more detailed explanation and configuration instructions.
I cant seem to find any documentation for it.
If connection draining is not available how is one supposed to do zero-downtime deployments?
Rick Rainey answered essentially the same question on Server Fault. He states:
The recommended way to do this is to have a custom health probe in
your load balanced set. For example, you could have a simple
healthcheck.html page on each of your VM's (in wwwroot for example)
and direct the probe from your load balanced set to this page. As long
as the probe can retrieve that page (HTTP 200), the Azure load
balancer will keep sending user requests to the VM.
When you need to update a VM, then you can simply rename the
healthcheck.html to a different name such as _healthcheck.html. This
will cause the probe to start receiving HTTP 404 errors and will take
that machine out of the load balanced rotation because it is not
getting HTTP 200. Existing connections will continue to be serviced
but the Azure LB will stop sending new requests to the VM.
After your updates on the VM have been completed, rename
_healthcheck.html back to healthcheck.html. The Azure LB probe will start getting HTTP 200 responses and as a result start sending
requests to this VM again.
Repeat this for each VM in the load balanced set.
Note, however, that Kevin Williamson from Microsoft states in his MSDN blog post Heartbeats, Recovery, and the Load Balancer, "Make sure your probe path is not a simple HTML page, but actually includes logic to determine your service health (eg. Try to connect to your SQL database)." So you may actually want an aspx page that can check several factors, including a custom "drain" flag you put somewhere.
Your clients need to simply retry.
The load balancer only forwards a request to an instance that is alive (determined by pings), it doesn't keep track of the connections. So if you have long-standing connections, it is your responsibility to clean them up on restart events or leave it to the OS to clean them up on restarts (which is obviously not gracefully in most of the cases).
Zero-downtime means that you'll always be able to reach an instance that is alive, nothing more- it gives you no guarantees on long running requests.
Note that when a probe is down, only new connections will go to other VMs
Existing connections are not impacted.
When we migrated our apps to azure from rackspace, we saw almost 50% of http requests getting read timeouts.
We tried placing the client both inside and outside azure with the same results. The client in this case is also a server btw, so no geographic/browser issues either.
We even tried increasing the size of the box to ensure azure wasn't throttling. But even using D boxes for a single request, the result was the same.
Once we moved out apps out of azure they started functioning properly again.
Each query was done directly on an instance using a public ip, so no load balancer issues either.
Almost 50% of queries ran into this issue. The timeout was set to 15 minutes.
Region was US East 2
Having 50% of HTTP requests timing out is not normal behavior. This is why you need to analyze what is causing those timeouts by validating the requests are hitting your VM. For this, I would recommend you running a packet capture on your server and analyze response times, as well as look for high number of retransmissions; it is even better if you can take a simultaneous network trace on your clients machines so you can do TCP sequence number analysis and compare packets sent vs received.
If you are seeing high latencies in the packet capture or high number of retransmissions, it requires detailed analysis. I strongly suggest you to open a support incident so Microsoft support can help you investigate your issue further.