ServicePointManager SetTcpKeepAlive - azure

We have a web site which calls Azure Storage thousands of times a second. All of the storage endpoints are HTTPS. Does anyone know if setting ServicePointManager.SetTcpKeepAlive = true will help with performance? It is disabled by default.

Not sure if enabling tcp keep-alive will help your performance issue (it should be easy enough for you to benchmark), but... if you're calling storage endpoints from your Azure-hosted web site, and storage is in the same region (same data center), you shouldn't need https, since traffic never leaves the data center.
EDIT since you're working with the ServicePointManager, also consider setting ServicePointManager.UseNagleAlgorithm=false. Otherwise, small tcp packets get buffered up to 1/2-second. If your storage communication is for small (less than ~1400 byte) payloads, this setting should help (especially when dealing with things like Azure Queues, which tend to have very small messages).

Related

How to control the usage of APIs by consumers during a given period (throttle) in Azure function app Http trigger without using Azure API Gateway

How to control the usage of APIs by consumers during a given period in Azure function app Http trigger. Simply how to set a requests throttle when exceed the request limit, and please let me know a solution without using azure API Gateway.
The only control you have over host creation in Azure Functions an obscure application setting: WEBSITE_MAX_DYNAMIC_APPLICATION_SCALE_OUT. This implies that you can control the number of hosts that are generated, though Microsoft claim that “it’s not completely foolproof” and “is not fully supported”.
From my own experience it only throttles host creation effectively if you set the value to something pretty low, i.e. less than 50. At larger values then its impact is pretty limited. It’s been implied that this feature will be will be worked on in the future, but the corresponding issue has been open in GitHub with no update since July 2017.
For more details, you could refer to this article.
You can use the initialVisibilityDelay property of the CloudQueue.AddMessage function as outlined in this blog post.
This will throttle the message to prevent the 429 error if implemented correctly using the leaky bucket algorithm or equivalent.

AWS Cloudfront latency: origin fetch vs internet

Is the first fetch of any given file from an origin via Cloudfront faster on average than fetching directly from the origin over the internet? I'm wondering if the AWS backbone somehow outperforms the speed of the public internet.
Eg if a user from Sydney wants a file from my S3 in Europe, and Cloudfront doesn't yet have it cached, is it quicker to get it directly over the internet, or for Cloudfront to fetch it from the European origin to the Sydney edge cache and to the internet for the last few hops? But that's just an example. Users will be worldwide, and many will be in Europe, close to the origin.
I do understand that AFTER that request to origin the CDN will cache the file and subsequent requests from Sydney for that same file within the file's TTL will be much faster, but subsequent requests will not happen often in my use case...
I have a large collection of small files (<1MB) on S3 which seldom change, and each of them individually is seldom downloaded and will have a TTL of about 1 week.
I'm curious if putting Cloudfront in front of S3, in this case, will be worth it even though I won't get much value from the edge caching service that the CDN provides.
So should I expect to see any latency decrease on average for those first fetch scenarios?
EDIT: I subsequently found this article which mentions "Persistent Connections... reduces overall latency...", but I suspect it just means better performance of the Cloudfront-to-origin subsystem, and not necessarily better end-to-end perf for the user.
I'm wondering if the AWS backbone somehow outperforms the speed of the public internet.
The idea is that it should.
You should see an overall improvement, because CloudFront does several useful things, even when not caching:
brings the traffic onto the AWS managed network as close to the viewer as practical, with the traffic traversing most of its distance on the AWS network rather than on the public Internet.
sectionalizes the TCP interactions between the browser and the origin by creating two TCP connections¹, one from browser to CloudFront, and one from CloudFront to origin. The back-and-forth messaging that occurs for connection setup, then TLS negotiation, then HTTP request/response, are optimized.
(optional) provides http/2 to HTTP/1.1 gateway/translation, allowing the browser to make concurrent requests over a single http/2 connection while converting these to multiple HTTP/1.1 requests on separate connections to the origin.
There are some minor arbitrage opportunities in the discrepancies between costs for traffic leaving a region bound for the Internet and traffic leaving a CloudFront edge bound for the Internet. (Traffic outbound from EC2/S3 to CloudFront is not billable). In many cases, these work in your favor, such as a viewer in a low cost area accessing a bucket in a high cost area, but they are almost always asymmetric. A London viewer and a Sydney bucket is $0.14/GB accessing the bucket directly, but $0.085/GB accessing the same bucket through CloudFront. On the flip side, a Sidney viewer accessing a London bucket is $0.09/GB direct to the bucket, $0.14/GB through CloudFront. London viewer/London bucket is $0.085 through CloudFront or $0.09/GB direct to the bucket. It is my long-term assumption thst these discrepancies represent the cost of Internet access compared to the cost of AWS's private transport. You can also configure CloudFront, via the price class feature, to use only the lower cost edges, which is not guaranteed to actually use only the lower cost edges for traffic, but rather guaranteed not to charge you a higher price if a lower cost edge is not used.
Note also that there are two (known) services that use CloudFront with caching always disabled:
Enabling S3 Transfer Acceleration on a bucket is fundamentally a zero-config-required CloudFront distribution without the cache enabled. Transfer acceleration has only three notable differences compared to a self-provisioned CloudFront + S3 arrangement: specifically, it can pass-through signed URLs that S3 understands and accepts (with S3 plus your own CloudFront, you have to use CloudFront signed URLs, which use a different algorithm) and the CloudFront network is bypassed for users who are geographically close to the bucket region, which also eliminates the Transfer Acceleration surcharge for that request. The third difference is that it almost always costs more than your own CloudFront + S3.
AWS apparently believes the value added here is significant enough that for the feature to cost more than using S3 + CloudFront yourself makes sense. On occasion, I have used it to squeeze a bit more optimization out of a direct-to-bucket arrangement, because it is an easy change to make.
Find the Transfer Acceleration speed test on this page and observe what it does. This is upload, rather than download, but it is the same idea -- it gives you reasonable depiction of the differences between the public Internet and the AWS "Edge Network" (the CloudFront infrastructure).
API Gateway edge-optimized APIs also do route through CloudFront for performance reasons. While API Gateway does offer optional caching, it uses a caching instance, not the CloudFront cache. API subsequently introduced a second type of API endpoint that doesn't use CloudFront, because when you are making requests within the same actual AWS region, it doesn't make sense to send the request through extra hardware. This also makes deploying API Gateway behind your own CloudFront a bit more sensible, avoiding an unnecessary second pass through the same infrastructure.
¹two TCP connections may actually be three, which should tend to further improve performance because the boundary between each connection provides a content buffer that allows for smoother and faster transport and changes the bandwidth-delay product in favorable ways. Since some time in 2016, CloudFront has two tiers of edge locations, the outer "global" edges (closest to the viewer) and the inner "regional" edges (within the actual AWS regions). This is documented but the documentation is very high-level and doesn't explain the underpinnings thoroughly. Anecdotal observations suggest that each global edge has an assigned "home" regional edge that is the regional edge in its nearest AWS region. The connection goes from viewer, to outer edge, to the inner edge, and then to the origin. The documentation suggests that there are cases where the inner (regional) edge is bypassed, but observations suggest that these are the exception.

Port exhaustion on Azure Data Lake Store

I am doing performance testing of my Azure Web API that receives file attachments from the client and then uploads them to the Data Lake Store. My performance test is currently running for 6 minutes with a load of 250 users making 40 requests/sec.
The file uploads are successful until around 4minutes while the requests are under 4000, once the requests exceeds 4000 the file upload starts failing with the error of Port Exhaustion.
After some research I found out that there are around 4K ports available for communication and once the client sends the FIN packet, those ports go into a TcpTimedWaitDelay which by default is 4minutes(240seconds).
The solutions I found after initial research includes
1- Minimizing the TIME_WAIT of the ports by changing the registry.
My scenario: I'm using a Web API and I do not have access to the VM.
2- Increasing the ports to 65K by changing registry.
My scenario: I'm using a Web API and I do not have access to the VM.
3- Disposing the http client that is being used to make the requests.
My scenario: I do not have access to the client directly as I am using Azure .NET SDK's DataLakeStoreFileSystemManagementClient to upload the files.
I get the error after around 4K+ requests have been made. For file upload I use
DataLakeStoreFileSystemManagementClient.FileSystem.Create(_dlAccountName, filePath, filestream, true)
Can someone please help fix this port exhaustion issue?
Something which jumps to mind is the session timeout on your file upload session. Once you hit the 4000 mark 6 minutes in then essentially you have no ports available until the earliest sessions start timing out and the transient client port connection resource on the server is released.
In a standard HTTP session environment you would have enormous flexibility to tune the session timeout to recover the ports in the configuration file for your web server/http-based applications server/HTTP ESB/ etc.... The timeout on your target seems to be set to 240 seconds. Do you have a configuration option available to reduce this value in the configuration of your target service?
Actually there is way to update the default timeout of 5 minutes:
DataLakeStoreFileSystemClient.HttpClient.Timeout = TimeSpan.FromMinutes(1);
Also, please take note that we recently released a new Data Lake Store SDK just for filesystem operations in order to improve performance. Check it out!
Nuget: https://www.nuget.org/packages/Microsoft.Azure.DataLake.Store/
Github: https://github.com/Azure/azure-data-lake-store-net

Expected performance with getstream.io

The getstream.io documentation says that one should expect retrieving a feed in approximately 60ms. When I retrieve my feeds they contain a field named 'duration' which I take is the calculated server side processing time. This value is steadily around 10-40ms, with an average around 15ms.
The problem is, I seldomly get my feeds in less than 150ms and the average time is rather around 200-250ms and sometimes up to 300-400ms. This is the time for the getting the feed alone, no enrichment etc., and I have verified with tcpdump that the network roundtrip is low (around 25ms), and that the time is actually spent waiting for the server to respond.
I've tried to move around my application (eu-west and eu-central) but that doesn't seem to affect things much (again, network roundtrip is steadily around 25ms).
My question is - should I really expect 60ms and continue investigating, or is 200-400ms normal? On the getstream.io site it is explained that developer accounts receive "Low Priority Processing" - what does this mean in practise? How much difference could I expect with another plan?
I'm using the node js low level API.
Stream APIs use SSL to encrypt traffic. Unfortunately SSL introduces additional network I/O. Usually you need to pay for the increased latency only once because Stream HTTP APIs supports HTTP persistent connection (aka keep-alive).
Here's a Wireshark screenshot of the TCP traffic of 2 sequential API requests with keep alive disabled client side:
The 4 lines in red highlight that the TCP connection is getting closed each time. Another interesting thing is that the handshaking takes almost 100ms and it's done twice (the first bunch of lines).
After some investigation, it turns out that the library used to make API requests to Stream's APIs (request) does not have keep-alive enabled by default. Such change will be part of the library soon and is available on a development branch.
Here's a screenshot of the same two requests with keep-alive enabled (using the code from that branch):
This time there is not connection reset anymore and the second HTTP request does not do SSL handshaking.

High amount of http read timeouts on azure

When we migrated our apps to azure from rackspace, we saw almost 50% of http requests getting read timeouts.
We tried placing the client both inside and outside azure with the same results. The client in this case is also a server btw, so no geographic/browser issues either.
We even tried increasing the size of the box to ensure azure wasn't throttling. But even using D boxes for a single request, the result was the same.
Once we moved out apps out of azure they started functioning properly again.
Each query was done directly on an instance using a public ip, so no load balancer issues either.
Almost 50% of queries ran into this issue. The timeout was set to 15 minutes.
Region was US East 2
Having 50% of HTTP requests timing out is not normal behavior. This is why you need to analyze what is causing those timeouts by validating the requests are hitting your VM. For this, I would recommend you running a packet capture on your server and analyze response times, as well as look for high number of retransmissions; it is even better if you can take a simultaneous network trace on your clients machines so you can do TCP sequence number analysis and compare packets sent vs received. 
If you are seeing high latencies in the packet capture or high number of retransmissions, it requires detailed analysis. I strongly suggest you to open a support incident so Microsoft support can help you investigate your issue further.

Resources