We have a WebApi in Azure that sends requests to a VM cluster that is load balanced via an Azure Cloud Service. We see occasional timeouts where requests are working, then one times out for no reason. Reissuing the request immediately succeeds.
In Fiddler I see:
[Fiddler] The connection to '[myApp].cloudapp.net' failed. Error: TimedOut (0x274c). System.Net.Sockets.SocketException A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond 40.122.42.33:9200
I can't find any telemetry in the portal that shows any kind of error, and all is fine when the request is issued from my api. Also, I don't see anything in the Event Logs on my VMs.
I am thinking it might have something to do with TCP port closure, but I am unfamiliar with this. My requests are specifying 'Connection: keep-alive', so I assume that subsequent requests to the same protocol/domain would attempt to use the same connection. It usually works, however.
Is there any kind of throttling on the number of active connections that can come into my Cloud Service? It is possible that these timeouts happen during peak load (though we don't have enough consistent traffic to verify this).
thanks!
Related
So...about 5pm 2 nights ago, all 14 of my listeners on my Azure Service Bus dropped. So I logged in to my on-prem SQL Server to check on my Hybrid Connections and both of them showed a status of "Status Unknown". I can't find anything on the internet about this specific status.
Nothing changed on my SQL Server other than the fact that I've pegged the RAM....it's at 100% usage.
If I go to the Azure Portal, navigate to either of my Hybrid Connection Overview pages and click on the "Hybrid Connection Url", I get the following message in the browser:
"error": {
"code":"TokenMissingOrInvalid",
"message":"MissingToken: Relay security token is required. TrackingId:*SOME GUID*, SystemTracker:*SERVICE BUS NAME*:*HYBRID CONNECTION NAME*, Timestamp:2021-08-04T04:19:16"}
}
Now....I didn't change anything on my Hybrid Connection configurations. I haven't changed anything about tokens. I have no idea what's going on other than my Azure App Services have been down for 2 days, now.
Any help would be greatly appreciated....
This looks like an authentication error where a token might not be generating when you are trying to make a call to the underlying On prem server
You can refer the SO thread for ServiceBusAuthorization and still if you are facing the issue kindly raise a ticket with MS-Q&A
Microsoft support led me to this article where I found the following information:
Make Sure that the Date and Time are Correct
The Hybrid Connection Manager connects to Azure Relay using Secure Sockets Layer (SSL) on port 443. If there's a problem with your SSL handshake or connection, it will break your Hybrid Connection. If you find that your Hybrid Connection works initially, and then it stops working after about 10 minutes, that's a sign that you need to check the date and time on the machine running the Hybrid Connection Manager. Make sure they are correct because if they're not, your SSL connection may not work.
Well...the time on my server was off by about 16 minutes b/c of a group policy that I had never bothered to fix b/c I don't know anything about group policies. So I looked up how to fix the server's clock and, once that done, resolved this issue.
I have a service running behind an Azure API Management instance running in the consumption tier. When no traffic has been sent to the API Management instance in a while (15 minutes isn't enough to trigger it, but an hour is), the first request sent takes about 3 minutes 50 seconds and returns a HTTP 500 with this body content:
<html><head><title>500 - The request timed out.</title></head><body> <font color ="#aa0000"> <h2>500 - The request timed out.</h2></font> The web server failed to respond within the specified time.</body></html>
Following requests work fine. Based on application logs and testing with an API Management instance pointing to my local machine via ngrok, it doesn't look like API management is even trying to connect to the backend for these requests. For the local test, I ran my app under the debugger, put a breakpoint in my service method (there's no auth that could get in the way) and watched the "output" window in Visual Studio. It never hit my breakpoint, and never showed anything in the output window for that "500 request timed out" request. When I made another request to API Management, it forwarded along to my service as expected, giving me output and hitting my breakpoint.
Is this some known issue with API Management consumption tier that I need to find some way to work around (ie. a service regularly pinging it)? Or a possible configuration issue with the way I've set up my API Management instance?
My API management instance is deployed via an ARM template using the consumption tier in North Central US and has some REST and some SOAP endpoints (this request I've been using for testing is one of the SOAP ones and uses the envelope header to specify the SOAP action).
Additional information:
The request is question is about 2KB, and a response from the server (which doesn't play into this scenario as the call never makes it to my server) is about 1KB, so it's not an issue with request/response sizes.
When I turn on request tracing (by sending the Ocp-Apim-Subscription-Key + Ocp-Apim-Trace headers), this 500 response I'm getting doesn't have the Ocp-Apim-Trace-Location header with the trace info that other requests do.
I get this behavior when I send 2 requests (to get the 4-minute 500 response and then a normal 5s 200 response), wait an hour, and make another request (which gets the 4-minute delay and 500 response), so I don't believe this could be related to the instance serving too much traffic (at least too much of my traffic).
Further testing shows that this happens about once every 60 to 90 minutes, even if I send one request every minute trying to keep the APIM instance "alive".
HTTP 500 (Internal Server Error) status code indicates that the server encountered an unexpected condition that prevented it from fulfilling the request. (possibly due to large payload). There is no issue at APIM level. Analyze the APIM inspector trace and you should see HTTP 500 status code under 'forward-request' response attribute.
You need to understand who is throwing these HTTP 404 and 500 responses, APIM, or the backend SOAP API. The best way to get that answer is to collect APIM inspector trace to inspect request and response. Debug your APIs using request tracing
The Consumption tier exposes serverless properties. It runs on a shared infrastructure, can scale down to zero in times of no traffic and is billed per execution. Connections are pooled and reused unless explicitly closed by the back end. Api management service limits
1. These pattern of symptoms are also often known to occurs due to
network address translation (SNAT) port limits with your APIM
service.
Whenever a client calls one of your APIM APIs, Azure API Management service opens a SNAT port to access your backend API. Azure uses SNAT and a Load Balancer (not exposed to customers) to communicate with end points outside Azure in the public IP address space, as well as end points internal to Azure that aren't using Virtual Network service endpoints. (This situation is only applicable to backend APIs exposed on public IPs.)
Each instance of API Management service is initially given a pre-allocated number of SNAT ports. That limit affects opening connections to the same host and port combination. SNAT ports are used up when you have repeated calls to the same address and port combination. Once a SNAT port has been released, the port is available for reuse as needed. The Azure Network load balancer reclaims SNAT ports from closed connections only after waiting four minutes.
A rapid succession of client requests to your APIs may exhaust the pre-allocated quota of SNAT ports if these ports are not closed and recycled fast enough, preventing your APIM service from processing client requests in a timely manner.
Following strategies can be considered:
Use multiple IPs for your backend URLs
Place your APIM and backend service in the same VNet
Place your APIM in a virtual network and route outbound calls to Azure Firewall
Consider response caching and other backend performance tuning (configuring certain APIs with response caching to reduce latency
between client applications calling your API and your APIM backend
load.)
Consider implementing access restriction policies (policy can be used to prevent API usage spikes on a per key basis by limiting the
call rate per a specified time period.)
2. The forward-request policy forwards the incoming request to the
backend service specified in the request context. The backend
service URL is specified in the API settings and can be changed
using the set backend service policy.
Policy statement:
<forward-request timeout="time in seconds" follow-redirects="false | true" buffer-request-body="false | true" buffer-response="true | false" fail-on-error-status-code="false | true"/>
Example:
The following API level policy forwards all API requests to the backend service with a timeout interval of 60 seconds.
<!-- api level -->
<policies>
<inbound>
<base/>
</inbound>
<backend>
<forward-request timeout="60"/>
</backend>
<outbound>
<base/>
</outbound>
</policies>
Attribute: timeout="integer"
Description: The amount of time in seconds to wait for the HTTP
response headers to be returned by the backend service before a
timeout error is raised. Minimum value is 0 seconds. Values greater
than 240 seconds may not be honored as the underlying network
infrastructure can drop idle connections after this time.
Required: No
Default: None
This policy can be used in the following policy sections and scopes.
Policy sections: backend
Policy scopes: all scopes
Checkout similar feedback for your reference. Also, refer for detailed troubleshooting of 5oo error for APIM.
Created an azure MVC website, from service (controller) code we are connecting to an on-premise sql server using Azure Hybrid Connection. Intermittently we are facing below issue.
"A transport-level error has occurred when receiving results from the
server. (provider: TCP Provider, error: 0 - The specified network name
is no longer available.)"
Please provide suggestions to resolve this issue.
You can try following solutions :
Try increasing connection time-out.
check if remote connection is enabled.
Try adding firewall exception.
First of all the error means either the networks has some extra latency, the database is down or you may have too many concurrent connections open the database.
(Make sure you are closing all open datareaders.)
also it may be due to this
These are transient faults and are to be expected in the cloud. Implementing defensive programming is usually a must in the cloud. Try using some retry logic. Microsoft's transient fault exception library is an excellent start. Though meant primarily for SQL Azure and Azure Service bus, you can use the library for SQL IaaS.
In my opinion, 98% sure, because I recently had the same experience, it is a network issue from the server provider.
For instance: if you are rent the server from Ionos, by default all remote connections are blocked, even though you disable the firewall in the server. You still won't be able to connect remotely. You can, however, do your work on the server without any problem.
To connect remotely, you have to contact the server provider. They will explain how to enable firewall ports from your control panel.
I contacted my server provider as I almost get frustrated. Here was their response.
enter image description here
After this, every permitted client can connect remotely to the server.
I wish you success.
I'm currently working on a Windows Azure application using WebAPI and SignalR for communication. Both services are hosted via OWIN on a Worker role with multiple instances.
Current solution
Currently we start one Owin host with WebAPI on port 443 on every machine and one SignalR Owin host on the instance input endpoint port (e.g. 10106-1010x) on every machine.
Everything works fine, but some of our customer are sitting behind a firewall where all ports except 80/443 are blocked -> so no websocket communication there (WebAPI works fine).
New solution
We are starting one Owin host with WebAPI and SignalR on every instance. So both HTTP and WebSocket traffic will be routed through the loadbalancer over port 443 -> no more instance input endpoints (and no more firewall problems).
The problem
The problem now is that sometimes the WebSocket connection can be established and sometimes not (browser independent). If the connection can't be established the following error appears in the console:
Error during WebSocket handshake: Unexpected response code: 400
No transport could be initialized successfully. Try specifying a different transport or none at all for auto initialization.
I've already added the role instance id to the websocket response messages from the server, but couldn't find any (ir)regularities (e.g. a single instance doesn't respond, ...). All SignalR servers seem to be up and running, but sometimes the connection can't be established.
You can test it yourself by going to the following link. If you don't get an error dialog ("Connection to server lost") it is working, otherwise try to refresh the page several times.
-
I'm not looking for a scaleout feature for SignalR (as described here or here). The client just connects to one (random) server (worker role instance) and communicates with the server until a close message is sent. If he connects again he can be routed to any other server. Also there is no communication between the servers.
Update/Solution
halter73 was right, each instance generates its own anti-CSRF token. To avoid this I implemented my own IDataProtector/IDataProtectionProvider, similar to these to SO questions (see here and here).
If you can look at content of the 400 response (this may be difficult since it is an SSL encrypted response to a WebSocket request), you will probably see a message similar to "The ConnectionId is in the incorrect format."
SignalR uses the server's machine key to create an anti-CSRF token, but this requires that all the servers in your farm share a machine key for the token to be properly decrypted in when SignalR requests hop servers. The /negotiate is the request that retrieves the anti-CSRF token. When the SignalR client then uses the anti-CSRF token to make a /connect request, it sometimes fails when the /connect request is processed by a different server which didn't create the token and therefore is unable to decrypt it.
Here is an issue that filed on GitHub by someone who experienced a similar issue: https://github.com/SignalR/SignalR/issues/2292.
We are running some long-running test apps with Azure Service Bus relay over http, hosted in a windows service and most of the time, these run fine for 2-3 days. However every so often an internal network glich may occur (e.g. firewall reboots) that kills the internet connection.
At this point, the relay is dropped in Azure and our web app can no longer communicate with the on-premise service.
I would have thought that the Azure relay client was fault-tolerant - in that if it realises that it's lost connection with Azure then it will re-establish the connection andf if it can't keep trying until it can.. but it appears that this is not the case. This seems pretty fundamental...?
Only once have I ever seen a "System.ServiceModel.CommunicationException" where the service can't communicate on the internet, and that was when the client was starting up and trying to establish the connection in the first place.
Is there any advice or feedback on handling transient disconnections through the relay service (as it's a cloud --> on-prem direction then the client can't AFAIK ping the server).
If you are still experiencing issues, you may want to contact Azure support to understand why it is disconnecting. The Relay client should reconnect if something happens to the existing connection.
You may want to add ConnectionStatusBehavior to your ChannelFactory to have it output when the status for the connection changes. It will contain the error that caused it to change status.
var connectionStatusBehavior = new ConnectionStatusBehavior();
connectionStatusBehavior.Online += ConnectionStatusOnlineMethod;
connectionStatusBehavior.Offline += ConnectionStatusOfflineMethod;
channelFactory.Endpoint.Behaviors.Add(connectionStatusBehavior);
This issue is solved by Microsoft in version 2.6.5 of Microsoft Azure Service Bus dll. After 1 month of testing it seems to work.