Troubleshooting the websocket limit in Azure, active connections

Troubleshooting the websocket limit in Azure, active connections - azure

I'm in the process of troubleshooting an App Service that is using websockets.
It's running on service plan Basic which allows for 350 websockets.
This is the only app on that plan that uses websockets.
The problem is that after abou 20 hours I get 503 responses saying I reached my websocket limit.
The setup right now has 3 clients connecting to the service.
In the process of investigating websocket leakage in my app I would like to track the number of websockets in use.
Is there anywhere, from my app or in Azure portal, where I can see the number of active websocket connections?
Follow up:
I've logged the websocket connections as Amor suggested.
The HTTP part of my app is still working, I can get dynamic results from the app which now reports what websocket connections are active and how many has been created since start.
After restarting the app service and configured one client to reconnect indefinetely.
It worked fine until the "total websocket connections" reached 350. At this time I shut down the client.
The limit should be 350 concurrent connections but it looks like it is 350 in total since start.
Most (at least 340) of these connections were initiated by a single client which disposed each connection before starting a new one, it has been shutdown once the limit was reached.
I've been suggested to upgrade from Basic to Standard since standard doesn't have the artificial limitation. The only reason I can see this work would be if there is a bug in the websocket limitation for the Basic plan.
Update 2
In parallel I've been in contact with Microsoft Developer Support and they noticed what appears to be that the sockets are stuck in IIS whereas not in Kestrel. The cause of this is still being investigated.
Support could show me graphs of the connection usage over time which clearly showed how the limit was reached.
I'll keep this question updated in case there was some error in my code.

I suggest you define a variable to count the connections. If a web socket connection is opened, just increase the number of connections. If a web socket connection is closed, decrease the number of connections. Code below is for your reference.
Count the connections for ASP.NET SignalR.
public class MyHub : Hub
{
private int _connectionCount = 0;
public override Task OnConnected()
{
_connectionCount++;
return base.OnConnected();
}
public override Task OnReconnected()
{
_connectionCount++;
return base.OnReconnected();
}
public override Task OnDisconnected(bool stopCalled)
{
_connectionCount--;
return base.OnDisconnected(stopCalled);
}
}
Count the connections in traditional ASP.NET.
public class WSChatController : ApiController
{
private int _connectionCount = 0;
public HttpResponseMessage Get()
{
if (HttpContext.Current.IsWebSocketRequest)
{
HttpContext.Current.AcceptWebSocketRequest(ProcessWSChat);
}
return new HttpResponseMessage(HttpStatusCode.SwitchingProtocols);
}
private async Task ProcessWSChat(AspNetWebSocketContext context)
{
WebSocket socket = context.WebSocket;
while (true)
{
ArraySegment<byte> buffer = new ArraySegment<byte>(new byte[1024]);
WebSocketReceiveResult result = await socket.ReceiveAsync(
buffer, CancellationToken.None);
if (socket.State == WebSocketState.Open)
{
_connectionCount++;
//Process the request
}
else
{
_connectionCount--;
break;
}
}
}
}

Related

How to stop outbound HTTP connections from timing out

Backgound:
I'm currently hosting an ASP.NET application in Azure with the following specs:
ASP .Net Core 2.2
Using Flurl for HTTP requests
Kestrel Webserver
Docker (Linux - mcr.microsoft.com/dotnet/core/aspnet:2.2 runtime)
Azure App Service on P2V2 tier app service plan
I have a a couple of background jobs that run on the service that makes a lot of outbound HTTP calls to a 3rd party service.
Issue:
Under a small load (approximately 1 call per 10 seconds), all requests are completed in under a second with no issue. The issue I'm having is that under a heavy load, when service can make up to 3/4 calls in a 10 second span, some of the requests will randomly timeout and throw an exception. When I was using RestSharp the exception would read "The operation has timed out". Now that I'm using Flurl, the exception reads "The call timed out".
Here's the kicker - If I run the same job from my laptop running Windows 10 / Visual Studios 2017, this problem does NOT occur. This leads me to believe I'm hitting some limit or running out of some resource in my hosted environment. Unclear if that is connection/socket or thread related.
Things I've tried:
Ensure all code paths to the request are using async/await to prevent lockouts
Ensure Kestrel Defaults allow unlimited connections (it does by default)
Ensure Dockers default connection limits are sufficient (2000 by default, more than enough)
Configuring ServicePointManager settings for connection limits
Here is the code in my startup.cs that I'm currently using to try and prevent this issue:
public class Startup
{
public Startup(IHostingEnvironment hostingEnvironment)
{
...
// ServicePointManager setup
ServicePointManager.UseNagleAlgorithm = false;
ServicePointManager.Expect100Continue = false;
ServicePointManager.DefaultConnectionLimit = int.MaxValue;
ServicePointManager.EnableDnsRoundRobin = true;
ServicePointManager.ReusePort = true;
// Set Service point timeouts
var sp = ServicePointManager.FindServicePoint(new Uri("https://placeholder.thirdparty.com"));
sp.ConnectionLeaseTimeout = 15 * 1000; // 15 seconds
FlurlHttp.ConfigureClient("https://placeholder.thirdparty.com", cli => cli.Settings.ConnectionLeaseTimeout = new TimeSpan(0, 0, 15));
}
}
Has anyone else run into a similar issue to this? I'm open to any suggestions on how to best debug this situation, or possible methods to correct the issue. I'm at a complete loss after researching this for several days.
Thank you in advance.

I had similar issues. Take a look at Asp.net Core HttpClient has many TIME_WAIT or CLOSE_WAIT connections . Debugging via netstat helped identify the problem for me. As one possible solution. I suggest you use IHttpClientFactory. You can get more info from https://learn.microsoft.com/en-us/aspnet/core/fundamentals/http-requests?view=aspnetcore-2.2 It should be fairly easy to use as described in Flurl client lifetime in ASP.Net Core 2.1 and IHttpClientFactory

Random 21/42 seconds timeout in outgoing traffic on Azure Web Sites

I have an ASP.NET MVC 5 application running in the azure german cloud as Azure Web App (single instance - Standard S3 size).
I'm calling a non azure hosted REST/SOAP service on a particular host and the web requests either succeed promptly or timeout after 21 / 42 seconds.
I've load tested the requests and the percentile of requests timing out is between 20 and 80.
One particular remarkable property of the timeout is, that they occur after exactly 21 or 42 seconds (this is serious, no reference to hitchhiker's guide to the galaxy intended).
Calling a different service from the web app works just fine, temporarily at least.
We've already checked the firewall of the non azure service and if the timeout occurs, not a single packet reached the host.
This issue occurred once in the past one year ago and support was unable to tell what the cause was until the issue suddenly went away roughly two weeks after first occuring, so the ticket got closed as fixed itself but now its back.
The code is using https://github.com/canton7/RestEase (uses HttpClient underneath) and looks like
[Header("Content-Type", "application/json")]
public interface IApi
{
[Post("/Login")]
Task<LoginToken> Login([Body]LoginRequest request);
}
private static Dictionary<string, IApi> ApiClientsByHost = new Dictionary<string, IApi>();
private IApi GetApiForHost(string host)
{
if (!ApiClientsByHost.TryGetValue(host, out var client))
{
lock (ApiClientsByHost)
{
if (!ApiClientsByHost.TryGetValue(host, out client))
{
ApiClientsByHost[host] = client = RestClient.For<IApi>(host);
}
}
}
return client;
}
var client = GetApiForHost("https://production/");
var loginToken = await client.Login(new LoginRequest { Username = username, Password = password });
By different service, i mean using "https://testserver/" instead of "https://production/" (testserver is located in a different data center with different IP and all).
The API authentication is passing a token via query but it timeouts already before being able to get a token.
The code is caching the IApi to avoid the TCP starvation problems of disposing HttpClients (but i've never run into port exhaustion).
Restarting the app does not resolve the issue and the issue only occurs to production currently (but a year ago, when this issue occurred on production, we've switched to testserver which worked initially but after some time, ran into the same problem)
EDIT: Found some explanation in the last answer as to where those magical 21 seconds are comming from.
EDIT: One way i've found to workaround is, is to setup a azure vm with a proxy on it and configure defaultProxy to pass through that vm.

That's TCP retransmission timing out. It's odd that you are getting different values though.

Understanding asynchronous web processing

I've just finished reading up about asynchronous WebServlet processing. [This article] is a good read.
However, fundamentally I'm confused why this method is the "next generation of web processing" and is in fact used at all. It seems we are avoiding better configuring our Web Application Servers (WAS) - nginx, apache, tomcat, IIS - and instead, putting the problem on to the Web Developer.
Before I dive into my reasoning, I want to briefly explain how Requests are accepted and then handled by a WAS.
NETWORK <-> OS -> QUEUE <- WEB APPLICATION SERVER (WAS) <-> WEB APPLICATION (APP)
A Web Application Server (WAS) tells the Operating System (OS) that it wants to receive Requests on a specific Port, e.g. Port 80 for HTTP.
The OS opens a Listener on the Port (if it's free) and waits for Clients to connect.
When the OS receives a Connection, it adds it to a Queue assigned to the WAS (if there is space, otherwise the Client's Connection is rejected) - the size of the Queue is defined by the WAS when it requests the Port).
The WAS monitors the Queue for Connections and when a Connection is available, accepts the Connection for processing - removing it from the Queue.
The WAS passes the Connection on to the Web Application for processing - it could also handle the process itself if programmed to.
The WAS can handle multiple Connections at the same time by using multiple Processors (normally one per CPU core), each with multiple Threads.
So this now brings me to my query. If the amount of Requests the WAS can handle depends on the speed at which it can process the Queue, which is down to the number of Processors/Threads assigned to the WAS, why do we create an async method inside our APP to offload the Request from the WAS to another Thread not belonging to the WAS instead of just increasing the number of Threads available to the WAS?
If you consider the (not so) new Web Sockets that are popping up, when a Web Socket makes a connection to a WAS, a Thread is assigned to that Connection which is held open so Client and WAS can have continual communication. This Thread is ultimately a Thread on the WAS - meaning it is taking up Server resources - whether belonging to the WAS or independent of it (depending on APP design).
However, instead of creating an independent Thread not belonging to the WAS, why not just increase the number of Threads available to the WAS? Ultimately, the number of Threads you can have is down to the resources - MEMORY, CPU - available on the Server. Or is it a case that by offloading the Connection to a new Thread, you simply don't need to think about how many Threads to assign to the WAS (which seems dangerous because now you can use up Server resources without proper monitoring). It just seems as if a problem is being passed down to the APP - and thus the Developer - instead of being managed at the WAS.
Or am I simply misunderstanding how a Web Application Server works?
Putting it into a simple Web Application Server example. The following offloads the incoming Connection straight to a Thread. I am not limiting the number of Threads that can be created, however I am limited to the number of Open Connections allowed on my macbook. I have also noticed that if the backlog (the second number in the ServerSocket, currently 50) is set too small, I start receiving Broken Pipes and Connection Resets on the Client side.
import java.io.IOException;
import java.io.PrintWriter;
import java.net.ServerSocket;
import java.net.Socket;
import java.util.Date;
public class Server {
public static void main(String[] args) throws IOException {
try (ServerSocket listener = new ServerSocket(9090, 50)) {
while (true) {
new Run(listener.accept()).start();
}
}
}
static class Run extends Thread {
private Socket socket;
Run(Socket socket) {
this.socket = socket;
}
#Override
public void run() {
try {
System.out.println("Processing Thread " + getName());
PrintWriter out = new PrintWriter(this.socket.getOutputStream(), true);
out.println(new Date().toString());
} catch (IOException e) {
e.printStackTrace();
} finally {
try {
this.socket.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
}
And now using Asynchronous, you are just passing the Thread on to another Thread. You are still limited by System Resources - allowed number of open files, connections, memory, CPU, etc..
import java.io.IOException;
import java.io.PrintWriter;
import java.net.ServerSocket;
import java.net.Socket;
import java.util.Date;
public class Server {
public static void main(String[] args) throws IOException {
try (ServerSocket listener = new ServerSocket(9090, 100)) {
while (true) {
new Synchronous(listener.accept()).start();
}
}
}
// assumed Synchronous but really it's a Thread from the WAS
// so is already asynchronous when it enters this Class
static class Synchronous extends Thread {
private Socket socket;
Synchronous(Socket socket) {
this.socket = socket;
}
#Override
public void run() {
System.out.println("Passing Socket to Asynchronous " + getName());
new Asynchronous(this.socket).start();
}
}
static class Asynchronous extends Thread {
private Socket socket;
Asynchronous(Socket socket) {
this.socket = socket;
}
#Override
public void run() {
try {
System.out.println("Processing Thread " + getName());
PrintWriter out = new PrintWriter(this.socket.getOutputStream(), true);
out.println(new Date().toString());
} catch (IOException e) {
e.printStackTrace();
} finally {
try {
this.socket.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
}
Looking at this Blog about Netflix 'tuning-tomcat-for-a-high-throughput', it looks like Tomcat does the same as my first code example. So Asynchronous processing in the Application shouldn't be necessary.
Tomcat by default has two properties that affect load, acceptCount which defines the maximum Queue size (default: 100) and maxThreads which defines the maximum number of simultaneous request processing threads (default: 200). There is also maxConnections but I'm not sure the point of this with maxThreads defined. You can read about them at Tomcat Config

Late, but maybe better than never. :)
I don't have a great answer to "why asych servlets?" but I think there is another bit on information which would be helpful to you.
What you are describing for the WAS is what Tomcat used to do in it's BIO connector. It was basically a thread per connection model. This limits the number of requests you can serve not just because of the maxThreads setting, but also because the worker thread would potentially continue to be tied up waiting for additional requests on the connection if a Connection:Close wasn't sent. (See https://www.javaworld.com/article/2077995/java-concurrency/java-concurrency-asynchronous-processing-support-in-servlet-3-0.html and What is the difference between Tomcat's BIO Connector and NIO Connector?)
Switching to NIO connector allows tomcat to maintain thousands of connections while still maintaining only a small pool of worker threads.

ServiceStack RedisMqServer not always handling messages published from separate application

Context
I have a RedisMqServer configured to handle a single message on my ServiceStack web service. The messages on that MQ originate from another application and show up in the .inq with all the correct properties. Everything is on 4.0.38.
My configuration in MyAppHost.cs:
public override void Configure(Container container)
{
var redisFactory = new PooledRedisClientManager(0, "etc:etc");
redisFactory.ConnectTimeout = 5;
redisFactory.IdleTimeOutSecs = 30;
redisFactory.PoolTimeout = 3;
container.Register<IRedisClientsManager>(redisFactory);
//Plugins, Filters, other Registrations omitted
var mqHost = new RedisMqServer(redisFactory, retryCount: 2);
mqHost.DisablePublishingResponses = true;
mqHost.RegisterHandler<CreateVisitor>(ServiceController.ExecuteMessage);
mqHost.Start();
}
And then in Global.asax.cs:
void Application_Start(object sender, EventArgs e)
{
new MyAppHost().Init();
}
Problem
The messages are not consistently handled when I deploy this elsewhere. They wait in the .inq until whenever. Nothing is lost, just delayed for an indeterminate duration.
As of this moment, the only things that come to mind are:
I'm using IIS Express locally, and the server is using IIS.
Application_Start needs to happen before it can handle messages.
I've tried initializing the service by making other API calls over HTTP, before and after queuing messages, with more failure than success. Sometimes the service starts to handle them, but I am unable to identify and thus influence when this happens.
Note
I do have several other console applications and windows services that listen on other MQs and handle messages placed by other applications, and those have always worked flawlessly. This is the first time I've tried this from within an existing web service, however.

Hard to know what the issue from this description (are messages getting lost or just delayed?) but this sounds like it's due to ASP.NET AppDomain recycling in which case you can disable AppDomain recycling or setup up a continuous ping route to hit your ASP.NET Web Application to keep the AppDomain alive.
If the ASP.NET Service is available on the Internet you can use services like https://uptimerobot.com or https://www.pingdom.com to configure it to ping your Service at different intervals (e.g. 5-10 minutes) otherwise if this is an internal Service you can use a Scheduled Task.

Cannot get simple SignalR Azure worker role to work

I am trying to get a simple WebSocket server going using SignalR, OWIN and Azure Worker Roles.
WorkerRole.cs:
public class WorkerRole : RoleEntryPoint
{
public override void Run()
{
string url = "http://" + RoleEnvironment.CurrentRoleInstance.InstanceEndpoints["MyEndpoint"].IPEndpoint;
using (WebApp.Start<Startup>(url))
{
Trace.WriteLine(String.Format("Server running on {0}", url));
}
while (true)
{
}
}
/* ... */
}
Startup.cs:
public class Startup
{
public void Configuration(IAppBuilder app)
{
app.MapSignalR();
}
}
MyHub.cs:
public void Send(string name, string message)
{
Clients.All.addMessage(name, message);
}
The Endpoint "MyEndpoint" is defined in the Service as http, public and private port 5001.
After starting the service, it shows up under Azure Compute Emulator as running on 5001. However, if I try to connect to ws://127.0.0.1:5001/signalr (or just ws://127.0.0.1:5001) there is no response. I am using two different web socket clients for this purpose (both are Chrome plugins and they both worked fine using other WebSocket servers).
Questions:
1) Is there anything obviously wrong with my setup?
2) Do I need to use the SignalR JS client libraries to connect to the SignalR server, or should any vanilla client implementing the WebSocket protocol be able to connect?

I know this is a bit of an old post but just in case someone needs it...
1) There are two problems you need to address.
First of all, Start method in:
using (WebApp.Start<Startup>(url))
{
Trace.WriteLine(String.Format("Server running on {0}", url));
}
returns an IDisposable (hence the using(...){} block) means it is immediately disposed after creation since execution continues right passed Trace.Writeline(...) without pause.
It's also a bit tricky running these things under the Azure Compute Emulator for a few reasons, mainly because it remaps ports to avoid collisions. If you open up a command prompt and run
netstat -a
you'll find that you have open ports (listening) looking something like this (in my case I'm using port 81):
TCP 127.0.0.1:82 MyComputer:0 LISTENING
TCP 127.0.0.3:81 MyComputer:0 LISTENING
In the general console ouput of Visual Studio, you'll also most likely see something like
"Windows Azure Tools: Warning: Remapping private port 81 to 82 in role 'MyRoleThingy' to avoid conflict during emulation."
This all means that in order to connect to the server you're hosting using your worker role, you'll have to connect to port 82 instead of 81 (probably 5002 in your case).
2) If you implement the protocol, anything should work I think. Managing an initial connection on the port should always work.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string