I have an ASP .Net Core server app with SignalR. There is also a client app also written with .Net Core and SignalR. The client connects to the server and they work together.
While working in a local development environment everything works as expected but when I deploy my Server app to the Azure App Service I noticed weird behaviour.
When the client disconnects from the server, the event that the client disconnected comes with a huge lag of ~30 seconds. While other calls between the client and the server work normally, we observe the lag only with client disconnection event.
While investigating this issue I came up with solution to reduce the KeepAliveInterval in server settings;
.AddSignalR(configure =>
{
configure.EnableDetailedErrors = true;
configure.MaximumReceiveMessageSize = null;
configure.KeepAliveInterval = TimeSpan.FromSeconds(1);
configure.EnableDetailedErrors = true;
});
And then the client disconnection event is immediate.
Can someone explain why this issue occurs on Release version on Azure?
The issue did not happen in the local Development environment.
The issue did not happen in the local Release environment.
In a SignalR Hub OnReconnected Event Handler can execute directly after the OnConnected but not after the OnDisconnected for a given client. The reasons of reconnection without a disconnection is for several ways here
Physical connections might be slow or there might be interruptions in connectivity. Depending on factors such as the length of the interruption, the transport connection might be dropped. SignalR then tries to re-establish the transport connection. Sometimes the transport connection API detects the interruption and drops the transport connection, and SignalR finds out immediately that the connection is lost.
In other scenarios, neither the transport connection API nor SignalR becomes aware immediately that connectivity has been lost. For all transports except long polling, the SignalR client uses a function called keepalive to check for loss of connectivity that the transport API is unable to detect. For information about long polling connections, see here Timeout and keepalive settings l
Refer for more info
Related
In my project, there is a process that can run for a very long time (> 20 min.). The progress is transmitted to interested clients as a percentage value using SignalR. Now I noticed that the server is rigorously terminated after 20 minutes (IIS default Idle Time-out), although a client is connected and actively receiving data via SignalR.
Could it be that communication via WebSockets is not monitored by the IIS routine that resets the timeout? Is there any way to work around the problem? Or have I implemented something wrong?
We have enabled SignalR on our ASP.NET Core 5.0 web project running on an Azure Web App (Windows App Service Plan). Our SignalR client is an Angular client using the #microsoft/signalr NPM package (version 5.0.11).
We have a hub located at /api/hub/notification.
Everything works as expected for most of our clients, the web socket connection is established and we can call methods from client to server and vice versa.
For a few of our clients, we see a massive amount of requests to POST /api/hub/notification/negotiate and POST /api/hub/notification within a short period of time (multiple requests per minute per client). It seems like that those clients switch to long polling instead of using web sockets since we see the POST /api/hub/notification requests.
We have the suspicion that the affected clients could maybe sit behind a proxy or a firewall which forbids the web sockets and therefore the connection switches to long polling in the first place.
The following screenshot shows requests to the hub endpoints for one single user within a short period of time. The list is very long since this pattern repeats as long as the user has opened our website. We see two strange things:
The client repeatedly calls /negotiate twice every 15 seconds.
The call to POST /notification?id=<connectionId> takes exactly 15 seconds and the following call with the same connection ID returns a 404 response. Then the pattern repeats and /negotiate is called again.
For testing purposes, we enabled only long polling in our client. This works for us as expected too. Unfortunately, we currently don't have access to the browsers or the network of the users where this behavior occurs, so it is hard for us to reproduce the issue.
Some more notes:
We currently have just one single instance of the Web App running.
We use the Redis backplane for a scale-out scenario in future.
The ARR affinity cookie is enabled and Web Sockets in the Azure Web App are enabled too.
The Web App instance doesn't suffer from high CPU usage or high memory usage.
We didn't change any SignalR options except of adding the Redis backplane. We just use services.AddSignalR().AddStackExchangeRedis(...) and endpoints.MapHub<NotificationHub>("/api/hub/notification").
The website runs on HTTPS.
What could cause these repeated calls to /negotiate and the 404 returns from the hub endpoint?
How can we further debug the issue without having access to the clients where this issue occurs?
Update
We now implemented a custom logger for the #microsoft/signalr package which we use in the configureLogger() overload. This logger logs into our Application Insights which allows us to track the client side logs of those clients where our issue occurs.
The following screenshot shows a short snippet of the log entries for one single client.
We see that the WebSocket connection fails (Failed to start the transport "WebSockets" ...) and the fallback transport ServerSentEvents is used. We see the log The HttpConnection connected successfully, but after pretty exactly 15 seconds after selecting the ServerSentEvents transport, a handshake request is sent which fails with the message from the server Server returned handshake error: Handshake was canceled. After that some more consequential errors occur and the connection gets closed. After that, the connection gets established again and everything starts from new, a new handshare error occurs after those 15 seconds and so on.
Why does it take so long for the client to send the handshake request? It seems like those 15 seconds are the problem, since this is too long for the server and the server cancels the connection due to a timeout.
We still think that this has maybe something to to with the client's network (Proxy, Firewall, etc.).
Fiddler
We used Fiddler to block the WebSockets for testing. As expected, the fallback mechanism starts and ServerSentEvents is used as transport. Opposed to the logs we see from our issue, the handshake request is sent immediately and not after 15 seconds. Then everything works as expected.
You should check which pricing tier you use, Free or Standard in your project.
You should change the connectionstring which is in Standard Tier. If you still use Free tier, there are some restrictions.
Official doc: Azure SignalR Service limits
Currently I am using stomp protocol to send messages to activeMQ and to listen to messages. This is done in Nodejs using stompit library.
When the application is having high CPU or Memory usage, it stops sending heartbeat to broker. So the broker redelivers the message which is currently being processed, leading to repetitive processing of the same message
On disabling heartbeat, the application seems to work fine but I am unsure of the further issues disabling heartbeat might cause. Even when the broker is stopped while sending messages, behaviour seems to be same with or without heartbeat
I have read that it is an optional parameter but I am unable to find out it's exact use cases
Can anyone mention scenarios where no heart beat can cause issues to the application?
Regarding the purpose of heart-beating the STOMP 1.2 specification just says:
Heart-beating can optionally be used to test the healthiness of the underlying TCP connection and to make sure that the remote end is alive and kicking.
Heart-beats potentially flow both from the client to the server and from the server to the client so the "remote end" referenced in the spec here could be the client or the server.
For the server, heart-beating is useful to ensure that server-side resources are cleaned up in a timely manner to avoid excessive strain. Server-side resources are maintained for all client connections and it helps the broker to be able to detect quickly when those connections fail (i.e. heart-beats aren't received) so it can clean up those resources. If heart-beating is disabled then it's possible that a dead connection would not be detected and the server would have to maintain its resources for that dead connection in vain.
For a client, heart-beating is useful to avoid message loss when performing asynchronous sends. Messages are often sent asynchronously by clients (i.e. fire and forget). If there was no mechanism to detect connection loss the client could continue sending messages async on a dead connection. Those messages would be lost since they would never reach the broker. Heart-beating mitigates this situation.
We use Signal R with an Azure web app in an ASE for our real time web application.
We noticed that Signal R sometimes looses connection to the hub in no particular pattern.
This happens both during high traffic periods as well as low traffic ones but I am more interested in why this i happening during low traffic periods.
Note: We have a so called "1-minute auto refresh" which is triggered by the JavaScript on the page. That seems to be working.
Anyone experienced similar issues using SignalR, and if so, how did you resolve this?
Thank you
(a tester, don't be too harsh!lol )
I have definitely experienced this, and it drove me nuts.
By default, a SignalR client will try to reconnect for 20 seconds after losing connection to its Hub. After 20 seconds without a successful reconnect, the disconnected event is raised on JavaScript clients. After disconnected is raised, the client will give up trying to reconnect and the connection is dead. This page describes SignalR lifecycle events and offers some code on trying to reconnect after the disconnected event is raised.
Now as to why this happens. I've noticed that an App Pool recycle can take longer than 20 seconds in some apps, which can lead to a disconnected event. Intermittent drops in network connectivity between your JavaScript clients and Hub that lasts more than 20 seconds can cause this also. The bottom line is that things can go wrong that are beyond your control and you cannot code around them. Therefore, put in place the logic to attempt to reconnect after your JavaScript client receives the disconnected event.
On the Server side for websockets there is already an ping/pong implementation where the server sends a ping and client replies with a pong to let the server node whether a client is connected or not. But there isn't something implemented in reverse to let the client know if the server is still connected to them.
There are two ways to go about this I have read:
Every client sends a message to server every x seconds and whenever
an error is thrown when sending, that means the server is down, so
reconnect.
Server sends a message to every client every x seconds, the client receives this message and updates a variable on the client, and on the client side you have a thread that constantly checks every x seconds which checks if this variable has changed, if it hasn't in a while it means it hasn't received a message from the server and you can assume the server is down so reestablish a connection.
You can achieve trying to figure out on client side whether the server is still online using either methods. The first one you'll be sending traffic to the server whereas the second one you'll be sending traffic out of the server. Both seem easy enough to implement but I'm not so sure which is the better way in terms of being the more efficient/cost effective.
Server upload speeds are higher than client upload speeds, but server CPUs are an expensive resource while client CPUs are relatively cheap. Unloading logic onto the client is a more cost-effective approach...
Having said that, servers must implement this specific logic (actually, all ping/timeout logic), otherwise they might be left with "half-open" sockets that drain resources but aren't connected to any client.
Remember that sockets (file descriptors) are a limited resource. Not only do they use memory even when no traffic is present, but they prevent new clients from connecting when the resource is maxed out.
Hence, servers must clear out dead sockets, either using timeouts or by implementing ping.
P.S.
I'm not a node.js expert, but this type of logic should be implemented using the Websocket protocol ping rather than by your application. You should probably look into the node.js server / websocket framework and check how to enable ping-ing.
You should set pings to accommodate your specific environment. i.e., if you host on Heroku, than Heroku will implement a timeout of ~55 seconds and your pings should be sent before this timeout occurs.