GCP Cloud Run Socket.IO Timeout - node.js

I'm currently deploying my Socket.IO server with Node.js/Express on Google Cloud Platform using Cloud Build + Run, and it works pretty well.
The issue I'm having is that GCP automatically times out all Socket.IO connections after 1 hour, and it's really annoying. The application I'm running forces it to be run in the background for hours on end, with multiple people in each socket room and interacting with it a bit every 30 mins to 1 hour.
That's why I have 2 questions:
How can I gracefully handle these timeouts? I have a reconnection process setup on my client, checking if the socket is connected every 5 seconds, but for some reason it can't detect when these timeouts happen and I'm not sure why.
Is there a better platform I can deploy my Socket.IO server on? I don't like the timeouts that GCP sets - would a platform like Digital Ocean or Azure be better?

Cloud Run has a max timeout of 3600s to handle the requests, whatever the protocol (HTTP, HTTP/2, streaming or not). If you need to maintain longer the connexion, Cloud Run isn't the correct platform for this.
I could recommend you to have a look to App Engine Flex or to Autopilot. On both, you have longer timeout and the capacity to run jobs in background. And both accept containers.

Related

IIS Idle Time-out triggers even though a SignalR connection is still present

In my project, there is a process that can run for a very long time (> 20 min.). The progress is transmitted to interested clients as a percentage value using SignalR. Now I noticed that the server is rigorously terminated after 20 minutes (IIS default Idle Time-out), although a client is connected and actively receiving data via SignalR.
Could it be that communication via WebSockets is not monitored by the IIS routine that resets the timeout? Is there any way to work around the problem? Or have I implemented something wrong?

SignalR long polling repeatedly calls /negotiate and /hub POST and returns 404 occasionally on Azure Web App

We have enabled SignalR on our ASP.NET Core 5.0 web project running on an Azure Web App (Windows App Service Plan). Our SignalR client is an Angular client using the #microsoft/signalr NPM package (version 5.0.11).
We have a hub located at /api/hub/notification.
Everything works as expected for most of our clients, the web socket connection is established and we can call methods from client to server and vice versa.
For a few of our clients, we see a massive amount of requests to POST /api/hub/notification/negotiate and POST /api/hub/notification within a short period of time (multiple requests per minute per client). It seems like that those clients switch to long polling instead of using web sockets since we see the POST /api/hub/notification requests.
We have the suspicion that the affected clients could maybe sit behind a proxy or a firewall which forbids the web sockets and therefore the connection switches to long polling in the first place.
The following screenshot shows requests to the hub endpoints for one single user within a short period of time. The list is very long since this pattern repeats as long as the user has opened our website. We see two strange things:
The client repeatedly calls /negotiate twice every 15 seconds.
The call to POST /notification?id=<connectionId> takes exactly 15 seconds and the following call with the same connection ID returns a 404 response. Then the pattern repeats and /negotiate is called again.
For testing purposes, we enabled only long polling in our client. This works for us as expected too. Unfortunately, we currently don't have access to the browsers or the network of the users where this behavior occurs, so it is hard for us to reproduce the issue.
Some more notes:
We currently have just one single instance of the Web App running.
We use the Redis backplane for a scale-out scenario in future.
The ARR affinity cookie is enabled and Web Sockets in the Azure Web App are enabled too.
The Web App instance doesn't suffer from high CPU usage or high memory usage.
We didn't change any SignalR options except of adding the Redis backplane. We just use services.AddSignalR().AddStackExchangeRedis(...) and endpoints.MapHub<NotificationHub>("/api/hub/notification").
The website runs on HTTPS.
What could cause these repeated calls to /negotiate and the 404 returns from the hub endpoint?
How can we further debug the issue without having access to the clients where this issue occurs?
Update
We now implemented a custom logger for the #microsoft/signalr package which we use in the configureLogger() overload. This logger logs into our Application Insights which allows us to track the client side logs of those clients where our issue occurs.
The following screenshot shows a short snippet of the log entries for one single client.
We see that the WebSocket connection fails (Failed to start the transport "WebSockets" ...) and the fallback transport ServerSentEvents is used. We see the log The HttpConnection connected successfully, but after pretty exactly 15 seconds after selecting the ServerSentEvents transport, a handshake request is sent which fails with the message from the server Server returned handshake error: Handshake was canceled. After that some more consequential errors occur and the connection gets closed. After that, the connection gets established again and everything starts from new, a new handshare error occurs after those 15 seconds and so on.
Why does it take so long for the client to send the handshake request? It seems like those 15 seconds are the problem, since this is too long for the server and the server cancels the connection due to a timeout.
We still think that this has maybe something to to with the client's network (Proxy, Firewall, etc.).
Fiddler
We used Fiddler to block the WebSockets for testing. As expected, the fallback mechanism starts and ServerSentEvents is used as transport. Opposed to the logs we see from our issue, the handshake request is sent immediately and not after 15 seconds. Then everything works as expected.
You should check which pricing tier you use, Free or Standard in your project.
You should change the connectionstring which is in Standard Tier. If you still use Free tier, there are some restrictions.
Official doc: Azure SignalR Service limits

Setup secondary redundant failover Node.JS server

Introduction
So I made a discord bot and hosted it on Heroku server. The bot is basic, it just listens to new messages and responds to them.
I am using a free tier of heroku and discovered that there are so called "free dyno hours". When it is end of the month, all the free hours are consumed and my bot will go offline until next month.
I own a Raspberry pi and so I thought it would be a good idea to host it there as well in case Heroku goes offline.
What I want to achieve
Host discord bot on hosting platform(Heroku).
use my Raspberry pi as a failover server.
Ensure that only one instance of the bot is running at a time.
If both servers are online Heroku(chosen as the main server) has priority.
If the application on Heroku somehow crashes or goes offline the bot on my Raspberry gets activated.
And vice versa.
How would that work if I had 3 and more servers?
How would I approach this if I had the bot connected to database?
This is just an example I would appreciate if more general solution would be provided.
NginX(load balancer)
When searching for help, I came across load balancers and NginX.
If I am right, it basically functions like a gateway, all requests go to the server NginX is hosted on and then they get redirected to one of many servers(depending on how loaded they are etc.).
I have few problems with this:
my bot doesn't receive any requests, it just listens to the discord api.
if the server with Nginx fails it all falls apart?
My solution 1
My first approach was to let the bot running on both Heroku and Raspberry at the same time.
It guarantees that the bot runs on at least one server if the other one fails.
But it is not ideal as the bot responds 2 times when both servers are up.
My solution 2
Heroku:
I added a simple Rest api endpoint with express.js to the bot app.
This way I can check if the bot on Heroku is running.
Raspberry pi:
I wrapped the bot app with this Node program that basically functions like a switch.
Link to a git repo: https://github.com/Matyanson/secondary-node-server.git
Every 3 minutes the program calls the endpoint and then by looking at the status code it checks whether the app is down.
Then it either starts the app or shuts down the app on raspberry(using pm2).
This kinda works but I am sure there is a better solution!
There is also a problem with Heroku(you can skip this):
Heroku uses Dynos(containers) to run the apps. There are different configurations of dynos including 'Web' and 'Worker'.
Worker is used for background job and never is never put to sleep.
Web dyno is the only dyno to receive HTTP traffic. It is put to sleep after 30min of receiving no web traffic.
I can switch on/off either of these.
both: there are 2 instances of my bot.
Worker: Can't connect to the endpoint.
Web: have to 'ping' the api at least every 30min or the bot sleeps.
Why am I asking?
Not because I need to solve this exact problem but because I want to learn a good way of doing this to use it in the future projects.
I also think this would be a good way to not 100% rely on external Hosting providers.

Google cloud app engine 1 hour latency without notivceable response times

I've got a GAE nodejs flex socek.io + express web and websocket server...
Everything works just fine and response times are really good but when i go to the metrics tab i can see this for latency
an hour latency??? I'm guessing it's something related to socket.io long polling? or websockets themselves?
Anybody cares to explain?
This is related with the time alive of your sockets, which seems to be 60 mins. I guess you are using sessions for your app. In this case, that you are using socket.io, it fall back on HTTP long polling, just as you mentioned. To get a better performance there is a new session affinity setting in the app.yaml. You can take a look into it.
If your app is working just fine just keep monitoring your Memory usage and CPU usage. Always try to keep a good management of your resources.

Nodejitsu and Twitter Streaming API - multiple disconnects

Hi I've been struggling with this issue for a few days now. I have a simple node.js app that connects to Twitter's streaming API and tracks a few terms. As a term is found the client side gets a websocket notification. I've made sure that my OAuth credentials are only used by this app and that the connection to the streaming API occurs only on app start up. What keeps happening is I get a 200 ok response but the stream then disconnects. I have it set to reconnect in 30 seconds but it's becoming ridiculous. It seems to be fine for a few minutes after restarting the app and then goes back to repeatedly disconnecting. The error is {"disconnect":{"code":7,"stream_name":"XXXXX-statuses158325","reason":"admin logout"}}. I have ran the same app locally with multiple client connections and not had a problem. I looked into other hosting services but I can't find one that supports websockets without having to revert to a slow long polling option on socket.io (which won't work for my app's purposes).
Any ideas for why this keeps happening?
that error means that you're connecting again with the same credentials (https://dev.twitter.com/discussions/11251).
One cause might be running more than 1 drone.
If this doesn't help, join us on http://webchat.jit.su and we'll do our best to help you :D
-yawnt

Resources