Socket.io huge server response time when using xhr-polling - node.js

I am trying to scale a messaging app. Im using nodeJS with Socket.io and Redis-Store on the backend. The client can be iphone native browser, android browsers .. etc
I am using SSL for the node connection, using Nginx to load balance the socket connections. I am not clustering my socket.io app , instead i am load balancing over 10 node servers ( we have a massive amount of users ). Everything looks fine when the transport is Websockets , however when it falls back to xhr-polling ( in case of old android phones ) I see a HUGE response time of up to 3000 rpm in New-relic. And I have to restart my node servers every hour or so otherwise the server crashes.
I was wondering if I am doing anything wrong , and if there are any measures I can take to scale socket.io when using xhr-polling transport ? like increasing or decreasing the poll duration ?

You are not doing anything wrong, xhr-polling is also called long polling. The name comes from the fact that the connection is maintained open longer, usually until some answer can be sent down the wire. After the connection closes, a new connection is open waiting for the next piece of information.
You can read more on this here http://en.wikipedia.org/wiki/Push_technology#Long_polling
New Relic shows you the response time of the polling request. Socket.IO has a default "polling duration" of 20 seconds.
You will get a higher RPM for smaller polling duration and a smaller RPM for a higher polling duration. I would consider increasing the polling duration or just keep the 20 sec default.
Also, to keep New Relic from displaying irrelevant data for the long polling, you can add ignore rules in the newrelic.js that you require in you app. This is also detailed in the newrelic npm module documentation here https://www.npmjs.org/package/newrelic#rules-for-naming-and-ignoring-requests

Related

Google cloud app engine 1 hour latency without notivceable response times

I've got a GAE nodejs flex socek.io + express web and websocket server...
Everything works just fine and response times are really good but when i go to the metrics tab i can see this for latency
an hour latency??? I'm guessing it's something related to socket.io long polling? or websockets themselves?
Anybody cares to explain?
This is related with the time alive of your sockets, which seems to be 60 mins. I guess you are using sessions for your app. In this case, that you are using socket.io, it fall back on HTTP long polling, just as you mentioned. To get a better performance there is a new session affinity setting in the app.yaml. You can take a look into it.
If your app is working just fine just keep monitoring your Memory usage and CPU usage. Always try to keep a good management of your resources.

Is there a hard limit on socket.io connections?

Background
We have a server that has socket.io 2.0.4. This server receives petitions from a stress script that simulates clients using socket.io-client 2.0.4.
The scrip simulates the creation of clients ( each client with its own socket ) that sends a petition and immediately dies after, using socket.disconnect();
Problem
During the first few of seconds all goes well. But every test reaches a point in which the script starts spitting out the following error:
connect_error: Error: websocket error
This means that the clients my script is creating are not connecting to the server because they are unable to connect.
This script creates 7 clients per second ( spaced evenly throughout the second ), each client makes 1 petition and then dies.
Research
At first I thought there was an issue with file descriptors and limits imposed by UNIX, since the server is in a Debian machine:
https://github.com/socketio/socket.io/issues/1393
After following these suggestions, the issue remained however.
Then I though maybe my test script was not connecting correctly, so I changed the connection options as in this discussion:
https://github.com/socketio/socket.io-client/issues/1097
Still, to no avail.
What could be wrong?
I see the machine's CPU's are constantly at 100% so I guess I am pounding the server with requests.
But if I am not mistaken, the server should simply accept more requests and process them when possible.
Questions
Is there a limit to the amount of connections a socket.io server can handle?
When making such stress tests one needs to be aware of protections and gate keepers.
In our case, our stack was deployed in AWS. So first, the AWS load balancers started blocking us because they thought the system was being DDOSed.
Then, the Debian system was getting flooded and it started refusing connections with SYN_FLOOD.
But after fixing that we were still having the error. Turns out we had to increase TCP connection's buffer and how TCP connections were being handled in the kernel.
Now it accepts all connections, but I wish no one the suffering we went through to find it out...

socket.io disconnects clients when idle

I have a production app that uses socket.io (node.js back-end)to distribute messages to all the logged in clients. Many of my users are experiencing disconnections from the socket.io server. The normal use case for a client is to keep the web app open the entire working day. Most of the time on the app in a work day time is spent idle, but the app is still open - until the socket.io connection is lost and then the app kicks them out.
Is there any way I can make the connection more reliable so my users are not constantly losing their connection to the socket.io server?
It appears that all we can do here is give you some debugging advice so that you might learn more about what is causing the problem. So, here's a list of things to look into.
Make sure that socket.io is configured for automatic reconnect. In the latest versions of socket.io, auto-reconnect defaults to on, but you may need to verify that no piece of code is turning it off.
Make sure the client is not going to sleep such that all network connections will become inactive get disconnected.
In a working client (before it has disconnected), use the Chrome debugger, Network tab, webSockets sub-tab to verify that you can see regular ping messages going between client and server. You will have to open the debug window, get to the network tab and then refresh your web page with that debug window open to start to see the network activity. You should see a funky looking URL that has ?EIO=3&transport=websocket&sid=xxxxxxxxxxxx in it. Click on that. Then click on the "Frames" sub-tag. At that point, you can watch individual websocket packets being sent. You should see tiny packets with length 1 every once in a while (these are the ping and pong keep-alive packets). There's a sample screen shot below that shows what you're looking for. If you aren't seeing these keep-alive packets, then you need to resolve why they aren't there (likely some socket.io configuration or version issue).
Since you mentioned that you can reproduce the situation, one thing you want to know is how is the socket getting closed (client-end initiated or server-end initiated). One way to gather info on this is to install a network analyzer on your client so you can literally watch every packet that goes over the network to/from your client. There are many different analyzers and many are free. I personally have used Fiddler, but I regularly hear people talking about WireShark. What you want to see is exactly what happens on the network when the client loses its connection. Does the client decide to send a close socket packet? Does the client receive a close socket packet from someone? What happens on the network at the time the connection is lost.
webSocket network view in Chrome Debugger
The most likely cause is one end closing a WebSocket due to inactivity. This is commonly done by load balancers, but there may be other culprits. The fix for this is to simply send a message every so often (I use 30 seconds, but depending on the issue you may be able to go higher) to every client. This will prevent it from appearing to be inactive and thus getting closed.

socket.io, node cluster, express, session store with redis

I'm not exactly sure how to describe this, but I'm running a node app with a cluster on 4 cores, port 80, using RedisStore as the socket.io store, express.js, and socket.io listens on it.
Some interesting behavior that happens is that on around 40% of clients that connect to socket.io using Chrome (and Firefox, but we stopped bothering using different browsers because it seems to be across the board), it connects and works fine for the first 25-30 seconds, then after that there is 60 seconds of dead time where requests are sent from the client but none are received or acknowledged by the server, and at 1.5 min - 1.6 min, the client starts a new websocket connection. Sometimes that new websocket connection has the same behavior, at other times the connection "catches" and is persistent for the next few hours and is absolutely fine.
What's funny is that this behavior doesn't happen on our test server that runs on a different port (which also uses a cluster); and moreover it doesn't happen on any of our local development servers (some of which implement clusters, others of which don't).
Any ideas?

Non-Websocket Socket.io clients for benchmarking

There are a bunch of Socket.io client implementations out there in e.g. Java (see Java socket.io client), that seem to exclusively support the Websocket protocol.
For benchmarking the server performance of other protocols - and I'm particularly interested in htmlfile as it will be used by IE browsers < 10, unless I enable Flash, which I'm not sure I'll do, as socket.io transport 'flashsocket' takes 5 seconds to start on IE 8 - is there any Socket.io client available that would allow benchmarking of the server?
I don't care too much what OS or programming language it is.
There's
https://github.com/Gottox/socket.io-java-client
In addition to WebSocket, it only does XHR, and that feature is currently considered to be in beta:
Status: Connecting with Websocket is production ready. XHR is in beta.
Light testing xhr polling seems to support the claim that it is not production ready yet. I had a bunch of disconnects without subsequent reconnects. This was when testing a few hundred client instances simultaneously in one JVM. As there were errors in the server log, I guess it's the client.
One more guess: As the connection drops are so frequent, and the load on the server is so much higher compared to WebSockets, I wonder whether this client's 'xhr polling' does not do HTTP Keepalive, that would explain a lot... Will check as soon as time permits.
Using WebSocket, I could do 1000 instances per JVM (probably more) and 5000 instances (5 JVMs x 1000 instances each) per machine (probably more, too) without issues.
And apparently
it's easy to write your own transport
Will check this out.
It should also be easy to create your own by
sniffing on a browser <-> socket.io session (On Windows, that's Fiddler)
copying the code from one of the official socket.io tests

Resources