We have a weird networking issue.
We have a Hyperledger Fabric client application written in Node.js running in Kubernetes which communicates with an external Hyperledger Fabric Network.
We randomly get timeout errors on this communication. When the pod is restarted, all goes good for a while then timeout errors start, sometimes randomly fixed on its own and then goes bad again.
This is Azure EKS, we setup a quick Kubernetes cluster in AWS with Rancher and deployed the app there and same timeout error happened there too.
We ran scripts in the same container all night long which hits the external Hyperledger endpoint both with cURL and a small Node.js script every minute and we didnt get even a single error.
We ran the application in another VM as plain Docker containers and there was no issue there.
We inspected the network traffic inside container, when this issue happens, we can see with netstat a connection is established but tcpdump shows no traffic, no packages are even tried to be sent.
Checking Hyperledger Fabric SDK code, it uses gRPC protocol buffers behind the scenes.
So any clues maybe?
This turned out to be not Kubernetes but dropped connection issue.
gRPC keeps connection open and after some period of inactivity intermediary components drop the connection. In Azure AKS case this is the load balancer, as every outbound connection goes through a load balancer. There is a non configurable idle timeout period of 4 minutes after which load balancer drops the connection.
The fix is configuring gRPC for sending keep alive messages.
Scripts in the container worked without a problem, as they open a new connection every time they run.
Application running as plain Docker containers didnt have this issue since we were hitting endpoints every minute hence never reaching idle timeout threshold. When we hit endpoints every 10 minutes, timeout issue also started there too.
Related
I have a DO Load Balancer that has 4 servers behind it, I've been using socket.io with Sticky Sessions enabled in the Load Balancer settings and it had been working just fine for a while.
Recently clients have not been able to connect at all getting a 400 error immediately on connection. I haven't changed anything in the way I connect to the sockets at all. If I do require that the transport be 'websocket' only from the client it does connect successfully, but then I lose out on the polling backup (one of the main benefits of socket.io).
Also, connecting directly to one of the droplets works as expected, so the issue definitely stands with the Load Balancer.
Does anyone have any idea as to any kind of set up that should be in place for this to work with the DO Load Balancers? Anything that might have changed recently?
I'm running socket.io on a NodeJS server with Express if that helps at all.
Edit #1: Added a screenshot of the LB Settings
I'm hosting my AspNetCore app in Azure (Windows hosting plan P3v2 plan). It works perfectly fine under normal load (5-10 requests/sec) but under high load (100-200 requests/sec) starts to hang and requests return the following response:
The specified CGI application encountered an error and the server terminated the process.
And from the event logs I can get even more details:
An attempt was made to access a socket in a way forbidden by its access permissions aaa.bbb.ccc.ddd
I have to scale instance count to 30 instances, and while each instance getting just 3-5 requests per sec, it works just fine. I beleive that 30 hosts is too much to process this high load, beleive that the resource is underutilized and trying to find the real bottleneck. If I set instance count to 10 - everything crashes and every request starts to return the error above. Resources utilization metrics for the high load case with 30 instances enabled:
The service plan CPU usage is low, about 10-15% for each host
The service plan memory usage is around 30-40%
Dependency responses quickly, 50-200 ms
Azure SQL DTU usage is about 5%
I discovered this useful article on current tier limits and after an Azure TCP connections diagnostics I figured out a few possible issues:
High outbound TCP connection
High TCP Socket handle count - High TCP Socket handle count was detected on the instance .... During this period, the process dotnet.exe of site ... with ProcessId 8144 had the maximum open handle count of 17004.
So I dig more and found the following information:
Per my service plan tier, my tcp connections limit should be 8064 which is far from the displayed above. Next I've checked the socket state:
Even though I see that number of active TCP connections is below the limit, I'm wondering if open socket handles count could be an issue here. What can cause this socket handle leak (if any)? How can I troubleshoot and debug it?
I see that you have tried to isolate the possible cause for the error, just highlighting some of the reasons to revalidate/remediate:
1- On Azure App Service - Connection attempts to local addresses (e.g. localhost, 127.0.0.1) and the machine's own IP will fail, except if another process in the same sandbox has created a listening socket on the destination port. Rejected connection attempts, normally returns the above socket forbidden error (above).
For peered VNet/On_premise, kindly ensure that the IP address used is in the ranges listed for routing to VNet/Incorrect routing.
2.On Azure App service - If the outbound TCP connections on the VM instance are exhausted. limits are enforced for the maximum number of outbound connections that can be made for each VM instance.
Other causes as highlighted in this blog
Using client libraries which are not implemented to re-use TCP connections.
Application code or the client library is leaking TCP socket handles.
Burst load of requests opening too many TCP socket connections at once.
In case of higher level protocol like HTTP this is encountered if the Keep-Alive option is not leveraged.
I'm unusure if you have already tried the App Service Diagonstic to fetch more details, kindly give that a shot:
Navigate to the Diagnose and solve problems blade in the Azure portal.
In the Azure portal, open the app in App Services.
Select Diagnose and solve problems > "TCP Connections"
Consider optimizing the application to implement Connection Pooling for your .Net/Observe the behavior locally. If feasible restart the WebApp and then check to see if that helps.
If the issue still persists, kindly file a support ticket for a detailed/deeper investigation of the backend logs.
I was working fine with cloudamqp until all of a sudden wascally/rabbot stopped being able to connect to my endpoint. I have installed RabbitMQ locally and my system works fine. I have since then tried to setup a RabbitMq instance on Heroku via bigwig, to no avail. The endpoints I'm using should be fine and I also installed amqp.node and node-amqp to test if maybe it was a problem with rabbot. However none of these can connect either.
Any idea what the problem can be?
the most common cause is connection timeout. with all my wascally code, hosting on cloudamqp (with heroku, digital ocean or otherwise), i have to set a connection timeout much higher than the default for it to work.
this can be done with the connection_timeout parameter on the connection string url (https://www.rabbitmq.com/uri-query-parameters.html)
for example:
var conn = "amqp://myuser:mypassword#server.cloudamqp.com/my-vhost?connection_timeout=30"
this will set a connection timeout of 30 seconds
My Node.JS application is running on production server via forever daemon:
forever start -w --watchDirectory=/path/to/app \
--watchIgnore=/path/to/app/node_modules/** /path/to/app/server.js
When I change files contents in /path/to/app/ directory, the process is restarted by forever. While restart takes around 2-3 seconds, the app is unavailable and so downtime occurs every time I deploy a new change. How can I prevent the downtime assuming I have full access to the server?
You can do that manually by using an HTTP load balancer, so you going to create two or more backends that are accessible only by the load balancer (the load balancer is only one reachable by a public address). The next step is to update one server only, while the load balancer controls the traffic to one backend (the available one). After the successful update, you can turn on the updated one and redirect the load balancer to the right backend (the updated), repeat the procedure, and both should be updated without service downtime.
I have couchdb running on a Linux Ubuntu 14.04 VM and a .net Web application running under Azure Web Apps. Under our ELMAH logging for the web application I keep getting intermittent errors:
System.Net.Sockets.SocketException
A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond [ipaddress]:5984
I've checked the CouchDB logs and there isn't a record of those requests so I don't believe it's hitting the CouchDB server, I can confirm this by looking at the web server logs on Azure and see the Error 500 response. I've also tried a tcpdump however with little success (another issue logging tcpdump to a separate disk keeps failing due to access denied)
We've previously ran CouchDB on a Windows VM with no issues so I wonder if the issue relates to the OS connection settings for tcp and timeouts
Anyone have any suggestions as to where to look or what immediately jumps to mind?