Random connection errors to MS SQL from nodeJS app - node.js

We have an AWS server running some nodeJS services. The services connecting to MS sql are randomly crashing with message "Failed to connect to databaseserver:1433 - Could not connect (sequence)".
We are running on:
App server:
Linux Ubuntu 14.4
AWS m5
NodeJS: 8.11.2
Services are using package mssql latest version (4.3.0). This includes tedious 2.7.1.
DB server:
Windows server 2012.
sql server 2012
throughput: about 300 rpm, error also happens when throughput is lower (about 20 rpm).
App is running in a cluster through PM2 (runs 4 times). We see the error happening on all 4 at the same time, but sometimes also on 1 or 2 instances.
What we tried:
Upgrading to alpha version of mssql with tedious 3.0.1. Did not make a difference
Upgrading from Amazon M4 machine to M5 machine with enhanced networking
Changing the pool settings in the app. We tried setting min connections to 0 or low/high value. Max also to low/high value but no avail.
Duplicate server to new machine.
Setting idleTimeoutMillis to 1 second
Pinging DB server to see if there is a connection problem, but we see no weird pings when the error happens.
Connection on app startup:
App.sqlConnection = new App.SQL.ConnectionPool(config, function(err) {
if(err){
Log.error(err);
process.exit(1);
}
App.sqlConnection.on('error', err => {
Log.error(`There was a connection err : ${err}`);
process.exit(1);
});
});
request;
var request = new App.SQL.Request(App.sqlConnection);
request.query(sQuery, function(err,results)
{
});
Errors are catched by the "on error" handler.
The error happens randomly across services. Some have more instances of the error then others.
We are running out of options. Any idea if we can see more detailed errors?

I have a couple suggestions.
First, how sure are you that these errors are actually a problem? If your code simply retries, instead of exiting, are the connections stable afterwards, or can a connection drop in the middle of a query?
(Connections dropping in the middle of queries are obviously not good, but random failures on connection, that can be fixed by retries, are the best kind of problem to have IMHO.)
Ignoring the potential in-code fix, I'm wondering when you say you "duplicated server to new machine" - did you launch a new AMI using latest Windows Server 2012, or did you image and clone? If your database server is a couple years old, you might actually be running outdated network drivers in your instance, which could give you some hiccups.
If you wanted to explore that, you could attempt rebuilding the entire database server from scratch on a newly launched AMI. Alternately you can upgrade PV driver, network adapter, and EC2Config on your existing instance, you can find the instructions at the following links:
https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/Upgrading_PV_drivers.html#aws-pv-upgrade
https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/sriov-networking.html#enable-enhanced-networking
https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/UsingConfig_Install.html

Related

Blazor server side app on IIS frequently disconnects WebSocket connection

I have a Blazor server side app published on IIS 10.
When browsing to an arbitrary page and just letting it idle after a minute or so (sometimes only 45 sec, sometimes something between 1 and two minutes) the modal
Attempting to reconnect to server ...
appears for a couple of seconds.
In the browser console the logging shows either
Error: Connection disconnected with error 'Error: Server timeout
elapsed without receiving a message from the server.'.
or
Information: Connection disconnected.
Since this seems to be a timeout problem I added the following options to ConfigureServices in my startup.cs
services.AddServerSideBlazor()
.AddHubOptions(options =>
{
options.ClientTimeoutInterval = TimeSpan.FromMinutes(10);
options.KeepAliveInterval = TimeSpan.FromSeconds(3);
options.HandshakeTimeout = TimeSpan.FromMinutes(10);
});
This does not solve the problem though.
I also went to the advanced settings of my site in IIS and increased the connection timeout from the default 120 sec to 600 sec. This did not help either.
Those frequent disconnections only happen on the live site hosted on IIS 10.
If I start the app locally with Visual Studio the connection is stable.
Any hints of what I'm missing would be appreciated!
Update:
As suggested by #agua from mars in comment below I changed transport type like this
app.UseEndpoints(endpoints =>
{
endpoints.MapControllers();
endpoints.MapBlazorHub(options => { options.Transports = HttpTransportType.LongPolling; });
endpoints.MapFallbackToPage("/_Host");
});
With this change the connection is still closed. The console log shows
Information: (LongPolling transport) Poll terminated by server.
I also tried HttpTransportType.ServerSentEvents which does not work at all but gives this error
Error: Failed to start the connection: Error: Unable to connect to the
server with any of the available transports. ServerSentEvents failed:
Error: 'ServerSentEvents' does not support Binary.
Update 2:
The IIS is configured to use HTTP 1.1
I tried changing to HTTP/2 but this did not change anything regarding the disconnections.
This is related to application pool recycling in IIS as stated by #Programmer. You can reproduce this by going into the application pool, right click the pool and choose recycle to force it. Your blazor app will get the "reconnect modal screen".
For me, I did not want to disable pool recycle, so I added js in the _Hosts.cshtml file as
<script>Blazor.defaultReconnectionHandler._reconnectCallback = function (d) {document.location.reload();}</script>
to automatically reconnect when the server comes back up.
Try this out..
app.UseEndpoints(endpoints =>
{
//other settings
.
.
endpoints.MapBlazorHub(options => options.WebSockets.CloseTimeout = new TimeSpan(1, 1, 1));
//other settings
.
.
});
This could be related to IIS application pool recycling. Try disabling the recycling to see if that's casing the disconnection.
I suffer the same problem on my Blazor server too: Myspector.com
I am sure this comes from network of data provider. I use Othello in Germany with 4G and see disconnection in 5 sec . When I am with wifi with t online on same target server no disconnection at all.
I Think some operators are incompatible with Blazor server/websoscket....
My recent experience especially on a shared server, increase the pool memory. Connectivity issues went away when we bumped 256MB up to 1GB for a small user base.

Azure SQL serverless is not waking up on connection attempt

I'm testing Azure SQL Serverless and from SSMS it seems to work fine, but from my ASP.NET Core application it never wakes up.
Using SSMS I can open a connection to a sleeping Serverless SQL database and after a delay the connection will go through.
Using my ASP.NET Core application I tried the same. From the login page I tried to login, which opens a connection to the database. After 10 or 11 seconds (I looked up the default timeout and its supposed to be 15 seconds but in this case it always seems to be about 10.5 seconds +/-0.5s). According to the docs, the first connection attempt may fail but subsequent ones should succeed, but I can send multiple queries to the database and it always fails with the following error:
Microsoft.Data.SqlClient.SqlException (0x80131904): Database 'myDb' on server
'MyDbSvr.database.windows.net' is not currently available. Please retry the connection later. If the
problem persists, contact customer support, and provide them the session tracing ID of
'{XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX}'.
If I wake the database up using SSMS then the login web page can connect to the database and succeeds.
I have added Connect Timeout=120; to the connection string.
The connection does happen during an HTTP request that is marked async on the Controller, thought I don't know if that makes any difference.
Am I doing something wrong or is there something additional I need to do to get the DB to wake?
[updte]
as an extra test wrote the following test
void Main()
{
SqlConnection con = new SqlConnection("Server=mydbsvr.database.windows.net;Database=mydb;User Id=abc;Password=xyz;Connect Timeout=120;");
Console.WriteLine(con.ConnectionTimeout);
con.Open();
var cmd = con.CreateCommand();
cmd.CommandText = "select getdate();";
Console.WriteLine(cmd.ExecuteScalar());
}
and got the same error.
I figured it out and its the dumbest thing.
This Azure SQL Server instance was migrated from another subscription and the group that migrated it gave it a new name, but they did something that allowed the use of the old name also. I'm researching to figure out how that was done. I will update this answer when I find out what that was.
As it turns out, using the old name with an Serverless Database won't wake up the db. Don't know why. But if you change to use the new/real server name it works. you do have to add a retry to the connection as it may fail the first few times.
[Update]
The new server allows logins using the old name by using a Azure SQL Database Alias https://learn.microsoft.com/en-us/azure/sql-database/dns-alias-overview

How to stop outbound HTTP connections from timing out

Backgound:
I'm currently hosting an ASP.NET application in Azure with the following specs:
ASP .Net Core 2.2
Using Flurl for HTTP requests
Kestrel Webserver
Docker (Linux - mcr.microsoft.com/dotnet/core/aspnet:2.2 runtime)
Azure App Service on P2V2 tier app service plan
I have a a couple of background jobs that run on the service that makes a lot of outbound HTTP calls to a 3rd party service.
Issue:
Under a small load (approximately 1 call per 10 seconds), all requests are completed in under a second with no issue. The issue I'm having is that under a heavy load, when service can make up to 3/4 calls in a 10 second span, some of the requests will randomly timeout and throw an exception. When I was using RestSharp the exception would read "The operation has timed out". Now that I'm using Flurl, the exception reads "The call timed out".
Here's the kicker - If I run the same job from my laptop running Windows 10 / Visual Studios 2017, this problem does NOT occur. This leads me to believe I'm hitting some limit or running out of some resource in my hosted environment. Unclear if that is connection/socket or thread related.
Things I've tried:
Ensure all code paths to the request are using async/await to prevent lockouts
Ensure Kestrel Defaults allow unlimited connections (it does by default)
Ensure Dockers default connection limits are sufficient (2000 by default, more than enough)
Configuring ServicePointManager settings for connection limits
Here is the code in my startup.cs that I'm currently using to try and prevent this issue:
public class Startup
{
public Startup(IHostingEnvironment hostingEnvironment)
{
...
// ServicePointManager setup
ServicePointManager.UseNagleAlgorithm = false;
ServicePointManager.Expect100Continue = false;
ServicePointManager.DefaultConnectionLimit = int.MaxValue;
ServicePointManager.EnableDnsRoundRobin = true;
ServicePointManager.ReusePort = true;
// Set Service point timeouts
var sp = ServicePointManager.FindServicePoint(new Uri("https://placeholder.thirdparty.com"));
sp.ConnectionLeaseTimeout = 15 * 1000; // 15 seconds
FlurlHttp.ConfigureClient("https://placeholder.thirdparty.com", cli => cli.Settings.ConnectionLeaseTimeout = new TimeSpan(0, 0, 15));
}
}
Has anyone else run into a similar issue to this? I'm open to any suggestions on how to best debug this situation, or possible methods to correct the issue. I'm at a complete loss after researching this for several days.
Thank you in advance.
I had similar issues. Take a look at Asp.net Core HttpClient has many TIME_WAIT or CLOSE_WAIT connections . Debugging via netstat helped identify the problem for me. As one possible solution. I suggest you use IHttpClientFactory. You can get more info from https://learn.microsoft.com/en-us/aspnet/core/fundamentals/http-requests?view=aspnetcore-2.2 It should be fairly easy to use as described in Flurl client lifetime in ASP.Net Core 2.1 and IHttpClientFactory

"Server x timed out" during MongoDB aggregation

I have a script that periodically runs aggregation on a mongodb collection. As the dataset has grown, the amount of time it takes to aggregate has also grown. My aggregation script has recently stopped working consistently, and the error logs show:
error: { [MongoError: server <x> timed out]
name: 'MongoError',
message: 'server <x> timed out' }
I've tried debugging this, and the only pattern I can find is that this timeout seems to only occur when the aggregation takes longer than 2 minutes (it times out right around 2m). Does anyone have additional debugging tips for this? The 2-minute thing is giving me the impression that I just need to configure some timeout somewhere but I can't figure out where or if i'm just falling into a red-herring trap.
About the system configuration: This aggregation script is a node.js (v5.9.1) application running in an alpine-based docker (v1.9.1) container. It uses the mongodb node driver (v2.1.19). Single mongodb server (though this is also happening in a separate environment with a replSet) running mongod (v3.2.6)
I got the same problem for logs time aggregation. I think I have the solution for you.
I found that the option socketTimeoutMS is responsible for that.
Check your mongo_client.js default socketTimeoutMS value. For me it was 2min. Mongodb module version 2.1.18.
So just add this option into your url :
mongodb://localhost:27017/test?maxPoolSize=2&socketTimeoutMS=60000
It will set timeout to 10 mins. That does the trick for me.

Using memcached failover servers in nodejs app

I'm trying to set up a robust memcached configuration for a nodejs app with the node-memcached driver, but it does not seem to use the specified failover servers when one server dies.
My local experiment goes as follows:
shell
memcached -p 11212
node
MC = require('memcached')
c = new MC('localhost:11211', //this process does not exist
{failOverServers: ['localhost:11212']})
c.get('foo', console.log) //this will eventually time out
c.get('foo', console.log) //repeat 5 or 6 times to exceed the retries number
//wait until all the connection errors appear in the console
//at this point, the failover server should be in use
c.get('foo', console.log) //this still times out :(
Any ideas of what might we be doing wrong?
It seems that the failover feature is somewhat buggy in node-memcached.
To enable failover you must set the remove options:
c = new MC('localhost:11211', //this process does not exist
{failOverServers: ['localhost:11212'],
remove : true})
Unfortunately, this is not going to work because of the following error:
[depricated] HashRing#replaceServer is removed.
[depricated] the API has no replacement
That is, when trying to replace a dead server with a replacement from the failover list, node-memcached outputs a deprecation error from the HashRing library (which, in turn, is maintained by the same author of node-memcached). IMHO, feel free to open a bug :-)
This is come when your nodejs server not getting any session id from memcached
Please check properly in php.ini file you are setting properly or not for memcached
session.save = 'memcache'
session.path = 'tcp://localhost:11212'

Resources