I'm having a strange error with a recently deployed Azure website.
Everything seems to work most of the time, but on a regular basis (at least daily) there is a period during which I receive following error:
A network-related or instance-specific error occured while establishing a connection to SQL Server.
The server was not found or was not accessible.
Is this a stability issue with Azure or is it possible that something's wrong in my code (but why does it work then most of the time)?
Is the code using the Transient Fault Handling Application Block - http://msdn.microsoft.com/en-us/library/hh680934(v=PandP.50).aspx? This block understands how to handle the transient errors that can, and will, happen with SQL Database.
Related
I have a service fabric cluster which hosts numerous applications. One of the applications has a service type where the service is created, runs for a bit, and then is deleted. Everything works great, but the cluster virtually always has its state set to error because there will be a few of these in the "Unhealthy evaluations" section.
Error event: SourceId='System.Hosting', Property='CodePackageActivation:Code:EntryPoint'.
There was an error during CodePackage activation.The service host terminated with exit code:7148
I've wrapped both the program's main and RunAsync in exception handlers, but never see anything in analytics. Is there any way to look up what exit code 7148 means? Thanks.
7148 is a general error code that indicates that something failed in SF in the process of setting up or activating your service's host process. So that's the reason that you're not seeing any errors or exceptions - your code is never getting a chance to run.
Examples of things I've seen that led to 7148:
The exe was not actually a windows exe due to corruption
The service's manifest had a reference to a cert or some other pre-req like an endpoint that was incorrectly configured (like a port that was already in use or the wrong thumbprint for a cert)
Something blew up inside Windows that cause the process creation to fail, like a failure to correctly configure host networking for a container
Most of the times when I see this I have to look at the windows error logs to see what's really happening. The SF folks are also trying to capture more common causes of failures and reporting them as better health errors rather than relying on 7148.
I have two Azure WebJobs. The first takes an incoming message that tells it to grab a PDF and break it into individual page images and then queue another message for the second WebJob to process the individual pages. It worked fine on our QC instance, but when we tried to move to production I started getting strange errors on the second job, but not consistently. The first job runs and breaks the file into page images. That is working fine. I have confirmed that every page image gets created and every page message gets queued. However, for the second job, only some of the messages are getting processed correctly. The remaining show this error in the WebJob diagnostics:
Microsoft.Azure.WebJobs.Host.FunctionInvocationException: Microsoft.Azure.WebJobs.Host.FunctionInvocationException: Exception while executing function: Functions.ProcessBatchPage ---> System.Data.SqlClient.SqlException: A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: SQL Network Interfaces, error: 52 - Unable to locate a Local Database Runtime installation. Verify that SQL Server Express is properly installed and that the Local Database Runtime feature is enabled.) ---> System.ComponentModel.Win32Exception: The system cannot find the file specified
But what's weird is that this error mentions the Local Database Runtime and SQL Server Express and I am not references either anywhere in my code. The system points at an Azure SQL DB. The job is ADO.Net and I have hardcoded the connection string to try to eliminate any issues with configuration based connection strings. But what's weird is that it only happens to a certain portion of the messages. The others process perfectly.
Lastly, I ran the job in debug locally (still pointing at the real queue and DB on Azure) and got the same problem. But the job outputs a console line with the job ID as the first line of the code. For those jobs that process successfully, I see this writeline. For those that fail, I never see anything. It's almost like the job is not really starting up correctly. (the failed jobs also have a really short run time 50-100ms)
I had the same issue with some jobs and I've came accross these articles to find a solution:
Transient Fault Handling (Building Real-World Cloud Apps with Azure)
Connection Resiliency / Retry Logic (EF6 onwards)
From theses articles :
Causes of transient failures :
In the cloud environment you’ll find that failed and dropped database connections happen periodically. That’s partly because you’re going through more load balancers compared to the on-premises environment where your web server and database server have a direct physical connection. Also, sometimes when you’re dependent on a multi-tenant service you’ll see calls to the service get slower or time out because someone else who uses the service is hitting it heavily. In other cases you might be the user who is hitting the service too frequently, and the service deliberately throttles you – denies connections – in order to prevent you from adversely affecting other tenants of the service.
Use smart retry/back-off logic to mitigate the effect of transient failures:
The Microsoft Patterns & Practices group has a Transient Fault Handling Application Block that does everything for you if you’re using ADO.NET for SQL Database access (not through Entity Framework). You just set a policy for retries – how many times to retry a query or command and how long to wait between tries – and wrap your SQL code in a using block :
public void HandleTransients()
{
var connStr = "some database";
var _policy = RetryPolicy.Create < SqlAzureTransientErrorDetectionStrategy(
retryCount: 3,
retryInterval: TimeSpan.FromSeconds(5));
using (var conn = new ReliableSqlConnection(connStr, _policy))
{
// Do SQL stuff here.
}
}
When you use the Entity Framework you typically aren’t working directly with SQL connections, so you can’t use this Patterns and Practices package, but Entity Framework 6 builds this kind of retry logic right into the framework. In a similar way you specify the retry strategy, and then EF uses that strategy whenever it accesses the database.
To use this feature in the Fix It app, all we have to do is add a class that derives from DbConfiguration and turn on the retry logic.
// EF follows a Code based Configuration model and will look for a class that
// derives from DbConfiguration for executing any Connection Resiliency strategies
public class EFConfiguration : DbConfiguration
{
public EFConfiguration()
{
AddExecutionStrategy(() => new SqlAzureExecutionStrategy());
}
}
We are using the Azure SQL Database (Web Edition) for a MVC3 ASP.NET/EF5 application.
Is there a limit to the number of sessions that this SQL Database setup supports? I am just wondering whether any delays that we are getting is due to some form of queuing or pooling. Currently we have about 5 concurrent users.
Thanks.
The SQL Azure Web edition database should support a high number of concurrent users - we've had applications running that issue thousands of queries per minute against Web databases.
Throttling
SQL Azure does implement database throttling to maintain performance for all users of the platform. If throttling has been applied to the current operation you'll receive error 40501. The link I've provided also shows you how to determine why throttling is being applied. If you receive this error you can treat it as a transient error and wait before retrying.
It doesn't sound like your connections are being throttled, because you mention only 5 concurrent users and talk about delays, whereas the throttling error would occur pretty quickly.
Transient error handling
If you're getting connection timeouts etc you need to handle them as transient errors. Transient errors are timeouts or dropped connections, as well as error codes 10054, 10053, 40501 (throttling as described above) and 40197 (usually because an upgrade or failover operation is in progress).
You should ensure you implement retry logic to handle transient errors.
Query performance
If you're executing long running queries you can check which ones are slow by logging into the database management URL:
https://<database-id>.database.windows.net/#$database=<database-name>
Log in and click "Query Performance" - take a look at the longest running queries at the top.
I have just got the latest code from SVN and I got the above error when I logged into my application. The exception message was:
An error occurred while getting provider information from the
database. This can be caused by Entity Framework using an incorrect
connection string. Check the inner exceptions for details and ensure
that the connection string is correct.
The inner exception says:
The client was unable to establish a connection because of an error
during connection initialization process before login. Possible causes
include the following: the client tried to connect to an unsupported
version of SQL Server; the server was too busy to accept new
connections; or there was a resource limitation (insufficient memory
or maximum allowed connections) on the server. (provider: Shared
Memory Provider, error: 0 - The handle is invalid.
The issue is, none of these suggestions seem like the cause. Any idea what might cause this?
You're going to love the solution to this. I restarted my machine and it works fine now. :o).
We are currently experiencing a rather troublesome problem in our development environment with the following message...
A connection was successfully established with the server,
but then an error occurred during the pre-login handshake.
(provider: SSL Provider, error: 0 - The certificate's CN
name does not match the passed value.)
...the commonly accepted wisdom to resolving this problem is to set the TrustServerCertificate portion of the connection to True. However, this does not work reliably or consistently.
This particular error occurs in a number of instances, for instance testing our WCF Service in our Azure Emulator talking to live / hosted SQL Azure Instance or even using SQL Management Studio. The only common denominator we've found is that this occurs only when we connect directly to SQL Azure as opposed to when its hosted and Azure is talking directly to SQL Azure (which does work).
I've tried a number of tactics to resolve the problem (such as the one detailed here), i.e. believing it was connection related and removing pooling and other modifications to the connection string. But alas, none are conclusive and more irritating is that the error is intermittent and will prevent access for a short period of time before magically resolving itself.
Other factors that I've eliminated.
We're using the Transcient Application Block to attempt to recover from these errors, but no.
Our office has no proxy server with our connection to the Azure hosted services.
Has anyone else experienced this problem or has any suggestions?
You need to scan for Non-IFS Winsock BSPs or LSPs which not compatible with the FILE_SKIP_COMPLETION_PORT_ON_SUCCESS flag ,problem results primarily from non-IFS LSPs Being installed.
Just run "netsh WinSock Show Catalog" from command prompt , and check any "service flag" which doesn't look in the format of 0x20xxx
In my case I found that "Speed Accelerator" with service flag 0x66,removing this software solve my Problem .
More information can be found here : http://support.microsoft.com/kb/2568167
What does your connection string look like? Not sure if you've tried this yet but I remember having a problem similar when using a remote SQL connection to SQL Azure and found that I had to set:
Trusted_Connection=False;Encrypt=True
and remove any Connect Timeout from the string entirely.