Azure WebJob DB Connection Error Only on some instances - azure

I have two Azure WebJobs. The first takes an incoming message that tells it to grab a PDF and break it into individual page images and then queue another message for the second WebJob to process the individual pages. It worked fine on our QC instance, but when we tried to move to production I started getting strange errors on the second job, but not consistently. The first job runs and breaks the file into page images. That is working fine. I have confirmed that every page image gets created and every page message gets queued. However, for the second job, only some of the messages are getting processed correctly. The remaining show this error in the WebJob diagnostics:
Microsoft.Azure.WebJobs.Host.FunctionInvocationException: Microsoft.Azure.WebJobs.Host.FunctionInvocationException: Exception while executing function: Functions.ProcessBatchPage ---> System.Data.SqlClient.SqlException: A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: SQL Network Interfaces, error: 52 - Unable to locate a Local Database Runtime installation. Verify that SQL Server Express is properly installed and that the Local Database Runtime feature is enabled.) ---> System.ComponentModel.Win32Exception: The system cannot find the file specified
But what's weird is that this error mentions the Local Database Runtime and SQL Server Express and I am not references either anywhere in my code. The system points at an Azure SQL DB. The job is ADO.Net and I have hardcoded the connection string to try to eliminate any issues with configuration based connection strings. But what's weird is that it only happens to a certain portion of the messages. The others process perfectly.
Lastly, I ran the job in debug locally (still pointing at the real queue and DB on Azure) and got the same problem. But the job outputs a console line with the job ID as the first line of the code. For those jobs that process successfully, I see this writeline. For those that fail, I never see anything. It's almost like the job is not really starting up correctly. (the failed jobs also have a really short run time 50-100ms)

I had the same issue with some jobs and I've came accross these articles to find a solution:
Transient Fault Handling (Building Real-World Cloud Apps with Azure)
Connection Resiliency / Retry Logic (EF6 onwards)
From theses articles :
Causes of transient failures :
In the cloud environment you’ll find that failed and dropped database connections happen periodically. That’s partly because you’re going through more load balancers compared to the on-premises environment where your web server and database server have a direct physical connection. Also, sometimes when you’re dependent on a multi-tenant service you’ll see calls to the service get slower or time out because someone else who uses the service is hitting it heavily. In other cases you might be the user who is hitting the service too frequently, and the service deliberately throttles you – denies connections – in order to prevent you from adversely affecting other tenants of the service.
Use smart retry/back-off logic to mitigate the effect of transient failures:
The Microsoft Patterns & Practices group has a Transient Fault Handling Application Block that does everything for you if you’re using ADO.NET for SQL Database access (not through Entity Framework). You just set a policy for retries – how many times to retry a query or command and how long to wait between tries – and wrap your SQL code in a using block :
public void HandleTransients()
{
var connStr = "some database";
var _policy = RetryPolicy.Create < SqlAzureTransientErrorDetectionStrategy(
retryCount: 3,
retryInterval: TimeSpan.FromSeconds(5));
using (var conn = new ReliableSqlConnection(connStr, _policy))
{
// Do SQL stuff here.
}
}
When you use the Entity Framework you typically aren’t working directly with SQL connections, so you can’t use this Patterns and Practices package, but Entity Framework 6 builds this kind of retry logic right into the framework. In a similar way you specify the retry strategy, and then EF uses that strategy whenever it accesses the database.
To use this feature in the Fix It app, all we have to do is add a class that derives from DbConfiguration and turn on the retry logic.
// EF follows a Code based Configuration model and will look for a class that
// derives from DbConfiguration for executing any Connection Resiliency strategies
public class EFConfiguration : DbConfiguration
{
public EFConfiguration()
{
AddExecutionStrategy(() => new SqlAzureExecutionStrategy());
}
}

Related

Why is Azure MySQL database unresponsive at first

I have recently setup an 'Azure Database for MySQL flexible server' using the burstable tier. The database is queried by a React frontend via a node.js api; which each run on their own seperate Azure app services.
I've noticed that when I come to the app first thing in the morning, there is a delay before database queries complete. The React app is clearly running when I first come to it, which is serving the html front-end with no delays, but queries to the database do not return any data for maybe 15-30 seconds, like it is warming up. After this initial slow performance though, it then runs with no delays.
The database contains about 10 records at the moment, and 5 tables, so it's tiny.
This delay could conceivably be due to some delay with the node.js server, but as the React server is running on the same type of infrastructure (an app service), configured in the same way, and is immediately available when I go to its URL, I don't think this is the issue. I also have no such delays in my dev environment which runs on my local PC.
I therefore suspect there is some delay with the database server, but I'm not sure how to troubleshoot. Before I dive down that rabbit hole though, I was wondering whether a delay when you first start querying a database (after, say, 12 hours of inactivity) is simply a characteristic of the burtsable tier on Azure?
There may be more factors affecting this (see comments from people on my original question), but my solution has been to set two global variables which cache data, improving initial load times. The following should be set to ON in the Azure config:
'innodb_buffer_pool_dump_at_shutdown'
'innodb_buffer_pool_load_at_startup'
This is explained further in the following best practices documentation: https://learn.microsoft.com/en-us/azure/mysql/single-server/concept-performance-best-practices in the section marked 'Use InnoDB buffer pool Warmup'

Convert azure web application to azure website

Within our company we've got a rather large serviceapplication running as a azure cloudservice. The service contains a webrole and a workerrole.
The webrole contains an MVC-application and the workerrole is running in the background. The workerrole is used to handle several large processes and a bunch smaller processes 24/7, this is checked every 5 minutes.
I've created an azure website for this application and wrote a small wrapper class which checks if configuration values needs to be taken from either the web.config file or cloud configuration files (.cscfg files). I've added the appropiate transformations to transform some extra settings and published the application to the azure website.
So far everything works good, but what i've expected a bit already indeed happened.. The workerrole isn't working anymore and is throwing errors. The first error i've seen was;
Could not load file or assembly 'Microsoft.WindowsAzure.ServiceRuntime, Version=2.5.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35' or one of its dependencies. The system cannot find the file specified.
So ofcourse i've taken the quick solution and went 'properties > copy local' and set it to true. After publishing this to the azure website i'm getting the following error;
Could not load file or assembly 'msshrtmi, Version=2.5.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35' or one of its dependencies. The system cannot find the file specified.
I can find out where this error is coming from, but it feels like this is the second of a whole other bunch of errors coming. On several sites I've read that azure websites just doesn't support workerroles (obviously).
This gives me a few options;
Find a solution so I can connect the azure website to the workerrole still running in the cloudservice. If this works I can drop the webrole and I'm able to connect multiple instances to one workerrole.
Find a solution to convert the workerrole to something (no idea what this possibly could be) supported by the azure website.
Forget the whole idea and stick to the cloudservice setting with the web- and workerrole.
Fragment from workerrole.cs
The facade makes a database call to check any newly added processes.
public override void Run()
{
// Only process if the web.config says we're allowed to do so.
while (true)
{
var process = Convert.ToBoolean(WebConfigurationManager.GetSetting("Process"));
try
{
if (process )
{
var username = WebConfigurationManager.GetSetting("UsernameWorkerRole");
if (string.IsNullOrEmpty(username))
{
var version = Assembly.Load("Ecare.Productie.WorkerRole").GetName().Version;
var versionString = String.Format("{0}.{1}.{2}.{3}", version.Major, version.Minor, version.Build.ToString("000"), version.Revision.ToString("00000"));
username = ApplicatieConstanten.WorkerRoleName + " " + versionString;
}
IServiceFacade serviceFacade = new ServiceFacade(username);
serviceFacade.Start();
}
}
catch (Exception ex)
{
AuditingLoggingHelper.GetLoggerInstance(ApplicatieConstanten.WorkerRoleName).Error("Exception while starting service", ex);
}
Thread.Sleep(10000);
}
// ReSharper disable once FunctionNeverReturns
}
The main reason we're doing this, is because we have VS solution with an MVC-application (the web role) and the workerrole. We're currently publishing this to an cloud service in azure. Because of the development processes we're running seperate test, acceptation and production environments. Since it's a heavy process we're running quite expensive machines in azure, but that mostly only needed for the workerrole. The webpart is lightweight. So it's mainly an idea trying to reduce costs. So the idea is to convert the webrole to an azure website (this part is working already with just a small modification to read information from the web.config instead of the cloudconfiguration). But the workerrole currently isn't working because we haven't changed anything for that yet. An colleague of mine basically said "write a wrapper for the configpart, publish the azurewebsite to 1 or more testenvironments and point them to the same workerrole". But i'm having my doubts wether this is even possible..
Did anyone else ever ran into this sort of situation and found a solution for this? Any help finding a solution is greatly appreciated!
Find a solution so I can connect the azure website to the workerrole
still running in the cloudservice. If this works I can drop the
webrole and I'm able to connect multiple instances to one workerrole.
I'm guessing that you're using some kind of queue mechanism (Azure Storage Queues or Service Bus Queues) to facilitate communication between Web and Worker Role. If that's the case, then you can continue to use the same. Your website will push messages in a queue and your worker role will poll this queue and fetch messages and work on those.
Find a solution to convert the workerrole to something (no idea what
this possibly could be) supported by the azure website.
Do take a look at Azure Webjobs. In Web Apps world, they are the counterpart of Worker Roles.
UPDATE
Based on the comments, I think you should be able to port your code to run as Web Jobs. There are two ways by which you can do it:
If you create a Continuous Web Job, then you would have to put this 10 second sleep logic in your code itself. The job will continuously be running but will only wake up every 10 seconds. Similar to your current Worker Role implementation.
You could very well take out this 10 seconds sleep logic from your code by making your Web Job as a Scheduled Web Job where you schedule to run this every 10 seconds. I would recommend going down this route as you have decoupled your scheduling logic (10 second sleep) from your application. So tomorrow if you were to increase the sleep time, you would simply change the schedule in the portal without redeploying your code.
As Gaurav pointed, the equivalent to worker roles in the App Service space is Azure WebJobs.
Regarding this problem:
So far everything works good, but what i've expected a bit already indeed happened.. The workerrole isn't working anymore and is throwing errors. The first error i've seen was;
Could not load file or assembly 'Microsoft.WindowsAzure.ServiceRuntime, Version=2.5.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35' or one of its dependencies. The system cannot find the file specified.
So ofcourse i've taken the quick solution and went 'properties > copy local' and set it to true. After publishing this to the azure website i'm getting the following error;
Could not load file or assembly 'msshrtmi, Version=2.5.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35' or one of its dependencies. The system cannot find the file specified.
Microsoft.WindowsAzure.ServiceRuntime is specific to Cloud Services, and will not work in Web Apps (that's why you get the msshrtmi error with the web app). If you are still running a worker role, that file is in the instance GAC, and should be in your local machine's also. That said, Microsoft.WindowsAzure.ServiceRuntime can be referenced in the worker role project, but not the web app project.
I'm guessing you are using ServiceRuntime to get some configuration setting value using:
var value = RoleEnvironment.GetConfigurationSettingValue(settingName);
You can changed it to:
var value = CloudConfigurationManager.GetSetting(settingName);
as this method reads the configuration setting value from the appropriate configuration store (from MSDN).
The best solution here is to convert the Worker Role to a WebJob as #Graurav mentioned above.
If you want to connect the Web App to the Worker Role would be to use an Azure Queue or other intermediary storage where operations could be dropped form the WebApp and picked up by the Worker Role.

Transactions in NServicebus using Azure Service Bus Transport

I have several message handlers in a particular endpoint that do their work against a SQL Azure database (at the moment still using a local SQL 2012 instance). I have a command handler that publishes 2 events, call them X and Y. In the same endpoint I have a subscriber to X and a subscriber to Y. Both of these subscribers are internally using the same data access component, call that Z. Dependency injection is configured on a per-call basis, not shared.
Component Z is using Entity Framework 6 under the curtains. The issue I am having is that just opening the database is throwing a SqlException and complaining about MSDTC escalations.
I have temporarily wrapped the handlers in a TransactionScope.Suppress and that has stopped the error but I believe I'm missing something more fundamental.
Is it a simple matter of configuring the endpoint to be non-transactional? I would have thought this would just work seeing as I've configured to use Azure Service Bus as the transport mechanism. If I do this will NServiceBus still retry if an exception is thrown within the message handler? (Up to the SLR limits -- not part of the question, I also understand the idempotency issues).
#Phil,
First, you shouldn't be using MSDTC with SQL Azure - it's not supported. The feature is suggested, but only under review. DTC is not supported on Azure. Alternatively, you could look into the following suggestion to use SqlTransaction approach.
Second, transport you're using has nothing to do with your data access. Since you're using Azure Service Bus, it will not be part of your handler code. Making handler a transactional is to force an atomic change or roll-back. Regardless of your handler, will retry. Challenge is that when handler/endpoint is not transactional, and within handler first write to DB succeeded and second failed, first write won't be reverted. As for Azure Service Bus as a transport, it's not transactional in its nature (ie no DTC).
Which version of NServiceBus.Azure are you on? Do you have a stack trace of the exception? Where does it come from?
We push the sends and publishes out of the scope of the receive transaction scope explicitly to prevent promotion to the DTC, so that the transaction is local to the sql, so I doubt that is what is happening here.
From you description it looks like you are using a different data access instance for each handler (per call container config) and you have multiple handlers on the same message. If both of these open a new connection to the SQL you would see promotion as well (even if it is the same server)
Could that be it? That it throws on the second open?

Is there a limit on the number of sessions for Azure Web SQL Database?

We are using the Azure SQL Database (Web Edition) for a MVC3 ASP.NET/EF5 application.
Is there a limit to the number of sessions that this SQL Database setup supports? I am just wondering whether any delays that we are getting is due to some form of queuing or pooling. Currently we have about 5 concurrent users.
Thanks.
The SQL Azure Web edition database should support a high number of concurrent users - we've had applications running that issue thousands of queries per minute against Web databases.
Throttling
SQL Azure does implement database throttling to maintain performance for all users of the platform. If throttling has been applied to the current operation you'll receive error 40501. The link I've provided also shows you how to determine why throttling is being applied. If you receive this error you can treat it as a transient error and wait before retrying.
It doesn't sound like your connections are being throttled, because you mention only 5 concurrent users and talk about delays, whereas the throttling error would occur pretty quickly.
Transient error handling
If you're getting connection timeouts etc you need to handle them as transient errors. Transient errors are timeouts or dropped connections, as well as error codes 10054, 10053, 40501 (throttling as described above) and 40197 (usually because an upgrade or failover operation is in progress).
You should ensure you implement retry logic to handle transient errors.
Query performance
If you're executing long running queries you can check which ones are slow by logging into the database management URL:
https://<database-id>.database.windows.net/#$database=<database-name>
Log in and click "Query Performance" - take a look at the longest running queries at the top.

Connection to SQL Azure DB is unstable

I'm having a strange error with a recently deployed Azure website.
Everything seems to work most of the time, but on a regular basis (at least daily) there is a period during which I receive following error:
A network-related or instance-specific error occured while establishing a connection to SQL Server.
The server was not found or was not accessible.
Is this a stability issue with Azure or is it possible that something's wrong in my code (but why does it work then most of the time)?
Is the code using the Transient Fault Handling Application Block - http://msdn.microsoft.com/en-us/library/hh680934(v=PandP.50).aspx? This block understands how to handle the transient errors that can, and will, happen with SQL Database.

Resources