Transactions in NServicebus using Azure Service Bus Transport - azure

I have several message handlers in a particular endpoint that do their work against a SQL Azure database (at the moment still using a local SQL 2012 instance). I have a command handler that publishes 2 events, call them X and Y. In the same endpoint I have a subscriber to X and a subscriber to Y. Both of these subscribers are internally using the same data access component, call that Z. Dependency injection is configured on a per-call basis, not shared.
Component Z is using Entity Framework 6 under the curtains. The issue I am having is that just opening the database is throwing a SqlException and complaining about MSDTC escalations.
I have temporarily wrapped the handlers in a TransactionScope.Suppress and that has stopped the error but I believe I'm missing something more fundamental.
Is it a simple matter of configuring the endpoint to be non-transactional? I would have thought this would just work seeing as I've configured to use Azure Service Bus as the transport mechanism. If I do this will NServiceBus still retry if an exception is thrown within the message handler? (Up to the SLR limits -- not part of the question, I also understand the idempotency issues).

#Phil,
First, you shouldn't be using MSDTC with SQL Azure - it's not supported. The feature is suggested, but only under review. DTC is not supported on Azure. Alternatively, you could look into the following suggestion to use SqlTransaction approach.
Second, transport you're using has nothing to do with your data access. Since you're using Azure Service Bus, it will not be part of your handler code. Making handler a transactional is to force an atomic change or roll-back. Regardless of your handler, will retry. Challenge is that when handler/endpoint is not transactional, and within handler first write to DB succeeded and second failed, first write won't be reverted. As for Azure Service Bus as a transport, it's not transactional in its nature (ie no DTC).

Which version of NServiceBus.Azure are you on? Do you have a stack trace of the exception? Where does it come from?
We push the sends and publishes out of the scope of the receive transaction scope explicitly to prevent promotion to the DTC, so that the transaction is local to the sql, so I doubt that is what is happening here.
From you description it looks like you are using a different data access instance for each handler (per call container config) and you have multiple handlers on the same message. If both of these open a new connection to the SQL you would see promotion as well (even if it is the same server)
Could that be it? That it throws on the second open?

Related

Azure Function with ServiceBusTrigger circuit breaker pattern

I have an Azure function with ServiceBusTrigger which will post the message content to a webservice behind an Azure API Manager. In some cases the load of the (3rd party) webserver backend is too high and it collapses returning error 500.
I'm looking for a proper way to implement circuit breaker here.
I've considered the following:
Disable the azure function, but it might result in data loss due to multiple messages in memory (serviceBus.prefetchCount)
Implement API Manager with rate-limit policy, but this seems counter productive as it runs fine in most cases
Re-architecting the 3rd party webservice is out of scope :)
Set the queue to ReceiveDisabled, this is the preferred solution, but it results in my InputBinding throwing a huge amount of MessagingEntityDisabledExceptions which I'm (so far) unable to catch and handle myself. I've checked the docs for host.json, ServiceBusTrigger and the Run parameters but was unable to find a useful setting there.
Keep some sort of responsecode resultset and increase retry time, not ideal in a serverless scenario with multiple parallel functions.
Let API manager map 500 errors to 429 and reschedule those later, will probably work but since we send a lot of messages it will hammer the service for some time. In addition it's hard to distinguish between a temporary 500 error or a consecutive one.
Note that this question is not about deciding whether or not to trigger the circuitbreaker, merely to handle the appropriate action afterwards.
Additional info
Azure functionsV2, dotnet core 3.1 run in consumption plan
API Manager runs Basic SKU
Service Bus runs in premium tier
Messagecount: 300.000

Azure - Send message to all other Roles and wait for response

A really common pattern that I need in multi instance web applications is invalidating MemoryCaches over all instances - and waiting for a confirmation that this has been done. (Because a user might otherwise after a refresh suddenly see old data on another instance)
We can make this with a combination of:
AzureServicebus,
Sending message to a topic
other instances send message back with ReplyTo to the original instance
have a wait loop for waiting on the messages back,
be aware of how many other instances are there in the first place.
probably some timeout because what happens if an instance crashes in between?
I think working out all these little edge cases might be a lot of work - so before we reinvent the wheel - is there already a common pattern or library for this?
(of course one solution would be using a shared cache like Redis, but for some situations a memorycache is a lot faster)
Have a look at Azure Durable Functions, e.g. Fan-In/Fan-Out scenario. They use Azure Storage Queues underneath, but provide higher-level abstractions.
Note that Durable Functions are still in early preview (as of August 2017), so not suitable for production use yet.
I think working out all these little edge cases might be a lot of work - so before we reinvent the wheel - is there already a common pattern or library for this?
Indeed. This sounds like a candidate for a middleware framework such as NServiceBus or MassTransit.
AzureServicebus
Both NServiceBus and MassTransit support Azure Service Bus as the transport.
Sending message to a topic
Both NServiceBus and MassTransit can Publish messages (events) to topics.
other instances send message back with ReplyTo to the original instance
Both NServiceBus and MassTransit can send messages to specific destination. NServiceBus also can Reply to the originator of an incoming message using a request/reply pattern.
have a wait loop for waiting on the messages back
Both NServiceBus and MassTransit support Sagas, also known as Process Coordinator pattern.
be aware of how many other instances are there in the first place.
Not sure about this requirement. When you scale out, you're running with a competing consumer and shouldn't care about number of instances of an endpoint.
probably some timeout because what happens if an instance crashes in between?
If you refer to retries and recovery, then both NServiceBus and MassTransit support retries.
You can use Azure Redis cache pub/sub model to do this.
1) Subscribe to Redis multiplexer
connectionMultiplexer.GetSubscriber().Subscribe(
"SubscribeChannelName",
(channel, message) => {
invalidate cache here and publish the confirmation using below publish method
connectionMultiplexer.GetSubscriber().PublishAsync("PublishChannelName", "Cache invalidated for instance").Wait();
});
2) Publish the cache invalidation and subscribe for confirmation from instances
var connection = ConnectionMultiplexer.Connect("redis connection string");
var redisSubscriber = connection.GetSubscriber();
redisSubscriber.Subscribe(
"PublishChannelName",
(channel, message) => {
// write logic to verify if all instances notified about cache invalidation.
});
redisSubscriber.PublishAsync("SubscribeChannelName","invalidate cache")).Wait();

Azure WebJob DB Connection Error Only on some instances

I have two Azure WebJobs. The first takes an incoming message that tells it to grab a PDF and break it into individual page images and then queue another message for the second WebJob to process the individual pages. It worked fine on our QC instance, but when we tried to move to production I started getting strange errors on the second job, but not consistently. The first job runs and breaks the file into page images. That is working fine. I have confirmed that every page image gets created and every page message gets queued. However, for the second job, only some of the messages are getting processed correctly. The remaining show this error in the WebJob diagnostics:
Microsoft.Azure.WebJobs.Host.FunctionInvocationException: Microsoft.Azure.WebJobs.Host.FunctionInvocationException: Exception while executing function: Functions.ProcessBatchPage ---> System.Data.SqlClient.SqlException: A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: SQL Network Interfaces, error: 52 - Unable to locate a Local Database Runtime installation. Verify that SQL Server Express is properly installed and that the Local Database Runtime feature is enabled.) ---> System.ComponentModel.Win32Exception: The system cannot find the file specified
But what's weird is that this error mentions the Local Database Runtime and SQL Server Express and I am not references either anywhere in my code. The system points at an Azure SQL DB. The job is ADO.Net and I have hardcoded the connection string to try to eliminate any issues with configuration based connection strings. But what's weird is that it only happens to a certain portion of the messages. The others process perfectly.
Lastly, I ran the job in debug locally (still pointing at the real queue and DB on Azure) and got the same problem. But the job outputs a console line with the job ID as the first line of the code. For those jobs that process successfully, I see this writeline. For those that fail, I never see anything. It's almost like the job is not really starting up correctly. (the failed jobs also have a really short run time 50-100ms)
I had the same issue with some jobs and I've came accross these articles to find a solution:
Transient Fault Handling (Building Real-World Cloud Apps with Azure)
Connection Resiliency / Retry Logic (EF6 onwards)
From theses articles :
Causes of transient failures :
In the cloud environment you’ll find that failed and dropped database connections happen periodically. That’s partly because you’re going through more load balancers compared to the on-premises environment where your web server and database server have a direct physical connection. Also, sometimes when you’re dependent on a multi-tenant service you’ll see calls to the service get slower or time out because someone else who uses the service is hitting it heavily. In other cases you might be the user who is hitting the service too frequently, and the service deliberately throttles you – denies connections – in order to prevent you from adversely affecting other tenants of the service.
Use smart retry/back-off logic to mitigate the effect of transient failures:
The Microsoft Patterns & Practices group has a Transient Fault Handling Application Block that does everything for you if you’re using ADO.NET for SQL Database access (not through Entity Framework). You just set a policy for retries – how many times to retry a query or command and how long to wait between tries – and wrap your SQL code in a using block :
public void HandleTransients()
{
var connStr = "some database";
var _policy = RetryPolicy.Create < SqlAzureTransientErrorDetectionStrategy(
retryCount: 3,
retryInterval: TimeSpan.FromSeconds(5));
using (var conn = new ReliableSqlConnection(connStr, _policy))
{
// Do SQL stuff here.
}
}
When you use the Entity Framework you typically aren’t working directly with SQL connections, so you can’t use this Patterns and Practices package, but Entity Framework 6 builds this kind of retry logic right into the framework. In a similar way you specify the retry strategy, and then EF uses that strategy whenever it accesses the database.
To use this feature in the Fix It app, all we have to do is add a class that derives from DbConfiguration and turn on the retry logic.
// EF follows a Code based Configuration model and will look for a class that
// derives from DbConfiguration for executing any Connection Resiliency strategies
public class EFConfiguration : DbConfiguration
{
public EFConfiguration()
{
AddExecutionStrategy(() => new SqlAzureExecutionStrategy());
}
}

Azure Service Bus: transient errors (exceptions) received through the message pump with built-in retry policy. Why?

I've been reading on the Event-Driven Message Programming Model introduced in April 2013, the OnMessageOptions.ExceptionReceived Event, the built-in RetryPolicy (May 2013, RetryPolicy.Default), The Transient Fault Handling Application Block (2011) which is outdated, and more (see bottom).
I've been monitoring the exceptions received through the message pump for transient errors and I get daily MessagingCommunicationExceptions. This article (Updated: September 16, 2014), recommend the following :
This exception signals a communication error that can manifest itself
when a connection from the messaging client to the Service Bus
infrastructure cannot be successfully established. In most cases,
provided network connectivity exists, this error can be treated as
transient. The client can attempt to retry the operation that has
resulted in this type of exception. It is also recommended that you
verify whether the domain name resolution service (DNS) is operational
as this error may indicate that the target host name cannot be
resolved.
My understanding is that there is no extra code to write to handle transients errors on the Service Bus after version 2.1 (2013). Unless my premise is wrong, why am I receiving transients errors each and every day? Should exceptions received through the message pump be ignored? If ignored, I can only assume that unexpected exceptions will also be ignored.. and I don't want that to happen of course.
Version of Microsoft.ServiceBus is 2.4.0.0
Also of interest : upgrading Windows Azure Service Bus from 1.x to 2.0 - Retry Policy, Introducing the Event-Driven Message Programming Model for the Windows Azure Service Bus, What's New in the Azure SDK 2.0 Release (April 2013), What's New in the Service Bus 2.1 Release (May 2013), Transient Fault Handling.
Officially answered here. In short, exceptions are bubbled for monitoring purpose after retry attempts. In long:
The transient exceptions you get from ExceptionHandler callback means
those exceptions are bubled up after retry attempts. You should just
log it for monitoring purposes. Take action if you need to. For
eample, if your client loses network connectivity you should expect
large number of communication exceptions bleeding through the handler.
In such cases you may need to take proper actions to fix things. So
the answer to question "should I ignore them?" really depends on
conditions. - Serkant Karaca, Microsoft

upgrading Windows Azure Service Bus from 1.x to 2.0 - Retry Policy

I am going through an effort to upgrade code that used the old Windows Azure Service Bus (pre 2.0).
This code based used the Enterprise Library Transient Fault Handling blocks to provide a retry policy that is leveraged when calling into the Service Bus API to send and receive queue messages.
Typically the code would look like this (minus all the try/catch/finally, etc.):
retryPolicy.Execute(() => { queueClient.Send(msg); });
However, in Service Bus 2.0, retry policy is built into the messaging factory, so I can set:
_msgFactory.RetryPolicy = RetryExponential.Default;
var queueClient = _msgFactory.CreateQueueClient(path, mode);
Once that is done, I can remove the TFH retry policy's Execute() wrapper around the call to queueClient.Send(msg).
Is this really all that is needed to ensure the queue client is retrying on transient exceptions? Seems to simple. How can I prove that it is retrying?
that is indeed all you need to do ;) it's not easy to simulate errors that will be handled transient. maybe a very short network issue (although I'm not sure that client side connection issues will be seen as transient)

Resources