"Temporary failure" on Azure caching service

"Temporary failure" on Azure caching service - azure

Starting last night around 2:00 am - some 8 hours after anybody touched anything having to do with the website - our Azure website began throwing this error:
Error: ErrorCode:SubStatus:There is a temporary failure. Please retry later. (One or more specified cache servers are unavailable, which could be caused by busy network or servers. For on-premises cache clusters, also verify the following conditions. Ensure that security permission has been granted for this client account, and check that the AppFabric Caching Service is allowed through the firewall on all cache hosts. Also the MaxBufferSize on the server must be greater than or equal to the serialized object size sent from the client.). Additional Information : The client was trying to communicate with the server: net.tcp://payboardprod.cache.windows.net:22233. ( at Microsoft.ApplicationServer.Caching.DataCache.ThrowException(ErrStatus errStatus, Guid trackingId, Exception responseException, Byte[][] payload, EndpointID destination)
Basically, it looks as if our Azure cache server took a dive. But there's no indication of this anywhere on our Azure management console, which indicates that the caching server in question is up and running just fine. Nor is there any indication of a problem on the Azure service availability dashboard (http://azure.microsoft.com/en-us/support/service-dashboard/). The only indication of any sort of a problem is that our Azure cache service started reporting zero requests around 1:00 am.
Our beta site, which uses a different caching server but is otherwise configured identically, stayed up through this whole episode.
We just have a BizSpark account, and hence no ability to open support tickets with MS.
We've restored service by disabling external caching, but that's obviously not optimal.
Any suggestions for troubleshooting this?

Wrap your calling code in appropriate protection (try / catch) and then cope with the failure at the app tier. The commodity platform offered in any cloud can (and does) have these sorts of issues from time-to-time. You need to bake in logging and log somewhere like Azure Diagnostics (http://msdn.microsoft.com/en-us/library/gg433048.aspx) for later troubleshooting.

I still haven't figured out what the problem was, and just ended up following Simon W's advice about wrapping everything in try/catches up the wazoo. But because it's not 100% intuitive, and took me several tries to get the code for cache retrieval right, I thought I'd post it here for anybody else who's interested.
public TValue Get(string key, Func<TValue> missingFunc)
{
// We need to ensure that two processes don't try to calculate the same value at the same time. That just wastes resources.
// So we pull out a value from the _cacheLocks dictionary, and lock on that before trying to retrieve the object.
// This does add a bit more locking, and hence the chance for one process to lock up everything else.
// We may need to add some timeouts here at some point in time. It also doesn't prevent two processes on different
// machines from trying the same bit o' nonsense. Oh well. It's probably still a worthwhile optimization.
key = _keyPrefix + "." + key;
var value = default(TValue);
object cacheLock;
lock (_cacheLocks)
{
if (!_cacheLocks.TryGetValue(key, out cacheLock))
{
cacheLock = new object();
_cacheLocks[key] = cacheLock;
}
}
lock (cacheLock)
{
// Try to get the value from the cache.
try
{
value = _cache.Get(key) as TValue;
}
catch (SerializationException ex)
{
// This can happen when the app restarts, and we discover that the dynamic entity names have changed, and the desired type
// is no longer around, e.g., "Organization_6BA9E1E1184D9B7BDCC50D94471D7A730423456A15BBAFB6A2C6AC0FF94C0D41"
// If that's the error, we should probably warn about it, but no point in logging it as an error, since it's more-or-less expected.
_logger.Warn("Error retrieving item '" + key + "' from Azure cache; falling back to missingFunc(). Error = " + ex);
}
catch (Exception ex)
{
_logger.Error("Error retrieving item '" + key + "' from Azure cache; falling back to missingFunc(). Error = " + ex);
}
// If we didn't get anything interesting, then call the function that should be able to retrieve it for us.
if (value == default(TValue))
{
// If that function throws an exception, don't swallow it.
value = missingFunc();
// If we try to put it into the cache, and *that* throws an exception,
// log it, and then swallow it.
try
{
_cache.Put(key, value);
}
catch (Exception ex)
{
_logger.Error("Error putting item '" + key + "' into Azure cache. Error = " + ex);
}
}
}
return value;
}
You can use it like so:
var user = UserCache.Get(email, () =>
_db.Users
.FirstOrDefault(u => u.Email == email)
.ShallowClone());

Related

Application Insights Collecting Duplicate Operations

I currently have an Azure ContainerApp architected as a BackgroundService, deployed to Azure. The Worker listens for ServiceBusMessages and processes them.
.NET6
Microsoft.ApplicationInsights.WorkerService 2.21.0
Application Insights is setup like this:
builder.Services.AddApplicationInsightsTelemetryWorkerService(opts =>
{
opts.DependencyCollectionOptions.EnableLegacyCorrelationHeadersInjection = true;
});
builder.Services.ConfigureTelemetryModule<DependencyTrackingTelemetryModule>((module, o) => { module.EnableSqlCommandTextInstrumentation = true; });
and I record the AI Operation in question by injecting a TelemetryClient and using it in the ServiceBusProcessor.ProcessMessageAsync handler like:
using (var op = _telemetryClient.StartOperation<RequestTelemetry>("ProcessAdapterMessage.ProcessSBMessage"))
{
// Process Data
}
My issue is that this operation happens a LOT, which is fine because I use AI sampling, but it's also duplicated. It's recorded once under the proper operation name "ProcessAdapterMessage.ProcessSBMessage", and once under the the operation name "<Empty>".
If I drill down into the "<Empty>" operation, it's actually just "ServiceBusProcessor.ProcessSBMessage" operations that wrap the same method as before. The reason I created the manual op is because looking at data named "Empty" isn't useful, so I'd rather keep my manual op, and make the "Empty" one go away. Any ideas on how to fix this?
This is what the "Empty" Operation details look like
This is what the "ProcessAdapterMessage.ProcessSBMessage" Operation details look like:

Can the Azure Service Bus be delayed before retrying a message?

The Azure Service Bus supports a built-in retry mechanism which makes an abandoned message immediately visible for another read attempt. I'm trying to use this mechanism to handle some transient errors, but the message is made available immediately after being abandoned.
What I would like to do is make the message invisible for a period of time after it is abandoned, preferably based on an exponentially incrementing policy.
I've tried to set the ScheduledEnqueueTimeUtc property when abandoning the message, but it doesn't seem to have an effect:
var messagingFactory = MessagingFactory.CreateFromConnectionString(...);
var receiver = messagingFactory.CreateMessageReceiver("test-queue");
receiver.OnMessageAsync(async brokeredMessage =>
{
await brokeredMessage.AbandonAsync(
new Dictionary<string, object>
{
{ "ScheduledEnqueueTimeUtc", DateTime.UtcNow.AddSeconds(30) }
});
}
});
I've considered not abandoning the message at all and just letting the lock expire, but this would require having some way to influence how the MessageReceiver specifies the lock duration on a message, and I can't find anything in the API to let me change this value. In addition, it wouldn't be possible to read the delivery count of the message (and therefore make a decision for how long to wait for the next retry) until after the lock is already required.
Can the retry policy in the Message Bus be influenced in some way, or can a delay be artificially introduced in some other way?

Careful here because I think you are confusing the retry feature with the automatic Complete/Abandon mechanism for the OnMessage event-driven message handling. The built in retry mechanism comes into play when a call to the Service Bus fails. For example, if you call to set a message as complete and that fails, then the retry mechanism would kick in. If you are processing a message an exception occurs in your own code that will NOT trigger a retry through the retry feature. Your question doesn't get explicit on if the error is from your code or when attempting to contact the service bus.
If you are indeed after modifying the retry policy that occurs when an error occurs attempting to communicate with the service bus you can modify the RetryPolicy that is set on the MessageReciver itself. There is an RetryExponitial which is used by default, as well as an abstract RetryPolicy you can create your own from.
What I think you are after is more control over what happens when you get an exception doing your processing, and you want to push off working on that message. There are a few options:
When you create your message handler you can set up OnMessageOptions. One of the properties is "AutoComplete". By default this is set to true, which means as soon as processing for the message is completed the Complete method is called automatically. If an exception occurs then abandon is automatically called, which is what you are seeing. By setting the AutoComplete to false you required to call Complete on your own from within the message handler. Failing to do so will cause the message lock to eventually run out, which is one of the behaviors you are looking for.
So, you could write your handler so that if an exception occurs during your processing you simply do not call Complete. The message would then remain on the queue until it's lock runs out and then would become available again. The standard dead lettering mechanism applies and after x number of tries it will be put into the deadletter queue automatically.
A caution of handling this way is that any type of exception will be treated this way. You really need to think about what types of exceptions are doing this and if you really want to push off processing or not. For example, if you are calling a third party system during your processing and it gives you an exception you know is transient, great. If, however, it gives you an error that you know will be a big problem then you may decide to do something else in the system besides just bailing on the message.
You could also look at the "Defer" method. This method actually will then not allow that message to be processed off the queue unless it is specifically pulled by its sequence number. You're code would have to remember the sequence number value and pull it. This isn't quite what you described though.
Another option is you can move away from the OnMessage, Event-driven style of processing messages. While this is very helpful you don't get a lot of control over things. Instead hook up your own processing loop and handle the abandon/complete on your own. You'll also need to deal some of the threading/concurrent call management that the OnMessage pattern gives you. This can be more work but you have the ultimate in flexibility.
Finally, I believe the reason the call you made to AbandonAsync passing the properties you wanted to modify didn't work is that those properties are referring to Metadata properties on the method, not standard properties on BrokeredMessage.

I actually asked this same question last year (implementation aside) with the three approaches I could think of looking at the API. #ClemensVasters, who works on the SB team, responded that using Defer with some kind of re-receive is really the only way to control this precisely.
You can read my comment to his answer for a specific approach to doing it where I suggest using a secondary queue to store messages that indicate which primary messages have been deferred and need to be re-received from the main queue. Then you can control how long you wait by setting the ScheduledEnqueueTimeUtc on those secondary messages to control exactly how long you wait before you retry.

I ran into a similar issue where our order picking system is legacy and goes into maintenance mode each night.
Using the ideas in this article(https://markheath.net/post/defer-processing-azure-service-bus-message) I created a custom property to track how many times a message has been resubmitted and manually dead lettering the message after 10 tries. If the message is under 10 retries it clones the message increments the custom property and sets the en queue of the new message.
using Microsoft.Azure.ServiceBus;
public PickQueue()
{
queueClient = new QueueClient(QUEUE_CONN_STRING, QUEUE_NAME);
}
public async Task QueueMessageAsync(int OrderId)
{
string body = JsonConvert.SerializeObject(OrderId);
var message = new Message(Encoding.UTF8.GetBytes(body));
await queueClient.SendAsync(message);
}
public async Task ReQueueMessageAsync(Message message, DateTime utcEnqueueTime)
{
int resubmitCount = (int)(message.UserProperties["ResubmitCount"] ?? 0) + 1;
if (resubmitCount > 10)
{
await queueClient.DeadLetterAsync(message.SystemProperties.LockToken);
}
else
{
Message clone = message.Clone();
clone.UserProperties["ResubmitCount"] = ++resubmitCount;
await queueClient.ScheduleMessageAsync(message, utcEnqueueTime);
}
}

This question asks how to implement exponential backoff in Azure Functions. If you do not want to use the built-in RetryPolicy (only available when autoComplete = false), here's the solution I've been using:
public static async Task ExceptionHandler(IMessageSession MessageSession, string LockToken, int DeliveryCount)
{
if (DeliveryCount < Globals.MaxDeliveryCount)
{
var DelaySeconds = Math.Pow(Globals.ExponentialBackoff, DeliveryCount);
await Task.Delay(TimeSpan.FromSeconds(DelaySeconds));
await MessageSession.AbandonAsync(LockToken);
}
else
{
await MessageSession.DeadLetterAsync(LockToken);
}
}

Checking if item exists in Azure blob container takes forever (SDK 2.0)

I am in the process of upgrading my c# solution to use the new Azure SDK 2.0 libraries. I have made minor changes to account for the breaking changes in the 2.0 libraries but other than that, it 's the same code. I have tested against my local storage and everything seems to be working fine but when I test against production Azure blob storage, it takes an excessive amount of time just to check if a blob item exists. It literally takes a minute if not more to simply return a Boolean indicating if the item exists or not.
In the code sample below, the line that is taking a very long time to complete is "if (!blob.Exists())".
public byte[] GetBlobContent(string blobName)
{
if (blobName == "") return null;
var blobClient = _storageAccount.CreateCloudBlobClient();
var container = blobClient.GetContainerReference(_containerName);
var blob = container.GetBlockBlobReference(blobName);
if (!blob.Exists())
{
return null;
}
byte[] buffer;
using (var ms = new MemoryStream())
{
blob.DownloadToStream(ms);
buffer = ms.ToArray();
}
return buffer;
}
Are there additional changes that I need to make to my code to make it perform like it used to?

An alternate approach would be to skip the Exists() check and just let DownloadToStream fail if the blob does not exist. You will need a try/catch block around DownloadToStream to handle the "expected" failure. This approach saves you a round trip to storage for every blob read, since it only has to make the remote call once instead of twice.

Fiddler doesn't reveal anything. I experienced a similar issue where the process hung indefinitely, until the timeout kicked in. Try replacing calls to CloudTable.CreateIfNotExists with CloudTable.CreateIfNotExistsAsync, and calls to CloudTable.Execute with CloudTable.ExecuteAsync.
Unfortunately I can't tell you why this works. At time of writing, the WindowsAzure.Storage-PremiumTable Nuget package was in pre-release mode, and presumably buggy.

sqlite returns SQLITE_BUSY in WAL mode

I have a web application working with sqlite database.
My version of sqlite is the latest from official windows binary distribution - 3.7.13.
The problem is that under heavy load on database, sqlite API functions (such as sqlite3_step) are returning SQLITE_BUSY.
I pass the following pragmas when initializing a connection:
journal_mode = WAL
page_size = 4096
synchronous = FULL
foreign_keys = on
The databas is one-file database. And I'm using Mono 2.10.8 and Mono.Data.Sqlite assembly provided with it to access database.
I'm testing it with 50 parallel threads which are sending 50 subsequent http-requests each to my application. On every request some reading and writing are done to the database. Every set of IO operations is executed inside the transaction.
Everything goes well until near 400th - 700th request. In this (random) moment API functions are starting to return SQLITE_BUSY permanently (To be more exact - until the limit of retries is reached).
As far as i know WAL mode transparently supports parallel reads and writes. I've guessed that it could be because of attempt to read database while checkpoint operation is executed. But even after turning autocheckpoint off the situation remains the same.
What could be wrong in this situation?
How to serve large amount of parallel database IO correctly?
P.S.
Only one connection per request is supposed.
I use nhibernate configured with WebSessionContext.
I initialize my NHibernate session like this:
ISession session = null;
//factory variable is session factory
if (CurrentSessionContext.HasBind(factory))
{
session = factory.GetCurrentSession();
if (session == null)
CurrentSessionContext.Unbind(factory);
}
if (session == null)
{
session = factory.OpenSession();
CurrentSessionContext.Bind(session);
}
return session;
And on HttpApplication.EndRequest i release it like this:
//factory variable is session factory
if (CurrentSessionContext.HasBind(factory))
{
try
{
CurrentSessionContext.Unbind(factory)
.Dispose();
}
catch (Exception ee)
{
Logr.Error("Error uninitializing session", ee);
}
}
So as far as i know there should be only one connection per request life cycle. While proceessing the request, code is executed sequentially (ASP.NET MVC 3). So it doesn't look like any concurency is possible here. Can i conclude that no connections are shared in this case?

It's not clear to me if the request threads share the same connection or not. If they don't then you should not be having these issues.
Assuming that you are indeed sharing the connection object across multiple threads, you should use some locking mechanism as the the SqliteConnection isn't thread-safe (an old post, but the SQLite library maintained as part of Mono evolved from System.Data.SQLite found on http://sqlite.phxsoftware.com).
So assuming that you don't lock around using the SqliteConnection object, can you please try it? A simple way to accomplish this could look like this:
static readonly object _locker = new object();
public void ProcessRequest()
{
lock (_locker) {
using (IDbCommand dbcmd = conn.CreateCommand()) {
string sql = "INSERT INTO foo VALUES ('bar')";
dbcmd.CommandText = sql;
dbcmd.ExecuteNonQuery();
}
}
}
You may however choose to open a distinct connection with each thread to ensure you don't have any more threading issues with the SQLite library.
EDIT
Following-up on the code you posted, do you close the session after committing the transaction? If you don't use some ITransaction, do you flush and close the session? I'm asking since I don't see it in your code, and I see it mentioned in https://stackoverflow.com/a/43567/610650
I also see it mentioned on http://nhibernate.info/doc/nh/en/index.html#session-configuration:
Also note that you may call NHibernateHelper.GetCurrentSession(); as
many times as you like, you will always get the current ISession of
this HTTP request. You have to make sure the ISession is closed after
your unit-of-work completes, either in Application_EndRequest event
handler in your application class or in a HttpModule before the HTTP
response is sent.

Error registering SharePoint WebDeleting event receiver in some environments

I am trying to register a WebDeleting event receiver within SharePoint. This works fine in my development environment, but not in several staging environments. The error I get back is "Value does not fall within the expected range.". Here is the code I use:
SPSecurity.RunWithElevatedPrivileges(delegate()
{
using (SPSite elevatedSite = new SPSite(web.Site.ID))
{
using (SPWeb elevatedWeb = elevatedSite.OpenWeb(web.ID))
{
try
{
elevatedWeb.AllowUnsafeUpdates = true;
SPEventReceiverDefinition eventReceiver = elevatedWeb.EventReceivers.Add(new Guid(MyEventReciverId));
eventReceiver.Type = SPEventReceiverType.WebDeleting;
Type eventReceiverType = typeof(MyEventHandler);
eventReceiver.Assembly = eventReceiverType.Assembly.FullName;
eventReceiver.Class = eventReceiverType.FullName;
eventReceiver.Update();
elevatedWeb.AllowUnsafeUpdates = false;
}
catch (Exception ex)
{
// Do stuff...
}
}
}
});
I realize that I can do this through a feature element file (I am trying that approach now), but would prefer to use the above approach.
The errors I consistently get in the ULS logs are:
03/11/2010 17:16:57.34 w3wp.exe (0x09FC) 0x0A88 Windows SharePoint Services Database 6f8g Unexpected Unexpected query execution failure, error code 3621. Additional error information from SQL Server is included below. "The statement has been terminated." Query text (if available): "{?=call proc_InsertEventReceiver(?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)}"
03/11/2010 17:16:57.34 w3wp.exe (0x09FC) 0x0A88 Windows SharePoint Services General 8e2s Medium Unknown SPRequest error occurred. More information: 0x80070057
Any ideas?
UPDATE - Some interesting things I have learned...
I modified my code to do the following:
I call EventReceivers.Add() without the GUID since most examples I see do not do that
Gave the event receiver a Name and Sequence number since most examples I see do that
I deployed this change along with some extra trace statements that go to the ULS logs and after doing enough iisresets and clearing the GAC of my assembly, I started to see my new trace statements in the ULS logs and I no longer got the error!
So, I started to go back towards my original code to see what change exactly helped. I finally ended up with the original version in source control and it still worked :-S.
So the answer is clearly that it is some caching issue. However, when I was originally trying to get it to work I tried IISRESETs, restarting some SharePoint services OWSTimer (this, I believe runs the event hander, but probably isn't involved in the event registration where I am getting the error), and even a reboot to make sure no assembly caching was going on - and that did not help before.
The only thing I have to go on is maybe following steps such as:
Clear the GAC of the assembly that contains the registration code and event hander class.
Do an IISRESET.
Uninstall the WSP.
Do an IISRESET.
Install the WSP.
Do an IISRESET.
To get it working I never did a reboot or restarted SharePoint services, but I had done those prior to getting it working (before changing my code).
I suppose I could dig more with Reflector to see what I can find, but I believe you get to a dead end (unmanaged code) pretty quick. I wonder what could be holding on to the old DLL? I can't imagine that SQL Server would be in some way. Even so, a reboot would have fixed that (the entire farm, including SQL Server are on the same machine in this environment).

So, it appears that the whole problem was creating the event receiver by providing the GUID.
EventReceiverDefinition eventReceiver = elevatedWeb.EventReceivers.Add(new Guid(MyEventReciverId));
Now I am doing:
EventReceiverDefinition eventReceiver = elevatedWeb.EventReceivers.Add();
Unfortunately this means when I want to find out if the event is already registered, I must do something like the code below instead of a single one liner.
// Had to use the below instead of: web.EventReceivers[new Guid(MyEventReceiverId)]
private SPEventReceiverDefinition GetWebDeletingEventReceiver(SPWeb web)
{
Type eventReceiverType = typeof(MyEventHandler);
string eventReceiverAssembly = eventReceiverType.Assembly.FullName;
string eventReceiverClass = eventReceiverType.FullName;
SPEventReceiverDefinition eventReceiver = null;
foreach (SPEventReceiverDefinition eventReceiverIter in web.EventReceivers)
{
if (eventReceiverIter.Type == SPEventReceiverType.WebDeleting)
{
if (eventReceiverIter.Assembly == eventReceiverAssembly && eventReceiverIter.Class == eventReceiverClass)
{
eventReceiver = eventReceiverIter;
break;
}
}
}
return eventReceiver;
}
It's still not clear why things seemed to linger and require some cleanup (iisreset, reboots, etc.) so if anyone else has this problem keep that in mind.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string