We were just trying out the Azure Storage Analytics Service, and something very unusual caught our attention.
The transaction count for the diagnostics storage account ( the account to which the Diagnostics Service writes it's data) was extremely high. We are talking about 600~ transaction per hour, all of which are GetBlob() operations, and all of them ended with error ( ClientOtherError is equal to the total number of operations ). Further investigation revealed that each instance running which has Diagnostics turned on, produces 300~ transactions per hour ( we has 2 instances, thus the 600). Continuing the investigation, looking at the $logs that the Analytics Service is producing revealed what really going on :
The log is filled with lots of calls to an xml file that's not exists. The log file itself is very cluttered but it's very clear that most of the calls are searching for
https://****.blob.core.windows.net/mam/MACommand.xml and also /mam/MACommanda.xml and /mam/MACommandb.xml
all those calls have an error of 404.
This issue is a real problem for us, and we have no idea what causing it.
Has anyone encountered this issue ?
(edit: Forgot to mention, the Diagnostics Service is not logging anything - scheduledTransferPeriod is zero for all the categories)
Those transactions are an expected behavior since SDK 1.6.
See full explanation here:
http://social.msdn.microsoft.com/Forums/en-US/windowsazuretroubleshooting/thread/2e2f46dd-638a-4af1-b8ac-cfd7659a3171
Related
When my system is running some time I got the connection error so I want to remove it from my Application Insign
It is possible If I want to remove the exception and trace come from EventProcessorHost error. You can see my insign log as below.
The only way is that you can use app insights Purge api to delete logs from Exceptions table and Traces table.
But the limitation is that you cannot specify such detailed filters, like the messages are from EventProcessorHost etc.
And the delete operation will be competed in 7 days in background, you should know these limitaions when using this api.
If the question was "how do i not collect these in the future", I believe the information you are looking for is here:
https://learn.microsoft.com/en-us/azure/azure-functions/functions-monitoring?tabs=cmd#configure-categories-and-log-levels
summary:
Log configuration in host.json
The host.json file configures how much logging a function app sends to Application Insights. For each category, you indicate the minimum log level to send
there are a lot of samples in the link above to turn on and off things of various levels, sources, sampling, batching and is probably too much to paste here and keep up to date
I have an Azure Functions application which once in a while "freezes" and stops processing messages and timed events.
When this happens I do not see anything in the logs (AppInsight), neither exceptions nor any kind of unfamiliar traces.
The application has following functions:
One processing messages from a Service Bus topic subscription (belonging to another application)
One processing from an internal storage queue
One timer based function triggered every half hour
Four HTTP endpoints
Our production app runs fine. This is due to an internal dashboard (on big screen in the office), which polls one of the HTTP endpoints every 5 minutes, there by keeping it alive.
Our test, stage and preproduction apps stop after a while, stopping to process messages and timer events.
This question is more or less the same as my previous question, but the without error message that was in focus then. Much fewer error messages now, as our deployment has been fixed.
A more detailed analysis can be found in the GitHub issue.
On a consumption plan, all triggers are registered in the host, so that these can be handled, leading to my functions being called at the right time. This part of the host also handles scalability.
I had two bugs:
Wrong deployment. Do zip based deployment as described in the Docs.
Malformed host.json. Comments in JSON are not right, although it does work in most circumstances in Azure Functions. But not all.
The sites now works as expected, both concerning availability and scalability.
Thanks to the people in the Azure Functions team (Ling Toh, Fabio Cavalcante, David Ebbo) for helping me out with this.
I'm having odd errors with Azure Service Bus. For long running apps, using the batch api so that I can read a batch of messages at once (and throttle back when I have no messages available etc.), I will eventually start to get "40400: Endpoint not found" errors. These are only transient in that it doesn't stop everything but, once they occur they are intermittently persistent.
I also regularly get Message Lock lost exceptions, with 60 second timeouts, for updating batches of messages (max. 100 at a time). This really shouldn't be happening as it's running under "test" conditions where the messages are read, nothing happens to them, and then I complete them (i.e. there is no "programming logic" that takes any time at all to cause the timeout).
I really don't know how to work out why these occur and what I can do to prevent them.
Obviously I have all the re-try logic so that it doesn't bring down my app but, eventually, my app will process messages so slowly that in effect it's doing nothing at all.
My suspicion is that its because my queue (a "global worldwide queue") resides in North Europe but my app resides in East US, so the latency is causing an issue. If this is the case then I'm really stumped as it's an Azure data center communicating with another Azure data center (so should be fast) and then secondly, how on earth should you architect global queues for distributed access if the performance is so bad? AFAIK Service Bus doesn't support single endpoint globally distributed queues...
Today at a customer we analysed the logs of the previous weeks and we found the following issue regarding Windows Azure Service Bus Queues:
The request was terminated because the entity is being throttled.
Please wait 10 seconds and try again.
After verifying the code I told them to use the Transient Fault Handing Application Block (TOPAZ) to implement a retry policy like this one:
var retryStrategy = new Incremental(5, TimeSpan.FromSeconds(1), TimeSpan.FromSeconds(2));
var retryPolicy = new RetryPolicy<ServiceBusTransientErrorDetectionStrategy>(retryStrategy);
The customer answered:
"Ah that's great, so it will also handle the fact that it should wait
for 10 seconds when throttled."
Come to think about it, I never verified if this was the case or not. I always assumed this was the case. In the Microsoft.Practices.EnterpriseLibrary.WindowsAzure.TransientFaultHandling assembly I looked for code that would wait for 10 seconds in case of throttling but didn't find anything.
Does this mean that TOPAZ isn't sufficient to create resilient applications? Should this be combined with some custom code to handle throttling (ie: wait 10 seconds in case of a specific exception)?
As far as throttling concerned, Topaz provides a set of built-in retry strategies, including:
- Fixed interval
- Incremental intervals
- Random exponential back-off intervals
You can also write your custom retry stragey and plug-it into Topaz.
Also, as Brent indicated, 10 sec wait is not mandatory. In many cases, retrying immediately may succeed without the need to wait. By default, Topaz performs the first retry immediately before using the retry intervals defined by the strategy.
For more info, see Ch.6 of the "Building Elastic and Resilient Cloud Apps" Developer's Guide, also available as epub/mobi/pdf from here.
If you have suggestions/feature requests for Topaz, please submit them via the uservoice.
As I recall, the "10 second" wait isn't a requirement. Additionally, TOPAZ I believe also has backoff capabilities which would help you over come thing.
On a personal note, I'd argue that simply utilzing something like TOPAZ is not sufficient to creating a truely resilient solution. Resiliency goes beyond just throttling on a single connection point, you'll also need to be able to handle failover to a redundant endpoint which TOPAZ won't do.
We have a worker role running in the cloud which polls an Azure CloudQueue periodically retrieving messages that a web role has put on there for us. Currently the worker role and web role are housed in the same Cloud Service application and currently we are only running one instance.
As we are testing we have our logging switched on and so the contents of the messages and other useful information appear in our cloud storage which we view using Cerebrata Azure Diagnostics Manager. (Great product btw)
DiagnosticMonitorConfiguration diagConfig = DiagnosticMonitor.GetDefaultInitialConfiguration();
diagConfig.Logs.ScheduledTransferLogLevelFilter = LogLevel.Verbose;
It all appears to work remarkably well actually, however occasionally we see a Verbose message in the trace log which simply has "Fail"as the message. The code it appears to be generated from is wrapped in a try catch so it is odd that we aren't seeing the message through those means.
It would appear that something is happening that is out of our code's control, perhaps the worker role is being restarted, or the cloud op system is detecting a major error that only it can deal with by restarting our worker role. It recovers and carries on so it is somewhat of a mystery to us what might be happening.
What we haven't ascertained yet is whether we are losing a message.
Any help would be gratefully appreciated.
Cheers
Kindo Malay
Without the stack trace it's hard to say too much, but with the logging set to verbose it's quite likely that you're seeing some internal logging from one of the dlls you're using.
For example if you run a Azure Table query that causes certain kinds of errors, the error will be logged out 3 times because the storage client library is catching the error, tracing it out and then retrying.
If the error is not being caught by your try catch block, then it's likely nothing you need to worry about.
If deliverability of queue messages is important to you, you should ensure that you make use of the visibility timeout overload of CloudQueue.GetMessage and only delete the message when you've finished processing it. You may end up processing some messages twice, but at least you will process all of them.
If your role instance is getting restarted after running for a while, it's often because your process exited due to an unhandled exception.