IotHubClientTransientException: Transient error occured, please retry - azure

I have an UWP app installed in an upboard that reads IotHub messages sended to that deviceID.
deviceClient = DeviceClient.CreateFromConnectionString(deviceConnectionString, TransportType.Mqtt);
Message receivedMessage = await deviceClient.ReceiveAsync();
The app works fine and reads the messages correctly, but sometimes I have these exceptions:
IotHubClientTransientException: Transient error occured, please retry.
I read that these errors may can be generated from wrong connection string, but it's not possible in my case.
Can someone help me?

The error is most likely caused by a network connectivity error. Just add a retry strategy. You could simply write your own or use a library like Polly.Net
In a distributed world connectivity issues should be expected, so I don't think there is any problem with your code other than is should be more resilient. I think it is really nice that the exceptions even tells you it should be retried, most of the times you have to figure that out yourself.
Some more guidance from the Azure team can be found here. In your case the Retry pattern is a good fit:
Retry
Enable an application to handle anticipated, temporary failures when it tries to connect to a service or network resource by transparently retrying an operation that's previously failed.

Related

Google Cloud Pub/Sup API: Almost 100% Errors on StreamingPull

I'm trying to use GCP Pub/Sub StreamingPull using the NodeJs client and I understand that the pub sub is designed for 100% error rate as mentioned in Docs.
So do I have to restart the listener if I face errors in the errorHandler and also please tell what error code should I be looking for to see if the streaming connection is closed. Here is the ref Error Codes
const errorHandler=(error)=>{
if(errorCodeCheckCondition){
subscription.on('message', messageHandler);
subscription.removeListener('message', messageHandler);
}
}
subscription.on('error', errorHandler);
I'm using GCP Pub/Sub StreamingPull for first time, so please guide.
You do need to re-establish the streaming pull connection after you get any error.
According to the rpc StreamingPull
The server will close the stream and return the status on any error. The server may close the stream with status UNAVAILABLE to reassign server-side resources, in which case, the client should re-establish the stream. Flow control can be achieved by configuring the underlying RPC channel.
Since You know about StreamingPull has a 100% error rate, I believe you must have also gone through the Diagnosing StreamingPull errors.
The Pub/Sub client library will re-establish the underlying streaming pull connection when it disconnects for a retriable reason, e.g., an UNAVAILABLE error. You can see in the StreamingPull config in the library the set of errors that are retried internally.
The errors you would typically get back at the application level would be ones where some additional intervention is likely necessary, e.g., a PERMISSION_DENIED error (where the subscriber does not have permission to receive messages on the subscription) or a NOT_FOUND error (where the subscription does not exist. Retrying on these types of errors is likely just to result in the error reoccurring until the underlying issue is resolved.
You could decide that retrying is what you want to do because you want the subscriber to start working again without having to manually restart it once other steps are taken to fix the problem, but you'll want to make sure you have some way to discover these types of issues, perhaps through some kind of Cloud Monitoring alerting on streaming pull errors or on a large number of unprocessed messages building up.

Spring integration DelayHandler, retryDelay and messageStore

A Delayer has message store and it provides the ability to not lose messages on the application shutdown. It's working fine for one delay.
But if I set retryDelay, the message will be removed from message store at first attempt and will be lost on the application shutdown.
Why it's happen? Why is message not stored in the message store for retry?
I think this is an omission and we really have to store message back on every retry.
Please, raise a GH issue as a bug and we will fix it soon.
As a workaround I would suggest to take a look into the ExpressionEvaluatingRequestHandlerAdvice and with its failureChannel and onFailureExpression re-route the message back into your delayer. Probably you might need to incorporate a RequestHandlerRetryAdvice instead of that retryDelay since an internal retry doesn't help us.
See docs for those advises: https://docs.spring.io/spring-integration/docs/5.3.2.RELEASE/reference/html/messaging-endpoints.html#message-handler-advice-chain
And delayer's delayedAdviceChain option.

Application insights: SQL Dependency result code "-2"

This is what I get in ~20 out of ~2M SQL dependency AI registrations.
Apparently this result code does not appear in sys.messages, since all message id's in there are positive.
Also I can't seem to find any stack trace information on this error. It seems to be a timeout or general transient error (handled by Polly from my side), but either way it's registered under dependencies and not exceptions.
Does anybody know what this error is and where I can find more information regarding all possible SQL dependency errors I might get?
I have submitted this issue to app insights team here, and just get feedback:
-2 means Timeout expired. The timeout period elapsed prior to completion of the
operation or the server is not responding. (Microsoft SQL Server, Error: -2).
The Sql Exception number is here.
And if you have any concerns, please comment at that thread. Hope it helps.

Automatic reboot whenever there's an uncaught exception in a continous WebJob

I'm currently creating a continous webjob that will do polling to an API, and then forward messages to an Azure Service Bus. I've managed to get this to work just fine, but I have one problem; what if my app crashes for whatever reason? What if there's an uncaught exception, or something goes wrong, and the app stops running. How do i get it to run again?
I created a test app, which will send a message every to the Service Bus, then on the 11th message it will crash due to an intentionally placed NullReferenceException. I did this in order to investigate behaviour whenever/if the app crashes.
What happens is that the app runs just fine for the first 10 seconds (as expected). Messages are being sent, and everything looks good. Then after the 10th second, when the exception occurs, nothing happens. No log in Azure saying there was an exception, no reboot - nothing. It just stands there as "running", but messages are no longer being sent.
How do I deal with this? It's essential that the application is able to reboot if it fails. Are there any standard ways to do this? Best practices?
Any help would be appreciated :)
It is always good to handle most of the failure scenarios in the system by ourselves rather than to let the hosting environment to react for the failures.
My suggestion would be to have a check in the code for exceptions like any try catch block in your executable script to catch different kind of failure scenarios and instead of throwing the exceptions, log it your self or take any retry operation if required.
Example, when you got a junk data to process and it failed. Then you can try to do the operation again for eg. 3 times and then finally push a log to deadletter account to manually take care of such junk inputs. And don't let the flow be stopped by throwing the exception but instead handle it your self by logging a message which needs manual intervention.
In any GUI or Web applications, if there is an exception then the flow is re initiated by user click and system will respond. But here as it a background processor, it is ideal to avoid all such control flow blockers.
Hope this would help.

Azure Worker Role generating writing unexpected error to Trace log storage

We have a worker role running in the cloud which polls an Azure CloudQueue periodically retrieving messages that a web role has put on there for us. Currently the worker role and web role are housed in the same Cloud Service application and currently we are only running one instance.
As we are testing we have our logging switched on and so the contents of the messages and other useful information appear in our cloud storage which we view using Cerebrata Azure Diagnostics Manager. (Great product btw)
DiagnosticMonitorConfiguration diagConfig = DiagnosticMonitor.GetDefaultInitialConfiguration();
diagConfig.Logs.ScheduledTransferLogLevelFilter = LogLevel.Verbose;
It all appears to work remarkably well actually, however occasionally we see a Verbose message in the trace log which simply has "Fail"as the message. The code it appears to be generated from is wrapped in a try catch so it is odd that we aren't seeing the message through those means.
It would appear that something is happening that is out of our code's control, perhaps the worker role is being restarted, or the cloud op system is detecting a major error that only it can deal with by restarting our worker role. It recovers and carries on so it is somewhat of a mystery to us what might be happening.
What we haven't ascertained yet is whether we are losing a message.
Any help would be gratefully appreciated.
Cheers
Kindo Malay
Without the stack trace it's hard to say too much, but with the logging set to verbose it's quite likely that you're seeing some internal logging from one of the dlls you're using.
For example if you run a Azure Table query that causes certain kinds of errors, the error will be logged out 3 times because the storage client library is catching the error, tracing it out and then retrying.
If the error is not being caught by your try catch block, then it's likely nothing you need to worry about.
If deliverability of queue messages is important to you, you should ensure that you make use of the visibility timeout overload of CloudQueue.GetMessage and only delete the message when you've finished processing it. You may end up processing some messages twice, but at least you will process all of them.
If your role instance is getting restarted after running for a while, it's often because your process exited due to an unhandled exception.

Resources