Retry after unrecoverable error in Azure IoT C SDK

Retry after unrecoverable error in Azure IoT C SDK - azure

According to this documentation, the SDK does not reattempt a re-connection if it detects an unrecoverable error. However, in the associated documentation for the C SDK there is no mention of a recoverable or an unrecoverable error (or at least I have not found it there).
How can I find out if the SDK is currently in a state where it might eventually recover or where it is required to manually reattempt a re-connection?
If the retry policy decides not to reattempt a re-connection, is it possible to reconnect manually without destroying the device handle and creating a new one (for example with IoTHubDeviceClient_CreateFromConnectionString)?

The C SDK currently doesn't distinguish between unrecoverable and recoverable errors from a retry perspective.
All types of failures are dealt as recoverable failures that could potentially be recovered - so the documentation should be fixed to reflect current functionality 2018_11_20 release.
That said, the SetConnectionStatusCallback will return appropriate reason codes, which can help determine if the SDK should be allowed to retry or if a new connection needs to be established by the application.
For instance, I would consider IOTHUB_CLIENT_CONNECTION_BAD_CREDENTIAL as an unrecoverable error for my application's device client.
You are right - a new device handle needs to be created in order to re-attempt the connection manually. This is because all C SDK APIs key off this handle to query various device client states within iothub_device_client.

Related

None value for paho_mqtt::create_options::CreateOptionsBuilder persistance

The documentation for CreateOptionsBuilder method.persistence indicates that setting this value as None will improve the performance, but ending up with a less reliable system.
Could someone elaborate on this? Please. Under which circumstances should I consider setting this to None?

The Eclipse Paho MQTT Rust Client Library is a "safe wrapper around the Paho C Library". The persistence options are mapped to values accepted by the C library with None becoming MQTTCLIENT_PERSISTENCE_NONE. The docs for the C client provide a more detailed explanation of the options:
persistence_type The type of persistence to be used by the client:
MQTTCLIENT_PERSISTENCE_NONE: Use in-memory persistence. If the device or system on which the client is running fails or is switched off, the current state of any in-flight messages is lost and some messages may not be delivered even at QoS1 and QoS2.
MQTTCLIENT_PERSISTENCE_DEFAULT: Use the default (file system-based) persistence mechanism. Status about in-flight messages is held in persistent storage and provides some protection against message loss in the case of unexpected failure.
MQTTCLIENT_PERSISTENCE_USER: Use an application-specific persistence implementation. Using this type of persistence gives control of the persistence mechanism to the application. The application has to implement the MQTTClient_persistence interface.
The upshot is that calling persistence(None) means that messages will be held in memory rather than being written to disk (assuming QOS1/2). This has the potential to improve performance (writing to disk can be expensive) but, because the info is only stored in memory, messages may be lost if your application shuts down without completing delivery.
A quick example might help (simplifying things a little); lets say you publish a message with QOS=1 and a network issue means that the broker does not receive it. When the connection is re-established (failed delivery will generally mean the connection will drop) the client will resend the message (because it has not processed an acknowledgment from the broker). With the default persistence (disk) the message will be retransmitted even if the failure was due to a power outage that affected the server your app was running on (obviously this only happens when power is restored and your app restarts); that message would be lost if you had called persistence(None).
The appropriate setting is going to depend upon your needs and other options may have an impact (e.g. if Clean Start/CleanSession is true then there unlikely to be any benefit to persisting to disk).

When you don't care if all messages are received. E.g. when using only QOS 0 messages

Google Cloud Pub/Sup API: Almost 100% Errors on StreamingPull

I'm trying to use GCP Pub/Sub StreamingPull using the NodeJs client and I understand that the pub sub is designed for 100% error rate as mentioned in Docs.
So do I have to restart the listener if I face errors in the errorHandler and also please tell what error code should I be looking for to see if the streaming connection is closed. Here is the ref Error Codes
const errorHandler=(error)=>{
if(errorCodeCheckCondition){
subscription.on('message', messageHandler);
subscription.removeListener('message', messageHandler);
}
}
subscription.on('error', errorHandler);
I'm using GCP Pub/Sub StreamingPull for first time, so please guide.

You do need to re-establish the streaming pull connection after you get any error.
According to the rpc StreamingPull
The server will close the stream and return the status on any error. The server may close the stream with status UNAVAILABLE to reassign server-side resources, in which case, the client should re-establish the stream. Flow control can be achieved by configuring the underlying RPC channel.
Since You know about StreamingPull has a 100% error rate, I believe you must have also gone through the Diagnosing StreamingPull errors.

The Pub/Sub client library will re-establish the underlying streaming pull connection when it disconnects for a retriable reason, e.g., an UNAVAILABLE error. You can see in the StreamingPull config in the library the set of errors that are retried internally.
The errors you would typically get back at the application level would be ones where some additional intervention is likely necessary, e.g., a PERMISSION_DENIED error (where the subscriber does not have permission to receive messages on the subscription) or a NOT_FOUND error (where the subscription does not exist. Retrying on these types of errors is likely just to result in the error reoccurring until the underlying issue is resolved.
You could decide that retrying is what you want to do because you want the subscriber to start working again without having to manually restart it once other steps are taken to fix the problem, but you'll want to make sure you have some way to discover these types of issues, perhaps through some kind of Cloud Monitoring alerting on streaming pull errors or on a large number of unprocessed messages building up.

IotHubClientTransientException: Transient error occured, please retry

I have an UWP app installed in an upboard that reads IotHub messages sended to that deviceID.
deviceClient = DeviceClient.CreateFromConnectionString(deviceConnectionString, TransportType.Mqtt);
Message receivedMessage = await deviceClient.ReceiveAsync();
The app works fine and reads the messages correctly, but sometimes I have these exceptions:
IotHubClientTransientException: Transient error occured, please retry.
I read that these errors may can be generated from wrong connection string, but it's not possible in my case.
Can someone help me?

The error is most likely caused by a network connectivity error. Just add a retry strategy. You could simply write your own or use a library like Polly.Net
In a distributed world connectivity issues should be expected, so I don't think there is any problem with your code other than is should be more resilient. I think it is really nice that the exceptions even tells you it should be retried, most of the times you have to figure that out yourself.
Some more guidance from the Azure team can be found here. In your case the Retry pattern is a good fit:
Retry
Enable an application to handle anticipated, temporary failures when it tries to connect to a service or network resource by transparently retrying an operation that's previously failed.

Azure Service Bus: transient errors (exceptions) received through the message pump with built-in retry policy. Why?

I've been reading on the Event-Driven Message Programming Model introduced in April 2013, the OnMessageOptions.ExceptionReceived Event, the built-in RetryPolicy (May 2013, RetryPolicy.Default), The Transient Fault Handling Application Block (2011) which is outdated, and more (see bottom).
I've been monitoring the exceptions received through the message pump for transient errors and I get daily MessagingCommunicationExceptions. This article (Updated: September 16, 2014), recommend the following :
This exception signals a communication error that can manifest itself
when a connection from the messaging client to the Service Bus
infrastructure cannot be successfully established. In most cases,
provided network connectivity exists, this error can be treated as
transient. The client can attempt to retry the operation that has
resulted in this type of exception. It is also recommended that you
verify whether the domain name resolution service (DNS) is operational
as this error may indicate that the target host name cannot be
resolved.
My understanding is that there is no extra code to write to handle transients errors on the Service Bus after version 2.1 (2013). Unless my premise is wrong, why am I receiving transients errors each and every day? Should exceptions received through the message pump be ignored? If ignored, I can only assume that unexpected exceptions will also be ignored.. and I don't want that to happen of course.
Version of Microsoft.ServiceBus is 2.4.0.0
Also of interest : upgrading Windows Azure Service Bus from 1.x to 2.0 - Retry Policy, Introducing the Event-Driven Message Programming Model for the Windows Azure Service Bus, What's New in the Azure SDK 2.0 Release (April 2013), What's New in the Service Bus 2.1 Release (May 2013), Transient Fault Handling.

Officially answered here. In short, exceptions are bubbled for monitoring purpose after retry attempts. In long:
The transient exceptions you get from ExceptionHandler callback means
those exceptions are bubled up after retry attempts. You should just
log it for monitoring purposes. Take action if you need to. For
eample, if your client loses network connectivity you should expect
large number of communication exceptions bleeding through the handler.
In such cases you may need to take proper actions to fix things. So
the answer to question "should I ignore them?" really depends on
conditions. - Serkant Karaca, Microsoft

Transactions in NServicebus using Azure Service Bus Transport

I have several message handlers in a particular endpoint that do their work against a SQL Azure database (at the moment still using a local SQL 2012 instance). I have a command handler that publishes 2 events, call them X and Y. In the same endpoint I have a subscriber to X and a subscriber to Y. Both of these subscribers are internally using the same data access component, call that Z. Dependency injection is configured on a per-call basis, not shared.
Component Z is using Entity Framework 6 under the curtains. The issue I am having is that just opening the database is throwing a SqlException and complaining about MSDTC escalations.
I have temporarily wrapped the handlers in a TransactionScope.Suppress and that has stopped the error but I believe I'm missing something more fundamental.
Is it a simple matter of configuring the endpoint to be non-transactional? I would have thought this would just work seeing as I've configured to use Azure Service Bus as the transport mechanism. If I do this will NServiceBus still retry if an exception is thrown within the message handler? (Up to the SLR limits -- not part of the question, I also understand the idempotency issues).

#Phil,
First, you shouldn't be using MSDTC with SQL Azure - it's not supported. The feature is suggested, but only under review. DTC is not supported on Azure. Alternatively, you could look into the following suggestion to use SqlTransaction approach.
Second, transport you're using has nothing to do with your data access. Since you're using Azure Service Bus, it will not be part of your handler code. Making handler a transactional is to force an atomic change or roll-back. Regardless of your handler, will retry. Challenge is that when handler/endpoint is not transactional, and within handler first write to DB succeeded and second failed, first write won't be reverted. As for Azure Service Bus as a transport, it's not transactional in its nature (ie no DTC).

Which version of NServiceBus.Azure are you on? Do you have a stack trace of the exception? Where does it come from?
We push the sends and publishes out of the scope of the receive transaction scope explicitly to prevent promotion to the DTC, so that the transaction is local to the sql, so I doubt that is what is happening here.
From you description it looks like you are using a different data access instance for each handler (per call container config) and you have multiple handlers on the same message. If both of these open a new connection to the SQL you would see promotion as well (even if it is the same server)
Could that be it? That it throws on the second open?

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string