Google Cloud Pub/Sup API: Almost 100% Errors on StreamingPull - node.js

I'm trying to use GCP Pub/Sub StreamingPull using the NodeJs client and I understand that the pub sub is designed for 100% error rate as mentioned in Docs.
So do I have to restart the listener if I face errors in the errorHandler and also please tell what error code should I be looking for to see if the streaming connection is closed. Here is the ref Error Codes
const errorHandler=(error)=>{
if(errorCodeCheckCondition){
subscription.on('message', messageHandler);
subscription.removeListener('message', messageHandler);
}
}
subscription.on('error', errorHandler);
I'm using GCP Pub/Sub StreamingPull for first time, so please guide.

You do need to re-establish the streaming pull connection after you get any error.
According to the rpc StreamingPull
The server will close the stream and return the status on any error. The server may close the stream with status UNAVAILABLE to reassign server-side resources, in which case, the client should re-establish the stream. Flow control can be achieved by configuring the underlying RPC channel.
Since You know about StreamingPull has a 100% error rate, I believe you must have also gone through the Diagnosing StreamingPull errors.

The Pub/Sub client library will re-establish the underlying streaming pull connection when it disconnects for a retriable reason, e.g., an UNAVAILABLE error. You can see in the StreamingPull config in the library the set of errors that are retried internally.
The errors you would typically get back at the application level would be ones where some additional intervention is likely necessary, e.g., a PERMISSION_DENIED error (where the subscriber does not have permission to receive messages on the subscription) or a NOT_FOUND error (where the subscription does not exist. Retrying on these types of errors is likely just to result in the error reoccurring until the underlying issue is resolved.
You could decide that retrying is what you want to do because you want the subscriber to start working again without having to manually restart it once other steps are taken to fix the problem, but you'll want to make sure you have some way to discover these types of issues, perhaps through some kind of Cloud Monitoring alerting on streaming pull errors or on a large number of unprocessed messages building up.

Related

Retry after unrecoverable error in Azure IoT C SDK

According to this documentation, the SDK does not reattempt a re-connection if it detects an unrecoverable error. However, in the associated documentation for the C SDK there is no mention of a recoverable or an unrecoverable error (or at least I have not found it there).
How can I find out if the SDK is currently in a state where it might eventually recover or where it is required to manually reattempt a re-connection?
If the retry policy decides not to reattempt a re-connection, is it possible to reconnect manually without destroying the device handle and creating a new one (for example with IoTHubDeviceClient_CreateFromConnectionString)?
The C SDK currently doesn't distinguish between unrecoverable and recoverable errors from a retry perspective.
All types of failures are dealt as recoverable failures that could potentially be recovered - so the documentation should be fixed to reflect current functionality 2018_11_20 release.
That said, the SetConnectionStatusCallback will return appropriate reason codes, which can help determine if the SDK should be allowed to retry or if a new connection needs to be established by the application.
For instance, I would consider IOTHUB_CLIENT_CONNECTION_BAD_CREDENTIAL as an unrecoverable error for my application's device client.
You are right - a new device handle needs to be created in order to re-attempt the connection manually. This is because all C SDK APIs key off this handle to query various device client states within iothub_device_client.

IotHubClientTransientException: Transient error occured, please retry

I have an UWP app installed in an upboard that reads IotHub messages sended to that deviceID.
deviceClient = DeviceClient.CreateFromConnectionString(deviceConnectionString, TransportType.Mqtt);
Message receivedMessage = await deviceClient.ReceiveAsync();
The app works fine and reads the messages correctly, but sometimes I have these exceptions:
IotHubClientTransientException: Transient error occured, please retry.
I read that these errors may can be generated from wrong connection string, but it's not possible in my case.
Can someone help me?
The error is most likely caused by a network connectivity error. Just add a retry strategy. You could simply write your own or use a library like Polly.Net
In a distributed world connectivity issues should be expected, so I don't think there is any problem with your code other than is should be more resilient. I think it is really nice that the exceptions even tells you it should be retried, most of the times you have to figure that out yourself.
Some more guidance from the Azure team can be found here. In your case the Retry pattern is a good fit:
Retry
Enable an application to handle anticipated, temporary failures when it tries to connect to a service or network resource by transparently retrying an operation that's previously failed.

Azure Service Bus queues close after 24 hours automatically

Problem
We are developing a Azure Service Bus based Cloud Service, but after 24 hours the queue clients seem to get closed automatically.
Can someone confirm this behavior or give advise how to fix it?
At the moment we close the clients after 24 hours manually and recreate them to avoid this effect, but this can't be the only solution.
Sessions dropping intermittently is a normal occurrence. The AMQP protocol and stack in the client is newer and generally more resilient against this. The only reason not to use AMQP is if you are using transactions. Also, unless you have a good reason to run your own receive loop, use OnMessage.
You are getting ‘OperationCanceledException’ when the link fails for any reason and any in-flight requests will fail with this exception. However, this is transient, so you should be able to reuse the same QueueClient to issue receives and those should (eventually) work as the client recovers. OnMessage will hide all of that from you.

Mqtt paho using spring integration stops processing messages on topic over certain load requests

I am using Spring Integration with mqtt-paho version 4.0.4 For receiving MQTT messages on specified topic.
When application is receiving huge load I found that, sometimes application is dropping connection with IMA (mqtt) and this was happened three times in a span of 1 Lac record.
But it regains the connectivity and started consuming messages received there after. There were no issue in IMA re-connectivity.
There is some other issue which I faced during this testing.
When there is continuous load on application, at some point application stops receiving messages and we can see one message flashed on screen i.e.
May 04, 2015 2:45:29 PM org.eclipse.paho.client.mqttv3.internal.ClientState checkForActivity
SEVERE: gvjIpONtSpP: Timed out as no activity, keepAlive=60,000 lastOutboundActivity=1,430,730,869,017 lastInboundActivity=1,430,730,929,151
After this we can see that there are no messages received on application even if continuous load is pushed through utility.
This behavior I found it three times.
At around 40K.
At around 90K.
At around 145K.
There is no consistent point or figures where application actually stops receiving messages.
Please let me know if anybody has faced and solved this before .
We had the same issue during performance testing and during MQTT Paho client performance/durability testing, before moving to production. The issue was on broker side, after settings adjustment, the IMA broker was able to consume millions of messages with no rejection.
Please look into max buffer parameter on IMA configuration web console. And overlimit behavior policy (what to do with messages published over specified threshold): reject, rollover etc.

Catching auth error on redis / heroko / node.js

I'm running a redis / node.js server and had a
[Error: Auth error: Error: ERR max number of clients reached]
My current setup is, that I have a connection manager, that adds connections until the maximum number of concurrent connections for my heroku app (256, or 128 per dyno) is reached. If so, it just delivers an already existing connection. It's ultra fast and it's working.
However, yesterday night I got this error and I'm not able to reproduce it. It may be a rare error and I'm not sleeping well, knowing it's out there. Because: Once the error is thrown, my app is no longer reachable.
So my questions would be:
is that kind of a connection manager a good idea?
would it be a better idea to use that manager to wait for 'idle' to be called and the close the connection, meaning that I had to reestablish a connection everytime a requests kicks in (this is what I wanted to avoid)
how can I stop my app from going down? Should i just flush the connection pool whenever an error occurs?
What are your general strategies for handling multiple concurrent connections with a given maximum?
In case somebody is reading along:
The error was caused by a messed up redis 0.8.x that I deployed to live:
https://github.com/mranney/node_redis/issues/251
I was smart enough to remove the failed connections from the connection pool but forgot to call '.quit()' on it, hence the connection was out there in the wild but still a connection.

Resources