Is Azure Notification Hub so unstable? - azure

I just did a little stress test on Azure Notification Hub.
Sent 200 exactly the same messages to iPhone:
There are 62
"The Push Notification System returned an Internal Server Error"
And 138
"The Notification was successfully sent to the Push Notification System"
So the failure rate is 31%!!!
I turned on 'enableTestSend' mode and the message is got from NotificationOutcome->RegistrationResult->Outcome
Does anyone also have done some tests on it?
This is definitely not acceptable.

Depending on how your load test was designed, it might or might have not been Notification Hubs-related failure. Either as #efimovandr noted, those could have been throttled by APNS or as #simon-w suggested, could be any other PNS-specific issue. One way to verify that is to run exactly the same code that you ran using NH, but instead call the PNS directly. Changes are you'll get the same success rate. Then it means you need to design the test in a different way that better reflects real-world service usage.
Microsoft offers SLA on Notification Hubs service which means that they invest in making sure that failure rates are manageable for customers.
If you are still experiencing the problem, contact customer support with your namespace name and approximate time (with time zone) when it happened and they will help you understand what was going on.

Related

Azure Bot Service using over 1GB of data transfer out per day. Why? How can I stop that?

I created a QnA bot using the Azure Bot service, and now I'm seeing data transfers out of my subscription of over 1 GB a day! I cannot figure out why, but since it's billable, I'd like to know why and how I can stop it.
The bot isn't being used yet, so no one is sending queries to it. I'm confused how this is happening.
Here's a screen shot of the graph for use in the last hour as well as a screen shot of the billing for the last few days showing the sudden jump in use.
Is this normal?
If you add AzureWebJobsDisableHomepage with a value of true, to the App settings, the data out will stop.
The setting itself is documented here: https://github.com/Azure/azure-webjobs-sdk-script/wiki/Configuration-Settings (although it doesn't provide an explanation for how this setting affects a bot specifically)
The reasoning behind what is happening is a little complex. Azure Functions are not normally "in memory" and available all the time. There is a small spinup time that is not ideal within a bot. So, apparently there is a job setup with consumption plan bots to ping it every 10 seconds (and by 'ping', i mean retrieve the root of the site). If you open the Log Stream, you'll see an http get request every 10 seconds. Adding AzureWebJobsDisableHomepage doesn't disable the request, but changes the status of what is returned from "OK" to "NoContent".
This will be added to the Bot Service arm template soon (so future consumption plan bots do not automatically accrue these data usages).

How can i see the exact times push notifications are processed in Notification Hubs?

I am trying to diagnose reliability and performance of a system that delivers push notifications to client apps. Part of the system involves telling an Azure Notification Hub to push. I can see metrics in the classic portal, but no data on exact times the notification was processed or handed off to gcm.
Is such telemetry available? Service Bus Explorer does not appear to be working: the default sas connection string does not seem to give me any metrics to download.
You can get that information by using Per Message Telemetry which is available in Standard Tier. You'll get something like this for each push:
<NotificationDetails xmlns="http://schemas.microsoft.com/netservices/2010/10/servicebus/connect" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<NotificationId>{Your message id}</NotificationId>
<Location>sb://{Your namespace}.servicebus.windows.net/{your hub name}/messages/{your message id}?api-version=2015-04</Location>
<State>Completed</State>
<EnqueueTime>2015-11-02T21:19:43Z</EnqueueTime>
<StartTime>2015-11-02T21:19:43.9926996Z</StartTime>
<EndTime>2015-11-02T21:19:43.9926996Z</EndTime>
<NotificationBody><?xml version="1.0" encoding="utf-16"?><toast><visual><binding template="ToastText01"><text id="1">Hello from a .NET App!</text></binding></visual></toast></NotificationBody>
<TargetPlatforms>windows</TargetPlatforms>
<WnsOutcomeCounts>
<Outcome>
<Name>Success</Name>
<Count>3</Count>
</Outcome>
<Outcome>
<Name>WrongToken</Name>
<Count>1</Count>
</Outcome>
</WnsOutcomeCounts>
<PnsErrorDetailsUri>{Blob uri}</PnsErrorDetailsUri>
</NotificationDetails>
If you're using direct send, then the EndTime there should be what you're looking for once the job is in Completed state.
Generally, NH processes all pushes within 1 minute.

Azure service bus intermittent "40400: Endpoint not found"

I'm having odd errors with Azure Service Bus. For long running apps, using the batch api so that I can read a batch of messages at once (and throttle back when I have no messages available etc.), I will eventually start to get "40400: Endpoint not found" errors. These are only transient in that it doesn't stop everything but, once they occur they are intermittently persistent.
I also regularly get Message Lock lost exceptions, with 60 second timeouts, for updating batches of messages (max. 100 at a time). This really shouldn't be happening as it's running under "test" conditions where the messages are read, nothing happens to them, and then I complete them (i.e. there is no "programming logic" that takes any time at all to cause the timeout).
I really don't know how to work out why these occur and what I can do to prevent them.
Obviously I have all the re-try logic so that it doesn't bring down my app but, eventually, my app will process messages so slowly that in effect it's doing nothing at all.
My suspicion is that its because my queue (a "global worldwide queue") resides in North Europe but my app resides in East US, so the latency is causing an issue. If this is the case then I'm really stumped as it's an Azure data center communicating with another Azure data center (so should be fast) and then secondly, how on earth should you architect global queues for distributed access if the performance is so bad? AFAIK Service Bus doesn't support single endpoint globally distributed queues...

NServicebus: Programmatic reading of error queue

I’m currently building an application using NServicebus and Azure.
The regular processes are working, but now I’d like to do more about the management and monitoring aspect of the application.
The customer wants to see a dashboard where he can see the health of the application and also be able to correct issues.
What I’d like to do is:
Detect when things are sent to an error queue (to be able to send an alert to an admin)
Allow admin to handle messages on error queue from management application, without
resorting to the provided command line tool.
Is there a way to programmatically do error handling in NServicebus? I know which errors are transient and which errors might need manual intervention.
Is it possible to plug in logic to the error handling logic of nservicebus?
Is it possible to handle messages on the error queue programmatically?
Thanks,
Erwin
Regarding "dashboard where he can see the health of the application and also be able to correct issues":
Please take a look at ServicePulse (http://particular.net/ServicePulse) for production and online monitoring.
This provides both endpoint health indicators and Failed message indicators (including "Retry" capabilities).
For advanced debugging and visualization of your process you should also consider ServiceInsight (http://particular.net/ServiceInsight).
Behind the scenes of ServicePulse there's the ServiceControl server which exposes REST HTTP API with programmatic access to audited and error messages.
HTH,
Danny.

Systematic way to monitor FourSquare real-time API?

Is there a systematic/programmatic way to monitor for real-time push notification failures? I've found that the real time API notifications can fail intermittently and I would like to be notified when this happens.
Usually pushes "fail" because of errors on the application's end, essentially whenever your push URL doesn't return a 200. You will still receive the push from Foursquare though, so assiduous monitoring/logging of all pushes to your app should help you identify when they fail.
If you're doing a lot of processing after receiving a push from us, you may be returning a 200 too slowly. Foursquare will assume the request timed out and report it as a failed push. To mitigate this, you should always return a 200 immediately, then do any processing.
It rarely happens, but sometimes the error could be coming from Foursquare's end. For the latest on platform status, you should follow #foursquareAPI on Twitter, where we communicate downtime and other updates.

Resources