Monitor a specific IoT device in Azure? - azure

I have multiple devices in my IoT Hub and I want to set an alert when one specific device is not sending messages. I know you can set alerts when the IoT Hub in general is not getting any messages but I want to alert when a specific device isn't.
Example: Device1, Device2, Device3, Device4
Alert when Device1 is not sending messages.
I have tried searching all over and all that I found was a question from 2018 saying it was not possible (I am hoping this has changed).

To your case specific I would leverage Device heartbeat pattern: https://learn.microsoft.com/en-us/azure/iot-hub/iot-hub-devguide-identity-registry#device-heartbeat
If your IoT solution needs to know if a device is connected, you can implement the heartbeat pattern. In the heartbeat pattern, the device sends device-to-cloud messages at least once every fixed amount of time (for example, at least once every hour). Therefore, even if a device does not have any data to send, it still sends an empty device-to-cloud message (usually with a property that identifies it as a heartbeat). On the service side, the solution maintains a map with the last heartbeat received for each device. If the solution does not receive a heartbeat message within the expected time from the device, it assumes that there is a problem with the device.

Related

IoT EDGE Device Connection state monitoring

We have a business requirement to maintain Iot Edge devices Connected state in Digital Twins Instance. It should be near to real time, but short delays up to few minutes are acceptable.
I.e., In Digital Twins instance we have DT entity for each IoT Edge device, and it have property Online (true/false).
In production we will have up to few hundreds of devices in total.
We are looking for a good method of monitoring Edge devices connected state.
Our initial attempt was to subscribe an AZ Function for Event Grid Device Connected/Disconnected notifications in IoT Hub events.
After initial testing we found that Event Grid seems cannot be used as a single source. After more research we found following information:
https://learn.microsoft.com/en-us/azure/iot-hub/iot-hub-event-grid#limitations-for-device-connected-and-device-disconnected-events
IoT Hub does not report each individual device connect and disconnect, but rather publishes the current connection state taken at a periodic 60 second snapshot. Receiving either the same connection state event with different sequence numbers or different connection state events both mean that there was a change in the device connection state during the 60 second window.
And another one:
https://learn.microsoft.com/en-us/azure/iot-hub/iot-hub-troubleshoot-connectivity#mqtt-device-disconnect-behavior-with-azure-iot-sdks
Azure IoT device SDKs disconnect from IoT Hub and then reconnect when they renew SAS tokens over the MQTT (and MQTT over WebSockets) protocol….
…
If you're monitoring device connections with Event Hub, make sure you build in a way of filtering out the periodic disconnects due to SAS token renewal. For example, do not trigger actions based on disconnects as long as the disconnect event is followed by a connect event within a certain time span.
Next, after more search on the topic, we found the following question:
Best way to Fetch connectionState from 1000's of devices - Azure IoTHub
Accepted answer suggests using heartbeat pattern, however in official documentation it is clearly stated that it should not be used in production environment:
https://learn.microsoft.com/en-us/azure/iot-hub/iot-hub-devguide-identity-registry#device-heartbeat
And in the article describing heartbeat pattern there is a mention of “short expiry time pattern” but not much information given to detail it.
For complete picture, we also found the following article:
https://learn.microsoft.com/en-us/azure/iot-hub/iot-hub-how-to-order-connection-state-events
But it is based on Event Grid subscription and therefore will not provide accurate data.
Finally, after reading all of this, we have the following plan to address the problem:
We will have AZ Function subscribed for Event Grid Device Connected/Disconnected notifications.
If DeviceConnected event received, the function will check device connectivity immediately.
If DeviceDisconnected event received, the function will delay for 90 seconds, as we found DeviceConnected event usually come after ~60 seconds for a given device. And after the delay it will check the device connectivity.
Device Connectivity will be checked with Cloud to Device message send with acknowledgment as described here:
https://learn.microsoft.com/en-us/azure/iot-hub/iot-hub-csharp-csharp-c2d#receive-delivery-feedback
Concerns of the solution:
Complexity.
AZ function would need IoT HUB service Connection string.
Device disconnected event might be delayed up to few minutes.
Can anyone suggest better solution?
Thanks!
EDIT:
In our case, we do not use DeviceClient, but ModuleClient on the Edge devices, and modules does not support C2D messages, which is stated here:
https://learn.microsoft.com/en-us/azure/iot-edge/module-development?view=iotedge-2018-06&WT.mc_id=IoT-MVP-5004034#iot-hub-primitives
So we would need to use Direct Methods instead to test if the device is Online.

What are outgoing messages in Service Bus?

I need help interpreting this graph:
It looks like messages are coming in, then being processed, and going back to 0. The graph is not continually rising.
However, the outgoing messages has "--". Does this mean 0?
Does an outgoing message represent a message being read by a service?
If the messages are not being read, then what is happening to them? The dead letter queue has 0 messages.
Yes, an outgoing message represents a message being read by a service.
In your case, I see you've disabled the dead letter queue(in your screenshot, the Dead lettering option is disabled.), so there is no messages in the DLQ. If DLQ is disabled, the messages will be got deleted after expired.
The docs describes them like this:
The number of events or messages received from Service Bus over a specified period.
So, the incoming messages are message that are sent to the service bus. The outgoing messages are messages that are picked up by message processors (your application). So the answer to your question
Does an outgoing message represent a message being read by a service?
is: Yes!

Does the device get the error message from IoThub if the eventgrid event delivery fails?

I was wondering if IoThub informs the device if the event grid message fails?
Here is the architecture flow:
Device->IoTHub->EventGrid->Webhook
So when the EevntGrid gets some error(like 400/500 error due to some reason), does IoThub inform the device of this failure when the device could retry, or does it send a message received where the device will think the message has been sent successfully to the Webhook? Is there a workflow/ design where we could inform the device of this error?
So when the EevntGrid gets some error(like 400/500 error due to some reason), does IoThub inform the device of this failure when the device could retry, or does it send a message received where the device will think the message has been sent successfully to the Webhook?
No, it is an asynchronous flow. There won't be any cloud-to-device message that tells the device thing have failed. At least not out-of-the-box. Also, the retry logic is already present in the Event Grid, no need for the device to retry the message unless it fails to communicate with the IoT Hub.
Is there a workflow/ design where we could inform the device of this error?
Take this example:
Device sends telemetry
IoT Hub receives message and passes it to Event Grid
Event Grid tries to deliver event
Event delivery fails due to a transient error
Event Grid retries event delivery n times over a given period
Event is succesfully delivered or dead lettered
In case of succesful delivery, the message is processed by the endpoint.
Now, the cloud-to-device message can only be send at step 6 (or 7 if you want to include the response of the final endpoint). There might be milliseconds between step 1 and 6/7 or there might be minutes, hours or even days (in worst case scenarios, depending on the retry configuration and endpoint status).
A logical thing is to have the endpoint publish its own event and get it delivered to the cloud device, but this will be an asynchronous flow.
Why do you want to burden the device with the outcome of the message delivery? I think it shouldn't have to worry about the flow unless it fails to communicate with the IoT Hub.

When are EventGrid IoT Hub DeviceConnected and DeviceDisconnected Events raised

IoT Hub publishes the events "DeviceConnected" and "DeviceDisconnected" via Event Grid according to the documentation.
My question is, which action from an actual IoT Device triggers these events?
For the "DeviceConnected" event:
Is it triggered when the OpenAsync Method is called on the Client SDK?
Is it triggered implicitly when the SendEvent Method is called?
Is this event also available via direct AMQP/MQTT connections?
For how long will it stay in this state?
For the "DeviceDisconnected" event:
Is the device going to "disconnected" as soon as "Close" on the DeviceClient is called?
What if connectivity is not good? Is there a constant ping along with a timeout mechanism which marks a device as offline after it was idle for a given time?
We currently have implemented the heartbeat pattern as described here but we are wondering if there is an easier and eventually more cost-efficient way to achieve the same goal.
I found this passage in the documentation
The connection state is updated only for devices using MQTT or AMQP.
Also, it is based on protocol-level pings (MQTT pings, or AMQP pings),
and it can have a maximum delay of only 5 minutes. For these reasons,
there can be false positives, such as devices reported as connected
but that are disconnected.
This covers most of my questions.

MQTT what is the purpose or usage of Last Will Testament?

I'm surely missing something about how the whole MQTT protocol works, as I can't grasp the usage pattern of Last Will Testament messages: what's their purpose?
One example I often see is about informing that a device has gone offline. It doesn't make very much sense to me, since it's obvious that if a device isn't publishing any data it may be offline or there could be some network problems.
So, what are some practical usages of the LWT? What was it invented for?
LWT messages are not really concerned about detecting whether a client has gone offline or not (that task is handled by keepAlive messages).
LWT messages are about what happens after the client has gone offline.
The analogy is that of a real last will:
If a person dies, she can formulate a testament, in which she declares what actions should be taken after she has passed away. An executor will heed those wishes and execute them on her behalf.
The analogy in the MQTT world is that a client can formulate a testament, in which it declares what message should be sent on it's behalf by the broker, after it has gone offline.
A fictitious example:
I have a sensor, which sends crucial data, but very infrequently.
It has formulated a last will statement in the form of [topic: '/node/gone-offline', message: ':id'], with :id being a unique id for the sensor. I also have a emergency-subscriber for the topic 'node/gone-offline', which will send a SMS to my phone every time a message is published on that channel.
During normal operation, the sensor will keep the connection to the MQTT-broker open by sending periodic keepAlive messages interspersed with the actual sensor readings. If the sensor goes offline, the connection to the broker will time out, due to the lack of keepAlives.
This is where LWT comes in: If no LWT is specified, the broker doesn't care and just closes the connection. In our case however, the broker will execute the sensor's last will and publish the LWT-message '/node/gone-offline: :id'. The message will then be consumed to my emergency-subscriber and I will be notified of the sensor's ID via SMS so that I can check up on what's going on.
In short:
Instead of just closing the connection after a client has gone offline, LWT messages can be leveraged to define a message to be published by the broker on behalf of the client, since the client is offline and cannot publish anymore.
Just because a device is not publishing does not mean it is not online or there is a network problem.
Take for example a sensor that monitors a value that only changes very infrequently, good design says that the sensor should only publish the changes to help reduce bandwidth usage as periodically publishing the same value is wasteful. If the value is published as a retained value then any new subscriber will always get the current value without having to wait for the sensor value to change and it publish again.
In this case the LWT is used to published when the sensor fails (or there is a network problem) so we know of the problem as soon at the client keep alive times out.
A in-depth article about Last-Will-and-Testament messages is available in the MQTT Essentials Blog Post series: http://www.hivemq.com/mqtt-essentials-part-9-last-will-and-testament/.
To summarize the blog post:
The Last Will and Testament feature is used in MQTT to notify other clients about an ungracefully disconnected client.
MQTT is often used in scenarios were unreliable networks are very common. Therefore it is assumed that some clients will disconnect ungracefully from time to time, because they lost the connection, the battery is empty or any other imaginable case. It would be good to know if a connected client has disconnected gracefully (which means with a MQTT DISCONNECT message) or not, in order to take appropriate action.

Resources