IoT EDGE Device Connection state monitoring - azure

We have a business requirement to maintain Iot Edge devices Connected state in Digital Twins Instance. It should be near to real time, but short delays up to few minutes are acceptable.
I.e., In Digital Twins instance we have DT entity for each IoT Edge device, and it have property Online (true/false).
In production we will have up to few hundreds of devices in total.
We are looking for a good method of monitoring Edge devices connected state.
Our initial attempt was to subscribe an AZ Function for Event Grid Device Connected/Disconnected notifications in IoT Hub events.
After initial testing we found that Event Grid seems cannot be used as a single source. After more research we found following information:
https://learn.microsoft.com/en-us/azure/iot-hub/iot-hub-event-grid#limitations-for-device-connected-and-device-disconnected-events
IoT Hub does not report each individual device connect and disconnect, but rather publishes the current connection state taken at a periodic 60 second snapshot. Receiving either the same connection state event with different sequence numbers or different connection state events both mean that there was a change in the device connection state during the 60 second window.
And another one:
https://learn.microsoft.com/en-us/azure/iot-hub/iot-hub-troubleshoot-connectivity#mqtt-device-disconnect-behavior-with-azure-iot-sdks
Azure IoT device SDKs disconnect from IoT Hub and then reconnect when they renew SAS tokens over the MQTT (and MQTT over WebSockets) protocol….
…
If you're monitoring device connections with Event Hub, make sure you build in a way of filtering out the periodic disconnects due to SAS token renewal. For example, do not trigger actions based on disconnects as long as the disconnect event is followed by a connect event within a certain time span.
Next, after more search on the topic, we found the following question:
Best way to Fetch connectionState from 1000's of devices - Azure IoTHub
Accepted answer suggests using heartbeat pattern, however in official documentation it is clearly stated that it should not be used in production environment:
https://learn.microsoft.com/en-us/azure/iot-hub/iot-hub-devguide-identity-registry#device-heartbeat
And in the article describing heartbeat pattern there is a mention of “short expiry time pattern” but not much information given to detail it.
For complete picture, we also found the following article:
https://learn.microsoft.com/en-us/azure/iot-hub/iot-hub-how-to-order-connection-state-events
But it is based on Event Grid subscription and therefore will not provide accurate data.
Finally, after reading all of this, we have the following plan to address the problem:
We will have AZ Function subscribed for Event Grid Device Connected/Disconnected notifications.
If DeviceConnected event received, the function will check device connectivity immediately.
If DeviceDisconnected event received, the function will delay for 90 seconds, as we found DeviceConnected event usually come after ~60 seconds for a given device. And after the delay it will check the device connectivity.
Device Connectivity will be checked with Cloud to Device message send with acknowledgment as described here:
https://learn.microsoft.com/en-us/azure/iot-hub/iot-hub-csharp-csharp-c2d#receive-delivery-feedback
Concerns of the solution:
Complexity.
AZ function would need IoT HUB service Connection string.
Device disconnected event might be delayed up to few minutes.
Can anyone suggest better solution?
Thanks!
EDIT:
In our case, we do not use DeviceClient, but ModuleClient on the Edge devices, and modules does not support C2D messages, which is stated here:
https://learn.microsoft.com/en-us/azure/iot-edge/module-development?view=iotedge-2018-06&WT.mc_id=IoT-MVP-5004034#iot-hub-primitives
So we would need to use Direct Methods instead to test if the device is Online.

Related

Monitor a specific IoT device in Azure?

I have multiple devices in my IoT Hub and I want to set an alert when one specific device is not sending messages. I know you can set alerts when the IoT Hub in general is not getting any messages but I want to alert when a specific device isn't.
Example: Device1, Device2, Device3, Device4
Alert when Device1 is not sending messages.
I have tried searching all over and all that I found was a question from 2018 saying it was not possible (I am hoping this has changed).
To your case specific I would leverage Device heartbeat pattern: https://learn.microsoft.com/en-us/azure/iot-hub/iot-hub-devguide-identity-registry#device-heartbeat
If your IoT solution needs to know if a device is connected, you can implement the heartbeat pattern. In the heartbeat pattern, the device sends device-to-cloud messages at least once every fixed amount of time (for example, at least once every hour). Therefore, even if a device does not have any data to send, it still sends an empty device-to-cloud message (usually with a property that identifies it as a heartbeat). On the service side, the solution maintains a map with the last heartbeat received for each device. If the solution does not receive a heartbeat message within the expected time from the device, it assumes that there is a problem with the device.

When are EventGrid IoT Hub DeviceConnected and DeviceDisconnected Events raised

IoT Hub publishes the events "DeviceConnected" and "DeviceDisconnected" via Event Grid according to the documentation.
My question is, which action from an actual IoT Device triggers these events?
For the "DeviceConnected" event:
Is it triggered when the OpenAsync Method is called on the Client SDK?
Is it triggered implicitly when the SendEvent Method is called?
Is this event also available via direct AMQP/MQTT connections?
For how long will it stay in this state?
For the "DeviceDisconnected" event:
Is the device going to "disconnected" as soon as "Close" on the DeviceClient is called?
What if connectivity is not good? Is there a constant ping along with a timeout mechanism which marks a device as offline after it was idle for a given time?
We currently have implemented the heartbeat pattern as described here but we are wondering if there is an easier and eventually more cost-efficient way to achieve the same goal.
I found this passage in the documentation
The connection state is updated only for devices using MQTT or AMQP.
Also, it is based on protocol-level pings (MQTT pings, or AMQP pings),
and it can have a maximum delay of only 5 minutes. For these reasons,
there can be false positives, such as devices reported as connected
but that are disconnected.
This covers most of my questions.

Why IotHub events are delayed when stored in Time Series Insights?

I have a Time Series Insights Environment with an IoT Hub data source configured.
What I noticed is that there is a specific 20-30 seconds delay from sending an event to IoT Hub and seeing it stored in TSI.
After I found this, I attached a Function Trigger directly to the Iot Hub. What happened is that events were received immediately by the trigger, but TSI returned them 20-30 seconds later.
So, I have two questions:
Where does that delay come from?
Is there anything I can do about minimizing the delay?
Thanks!
There is an expected measurable delay of up to 1 minute before you will see it in TSI and you cannot dial that up/down. It's just how the service works.
Just in case you haven't already, also make sure you've configured your SKU and capacity to support your use cases.

Cloud Architecture On Azure for Internet of Things

I'm working on a server architecture for sending/receiving messages from remote embedded devices, which will be hosted on Windows Azure. The front-facing servers are going to be maintaining persistent TCP connections with these devices, and I need a way to communicate with them on the backend.
Problem facts:
Devices: ~10,000
Frequency of messages device is sending up to servers: 1/min
Frequency of messages originating server side (e.g. from user actions, scheduled triggers, etc.): 100/day
Average size of message payload: 64 bytes
Upward communication
The devices send up messages very frequently (sensor readings). The constraints for that data are not very strong, due to the fact that we can aggregate/insert those sensor readings in a batched manner, and that they don't require in-order guarantees. I think the best way of handling them is to put them in a Storage Queue, and have a worker process poll the queue at intervals and dump that data. Of course, I'll have to be careful about making sure the worker process does this frequently enough so that the queue doesn't infinitely back up. The max batch size of Azure Storage Queues is 32, but I'm thinking of potentially pulling in more than that: something like publishing to the data store every 1,000 readings or 30 seconds, whichever comes first.
Downward communication
The server sends down updates and notifications much less frequently. This is a slightly harder problem, as I can see two viable paradigms here (with some blending in between). Could either:
Create a Service Bus Queue for each device (or one queue with thousands of subscriptions - limit is for number of queues is 10,000)
Have a state table housed in a DB that contains the latest "state" of a specific message type that the devices will get sent to them
With option 1, the application server simply enqueues a message in a fire-and-forget manner. On the front-end servers, however, there's quite a bit of things that have to happen. Concerns I can see include:
Monitoring 10k queues (or many subscriptions off of a queue - the
Azure SDK apparently reuses connections for subscriptions to the same
queue)
Connection Management
Should no longer monitor a queue if device disconnects.
Need to expire messages if device is disconnected for an extended period of time (so that queue isn't backed up)
Need to enable some type of "refresh" mechanism to update device's complete state when it goes back online
The good news is that service bus queues are durable, and with sessions can arrange messages to come in a FIFO manner.
With option 2, the DB would house a table that would maintain state for all of the devices. This table would be checked periodically by the front-facing servers (every few seconds or so) for state changes written to it by the application server. The front-facing servers would then dispatch to the devices. This removes the requirement for queueing of FIFO, the reasoning being that this message contains the latest state, and doesn't have to compete with other messages destined for the same device. The message is ephemeral: if it fails, then it will be resent when the device reconnects and requests to be refreshed, or at the next check interval of the front-facing server.
In this scenario, the need for queues seems to be removed, but the DB becomes the bottleneck here, and I fear it's not as scalable.
These are both viable approaches, and I feel this question is already becoming too large (although I can provide more descriptions if necessary). Just wanted to get a feel for what's possible, what's usually done, if there's something fundamental I'm missing, and what things in the cloud can I take advantage of to not reinvent the wheel.
If you can identify the device (may be device id/IMEI/Mac address) by the the message it sends then you can reduce the number of queues from 10,000 to 1 queue and not have 10000 subscriptions too. This could also help you in the downward communication as you will be able to identify the device and send the message to the appropriate socket.
As you mentioned the connections last longer you could deliver the command to the device that is connected and decide what to do with the commands to the device that are not connected.
Hope it helps

MQTT what is the purpose or usage of Last Will Testament?

I'm surely missing something about how the whole MQTT protocol works, as I can't grasp the usage pattern of Last Will Testament messages: what's their purpose?
One example I often see is about informing that a device has gone offline. It doesn't make very much sense to me, since it's obvious that if a device isn't publishing any data it may be offline or there could be some network problems.
So, what are some practical usages of the LWT? What was it invented for?
LWT messages are not really concerned about detecting whether a client has gone offline or not (that task is handled by keepAlive messages).
LWT messages are about what happens after the client has gone offline.
The analogy is that of a real last will:
If a person dies, she can formulate a testament, in which she declares what actions should be taken after she has passed away. An executor will heed those wishes and execute them on her behalf.
The analogy in the MQTT world is that a client can formulate a testament, in which it declares what message should be sent on it's behalf by the broker, after it has gone offline.
A fictitious example:
I have a sensor, which sends crucial data, but very infrequently.
It has formulated a last will statement in the form of [topic: '/node/gone-offline', message: ':id'], with :id being a unique id for the sensor. I also have a emergency-subscriber for the topic 'node/gone-offline', which will send a SMS to my phone every time a message is published on that channel.
During normal operation, the sensor will keep the connection to the MQTT-broker open by sending periodic keepAlive messages interspersed with the actual sensor readings. If the sensor goes offline, the connection to the broker will time out, due to the lack of keepAlives.
This is where LWT comes in: If no LWT is specified, the broker doesn't care and just closes the connection. In our case however, the broker will execute the sensor's last will and publish the LWT-message '/node/gone-offline: :id'. The message will then be consumed to my emergency-subscriber and I will be notified of the sensor's ID via SMS so that I can check up on what's going on.
In short:
Instead of just closing the connection after a client has gone offline, LWT messages can be leveraged to define a message to be published by the broker on behalf of the client, since the client is offline and cannot publish anymore.
Just because a device is not publishing does not mean it is not online or there is a network problem.
Take for example a sensor that monitors a value that only changes very infrequently, good design says that the sensor should only publish the changes to help reduce bandwidth usage as periodically publishing the same value is wasteful. If the value is published as a retained value then any new subscriber will always get the current value without having to wait for the sensor value to change and it publish again.
In this case the LWT is used to published when the sensor fails (or there is a network problem) so we know of the problem as soon at the client keep alive times out.
A in-depth article about Last-Will-and-Testament messages is available in the MQTT Essentials Blog Post series: http://www.hivemq.com/mqtt-essentials-part-9-last-will-and-testament/.
To summarize the blog post:
The Last Will and Testament feature is used in MQTT to notify other clients about an ungracefully disconnected client.
MQTT is often used in scenarios were unreliable networks are very common. Therefore it is assumed that some clients will disconnect ungracefully from time to time, because they lost the connection, the battery is empty or any other imaginable case. It would be good to know if a connected client has disconnected gracefully (which means with a MQTT DISCONNECT message) or not, in order to take appropriate action.

Resources