How to enforce the order of messages passed to an IoT device over MQTT via a cloud-based system (API design issue) - protocols

Suppose I have an IoT device which I'm about to control (lets say switch on/off) and monitor (e.g. collect temperature readings). It seems MQTT could be the right fit. I could publish messages to the device to control it and the device could publish messages to a broker to report temperature readings. So far so good.
The problems start to occur when I try to design the API to control the device.
Lets day the device subscribes to two topics:
/device-id/control/on
/device-id/control/off
Then I publish messages to these topics in some order. But given the fact that messaging is typically an asynchronous process there are no guarantees on the order of messages received by the device.
So in case two messages are published in the following order:
/device-id/control/on
/device-id/control/off
they could be received in the reversed order leaving the device turned on, which can have dramatic consequences, depending on the context.
Of course the API could be designed in some other way, for example there could be just one topic
/device-id/control
and the payload of individual messages would carry the meaning of an individual message (on/off). So in case messages are published to this topic in a given order they are expected to be received in the exact same order on the device.
But what if the order of publishes to individual topics cannot be guaranteed? Suppose the following architecture of a system for IoT devices:
/ control service \
application -> broker -> control service -> broker -> IoT device
\ control service /
The components of the system are:
an application which effectively controls the device by publishing messages to a broker
a typical message broker
a control service with some business logic
The important part is that as in most modern distributed systems the control service is a distributed, multi instance entity capable of processing multiple control messages from the application at a time. Therefore the order of messages published by the application can end up totally mixed when delivered to the IoT device.
Now given the fact that most MQTT brokers only implement QoS0 and QoS1 but no QoS2 it gets even more interesting as such control messages could potentially be delivered multiple times (assuming QoS1 - see https://stackoverflow.com/a/30959058/1776942).
My point is that separate topics for control messages is a bad idea. The same goes for a single topic. In both cases there are no message delivery order guarantees.
The only solution to this particular issue that comes to my mind is message versioning so that old (out-dated) messages could simply be skipped when delivered after another message with more recent version property.
Am I missing something?
Is message versioning the only solution to this problem?

Am I missing something?
Most definitely. The example you brought up is a generic control system, being attached to some message-oriented scheme. There are a number of patterns that can be used when referring to a message-based architecture. This article by Microsoft categorizes message patterns into two primary classes:
Commands and
Events
The most generic pattern of command behavior is to issue a command, then measure the state of the system to verify the command was carried out. If you forget to verify, your system has an open loop. Such open loops are (unfortunately) common in IT systems (because it's easy to forget), and often result in bugs and other bad behaviors such as the one described above. So, the proper way to handle a command is:
Issue the command
Inquire as to the state of the system
Evaluate next action
Events, on the other hand, are simply fired off. As the publisher of an event, it is not my business to worry about who receives the event, in what order, etc. Now, it should also be pointed out that the use of any decent message broker (e.g. RabbitMQ) generally carries strong guarantees that messages will be delivered in the order which they were originally published. Note that this does not mean they will be processed in order.
So, if you treat a command as an event, your system is guaranteed to act up sooner or later.
Is message versioning the only solution to this problem?
Message versioning typically refers to a property of the message class itself, rather than a particular instance of the class. It is often used when multiple versions of a message-based API exist and must be backwards-compatible with one another.
What you are instead referring to is unique message identifiers. Guids are particularly handy for making sure that each message gets its own unique id. However, I would argue that de-duplication in message-based architectures is an anti-pattern. One of the consequences of using messaging is that duplicates are possible, so you should try to design your system behaviors to be stateless and idempotent. If this is not possible, it should be considered that messaging may not be the correct communication solution for the need.
Using the command-event dichotomy as an example, you could perform the following transaction:
The controller issues the command, assigning a unique identifier to the command.
The control system receives the command and turns on.
The control system publishes the "light on" event notification, containing the unique id of the command that was used to turn on the light.
The controller receives the notification and correlates it to the original command.
In the event that the controller doesn't receive notification after some timeout, the controller can retry the command. Note that "light on" is an idempotent command, in that multiple calls to it will have the same effect.

When state changes, send the new state immediately and after that periodically every x seconds. With this solution your systems gets into desired state, after some time, even when it temporarily disconnects from the network (low battery).
BTW: You did not miss anything.

Apart from the comment that most brokers don't support QOS2 (I suspect you mean that a number of broker as a service offerings don't support QOS2, such as Amazon's AWS IoT service) you have covered most of the major points.
If message order really is that important then you will have to include some form of ordering marker in the message payload, be this a counter or timestamp.

Related

Replaying Messages in Order

I am implementing a consumer which does processing of messages from a queue where order of messages is of importance. I would like to implement a mechanism using NodeJS where:
the consumer function is consuming messages m1, m2, ..., mN from the queue
doing an IO intensive operation and process the messages. m -> m'
Storing the result m' in a redis cache.
acknowledging the queue after each message process (2)
In a different function, I am listening to the message from the cache
sending the processed messages m' to an external system
if the external system was able to process the external system, then delete the processed message from the cache
If the external system rejects the processed message, then stop sending messages, discard the unsent processed messages in the cache and reset the offset to the last accepted message in the queue. For example if m12' was the last message accepted by the system, and I have acknowledged m23 from the queue, then I have to discard m13' to m23' and reset the offset so that the consumer can read and start processing from m13 again.
Few assumptions:
The processing m to m' is intensive and I am processing them optimistically, knowing that most of the times there won't be a failure
With the current assumptions and goals, is there any way I can achieve this with RabbitMQ or any Azure equivalent? My client doesn't prefer Kafka or any Azure equivalent of Kafka (Azure Event Hub).
In scenarios where the messages will always be generated in sequence then a simple queue is probably all you need.
Azure Queues are pretty simple to get into, but the general mode of operation for queues is to remove the messages as they are processed successfully.
If you can avoid the scenario where you must "roll back" or re-process from an earlier time, so if you can avoid the orchestration aspect then this would be a much simpler option.
It's the "go back and replay" that you will struggle with. If you can implement two queues in a sequential pattern, where processing messages from one queue successfully pushes the message into the next queue, then we never need to go back, because the secondary consumer can never process ahead of the primary.
With Azure Event Hubs it is much easier to reset the offset for processing, because the messages stay in the bucket regardless of their read state, (in fact any given message does not have such a state) and the consumer maintains the offset pointer itself. It also has support for multiple consumer groups, which will make a copy of the message available to each consumer.
You can up your plan to maintain the data for up to 7 days without blowing the budget.
There are two problems with Large scale telemetry ingestion services like Azure Event Hubs for your use case
The order of receipt of the message is less reliable for messages that are extremely close together, the Hub is designed to receive many messages from many sources concurrently, so its internal architecture cares a lot less about trying to preserve the precise order, it records the precise receipt timestamp on the message, but it does not guarantee that the overall sequence of records will match exactly to a scenario where you were to sort by the receipt timestamp. (its a subtle but important distinction)
Event Hubs (and many client processing code examples) are designed to guarantee Exactly Once delivery across multiple concurrent consuming threads. Again the Consumers are encouraged to be asynchronous and the serice will try to ensure that failed processing attempts are retried by the next available thread.
So you could use Event Hubs, but you would have to bypass or disable a lot of its features which is generally a strong message that it is not the correct fit for your purpose, if you want to explore it though, you would want to limit the concurrency aspects:
minimise the partition count
You probably want 1 partition for each message producer, or atleast for each sequential set, maintaining sequence is simpler inside a single partition
make sure your message sender (producer) only sends to a specific partition
Each producer MUST use a unique partition key
create a consumer group for each of your consumers
process messages one at a time, not in batches
process with a single thread
I have a lot of experience in designing MS Azure based solutions for Industrial IoT (Telemetry from PLCs) and Agricultural IoT (Raspberry Pi) device implementations. In almost all cases we think that the order of messaging is important, but unless you are maintaining real-time 2 way command and control, you can usually get away with an optimisitic approach where each message and any derivatives are or were correct at the time of transmission.
If there is the remote possibility that a device can be offline for any period of time, then dealing with the stale data flushing through the system when a device comes back online can really play havok with sequential logic programming.
Take a step back to analyse your solution, EventHubs does offer a convient way to rollback the processing to a previous offset, as long as that record is still in the bucket, but can you re-design your logic flow so that you do not have to re-process old data?
What is the requirement that drives this sequence? If it is so important to maintain the sequence, then you should probably process the data with a single consumer that does everything, or look at chaining the queues in a sequential manner.

Architecture issue - Azure servicebus and message order guarantee

Ok so i'm relatively new to the servicebus. Working on a project where we use Azure servicebus for queueing messages. Our architecture roughly looks like the following:
So the idea is that in our SourceSystem all kinds of stuff happens, which leads to messages being put on the servicebustopics. Now our responsibility is syncing these events to the external client so they are aware of what we are doing.
Now the issue is that currently we dont use servicebus sessions so message order isnt guaranteed. Also consider the following scenario:
OrderCreated
OrderUpdate 1
OrderUpdate 2
OrderClosed
What happens now is if the externalclients API is down for say OrderUpdate 1 and OrderUpdate 2, we could potentially send the messages in order: OrderCreated, OrderClosed, OrderUpdate 1, OrderUpdate 2.
Currently we just retry a message a few times and then it moves into the deadletter queue for manual reprocessing.
What steps should we take to better guarantee message order? I feel like in the scope of an order, message order needs to be guaranteed.
Should we force the sourcesystem to put all messages for a order in a servicebus session? But how can we handle this with multiple topics? And what do we do if message 1 from a session ends up in the deadletter?
There are a lot of considerations here, should we use a single topic so its easier to manage the sessions? But this opens up other problems with different message structures being in a single topic?
Id love to hear your opinions on this
Have a look at Durable Functions in Azure. You can use the 'Async Http API' or one of the other patterns to achieve the orchestration you need to do.
NServicebus' Sagas might also be a good option, here is an article that does a very good comparison between NServicebus and Durable Functions.
If the external client has to receive all those events and order matters, sending those messages to multiple topics where a topic is per message type will make your mission extremely hard to accomplish. For ordered messaging first you need to use a single entity (queue or topic) with Sessions enabled. That way you can guarantee ordered message processing. In case you have multiple external clients, you'd need to have a session-enabled entity (topic) per external client.
Another option is to implement a pattern known as Process Manager. The process manager would be responsible to make the decisions about the incoming messages and conclude when the work for a given order is completed or not.
There are also libraries (MassTransit, NServiceBus, etc) that can help you. NServiceBus implements Process Manager via a feature called Saga (tutorial) and MassTransit has it as well (documentation).

How to handle publishing event when message broker is out?

I'm thinking how can I handle sending events when suddenly message broker go down. Please take a look at this code
using (var uow = uowProvider.Create())
{
...
...
var policy = offer.Buy(customer);
uow.Policies.Add(policy);
// DB changes are saved here! but what would happen if...
await uow.CommitChanges();
// ...eventPublisher throw an exception?
await eventPublisher.PublishMessage(PolicyCreated(policy));
return true;
}
IMHO if eventPublisher throw exception the event PolicyCreated won't be published. I don't know how to deal with this situation. The event must be published in system. I suppose that only good solution will be creating some kind of retry mechanism but I'm not sure...
I would like to elaborate a bit on the answers provided by both #Imran Arshad and #VoiceOfUnreason which are, of course, correct.
There are basically 3 patterns when it comes to publishing messages:
exactly once delivery (requires distributed transactions)
at most once delivery (no distributed transaction but may miss messages - like the actor model)
at least once delivery (no distributed transaction but may have duplicate messages)
The following is all in terms of your example.
For exactly once delivery both the database and the queue would need to provide the ability to enlist in distributed transactions. Some queues do not proivde this functionality out-of-the-box (like RabbitMQ) and even though it may be possible to roll your own it may not be the best option. Distributed transactions are typically quite slow.
For at most once delivery we have to accept that we may miss messages and I'm guessing that in most use-cases this is quite troublesome. You would get around this by tracking the progress and picking up the missed messages and resending them if required.
For at least once delivery we would need to ensure that the messages are idempotent. When we get a duplicate messages (usually quite an edge case) they should be ignored or their outcome should be the same as the initial message processed.
Now, there are a couple of ways around your issue. You could start a database transaction and make your database changes. Before you comit you perform the message sending. Should that fail then your transaction would be rolled back. That works fine for sending a single message but in your case some subscribers may have received a message. This complicates matters as all your subscribers need to receive the message or none of them get to receive it.
You could have your subscriber check whether the state is indeed true and whether it should continue processing. This places a burden on the subscriber and introduces some coupling. It could either postpone the action should the state not allow processing, or ignore it.
Another option is that instead of publishing the event you send yourself a command that indicates completion of the step. The command handler would perform the publishing and retry until all subscriber queues receive the message. This would require the relevant subscribers to ignore those messages that they had already processed (idempotence).
The outbox is a store-and-forward approach and will eventually send the message to all subscribers. You could have your outbox perhaps be included in the database transaction. In my Shuttle.Esb service bus one of the folks that used it came across a weird side-effect that I had not planned. He used a sql-based queue as an outbox and the queue connection was to the same database. It was therefore included in the database transasction and would roll back with all the other changes if not committed. Apologies for promoting my own product but I'm sure other service bus offerings may have the same functionality.
There are therefore quite a few things to consider and various techniques to mitigate the risk of a queue outage. I would, however, move the queue interaction to before the database commit.
For reliable system you need to save events locally. If your broker is down you have to retry and publish event.
There are many ways to achieve this but most common is outbox pattern. Just like your mail box your event/message stays locally and you keep retrying until it's sent and you mark the message published in your local DB.
you can read more about here Publish Events
You'll want to review Udi Dahan's discussion of Reliable Messaging without Distributed Transactions.
But very roughly, the PolicyCreated event becomes part of the unit of work; either because it is saved in the Policy representation itself, or because it is saved in an EventRepository that participates in the same transaction as the Policies repository.
Once you've captured the information in your database, retry the publish is relatively straight forward - read the events from the database, publish, optionally mark the events in the database as successfully published so that they can be cleaned up.

Distributed pub/sub with single consumer per message type

I have no clue if it's better to ask this here, or over on Programmers.SE, so if I have this wrong, please migrate.
First, a bit about what I'm trying to implement. I have a node.js application that takes messages from one source (a socket.io client), and then does processing on the message, which might result in zero or more messages back out, either to the sender, or other clients within that group.
For the processing, I would like to essentially just shove the message into a queue, then it works its way through various message processors that might kick off their own items, and eventually, the bit running socket.io is informed "Hey, send this message back"
As a concrete example, say a user signs into the service, that sign in message is then placed in the queue, where the authorization processor gets it, does it's thing, then places a message back in the queue saying the client's been authorized. This goes back to the socket.io socket that is connected to the client, along with other clients that might be interested. It can also go to other subsystems that might want to do more processing on authorization (looking up user info, sending more info to the client based on their data, etc).
If I wanted strong coupling, this would be easy, but I tried that before, and it just goes to a mess of spaghetti code that's very fragile, and I would like to avoid that. Another wrench in the setup is this should be cluster-able, which is where the real problem comes in. There might be more than one, say, authorization processor running. But the authorization message should be processed only once.
So, in short, I'm looking for a pattern/technique that will allow me to, essentially, have multiple "groups" of subscribers for a message, and the message will be processed only once per group.
I thought about maybe having each instance of a processor generate a unique name that would be used as a list in Reids. This name would then be registered with some sort of dispatch handler, and placed into a set for that group of subscribers. Then when a message arrives, the dispatch pulls a random member out of that set, and places it into that list. While it seems like this would work, it seems somewhat over-complicated and fragile.
The core problem is I've never designed a system like this, so I'm not even sure the proper terms to use or look up. If anyone can point me in the right direction for this, I would be most appreciative.
I think what your describing is similar to https://www.getbridge.com/ service. I it but ended up writing my own based on zeromq, it allows you to register services, req -> <- rec and channels which are pub / sub workers.
As for the design, I used a client -> broker -> services & channels which are all plug and play using auto discovery, you have the services register their schema with the brokers who open a tcp connection so that brokers on other servers can communicate with that broker groups services. Then internal services and clients connect via unix sockets or ipc channels which ever is preferred.
I ended up wrapping around the redis publish/subscribe functions a bit to do this. Each type of message processor gets a "group name", and there can be multiple instances of the processor within that group (so multiple instances of the program can run for clustering).
When publishing a message, I generate an incremental ID, then store the message in a string key with that ID, then publish the message ID.
On the receiving end, the first thing the subscriber does is attempt to add the message ID it just got from the publisher into a set of received messages for that group with sadd. If sadd returns 0, the message has already been grabbed by another instance, and it just returns. If it returns 1, the full message is pulled out of the string key and sent to the listener.
Of course, this relies on redis being single threaded, which I imagine will continue to be the case.
What you might be looking for is an AMQP protocol implementation,where you can have queue get custom exchanges,and implement a pub-sub model.
RabbitMQ - a popular amqp protocol implementation with lots of libraries
it also has node.js library

Detect and Delete Orphaned Queues, Topics, or Subscriptions on Azure Service Bus

If there are no longer any publishers or subscribers reading nor writing to a Queue, Topic, or Subscription, because of crashes or other abnormal terminations (instance restart, etc.), is that Queue/Topic/Subscription effectively orphaned?
I tested this by creating a few Queues, and then terminating the applications. Those Queues were still on the Service Bus a long time later. It seems that they will just stay there forever. That would be wonderful if we WANTED that behavior, but in this case, we do not.
How can we detect and delete these Queues, Topics, and Subscriptions? They will count towards Azure limits, etc, and we cannot have these orphaned processes every time an instance is restarted/patched/crashes.
If it helps make the question clearer, this is a unique situation in which the Queues/Topics/Subscriptions have special names, or special Filters, and a very limited set of publishers (1) and subscribers (1) for a limited time. This is not a case where we want survivability. These are instance-specific response channels. Whether we use Queues or Subscriptions is immaterial. If the instance is gone, so is the need for that Queue (or Subscription).
This is part of a solution where each web role has a dedicated response channel that it monitors. At any time, this web role may have dozens of requests pending via other messaging channels (Queues/Topics), and it is waiting for the answers on multiple threads. We need the response to come back to the thread that placed the message, so that the web role can respond to the caller. It is no good in this situation to simply have a Subscription based on the machine, because it will be receiving messages for other threads. We need each publishing thread to establish a dedicated response channel, so that the only thing on that channel is the response for that thread.
Even if we use Subscriptions (with some kind of instance-related filter) to do a long-polling receive operation on the Subscription, if the web role instance dies, that Subscription will be orphaned, correct?
This question can be boiled down like so:
If there are no more publishers or subscribers to a Queue/Topic/Subscription, then that service is effectively orphaned. How can those orphans be detected and cleaned up?
In this scenario you are looking for the Queue/Subscriptions to be "dynamic" in nature. They would be created and removed based on use as opposed to the current explicit provisioning model for these entities. Service Bus provides you with the APIs to perform create/delete operations so you can plug these on role OnStart/OnStop events appropriately. If those operations fail for some reason then the orphaned entities will exist. Again you can run clean up operation on them based on some unique identifier for the name of the entities. An example of this can be seen here: http://windowsazurecat.com/2011/08/how-to-simplify-scale-inter-role-communication-using-windows-azure-service-bus/
In the near future we will add more metadata and query capabilities to Queues/Topics/Subscriptions so you can see when they were last accessed and make cleanup decisions.
Service Bus Queues are built using the “brokered messaging” infrastructure designed to integrate applications or application components that may span multiple communication protocols, data contracts, trust domains, and/or network environments. The allows for a mechanism to communicate reliably with durable messaging.
If a client (publisher) sends a message to a service bus queue and then crashes the message will be stored on the Queue until as consumer reads the message off the queue. Also if your consumer dies and restarts it will just poll the queue and pick up any work that is waiting for it (You can scale out and have multiple consumers reading from queue to increase throughput), Service Bus Queues allow you to decouple your applications via durable cloud gateway analogous to MSMQ on-premises (or other queuing technology).
What I'm really trying to say is that you won't get an orphaned queue, you might get poisoned messages that you will need to handled, this blog post gives some very detailed information re: Service Bus Queues and their Capacity and Quotas which might give you a better understanding http://msdn.microsoft.com/en-us/library/windowsazure/hh767287.aspx
Re: Queue Management, you can do this via Visual Studio (1.7 SDK & Tools) or there is an excellent tool called Service Bus Explorer that will make your life easier for queue managagment: http://code.msdn.microsoft.com/windowsazure/Service-Bus-Explorer-f2abca5a
*Note the default maximum number of queues is 10,000 (per service namespace, this can be increased via a support call)
As Abhishek Lai mentioned there is no orphan detecting capability supported.
Orphan detection can be implement externally in multiple ways.
For example, whenever you send/receive a message, update a timestamp in an SQL database to indicate that the queue/tropic/subscription is still active. This timestamp can then be used to determine orphans.
If your process will crash which is very much possible there will be issue with the message delivery within the queue however queue will still be available to process your request. Handling Application Crashes and Unreadable Messages with Windows Azure Service Bus queues are described here:
The Service Bus provides functionality to help you gracefully recover from errors in your application or difficulties processing a message. If a receiver application is unable to process the message for some reason, then it can call the Abandon method on the received message (instead of the Complete method). This will cause the Service Bus to unlock the message within the queue and make it available to be received again, either by the same consuming application or by another consuming application.
In the event that the application crashes after processing the message but before the Complete request is issued, then the message will be redelivered to the application when it restarts. This is often called At Least Once Processing, that is, each message will be processed at least once but in certain situations the same message may be redelivered. If the scenario cannot tolerate duplicate processing, then application developers should add additional logic to their application to handle duplicate message delivery. This is often achieved using the MessageId property of the message, which will remain constant across delivery attempts.
If there are no longer any processes reading nor writing to a queue, because of crashes or other abnormal terminations (instance restart, etc.), is that queue effectively orphaned?
No the queue is in place to allow communication to occur via Brokered Messages, if all your apps die for some reason then the queue still exists and will be there when they become alive again, it's the communication channel for loosely decoupled applications. Regards Billing 'Messages are charged based on the number of messages sent to, or delivered by, the Service Bus during the billing month' you won't be charged if a queue exists but nobody is using it.
I tested this by creating a few queues, and then terminating the
applications. Those queues were still on the machine a long time
later.
The whole point of the queue is to guarantee message delivery of loosely decoupled applications. Think of the queue as an entity or application in its own right with high availability (SLA) as its hosted in Azure, your producer/consumers can die/restart and the queue will be active in Azure. *Note I got a bit confused with your wording re: "still on the machine a long time later", the queue doesn't actually live on your machine, it sits up in Azure in a designated service bus namespace. You can view and managed the queues via the tools I pointed out in the previous answer.
How can we detect and delete these queues, as they will count towards
Azure limits, etc.
As stated above the default maximum number of queues is 10,000 (per service namespace, this can be increased via a support call), queue management can be done via the tools stated in the other answer. You should only be looking to delete queue's when you no longer have producer/consumers looking to write to them (i.e. never again). You can of course create and delete queues in your producer/consumer applications via the namespaceManager.QueueExists, more information here How to Use Service Bus Queues
If it helps make the question clearer, this is a unique situation in which the queues have special names, and a very limited set of publishers (1) and subscribers (1) for a limited time.
It sounds like you need to use Topics & Subscriptions How to Use Service Bus Topics/Subscriptions, this link also has a section on 'How to Delete Topics and Subscriptions' If you have a very limited lifetime then you could handle topic creation/deletion in your app's otherwise you could have have a separate Queue/Topic/Subscription setup/deletion script to handle this logic...

Resources