Production-ready WebSocket IOT System - Software Design Advice

Production-ready WebSocket IOT System - Software Design Advice - node.js

I'm developing a system to control a range of IoT devices. Each set of devices is grouped into a "system" that monitors/controls a real-world process. For example system A may be managing process A and have:
3 cameras
1 accelerometer
1 magnetometer
5 thermocouples
The webserver maintains socket connections to each device. Users can connect (via a UI - again with WebSockets) to the webserver and receive updates about systems to which they are subscribed.
When a user wants to begin process A, they should press a 'start' button on the interface. This will start up the cameras, accelerometer, magnetometer, and thermocouples. These will begin sending data to the server. It also triggers the server to set the recording mode to true for each device, which means the server will write output to a database. My question:
Should I send a single 'start' request from javascript code in my UI to the server, and allow the server to start each device individually (how do I then handle an error, for example, if a single sensor isn't working - what about if two sensors don't work?). Or do I send individual requests from the UI to the server for each device, i.e. start camera 1, start camera 2, start accelerometer, start recording camera1, etc. and handle each success/error state individually?
My preference throughout the system so far has been the latter approach - one request, one response; with an HTTP error code. However, programming becomes more complex when there are many devices to control, for example - System B has 12 thermocouples.
Some components of the system are not vital - e.g. if 1 camera fails we can continue, however, if the accelerometer fails the whole system cannot run and so human monitoring is required. If the server started the devices individually from a single 'start' message, should I return an array of errors, or should the server know which components are vital and return a single error if a vital component fails? And in a failure state, should the server then handle stopping each sensor and returning to the original state - and what if that then fails? I foresee this code becoming quite complex with this approach.
I've been going back and forth over the best way to approach this for months, but I can't find much advice online around building complex, production-ready IoT systems for the real world. If anybody has any advice or could point me towards any papers/books/etc. I would really appreciate it.
Thanks in advance,
Tom

Related

FreeRadius in combination with a vulnerability scan / software status check

What i have:
I am running a freeradius server fully configured of how i need it to be. Everything works just fine right now.
What i need:
I need the radius to put the devices in a seperate vlan before authentication and to run a vulnerability scan (nessus / openvas etc) on the devices in this vlan to check for software status ( antivirus etc. )
if the device passes the test the authentication should be done normaly.
if it fails it should be put into a third ( fourth if you count the unauth-vid ) vlan.
can someone tell me if this is doable in freeradius ?
thanks in advance for your answers

Yes. But this is a very broad question and is dependent on the networking equipment being used. I'll give you an overview of how I'd design such a system.
In general, you'll have an easier time if you can use the same DHCP server/IP range for your NAC and full access VLAN. That means you don't have to signal the higher networking layers in the client that there's been a state change, you can swap out VLANs behind the scenes to change what they can access.
You'd set up a database with an entry for each client. This doesn't have to be pre-populated, it could be populated during the first auth attempt. Part of each client entry would be a status field detailing when they last completed NAC.
You'd also need an accounting database, to store information about where each client is connected to the network.
If the client had never completed NAC checks before, you'd assign the client to the NAC VLAN, and signal your NAC processes to start interrogating it.
FreeRADIUS can act as both a RADIUS and DHCPv4 server, so you'd probably do signal the NAC process from the DHCPv4 side because then you'd know what IP the client received.
Binding the RADIUS and DHCPv4 sides can be done in a couple of ways. The most obvious is MAC, another common way is NAS/Port ID using the accounting table.
Once the NAC checks had completed, you'd have the NAC process write out a receipt in detail file format, and have that read back in by a detail file listener (there are examples of this in sites-available/ in the 'decoupled-accounting' virtual server files). When reading those entries back in, you'd change the state in the database, and send a CoA packet to the switch using information from the accounting database to identify the client. This would flip the VLAN and allow them to the standard set of networking resources.
I know this is very high level, documenting it properly would probably exceed StackOverflow's character limit. If you need more help with this, I suggest you research what I've described above and then start asking the RADIUS related questions on the FreeRADIUS user's mailing list https://freeradius.org/support/.

How to implement concurrency or context-switching in NodeJS

So I have this API endpoint called www.example.com/endpoint on which many devices post(I work in an IOT firm). We have implemented our whole backed in NodeJS and are stuck while scaling from 1 device to 'n' number of devices. The devices post their packets at this API endpoint, from where I execute a complex bit of code(arnd 1000 lines) and save the state of the device in the database(mongoDB). Now the issue is. Whenever I receive a packet from device 1 and I am executing it and in the middle I get a packet from device 2, NodeJS leaves the device 1 execution as it is and starts serving the packet 2 from device 2, I saw this when I put extensive console.log() statements
Now in an ideal world. I would want Node to save the context of my current progress with packet 1. then leave. and go on to save the packet 2 in a queue to be processed later. Once I am done with packet 1 I shall take up packet 2 and process it.
I know libraries like RabbitMQ and kue for storing it in queue and processing it later, but how do I context switch from one execution to another?
This is my way of thinking. There could be other solutions as well. Would like to hear your thoughts on the matter.

Q: How to implement concurrency or context-switching in NodeJS.
A: Short answer: Not possible. Because Javascript is single threaded.
Q: Now the issue is. Whenever I receive a packet from device 1 and I am executing it and in the middle I get a packet from device 2, NodeJS leaves the device 1 execution as it is and starts serving the packet 2 from device 2, I saw this when I put extensive console.log() statements
A: As you might have already read in numerous places that NodeJS is based on an event-driven model that is non-blocking for I/O.
The reason why Node seems to have ditched device1 midway to serve device2 was because the code for device1 has already been processed up till a point where it is just waiting on an asynchronous function to callback. E.g. performing a database write. So meantime while it is available, it went on to service device2
Similar case for device2 - once it hits an async function where an event gets pushed into the event queue, pending for a return. Node might go back to device1 if a response has come back. Or it could be other devices, deviceN.
We say NodeJS is non-blocking because the node process does not lock the entire web application down for a sole response. Instead it move on and pick the next event (essentially a block of code) from the queue to run it. Hence it is constantly busy, unless there is really nothing available on the event queue.
Q: I know libraries like RabbitMQ and kue for storing it in queue and processing it later, but how do I context switch from one execution to another?
A:
As said earlier. as of 2016 - it is still not possible for Javascript to do threading. NodeJS is not designed for heavy computation work, it should only be focused on serving requests therefore the code should preferably be light and non-blocking. Basically you will want to leave those heavy I/O duties like writing to file or databases or making HTTP requests (network) to other processes by wrapping the calls with async functions.
NodeJS is not a silver bullet technology. If your application is expected to do a lot of computational work on the event thread then Node is probably not a good choice of technology but it is not the end of the world - as you can fork your own child process for the heavy computational jobs.
See:
https://nodejs.org/api/child_process.html
You might also want to consider alternative like Java which has NIO and Threading capabilities.

How to enforce the order of messages passed to an IoT device over MQTT via a cloud-based system (API design issue)

Suppose I have an IoT device which I'm about to control (lets say switch on/off) and monitor (e.g. collect temperature readings). It seems MQTT could be the right fit. I could publish messages to the device to control it and the device could publish messages to a broker to report temperature readings. So far so good.
The problems start to occur when I try to design the API to control the device.
Lets day the device subscribes to two topics:
/device-id/control/on
/device-id/control/off
Then I publish messages to these topics in some order. But given the fact that messaging is typically an asynchronous process there are no guarantees on the order of messages received by the device.
So in case two messages are published in the following order:
/device-id/control/on
/device-id/control/off
they could be received in the reversed order leaving the device turned on, which can have dramatic consequences, depending on the context.
Of course the API could be designed in some other way, for example there could be just one topic
/device-id/control
and the payload of individual messages would carry the meaning of an individual message (on/off). So in case messages are published to this topic in a given order they are expected to be received in the exact same order on the device.
But what if the order of publishes to individual topics cannot be guaranteed? Suppose the following architecture of a system for IoT devices:
/ control service \
application -> broker -> control service -> broker -> IoT device
\ control service /
The components of the system are:
an application which effectively controls the device by publishing messages to a broker
a typical message broker
a control service with some business logic
The important part is that as in most modern distributed systems the control service is a distributed, multi instance entity capable of processing multiple control messages from the application at a time. Therefore the order of messages published by the application can end up totally mixed when delivered to the IoT device.
Now given the fact that most MQTT brokers only implement QoS0 and QoS1 but no QoS2 it gets even more interesting as such control messages could potentially be delivered multiple times (assuming QoS1 - see https://stackoverflow.com/a/30959058/1776942).
My point is that separate topics for control messages is a bad idea. The same goes for a single topic. In both cases there are no message delivery order guarantees.
The only solution to this particular issue that comes to my mind is message versioning so that old (out-dated) messages could simply be skipped when delivered after another message with more recent version property.
Am I missing something?
Is message versioning the only solution to this problem?

Am I missing something?
Most definitely. The example you brought up is a generic control system, being attached to some message-oriented scheme. There are a number of patterns that can be used when referring to a message-based architecture. This article by Microsoft categorizes message patterns into two primary classes:
Commands and
Events
The most generic pattern of command behavior is to issue a command, then measure the state of the system to verify the command was carried out. If you forget to verify, your system has an open loop. Such open loops are (unfortunately) common in IT systems (because it's easy to forget), and often result in bugs and other bad behaviors such as the one described above. So, the proper way to handle a command is:
Issue the command
Inquire as to the state of the system
Evaluate next action
Events, on the other hand, are simply fired off. As the publisher of an event, it is not my business to worry about who receives the event, in what order, etc. Now, it should also be pointed out that the use of any decent message broker (e.g. RabbitMQ) generally carries strong guarantees that messages will be delivered in the order which they were originally published. Note that this does not mean they will be processed in order.
So, if you treat a command as an event, your system is guaranteed to act up sooner or later.
Is message versioning the only solution to this problem?
Message versioning typically refers to a property of the message class itself, rather than a particular instance of the class. It is often used when multiple versions of a message-based API exist and must be backwards-compatible with one another.
What you are instead referring to is unique message identifiers. Guids are particularly handy for making sure that each message gets its own unique id. However, I would argue that de-duplication in message-based architectures is an anti-pattern. One of the consequences of using messaging is that duplicates are possible, so you should try to design your system behaviors to be stateless and idempotent. If this is not possible, it should be considered that messaging may not be the correct communication solution for the need.
Using the command-event dichotomy as an example, you could perform the following transaction:
The controller issues the command, assigning a unique identifier to the command.
The control system receives the command and turns on.
The control system publishes the "light on" event notification, containing the unique id of the command that was used to turn on the light.
The controller receives the notification and correlates it to the original command.
In the event that the controller doesn't receive notification after some timeout, the controller can retry the command. Note that "light on" is an idempotent command, in that multiple calls to it will have the same effect.

When state changes, send the new state immediately and after that periodically every x seconds. With this solution your systems gets into desired state, after some time, even when it temporarily disconnects from the network (low battery).
BTW: You did not miss anything.

Apart from the comment that most brokers don't support QOS2 (I suspect you mean that a number of broker as a service offerings don't support QOS2, such as Amazon's AWS IoT service) you have covered most of the major points.
If message order really is that important then you will have to include some form of ordering marker in the message payload, be this a counter or timestamp.

SocketIO scaling architecture and large rooms requirements

We are using socketIO on a large chat application.
At some points we want to dispatch "presence" (user availability) to all other users.
io.in('room1').emit('availability:update', {userid='xxx', isAvailable: false});
room1 may contains a lot of users (500 max). We observe a significant raise in our NodeJS load when many availability updates are triggered.
The idea was to use something similar to redis store with Socket IO. Have web browser clients to connect to different NodeJS servers.
When we want to emit to a room we dispatch the "emit to room1" payload to all other NodeJS processes using Redis PubSub ZeroMQ or even RabbitMQ for persistence. Each process will itself call his own io.in('room1').emit to target his subset of connected users.
One of the concern with this setup is that the inter-process communication may become quite busy and I was wondering if it may become a problem in the future.
Here is the architecture I have in mind.

Could you batch changes and only distribute them every 5 seconds or so? In other words, on each node server, simply take a 'snapshot' every X seconds of the current state of all users (e.g. 'connected', 'idle', etc.) and then send that to the other relevant servers in your cluster.
Each server then does the same, every 5 seconds or so it sends the same message - of only the changes in user state - as one batch object array to all connected clients.
Right now, I'm rather surprised you are attempting to send information about each user as a packet. Batching seems like it would solve your problem quite well, as it would also make better use of standard packet sizes that are normally transmitted via routers and switches.

You are looking for this library:
https://github.com/automattic/socket.io-redis
Which can be used with this emitter:
https://github.com/Automattic/socket.io-emitter

About available users function, I think there are two alternatives,you can create a "queue Users" where will contents "public data" from connected users or you can use exchanges binding information for show users connected. If you use an "user's queue", this will be the same for each "room" and you could update it when an user go out, "popping" its state message from queue (Although you will have to "reorganize" all queue message for it).
Nevertheless, I think that RabbitMQ is designed for asynchronous communication and it is not very useful approximation have a register for presence or not from users. I think it's better for applications where you don't know when the user will receive the message and its "real availability" ("fire and forget architectures"). ZeroMQ require more work from zero but you could implement something more specific for your situation with a better performance.
An publish/subscribe example from RabbitMQ site could be a good point to begin a new design like yours where a message it's sent to several users at same time. At summary, I will create two queues for user (receive and send queue messages) and I'll use specific exchanges for each "room chat" controlling that users are in each room using exchange binding's information. Always you have two queues for user and you create exchanges to binding it to one or more "chat rooms".
I hope this answer could be useful for you ,sorry for my bad English.

This is the common approach for sharing data across several Socket.io processes. You have done well, so far, with a single process and a single thread. I could lamely assume that you could pick any of the mentioned technologies for communicating shared data without hitting any performance issues.
If all you need is IPC, you could perhaps have a look at Faye. If, however, you need to have some data persisted, you could start a Redis cluster with as many Redis masters as you have CPUs, though this will add minor networking noise for Pub/Sub.

Cloud Architecture On Azure for Internet of Things

I'm working on a server architecture for sending/receiving messages from remote embedded devices, which will be hosted on Windows Azure. The front-facing servers are going to be maintaining persistent TCP connections with these devices, and I need a way to communicate with them on the backend.
Problem facts:
Devices: ~10,000
Frequency of messages device is sending up to servers: 1/min
Frequency of messages originating server side (e.g. from user actions, scheduled triggers, etc.): 100/day
Average size of message payload: 64 bytes
Upward communication
The devices send up messages very frequently (sensor readings). The constraints for that data are not very strong, due to the fact that we can aggregate/insert those sensor readings in a batched manner, and that they don't require in-order guarantees. I think the best way of handling them is to put them in a Storage Queue, and have a worker process poll the queue at intervals and dump that data. Of course, I'll have to be careful about making sure the worker process does this frequently enough so that the queue doesn't infinitely back up. The max batch size of Azure Storage Queues is 32, but I'm thinking of potentially pulling in more than that: something like publishing to the data store every 1,000 readings or 30 seconds, whichever comes first.
Downward communication
The server sends down updates and notifications much less frequently. This is a slightly harder problem, as I can see two viable paradigms here (with some blending in between). Could either:
Create a Service Bus Queue for each device (or one queue with thousands of subscriptions - limit is for number of queues is 10,000)
Have a state table housed in a DB that contains the latest "state" of a specific message type that the devices will get sent to them
With option 1, the application server simply enqueues a message in a fire-and-forget manner. On the front-end servers, however, there's quite a bit of things that have to happen. Concerns I can see include:
Monitoring 10k queues (or many subscriptions off of a queue - the
Azure SDK apparently reuses connections for subscriptions to the same
queue)
Connection Management
Should no longer monitor a queue if device disconnects.
Need to expire messages if device is disconnected for an extended period of time (so that queue isn't backed up)
Need to enable some type of "refresh" mechanism to update device's complete state when it goes back online
The good news is that service bus queues are durable, and with sessions can arrange messages to come in a FIFO manner.
With option 2, the DB would house a table that would maintain state for all of the devices. This table would be checked periodically by the front-facing servers (every few seconds or so) for state changes written to it by the application server. The front-facing servers would then dispatch to the devices. This removes the requirement for queueing of FIFO, the reasoning being that this message contains the latest state, and doesn't have to compete with other messages destined for the same device. The message is ephemeral: if it fails, then it will be resent when the device reconnects and requests to be refreshed, or at the next check interval of the front-facing server.
In this scenario, the need for queues seems to be removed, but the DB becomes the bottleneck here, and I fear it's not as scalable.
These are both viable approaches, and I feel this question is already becoming too large (although I can provide more descriptions if necessary). Just wanted to get a feel for what's possible, what's usually done, if there's something fundamental I'm missing, and what things in the cloud can I take advantage of to not reinvent the wheel.

If you can identify the device (may be device id/IMEI/Mac address) by the the message it sends then you can reduce the number of queues from 10,000 to 1 queue and not have 10000 subscriptions too. This could also help you in the downward communication as you will be able to identify the device and send the message to the appropriate socket.
As you mentioned the connections last longer you could deliver the command to the device that is connected and decide what to do with the commands to the device that are not connected.
Hope it helps

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string