I'm new to distributed systems and try to get an fitting architecture and frameworks for my scenario.
I have an scenario, where many heterogeneous computers shall be connected as a distributed system. Some computers can mine data and store these locally, other want to access the data for visualization. The data must be transfered regularly and can be quite big (some MegaBytes), to create plots out of it. I also want to be able to start services (e.g. data visualization) on specific computers.
Can I transfer the data with an MoM, like ActiveMQ, efficiently? For Example that my data visualization services subscribe the topic with the needed data and the data miner publishes it. Would be be fast enough for having live data updating 10 times per second?
Would an RPC Framework be more efficient?
Can i combine a overlaying MoM with an RPC Framework or a socket Connection for the data-transfer? Or does this make sense at all?
Related
we want to develop a dashboard to analyze geospatial data.
This is a small and close approach to what we want to do: http://adilmoujahid.com/images/data-viz-talkingdata.gif
Our main concerns are about the backend technologies to be used. (front will be D3.js, DC.js, leaflet.js...)
Between Django and node.js, we think that we will use node.js, cause we've read than its faster than Django for this kind of tasks. But we are not sure and we are open to ideas.
But about Mongo or Cassandra, we are so confused. Our data is mostly structured, so store it in tables like Cassandra would make it easy to manage, also Cassandra seems to have better performance. However, we also have IoT devices data, with lots of real-time GPS location...
Which suggestions can you give to us to achieve our goal?
TL;DR Summary;
Dashboard with hundreds of simultaneous users.
Stored data will be mostly structured text/numbers, but will include also images, GPS-arrays, IoT sensors, geographical data (vector-polygons & rasters)
Databases will receive high write load coming from sensors.
Dashboard performance is so important. Its more important to read data in real time, than keeping it uncorrupted/secure.
Most calculus/math will be calculated in the client's browser, the server will try to avoid mathematical operations.
Disclaimer: I'm a DataStax employee so I'll comment on the Cassandra piece.
Cassandra is a good choice for this if your dashboard can be planned around a set of known queries. If those users will be doing ad-hoc queries directly to the database from the dashboard, you'll want something with a little more flexibility like ElasticSearch or (shameless plug) DataStax Search. Especially if you expect the queries/database to handle some of the geospatial logic.
JaguarDB has very strong support of geospatial data (2D and 3D). It allows you to store multi-measurements per point location while other databases support only one measurement (pointm). Many complex queries such as Voronoi polygon, convexhull are also supported. It is open source, distributed and sharded, multiple columns indexes, etc.
Concerning Postgresql and Cassandra, is there much difference in RAM/CPU/DISK usage between them?
Our use case does not require transactions, it will be in a single node and we will have IoT devices writing data up to 500 times per second. However ive read that Geographical data that works better with Potstgis than cassandra...
According to this use case, do you recommend Cassandra or Postgis?
I am using hazelcast jet 0.6.1 for real time analysis. There are multiple streams (mostly from remote journal) coming from different sources.
I would like to know, if full join supported between multiple streams.
If yes, will you please point me to some links / examples for full join between multiple streams.
I think you need to elaborate a bit more on what you are trying to do. Streams are theoretically infinite, so the term "full join" has to mean something different than it does in a database.
There are several types of joins available in Jet. As Can said above, there is a merge operator, but you might be thinking more of windowed join where you time bound the period of the joins.
Merge Steams is here:
https://docs.hazelcast.org/docs/jet/0.7.2/manual/#merge
Window Concepts are here:
https://docs.hazelcast.org/docs/jet/0.7.2/manual/#unbounded-stream-processing
*This is in response to the comment from the first answer, it's to large for another comment and I thought the first answer is still relevant
Is this the same data and data type, just from different nodes? Like app servers for a microservices architecture? It seems to me that you have a few options here that really come down to preferred overall architecture, especially about how you want to transport the events. A couple thoughts:
You can simply merge streams from different data sources if that fits the use case:
See: https://docs.hazelcast.org/docs/jet/0.7.2/manual/#merge
If this is homogenous data, just distributed across app servers, if might be a case where you use the Hazelcast client on each app server to put events into an IMap (which is shared by all the app servers) with an Event Journal on a Hazelcast cluster. Then Jet just receives all the events from the Event Journal.
See: https://docs.hazelcast.org/docs/latest/manual/html-single/#event-journal
If you have Kafka available, perhaps you create a topic for the events from the servers and Jet receives the events from Kafka. Either way they are already merged when Jet gets them, so they are processed as one stream.
See: https://docs.hazelcast.org/docs/jet/0.7.2/manual/#kafka
I have been working on a Web App for visualizing live data. It is crucial that this data is kept up to date on the client side without such updates being invoked directly by the client (e.g. no button presses or refreshing the page). Currently, on page load, I grab the current data set from a database (DynamoDB) via Ajax, and subsequent updates are pushed to any listening clients every 5 minutes via a Websockets connection (using Socket.io).
I have overlooked the computational load of this update job. It has to mine some data, process it, update the database, and send the update out to all clients. As a result, the web server is left unresponsive for about 30 seconds with each update. Furthermore, my current architecture limits me from putting my server behind a load balancer, which is something I anticipate coming up in the future. For both these reasons, I really need to get this update job off my web server.
I am relatively inexperienced in web development, and I don't feel I am knowledgeable enough about these technologies to know the drawbacks of the solutions I have come up with. Currently, I am considering:
Break the update off into a separate process so it does not block the Node event loop. This would solve my issue in the short term, but if I ever want to load balance my application, I can't have the update running on multiple machines.
Drop Websockets entirely and just have the client query the database every 5 minutes, while a separate process (or separate server if I want load balancing) keeps the database up to date without interacting directly with the client. Will this kind of access pattern put too much load on my db?
Have a separate server run the update, and send the result via Websockets (or maybe some other protocol) to my load balanced application servers, which then push that update to all listening clients as usual. Is this even possible?
Perhaps there are other solutions. It seems like this would be a relatively common problem, so I was hoping I could find some guidance here. What are the potential issues with the solutions I have proposed, and are there other possible solutions that my suit my use case better?
It sounds like you want one process sitting somewhere which crunches the data and publishes it to a stream. Clients can then subscribe to the stream as and when they like. Redis handles streams nicely, you could process your data and push it into a redis stream. You could then create a small node service which subscribes to the redis stream and pushes the formatted data out over a websocket or via polling.
In this scenario you can then scale up either the publishing process (the one crunching the numbers) if your data load goes up, or scale up your subscribed process (which serves the data over a websocket to browsers) if you get an influx of clients watching the data.
You can also easily distribute the hosting of these services across other machines, and even write them in different languages if you decide the number crunching needs something like threading.
You're then left with the issue of clients (web browsers) consuming this data with a load balance in-between. This can be a hard problem if you use websockets and is bundled with pros and cons. But importantly you'll have separated your data crunching from your result publishing and that'll isolate out your issue to only the load balancing.
I have done pretty much the same to check ressources on some of our servers.
I have a C# service getting the information on each server that we manage, sending them to a queue (Amq).
From there, I have a stomp client fetching data from amq and emiting them to a websocket.
My main micro service is fetching the data to save them into a db.
My visualisation webapp is connected to the same ws and is fetching the data as they are sent to display them.
The Amq step isn't mandatory at all, it's just something I had to work with (historical).
I don't know what type of data your are working with, so I don't know if my solution can apply to you.
Don't hesitate if I'm not clear or you have any question.
This is a big question and I'm not going to try and give you a definitive answer.
For option 2
It really depends on how expensive your queries are. You can make DynamoDB fast if you pay for enough throughput. That said, on the face it, re-loading your whole dataset, when that sounds like its probably large, probably isn't good engineering.
For option 3
This option seems best to me if its achievable, although admittedly its hard to say with such a complex system - obviously you can't share your whole project.
Given your are already using AWS you might want to look into AWS Lambda. If you can move the update process into a stand alone job, you can host it on lambda and move the load off the web server. Lambda is essentially infinitely scalable and you only pay for the compute you use.
This really depends on you being able to split the update task off into a separate service. Its likely you would need a fair bit of refactoring to isolate it as a service. If you can break little bits off at a time, and make the move gradually, even better.
If you consider trying this, and you've not used Lambda before, I would definitely start small with some hello world examples. Then try a very simple service in your application, and build up to taking on the update service.
You might also consider looking in AWS Simple Message Queue Service to handle the comms between clients and server.
Database tuning
If a lot of your update time is spent waiting for database actions to complete, rather than server processing, you can consider tuning that side of things up. Things to consider are:
Buying more throughput
Using batch operations (as these move load to DynamoDB from your server)
Tuning keys, indexes and database access
I'm working on an IoT project that involves a sensor transmitting its values to an IoT platform. One of the platforms that I'm currently testing is Thingsboard, it is Open Source and I find it quite easy to manage.
My sensor is transmitting active energy indexes to Thingsboard. Using these values, I would like to calculate and show on a widget the values of the active power (= k*[ActiveEnergy(n)- ActiveEnergy(n-1)/Time(n)-Time(n-1)]). This basically means that I want to have access to history data, use this data to generate new data and inject it to my device.
Thingsboard uses Cassandra database to save history values.
One alternative to my problem could be to find a way to communicate with the database via a Web API for example, do the processing and send back the active power by MQTT or HTTP on my device using its access token.
Is this possible?
Is there a better alternative to my problem?
There are several options how to achieve this (based on a layer or component of the system):
1) Visualization layer only. Probably the most simple one. There is an option to apply post-processing function. The function has following signature:
function(time, value, prevValue)
Please note that prevTime is missing, but we may add this in future releases.
post processing function
2) Data processing layer. Use advanced analytics frameworks like Apache Spark to post-process your data using sliding time window, for example.
See our integration article about this.
I want to create a cluster of node.js server in order to support high concurrency, for a chat rooms application. I need to be able to share information between all nodes. I am trying to find out what would be the best way to keep all the servers in-sync. I want as much flexibility as possible in the shared object, as I plan to add more features in the future.
So far, I have 2 solutions in mind:
Subscribe to NoSQL key (for example redis publish-subscribe)
Nodes update each other using sockets.
Which is better? Any other ideas?
Redis is nice because it's independent of your node app and fairly easy to scale. You can also use it for a lot of stuff outside of pub/sub as well, such as sharing basic data structures (hashes, sorted sets, lists, strings) between your node servers to help keep them in sync this way as well. Theoretically, you could save all chats in a given room as a sorted set where your key is a json representation of some chat object (something like {'user':'some_user','msg':'some_msg'} and your score is the timestamp, so it's very easy to pull conversations by time). Redis is extremely fast, and its data structures are highly optimized, so a single server can handle many, many users.
We have a similar setup in production with one Redis server handling 1 million users (about 10k hits inserts and 20k reads from a sorted set per minute), and the CPU usage rarely gets above 5% on a non-CPU-heavy box.