I am using spark streaming for reading TCP server and then inserting the data into Cassandra, which I have to further push to UI, for pushing I decided to go for NodeJS. But I am not getting any technology which can talk with Cassandra as well as NodeJS. Below is my architecture, and I am not able to find out technology which can replace ? I am also open to change Cassandra with MongoDB if it is possible to directly push data from Mongo to NodeJS. But as of now I am using Cassandra as it have native support for Hadoop.
Take a look at the datastax nodejs-cassandra driver: https://github.com/datastax/nodejs-driver project. This has cassandra row streaming and piping functionality that you could use to push cassandra data into node, process, and then export via websockets per your desired architecture.
leave your stream client open - and this would need to run as a persistent node server handling errors - something like this should pick up new cassandra data:
var streamCassandra = function(){
client.stream('SELECT time, val FROM temperature WHERE station_id=', ['abc'])
.on('readable', function () {
//readable is emitted as soon a row is received and parsed
var row;
while (row = this.read()) {
console.log('time %s and value %s', row.time, row.val);
}})
.on('error', function (err) {
//handle err
})
.on('end', streamCassandra);
};
wrap your stream client into a recursive function that calls itself again on('end', streamCassandra). you could also poll the function every x seconds with a setInterval if you dont need that kind of concurrency. One of those approaches should work
Have you checked NiFi?
https://nifi.apache.org/
In your case, you could write your Spark Streaming results to Kafka, HDFS, or even directly to NiFi, but I personally prefer to write to Kafka or some other message queue.
From NiFi, you can write to Kafka, and also send requests to your Node JS app if that's what you need. In my case, I'm using Meteor, so just pushing from Kafka to MongoDB automatically refreshes the UI.
I hope it helps.
Related
I am wondering how I would properly use MySQL when I am scaling my Node.JS app using the cluster module. Currently, I've only come up with two solutions:
Solution 1:
Create a database connection on every "worker".
Solution 2:
Have the database connection on a master process and whenever one of the workers request some data, the master process will return the data. However, using this solution, I do not know how I would be able to get the worker to retrieve the data from the master process.
I (think) I made a "hacky" workaround emitting with a unique number and then waiting for the master process to send the message back to the worker and the event name being the unique number.
If you don't understand what I mean by this, here's some code:
// Worker process
return new Promise (function (resolve, reject) {
process.send({
// Other data here
identifier: <unique number>
})
// having a custom event emitter on the worker
worker.once(<unique number>, function (data) {
// data being the data for the request with the unique number
// resolving the promise with returned data
resolve(data)
})
})
//////////////////////////
// Master process
// Custom event emitter on the master process
master.on(<eventName>, function (data) {
// logic
// Sending data back to worker
master.send(<other args>, data.identifier)
}
What would be the best approach to this problem?
Thank you for reading.
When you cluster in NodeJS, you should assume each process is completely independent. You really shouldn't be relaying messages like this to/from the master process. If you need multiple threads to access the same data, I don't think NodeJS is what you should be using. However, If you're just doing basic CRUD operations with your database, clustering (solution 1) is certainly the way to go.
For example, if you're trying to scale write ops to your database (assuming your database is properly scaled), each write op is independent from another. When you cluster, a single write request will be load balanced to one of your workers. Then in the worker, you delegate the write op to your database asynchronously. In this scenario, there is no need for a master process.
If you've not planned on using a proper microservice architecture where each process would actually have its own database (or perhaps just an in-memory storage), your best bet IMO is to use a connection pool created by the main process and have each child request a connection out of that pool. That's probably the safest approach to avoid issues in the neighborhood of threadsafety errors.
I'm trying to create an efficient streaming node.js app, where the server would connect to a stream (capped collection) in MongoDB with mongoose, and then emit the stream directly to the client browsers.
What I'm worried about is the scalability of my design. Let me know if I'm wrong, but it seems that right now, for every new web browser that is opened, a new connection to MongoDB will also be opened (it won't re-use the previously utilized connection), and therefore there will be a lot of inefficiencies if I have a lot of user connected at the same time. How can I improve that?
I'm thinking of a one server - multiple client type of design in socket.io but I don't know how to achieve that.
Code below:
server side (app.js):
io.on('connection', function (socket) {
console.log("connected!");
var stream = Json.find().lean().tailable({ "awaitdata": true, numberOfRetries: Number.MAX_VALUE}).stream();
stream.on('data', function(doc){
socket.emit('rmc', doc);
}).on('error', function (error){
console.log(error);
}).on('close', function () {
console.log('closed');
});
});
client side (index.html):
socket.on('rmc', function(json) {
doSomething(); // it just displays the data on the screen
});
Unfortunately this will not depend only on mongo performance . unless you have a high level of concurrency (+1000 streams) you shouldn't worry about mongo (for the moment).
because with that kind of app you have bigger problems example: the data type and compression , buffer overflows , bandwith limit , socket.io limits , os limits . These are the kind of problems you will most likely face first.
now to answer your question. As far as i know no you are not opening a connection to mongo per user. the users are connected to the app not the database . the app is connected with the database.
lastly , these links will help you understand and tweak your queries for this kind of job (streaming)
https://github.com/Automattic/mongoose/issues/1248
https://codeandcodes.com/tag/mongoose-vs-mongodb-native/
http://drewww.github.io/socket.io-benchmarking/
hope it helps !
My current scenario is like
I have a rabbit mq which gives me the details of the order placed.
On the other side I have my titan db (cassandra storage, es index backends and gremlin server).
Yet another I have nodejs application which can interact with gremlin server through http api using https://www.npmjs.com/package/gremlin . I am able to make hits to my graph database from here.
Now what I am trying to do is load data from rabbit mq into titan db.
What I have been able to do till now is load the data from nodejs file using gremlin node module
var createClient = require('gremlin').createClient;
//import { createClient } from 'gremlin';
const client = createClient();
client.execute('tx=graph.newTransaction();tx.addVertex(T.label,"product","id",991);tx.commit()', {}, function(err, results){
if (err) {
return console.error(err)
}
console.log(results)
});
How should I move next so that I can harness existing rabbit mq of orders and push them into titan db.
Due to some constraints I can not use java.
You're most likely looking for something like node-amqp, which is a Node.js client for RabbitMQ. What you want to do is:
Establish a connection to Gremlin Server
Establish a connection to RabbitMQ
Listen to a RabbitMQ queue for messages
Send these messages to Gremlin, creating graph elements
Things you must watch for that will otherwise likely kill your performance:
Send Gremlin queries with bound parameters
Batch messages: create multiple vertices and commit them in the same transaction (= same Gremlin query, unless in session mode where you .commit() yourself). Numbers in the couple thousands should work.
Watchout for back-pressure and make sure you don't flood your Titan instances with more messages than they can handle.
I'm not familiar with RabbitMQ but hopefully this should get you started.
Note: Gremlin javascript driver interacts with Gremlin Server via a WebSocket connection, which is permanent and bi-directional. The client doesn't support the HTTP Channelizer yet (which is not the kind of connection that you wish to establish in the current scenario).
I'll give a small premise of what I'm trying to do. I have a game concept in mind which requires multiple players sitting around a table somewhat like poker.
The normal interaction between different players is easy to handle via socket.io in conjunction with node js.
What I'm having a hard time figuring out is; I have a cron job which is running in another process which gets new information every minute which then needs to be sent to each of those players. Since this is a different process I'm not sure how I send certain clients this information.
socket.io does have information for this and I'm quoting it below:
In some cases, you might want to emit events to sockets in Socket.IO namespaces / rooms from outside the context of your Socket.IO processes.
There’s several ways to tackle this problem, like implementing your own channel to send messages into the process.
To facilitate this use case, we created two modules:
socket.io-redis
socket.io-emitter
From what I understand I need these two modules to do what I mentioned earlier. What I do not understand however is why is redis in the equation when I just need to send some messages.
Is it used to just store the messages temporarily?
Any help will be appreciated.
There are several ways to achieve this if you just need to emit after an external event. It depend on what you're using for getting those new data to send :
/* if the other process is an http post incoming you can use for example
express and use your io object in a custom middleware : */
//pass the io in the req object
app.use( '/incoming', (req, res, next) => {
req.io = io;
})
//then you can do :
app.post('/incoming', (req, res, next) => {
req.io.emit('incoming', req.body);
res.send('data received from http post request then send in the socket');
})
//if you fetch data every minute, why don't you just emit after your job :
var job = sheduledJob('* */1 * * * *', io => {
axios.get('/myApi/someRessource').then(data => io.emit('newData', data.data));
})
Well in the case of socket.io providing those, I read into that you actually need both. However this shouldn't necessarily be what you want. But yes, redis is probably just used to store data temporarily, where it also does a really good job, by being close to what a message queue does.
Your cron now wouldn't need a message queue or similar behaviour.
My suggestion though would be to run the cron with some node package from within your process as a child_process hook onto it's readable stream and then push directly to your sockets.
If the cron job process is also a nodejs process, you can exchange data through redis.io pub-sub client mechanism.
Let me know what is your cron job process in and in case further help required in pub-sub mechanism..
redis is one of the memory stores used by socket.io(in case you configure)
You must employ redis only if you have multi-server configuration (cluster) to establish a connection and room/namespace sync between those node.js instances. It has nothing to do with storing data in this case, it works as a pub/sub machine.
Being somewhat new to Node, I've read all over to be mindful of the event loop and careful not to block. I have a web application written in Node.js that has a feature allowing a user to upload a CSV file and save the data as a document in a MongoDB collection.
I've not been able to get a straight answer from anyone whether this is a bad idea. My concern is that if a user uploads a large file the application will become unresponsive for other users while each row of the file gets saved to Mongo.
Is my concern well-founded? Is there an established best practice here (i.e. use streams?) and what would you do if presented with this problem?
Some background info
The files will be, on average, between 10kb - 100kb in size but a larger, 8mb file can happen (8mb is the server's max upload size).
Using a different language is an option but only if there's a compelling reason to do so
I have some Go daemons running with the application and planned to offload the CSV work to one of those or a new Ruby process if Node couldn't handle it. The only issue with this is sharing user sessions between the Node app and Go or Ruby process
The production server has 2 CPU cores available to it so maybe blocking on one core is acceptable?
Update:
The implementation I'm using is the fast-csv Node module using streams in an Express app like this:
var csv = require('fast-cvs');
csv.fromPath('path/to/file.csv')
.on('data', function(data) {
// Save data to MongoDB
})
.on('end', function() {
res.render('my_view', {});
});