Bulk load data in titan db from nodejs - node.js

My current scenario is like
I have a rabbit mq which gives me the details of the order placed.
On the other side I have my titan db (cassandra storage, es index backends and gremlin server).
Yet another I have nodejs application which can interact with gremlin server through http api using https://www.npmjs.com/package/gremlin . I am able to make hits to my graph database from here.
Now what I am trying to do is load data from rabbit mq into titan db.
What I have been able to do till now is load the data from nodejs file using gremlin node module
var createClient = require('gremlin').createClient;
//import { createClient } from 'gremlin';
const client = createClient();
client.execute('tx=graph.newTransaction();tx.addVertex(T.label,"product","id",991);tx.commit()', {}, function(err, results){
if (err) {
return console.error(err)
}
console.log(results)
});
How should I move next so that I can harness existing rabbit mq of orders and push them into titan db.
Due to some constraints I can not use java.

You're most likely looking for something like node-amqp, which is a Node.js client for RabbitMQ. What you want to do is:
Establish a connection to Gremlin Server
Establish a connection to RabbitMQ
Listen to a RabbitMQ queue for messages
Send these messages to Gremlin, creating graph elements
Things you must watch for that will otherwise likely kill your performance:
Send Gremlin queries with bound parameters
Batch messages: create multiple vertices and commit them in the same transaction (= same Gremlin query, unless in session mode where you .commit() yourself). Numbers in the couple thousands should work.
Watchout for back-pressure and make sure you don't flood your Titan instances with more messages than they can handle.
I'm not familiar with RabbitMQ but hopefully this should get you started.
Note: Gremlin javascript driver interacts with Gremlin Server via a WebSocket connection, which is permanent and bi-directional. The client doesn't support the HTTP Channelizer yet (which is not the kind of connection that you wish to establish in the current scenario).

Related

How to properly use database when scaling a NodeJS app?

I am wondering how I would properly use MySQL when I am scaling my Node.JS app using the cluster module. Currently, I've only come up with two solutions:
Solution 1:
Create a database connection on every "worker".
Solution 2:
Have the database connection on a master process and whenever one of the workers request some data, the master process will return the data. However, using this solution, I do not know how I would be able to get the worker to retrieve the data from the master process.
I (think) I made a "hacky" workaround emitting with a unique number and then waiting for the master process to send the message back to the worker and the event name being the unique number.
If you don't understand what I mean by this, here's some code:
// Worker process
return new Promise (function (resolve, reject) {
process.send({
// Other data here
identifier: <unique number>
})
// having a custom event emitter on the worker
worker.once(<unique number>, function (data) {
// data being the data for the request with the unique number
// resolving the promise with returned data
resolve(data)
})
})
//////////////////////////
// Master process
// Custom event emitter on the master process
master.on(<eventName>, function (data) {
// logic
// Sending data back to worker
master.send(<other args>, data.identifier)
}
What would be the best approach to this problem?
Thank you for reading.
When you cluster in NodeJS, you should assume each process is completely independent. You really shouldn't be relaying messages like this to/from the master process. If you need multiple threads to access the same data, I don't think NodeJS is what you should be using. However, If you're just doing basic CRUD operations with your database, clustering (solution 1) is certainly the way to go.
For example, if you're trying to scale write ops to your database (assuming your database is properly scaled), each write op is independent from another. When you cluster, a single write request will be load balanced to one of your workers. Then in the worker, you delegate the write op to your database asynchronously. In this scenario, there is no need for a master process.
If you've not planned on using a proper microservice architecture where each process would actually have its own database (or perhaps just an in-memory storage), your best bet IMO is to use a connection pool created by the main process and have each child request a connection out of that pool. That's probably the safest approach to avoid issues in the neighborhood of threadsafety errors.

SocketIO connection stop sending data after 4-5 hours

I have developed an application with ReactJS, ExpressJS, MongoDB and SocketIO.
I have two servers:- Server A || Server B
Socket Server is hosted on the Server A and application is hosted on the Server B
I am using Server A socket on Server B as a client.
Mainly work of Server A socket is to emit the data after fetching from the MongoDB database of Server A
Everything is working as expected but after 4-5-6 hours stop emitting the data but the socket connection will work.
I have checked using
socket.on('connection',function(){
console.log("Connected")
)
I am not getting whats the wrong with the code.
My code : https://jsfiddle.net/ymqxo31d/
Can anyone help me out on this
I have some programming errors.
I am getting data from MongoDB inside the setInterval() so after a little while exhausts resources and database connection starts failing every time.
Firstly i have created a Single MongoDB connection and used every place where i needed.
2ndly i removed setInterval and used setTimeout as below. (NOTE: If i continue using setInterval it execute on defined interval. It doesn't have any status that the data is emitted or not [this also cause the heavy resource usages] but i need to emit the data to socket when successfully fetched.)
setTimeout(emitData,1000);
function emitData(){
db.collection.find({}).toArray(function(data){
socket.emit('updateData',data);
setTimeout(emitData,1000);
})
}

Which technology can connect with Cassandra as well as NodeJS?

I am using spark streaming for reading TCP server and then inserting the data into Cassandra, which I have to further push to UI, for pushing I decided to go for NodeJS. But I am not getting any technology which can talk with Cassandra as well as NodeJS. Below is my architecture, and I am not able to find out technology which can replace ? I am also open to change Cassandra with MongoDB if it is possible to directly push data from Mongo to NodeJS. But as of now I am using Cassandra as it have native support for Hadoop.
Take a look at the datastax nodejs-cassandra driver: https://github.com/datastax/nodejs-driver project. This has cassandra row streaming and piping functionality that you could use to push cassandra data into node, process, and then export via websockets per your desired architecture.
leave your stream client open - and this would need to run as a persistent node server handling errors - something like this should pick up new cassandra data:
var streamCassandra = function(){
client.stream('SELECT time, val FROM temperature WHERE station_id=', ['abc'])
.on('readable', function () {
//readable is emitted as soon a row is received and parsed
var row;
while (row = this.read()) {
console.log('time %s and value %s', row.time, row.val);
}})
.on('error', function (err) {
//handle err
})
.on('end', streamCassandra);
};
wrap your stream client into a recursive function that calls itself again on('end', streamCassandra). you could also poll the function every x seconds with a setInterval if you dont need that kind of concurrency. One of those approaches should work
Have you checked NiFi?
https://nifi.apache.org/
In your case, you could write your Spark Streaming results to Kafka, HDFS, or even directly to NiFi, but I personally prefer to write to Kafka or some other message queue.
From NiFi, you can write to Kafka, and also send requests to your Node JS app if that's what you need. In my case, I'm using Meteor, so just pushing from Kafka to MongoDB automatically refreshes the UI.
I hope it helps.

Should I keep database connection open?

When I connect to Rexster graph server with Grex should I keep the database connection open?
var grex = require('grex');
var client = grex.createClient();
client.connect({ graph: 'graph'}, function(err, client) {
if (err) { console.error(err); }
...
});
I think I should because nodejs is single threaded so there's no chance of different requests trying to use the one connection at the same time.
Yes, you should. There 's no reason to have the overhead of connecting on every request. There will not be any issue of "mangling", as your code will be run in a single thread anyway.
Furthermore, you could even have a pool of connections waiting to serve your requests in case you have a heavy usage application. Some adapters do it for you automatically, for example, MongoClient has a default pool of 5 connections.

How to scale socket.io without redis

I'm currently searching for an alternative to scale my express app with socket.io. The problem is that I don't want to use redis as socket.io store. Are there any other possibilities to cluster socket.io except with Clusterhub?
EDIT: I tried to use fakeredis as replacement for redis, but it seems like it doesn't work with socket.io. From ActionHero.js I know that faye-websocket works with fakeredis.
This might well depends on your socket.io usage and the type of scaling you want to achieve (cluster vs scaling to multiple machines).
So, here is what I did to scale our usage of socket.io to multiples servers.
We have 3 servers behind a load balancer, when a socket connects it connect to any of the 3 servers, the three server has an in memory list of the sockets, and the three servers have an order list of internal server address e.g. [server1, server2, server3].
What I do basically is a ring (internally we call it the "ring of sockets"):
If I need to emit an event to a socket from server1, I look first if the socket is connected to that server1, if not I send an http request to the next server (server2) which will check if the socket is there, if not there it will send the same request to server3, and so on until reaching the origin in which case you might throw an error.
Its almost the same if I need to broadcast a message, I start from one server and then call an http endpoint on the others.
The algorithm I use to determine the next node (next_node.js) is:
var nodes = process.env.NODES.split(',');
//this is usually: http://server1/,http://server2/,http://server3/
var url = require('url');
var current = require("os").hostname();
//origin is the node that started the lookup
exports.get = function (origin) {
var next_node_i = nodes.map(function (uri) {
return url.parse(uri).hostname;
}).reduce(function (prev, curr, i, arr){
return curr === current && i < arr.length - 1 ? i + 1 : prev;
}, 0);
var next_node = nodes[next_node_i];
if (origin && url.parse(next_node).hostname === origin) {
// if the next node is equal to the first node initiating the lookup
// it means the socket we are looking for is not connect to any node.
return null;
}
return next_node;
};
Caveats:
Latency is low between these server and network partitioning is unlikely, they are physically on the same datacenter. But if it were a network partitioning is not that important for us.
We always run the ring in the same direction. An improved version will be to run in both directions(?)
Servers share a secret to call these endpoints.
In my opinion this is a very easy way to achieve scaling in a lot of socket.io use cases, there might be a lot of other scenarios where this is not an option but I hope this give some ideas.
If your comfortable with Azure services, some of the guys on the Azure team have gone to the liberty of writing a service bus store for socket.io.
Glenn Block Explains Socket.IO Scale-Out on Service Bus

Resources