Should I share Redis connection between files/modules? - node.js

I'm developing a node.js app and I am in need of heavy Redis usage. The app will be clustered across 8 CPU cores.
Right now I have 100 concurrent connections to Redis because every worker per CPU has several modules running require('redis').createClient().
Scenario A:
file1.js:
var redis = require('redis').createClient();
file2.js
var redis = require('redis').createClient();
SCENARIO B:
redis.js
var redis = require('redis').createClient();
module.exports = redis;
file1.js
var redis = require('./redis');
file2.js
var redis = require('./redis');
Which approach is better: creating new Redis instance in every new file I introduce (scenario A) or creating one Redis connection globally (scenario B) and sharing this connection across all modules I have. What are drawbacks/benefits of each solution?
Thanks in advance!

When I face a question such as this I generally think about three basic questions.
Which is more readable?
Which allows better code reuse?
Which is more efficient?
Not necessarily in this order as it depends on the scenario, but I believe in this case all three of these questions are in favor of option B.
If you ever needed to modify options for createClient, you would then need to edit them in every file which uses it. Which in option A is every file which uses redis, and option B is just redis.js. Also if a newer or different product comes out and you want to replace redis It would be feasible to make redis.js a wrapper for a different package or even a newer redis client substantially cutting down conversion time.
Globals are generally a bad thing, but in this example redis.js should not be storing mutable state, so there is no problem having a global/singleton in this context.

Both Node and Redis can handle lots of connections pretty well, so that's not a problem.
In your situation, you're creating Redis connections at the startup of your application, so the number of connections you're setting up is limited (in the sense that after your application is started, the number of connections will be constant).
Situations where you'd want to reuse the same connection is in highly dynamic situations, for instance with an HTTP-server where you need to query Redis for every request. Creating a new connection for each request would be a waste of resources (creating and destroying connections all the time) and reusing one connection for each request would be preferable.
As for which of the two scenario's I'd prefer, I'm leaning towards Scenario A myself.

You can create file to handle connection and functions with redis
redis-con.js
const redis = require('redis');
let redisClient;
(async () => {
redisClient = redis.createClient();
redisClient.on("error", (error) => console.error(`Error Redis: ${error}`));
await redisClient.connect();
})();
module.exports = redisClient;
Then you need to create function to handle set, get and del.
now, just import the connection

Related

How to wait for a Redis connection?

I'm currently trying to use Node.js Kue for processing jobs in a queue, but I believe I'm not doing it right.
Indeed the way I'm working now, I have two different services (which in this case I'm running with Docker Compose): one Web API built with Express with sends jobs to the queue and one processing module. The issue here is with the processing module.
I've coded it as follows:
var kue = require('kue');
var config = require('./config');
var queue = kue.createQueue({
prefix: config.redis.queuePrefix,
redis: {
port: config.redis.port,
host: config.redis.host
}
});
queue.process('jobType', function (job, done) {
// do processing here...
});
When we run this with Node, it sits there waiting for things to be placed on the queue to do the processing.
There are two issues however:
It needs that Redis be available before running this module. If we run this without Redis already available, it crashes because the host is not accessible and ends the process.
If Redis suddenly becomes unavailable, the processing module also crashes because it cannot stablish the connection and the process is killed.
How can I avoid these problems?
My guess is that I should somehow make the code "wait" for Redis, but I have no idea on how to do this.
How can this be done in this case?
You can use promise to wait until redis is loaded. Then run your module.
loadRedis().then(() => {
//load your module
})
Or you can use generator to "stop" until redis is loaded.
function*(){
const redisLoaded = yield loadRedis();
//load your module
}

Node.js + Socket.IO scaling with redis + cluster

Currently, I'm faced with the task where I must scale a Node.js app using Amazon EC2. From what I understand, the way to do this is to have each child server use all available processes using cluster, and have sticky connections to ensure that every user connecting to the server is "remembered" as to what worker they're data is currently on from previous sessions.
After doing this, the next best move from what I know is to deploy as many servers as needed, and use nginx to load balance between all of them, again using sticky connections to know which "child" server that each users data is on.
So when a user connects to the server, is this what happens?
Client connection -> Find/Choose server -> Find/Choose process -> Socket.IO handshake/connection etc.
If not, please allow me to better understand this load balancing task. I also do not understand the importance of redis in this situation.
Below is the code I'm using to use all CPU's on one machine for a seperate Node.js process:
var express = require('express');
cluster = require('cluster'),
net = require('net'),
sio = require('socket.io'),
sio_redis = require('socket.io-redis');
var port = 3502,
num_processes = require('os').cpus().length;
if (cluster.isMaster) {
// This stores our workers. We need to keep them to be able to reference
// them based on source IP address. It's also useful for auto-restart,
// for example.
var workers = [];
// Helper function for spawning worker at index 'i'.
var spawn = function(i) {
workers[i] = cluster.fork();
// Optional: Restart worker on exit
workers[i].on('exit', function(worker, code, signal) {
console.log('respawning worker', i);
spawn(i);
});
};
// Spawn workers.
for (var i = 0; i < num_processes; i++) {
spawn(i);
}
// Helper function for getting a worker index based on IP address.
// This is a hot path so it should be really fast. The way it works
// is by converting the IP address to a number by removing the dots,
// then compressing it to the number of slots we have.
//
// Compared against "real" hashing (from the sticky-session code) and
// "real" IP number conversion, this function is on par in terms of
// worker index distribution only much faster.
var worker_index = function(ip, len) {
var s = '';
for (var i = 0, _len = ip.length; i < _len; i++) {
if (ip[i] !== '.') {
s += ip[i];
}
}
return Number(s) % len;
};
// Create the outside facing server listening on our port.
var server = net.createServer({ pauseOnConnect: true }, function(connection) {
// We received a connection and need to pass it to the appropriate
// worker. Get the worker for this connection's source IP and pass
// it the connection.
var worker = workers[worker_index(connection.remoteAddress, num_processes)];
worker.send('sticky-session:connection', connection);
}).listen(port);
} else {
// Note we don't use a port here because the master listens on it for us.
var app = new express();
// Here you might use middleware, attach routes, etc.
// Don't expose our internal server to the outside.
var server = app.listen(0, 'localhost'),
io = sio(server);
// Tell Socket.IO to use the redis adapter. By default, the redis
// server is assumed to be on localhost:6379. You don't have to
// specify them explicitly unless you want to change them.
io.adapter(sio_redis({ host: 'localhost', port: 6379 }));
// Here you might use Socket.IO middleware for authorization etc.
console.log("Listening");
// Listen to messages sent from the master. Ignore everything else.
process.on('message', function(message, connection) {
if (message !== 'sticky-session:connection') {
return;
}
// Emulate a connection event on the server by emitting the
// event with the connection the master sent us.
server.emit('connection', connection);
connection.resume();
});
}
I believe your general understanding is correct, although I'd like to make a few comments:
Load balancing
You're correct that one way to do load balancing is having nginx load balance between the different instances, and inside each instance have cluster balance between the worker processes it creates. However, that's just one way, and not necessarily always the best one.
Between instances
For one, if you're using AWS anyway, you might want to consider using ELB. It was designed specifically for load balancing EC2 instances, and it makes the problem of configuring load balancing between instances trivial. It also provides a lot of useful features, and (with Auto Scaling) can make scaling extremely dynamic without requiring any effort on your part.
One feature ELB has, which is particularly pertinent to your question, is that it supports sticky sessions out of the box - just a matter of marking a checkbox.
However, I have to add a major caveat, which is that ELB can break socket.io in bizarre ways. If you just use long polling you should be fine (assuming sticky sessions are enabled), but getting actual websockets working is somewhere between extremely frustrating and impossible.
Between processes
While there are a lot of alternatives to using cluster, both within Node and without, I tend to agree cluster itself is usually perfectly fine.
However, one case where it does not work is when you want sticky sessions behind a load balancer, as you apparently do here.
First off, it should be made explicit that the only reason you even need sticky sessions in the first place is because socket.io relies on session data stored in-memory between requests to work (during the handshake for websockets, or basically throughout for long polling). In general, relying on data stored this way should be avoided as much as possible, for a variety of reasons, but with socket.io you don't really have a choice.
Now, this doesn't seem too bad, since cluster can support sticky sessions, using the sticky-session module mentioned in socket.io's documentation, or the snippet you seem to be using.
The thing is, since these sticky sessions are based on the client's IP, they won't work behind a load balancer, be it nginx, ELB, or anything else, since all that's visible inside the instance at that point is the load balancer's IP. The remoteAddress your code tries to hash isn't actually the client's address at all.
That is, when your Node code tries to act as a load balancer between processes, the IP it tries to use will just always be the IP of the other load balancer, that balances between instances. Therefore, all requests will end up at the same process, defeating cluster's whole purpose.
You can see the details of this issue, and a couple of potential ways to solve it (none of which particularly pretty), in this question.
The importance of Redis
As I mentioned earlier, once you have multiple instances/processes receiving requests from your users, in-memory storage of session data is no longer sufficient. Sticky sessions are one way to go, although other, arguably better solutions exist, among them central session storage, which Redis can provide. See this post for a pretty comprehensive review of the subject.
Seeing as your question is about socket.io, though, I'll assume you probably meant Redis's specific importance for websockets, so:
When you have multiple socket.io servers (instances/processes), a given user will be connected to only one such server at any given time. However, any of the servers may, at any time, wish to emit a message to a given user, or even a broadcast to all users, regardless of which server they're currently under.
To that end, socket.io supports "Adapters", of which Redis is one, that allow the different socket.io servers to communicate among themselves. When one server emits a message, it goes into Redis, and then all servers see it (Pub/Sub) and can send it to their users, making sure the message will reach its target.
This, again, is explained in socket.io's documentation regarding multiple nodes, and perhaps even better in this Stack Overflow answer.

MongoDB connections from AWS Lambda

I'm looking to create a RESTful API using AWS Lambda/API Gateway connected to a MongoDB database. I've read that connections to MongoDB are relatively expensive so it's best practice to retain a connection for reuse once its been established rather than making new connections for every new query.
This is pretty straight forward for normal applications as you can establish a connection during start up and reuse it during the applications lifetime. But, since Lambda is designed to be stateless retaining this connection seems to be less straight forward.
Therefore, I'm wondering what would be the best way to approach this database connection issue? Am I forced to make new connections every time a Lambda function is invoked or is there a way to pool/cache these connections for more efficient queries?
Thanks.
AWS Lambda functions should be defined as stateless functions, so they can't hold state like a connection pool.
This issue was also raised in this AWS forum post. On Oct 5, 2015 AWS engineer Sean posted that you should not open and close connection on each request, by creating a pool on code initialization, outside of handler block. But two days later the same engineer posted that you should not do this.
The problem is that you don't have control over Lambda's runtime environment. We do know that these environments (or containers) are reused, as describes the blog post by Tim Wagner. But the lack of control can drive you to drain all your resources, like reaching a connection limit in your database. But it's up to you.
Instead of connecting to MongoDB from your lambda function you can use RESTHeart to access the database through HTTP. The connection pool to MongoDB is maintained by RESTHeart instead. Remember that in regards to performance you'll be opening a new HTTP connection to RESTHeart on each request, and not using a HTTP connection pool, like you could do in a tradicional application.
You should assume lambdas to be stateless but the reality is that most of the time the vm is simply frozen and does maintain some state. It would be inefficient for Amazon to spin up a new process for every request so they often re-use the same process and you can take advantage of this to avoid thrashing connections.
To avoid connecting for every request (in cases where the lambda process is re-used):
Write the handler assuming the process is re-used such that you connect to the database and have the lamba re-use the connection pool (the db promise returned from MongoClient.connect).
In order for the lambda not to hang waiting for you to close the db connection, db.close(), after servicing a request tell it not wait for an empty event loop.
Example:
var db = MongoClient.connect(MongoURI);
module.exports.targetingSpec = (event, context, callback) => {
context.callbackWaitsForEmptyEventLoop = false;
db.then((db) => {
// use db
});
};
From the documentation about context.callbackWaitsForEmptyEventLoop:
callbackWaitsForEmptyEventLoop
The default value is true. This property is useful only to modify the default behavior of the callback. By default, the callback will wait until the Node.js runtime event loop is empty before freezing the process and returning the results to the caller. You can set this property to false to request AWS Lambda to freeze the process soon after the callback is called, even if there are events in the event loop. AWS Lambda will freeze the process, any state data and the events in the Node.js event loop (any remaining events in the event loop processed when the Lambda function is called next and if AWS Lambda chooses to use the frozen process). For more information about callback, see Using the Callback Parameter.
Restheart is a REST-based server that runs alongside MongoDB. It maps most CRUD operations in Mongo to GET, POST, etc., requests with extensible support when you need to write a custom handler (e.g., specialized geoNear, geoSearch query)
I ran some tests executing Java Lambda functions connecting to MongoDB Atlas.
As already stated by other posters Amazon does reuse the Instances, however these may get recycled and the exact behaviour cannot be determined. So one could end up with stale connections. I'm collecting data every 5 minutes and pushing it to the Lambda function every 5 minutes.
The Lambda basically does:
Build up or reuse connection
Query one record
Write or update one record
close the connection or leave it open
The actual amount of data is quite low. Depending on time of the day it varies from 1 - 5 kB. I only used 128 MB.
The Lambdas ran in N.Virgina as this is the location where the free tier is tied to.
When opening and closing the connection each time most calls take between 4500 - 9000 ms. When reusing the connection most calls are between 300 - 900 ms. Checking the Atlas console the connection count stays stable. For this case reusing the connection is worth it. Building up a connection and even disconnecting from a replica-set is rather expensive using the Java driver.
For a large scale deployment one should run more comprehensive tests.
Yes, there is a way to cache/retain connection to MongoDB and its name is pool connection. and you can use it with lambda functions as well like this:
for more information you can follow these links:
Using Mongoose With AWS Lambda
Optimizing AWS Lambda(a bit out date)
const mongoose = require('mongoose');
let conn = null;
const uri = 'YOUR CONNECTION STRING HERE';
exports.handler = async function(event, context) {
// Make sure to add this so you can re-use `conn` between function calls.
context.callbackWaitsForEmptyEventLoop = false;
const models = [{name: 'User', schema: new mongoose.Schema({ name: String })}]
conn = await createConnection(conn, models)
//e.g.
const doc = await conn.model('User').findOne({})
console.log('doc: ', doc);
};
const createConnection = async (conn,models) => {
// Because `conn` is in the global scope, Lambda may retain it between
// function calls thanks to `callbackWaitsForEmptyEventLoop`.
// This means your Lambda function doesn't have to go through the
// potentially expensive process of connecting to MongoDB every time.
if (conn == null || (conn && [0, 3].some(conn.readyState))) {
conn = await mongoose.createConnection(uri, {
// Buffering means mongoose will queue up operations if it gets
// disconnected from MongoDB and send them when it reconnects.
// With serverless, better to fail fast if not connected.
bufferCommands: false, // Disable mongoose buffering
bufferMaxEntries: 0, // and MongoDB driver buffering
useNewUrlParser: true,
useUnifiedTopology: true,
useCreateIndex: true
})
for (const model of models) {
const { name, schema } = model
conn.model(name, schema)
}
}
return conn
}
Unfortunately you may have to create your own RESTful API to answer MongoDB requests until AWS comes up with one. So far they only have what you need for their own Dynamo DB.
The short answer is yes, you need to create a new connection AND close it before the lambda finishes.
The long answer is actually during my tests you can pass down your DB connections in your handler like so(mysql example as that's what I've got to hand), you can't rely on this having a connection so check my example below, it may be that once your Lambda's haven't been executed for ages it does lose the state from the handler(cold start), I need to do more tests to find out, but I have noticed if a Lambda is getting a lot of traffic using the below example it doesn't create a new connection.
// MySQL.database.js
import * as mysql from 'mysql'
export default mysql.createConnection({
host: 'mysql db instance address',
user: 'MYSQL_USER',
password: 'PASSWORD',
database: 'SOMEDB',
})
Then in your handler import it and pass it down to the lambda that's being executed.
// handler.js
import MySQL from './MySQL.database.js'
const funcHandler = (func) => {
return (event, context, callback) => {
func(event, context, callback, MySQL)
}
}
const handler = {
someHandler: funcHandler(someHandler),
}
export default handler
Now in your Lambda you do...
export default (event, context, callback, MySQL) => {
context.callbackWaitsForEmptyEventLoop = false
// Check if their is a MySQL connection if not, then open one.
// Do ya thing, query away etc etc
callback(null, responder.success())
}
The responder example can he found here. sorry it's ES5 because that's where the question was asked.
Hope this helps!
Official Best Practice for Connecting from AWS Lambda
You should define the client to the MongoDB server outside the AWS
Lambda handler function. Don't define a new MongoClient object each
time you invoke your function. Doing so causes the driver to create a
new database connection with each function call. This can be expensive
and can result in your application exceeding database connection
limits.
As an alternative, do the following:
Create the MongoClient object once.
Store the object so your function can reuse the MongoClient across function invocations.
Step 1
Isolate the call to the MongoClient.connect() function into its own module so that the connections can be reused across functions. Let's create a file mongo-client.js for that:
mongo-client.js:
const { MongoClient } = require('mongodb');
// Export a module-scoped MongoClient promise. By doing this in a separate
// module, the client can be shared across functions.
const client = new MongoClient(process.env.MONGODB_URI);
module.exports = client.connect();
Step 2
Import the new module and use it in function handlers to connect to database.
some-file.js:
const clientPromise = require('./mongodb-client');
// Handler
module.exports.handler = async function(event, context) {
// Get the MongoClient by calling await on the connection promise. Because
// this is a promise, it will only resolve once.
const client = await clientPromise;
// Use the connection to return the name of the connected database for example.
return client.db().databaseName;
}
Resources
For more info, check this Docs.
We tested an AWS Lambda that connected every minute to our self managed MongoDB.
The connections were unstable and the Lambda failed
We resolved the issue by wrapping the MongoDB with Nginx reverse proxy stream module:
How to setup MongoDB behind Nginx Reverse Proxy
stream {
server {
listen <your incoming Mongo TCP port>;
proxy_connect_timeout 1s;
proxy_timeout 3s;
proxy_pass stream_mongo_backend;
}
upstream stream_mongo_backend {
server <localhost:your local Mongo TCP port>;
}
}
In addition to saving the connection for reuse, increase the memory allocation for the lambda function. AWS allocates CPU proportionally to the memory allocation and when changing from 128MB to 1.5Gb the connection time dropped from 4s to 0.5s when connecting to mongodb atlas.
Read more here: https://aws.amazon.com/lambda/faqs/
I was facing the same issue few times ago but I have resolved with by putting my mongo on same account of EC2.
I have created a mongo DB on the same AWS EC2 account where my lambda function reside.
Now I can access my mongo from the lambda function with the private IP.

SocketIO on a Node.js cluster

I have a standalone Node.js app which has SocketIO server that listens on a certain port, e.g. 8888. Now I am trying to run this app in a cluster and because cluster randomly assigns workers to requests, SocketIO clients in XHR polling mode once handshaken and authorized with one worker get routed to another worker where they're not handshaken and the mess begins.
And because workers don't share anything, I can't find a workaround. Is there a known solution to this issue?
There is no "simple" solution. What you have to do is the following:
If a client connects to a worker, save the connection-id together with the worker-id and a potential additional identification-id in a global (=for all workers accessible) store (i.e. redis).
If a client gets routed to another worker, use the store to look up which worker is reponsible for this client (either with the connection-id or with the additional identification-id and then hand it over to that worker (either with the nodejs-worker-master-worker-communication or via redis-pub-sub)
I habe implemented such thing with sock.js and an additional degree of complexity: I have two node.js servers with four workers each, so I had to use redis-pub-sub for worker/worker communication, because it is not guaranteed that they are on the same machine.
Actually there is a simple solution: using Redis to store sockets states.
Everything is explained in Socket.IO documentation:
The default 'session' storage in Socket.IO is in memory (MemoryStore).
The MemoryStore only allows you to deploy socket.io on a single
process. If you want to scale to multiple process and / or multiple
servers you can use our RedisStore which uses the Redis NoSQL database
as man in the middle.
So in order to change the store instance to RedisStore we add this:
var RedisStore = require('socket.io/lib/stores/redis')
, redis = require('socket.io/node_modules/redis')
, pub = redis.createClient()
, sub = redis.createClient()
, client = redis.createClient();
// Needs to be done after 'listen()'
io.set('store', new RedisStore({
redisPub : pub
, redisSub : sub
, redisClient : client
}));
Of course you will need to have a redis server running.

socketio and redisstore scaling efficiency

I am working on a pretty big project that involves sending data between clients. So, I am just researching on some new technologies out there. Anyways, thought I'd give Nodejs a try. I just have a question about socketio and redis.
When we are using the pub/sub functions in socketio, does every client connection create a new connection to redis? Or, does socketio use a max of create three connections (in total, regardless of the number of clients) to do the pub/sub stuff?
From the source, it seems that each client connection has two associated subscriptions to Redis (this.store in the code), but that each socket.io server has only three connections to Redis (source).
this.store.subscribe('message:' + data.id, function (packet) {
self.onClientMessage(data.id, packet);
});
this.store.subscribe('disconnect:' + data.id, function (reason) {
self.onClientDisconnect(data.id, reason);
});
Redis should be able to handle a lot of connections as well as subscriptions, but benchmarking is recommended as always.

Resources