I'm researching a good way to implement multiple database for multi-tenant support using node.js + mongoose and mongodb.
I've found out that mongoose supports a method called createConnection() and I'm wondering the best practice to use that. Actually I am storing all of those connection in an array, separated by tenant. It'd be like:
var connections = [
{ tenant: 'TenantA', connection: mongoose.createConnection('tenant-a') },
{ tenant: 'TenantB', connection: mongoose.createConnection('tenant-b') }
];
Let's say the user send the tenant he will be logged in by request headers, and I get it in a very early middleware in express.
app.use(function (req, res, next) {
req.mongoConnection = connections.find({tenant: req.get('tenant')});
});
The question is, is it OK to store those connections statically or a better move would be create that connection every time a request is made ?
Edit 2014-09-09 - More info on software requirements
At first we are going to have around 3 tenants, but our plan is to increase that number to 40 in a year or two. There are more read operations than write ones, it's basically a big data system with machine learning. It is not a freemium software. The databases are quite big because the amount of historical data, but it is not a problem to move very old data to another location (we already thought about that). We plan to shard it later if we run out of available resources on our database machine, we could also separate some tenants in different machines.
The thing that most intrigues me is that some people say it's not a good idea to have prefixed collections for multitenancy but the reasons for that are very short.
https://docs.compose.io/use-cases/multi-tenant.html
http://themongodba.wordpress.com/2014/04/20/building-fast-scalable-multi-tenant-apps-with-mongodb/
I would not recommend manually creating and managing those separate connections. I don't know the details of your multi-tenant requirements (number of tenants, size of databases, expected number transactions, etc), but I think it would be better to go with something like Mongoose's useDb function. Then Mongoose can handle all the connection pool details.
update
The first direction I would explore is to setup each tenant on a separate node process. There are some interesting benefits to running your tenants in separate node processes. It makes sense from a security standpoint (isolated memory) and from a stability standpoint (one tenant process crash doesn't effect others).
Assuming you're basing the tenancy off of the URL, you would setup a proxy server in front of the actual tenant servers. It's job would be to look at the URL and route to the correct process based on that information. This is a very straightforward node http proxy setup. Each tenant instance could be the exact same code base, but launched with a different config (which tells them what mongo connection string to use).
This means you're able to design your actual application as if it wasn't multi-tenant. Each process only knows about one mongo database, and there is no multi-tenant logic necessary. It also enables you to easily split up traffic later based on load. If you need split up the tenants for performance reasons, you can do it transparently at the proxy level. The DNS can all stay the same, and you can just move the server that the instances are on behind the scenes. You can even have the proxy balance the requests for a tenant between multiple servers.
Related
We are trying to implement the strategy outlined in the following presentation (slides 13-18) using nodejs/mongo-native driver.
https://www.slideshare.net/mongodb/securing-mongodb-to-serve-an-awsbased-multitenant-securityfanatic-saas-application
In summary:
Create a connection pool to mongodb from node.js.
For every request for a tenant, get a conenction from the pool and "authenticate" it. Use the authenticated conenection to serve the request. After response, return the connection to the pool.
Im able to create a connection pool to mongodb without specifying any database using the mongo-native driver like so:
const client = new MongoClient('mongodb://localhost:27017', { useNewUrlParser: true, poolSize: 10 });
However, in order to get a db object, I need to do the following:
const db = client.db(dbName);
This is where I would like to authenticate the connection, and it AFAICS, this functionality has been deprecated/removed from the more recent mongo drivers, node.js and java.
Going by the presentation, looks like this was possible to do with older versions of the Java driver.
Is it even possible for me to use a single connection pool and authenticate tenants to individual databases using the same connections ?
The alternative we have is to have a connection pool per tenant, which is not attractive to us at this time.
Any help will be appreciated, including reasons why this feature was deprecated/removed.
it's me from the slides!! :) I remember that session, it was fun.
Yeah that doesn't work any more, they killed this magnificent feature like 6 months after we implemented it and we were out with it in Beta at the time. We had to change the way we work..
It's a shame since till this day, in Mongo, "connection" (network stuff, SSL, cluster identification) and authentication are 2 separate actions.
Think about when you run mongo shell, you provide the host, port, replica set if any, and your in, connected! But not authenticated. You can then authenticate to user1, do stuff, and then authenticate to user2 and do stuff only user2 can do. And this is done on the same connection! without going thru the overhead creating the channel again, SSL handshake and so on...
Back then, the driver let us have a connection pool of "blank" connections that we could authenticate at will to the current tenant in context of that current execution thread.
Then they deprecated this capability, I think it was with Mongo 2.4. Now they only supported connections that are authenticated at creation. We asked enterprise support, they didn't say why, but to me it looked like they found this way is not secured, "old" authentication may leak, linger on that "not so blank" reusable connection.
We made a change in our multi-tenancy infra implementation, from a large pool of blank connections to many (small) pools of authenticated connections, a pool per tenant. These pools per tenant can be extremely small, like 3 or 5 connections. This solution scaled nicely to several hundreds of tenants, but to meet thousands of tenants we had to make all kinds of optimizations to create pools as needed, close them after idle time, lazy creation for non-active or dormant tenants, etc. This allowed us to scale even more... We're still looking into solutions and optimizations.
You could always go back to a global pool of authenticated connections to a Mongo user that have access to multiple databases. Yes, you can switch database on that same authenticated connection. You just can't switch authentication..
This is an example of pure Mongo Java driver, we used Spring which provide similar functionality:
MongoClient mongoClient = new MongoClient();
DB cust1db = mongoClient.getDB("cust1");
cust1db.get...
DB cust2db = mongoClient.getDB("cust2");
cust2db.get...
Somewhat related, I would recommend looking at MongoDB encryption at rest, it's an enterprise feature. The only way to encrypt each database (each customer) according to a different key.
I am trying to find a good way to horizontally scale a stateful NodeJS service.
The Problem
The problem is that most of the options I find online assume the service is stateless. The NodeJS cluster documentation says:
Node.js [Cluster] does not provide routing logic. It is, therefore important to design an application such that it does not rely too heavily on in-memory data objects for things like sessions and login.
https://nodejs.org/api/cluster.html
We are using Kubernetes so scaling across multiple machines would also be easy if my service was stateless, but it is not.
Current Setup
I have a list of objects that stay in memory, each object alone is a transaction boundary. Requests to this service always have the object ID in the url. Requests to the same object ID are put into a queue and processed one at a time.
Desired Setup
I would like to keep this interface to the external world but internally spread this list of objects across multiple nodes and based on the ID in the URL the request would be routed to the appropriate node.
What is the usual way to do it in NodeJS? I've seen people using the user session to make sure a given user always go to the same node, what I would like to do is the same thing but instead of using the user session using the ID in the url.
I have a forum which contains groups, new groups are created all the time by users, currently I'm using node-cache with ttl to cache groups and it's content (posts, likes and comments).
The server worked great at the begging but the performance decreased when more people start using the app, so I decided to use the node.js Cluster module as the next step to improve performance.
The node-cache will cause a consistency problem, the same group could be cached in two workers, so if one of them changed, the other will not know (unless you do).
The first solution that came to my mind is using redis to store the whole group and it's content with the help of redis datatypes (sets and hash objects), but I don't know how efficient this could be.
The other solution is using redis to map requests to the correct worker, in this case the cached data is distributed randomly in workers, so when a worker receives a request that related to some group, he checks the group owner(the worker that holds this group instance in-memory) in redis and ask him to get the wanted data using node-ipc and then return it to the user.
Is there any problem with the first solution?
The second solution does not provides a fairness (if all the popular groups landed in the same worker), is there a solution for this?
Any suggestions?
Thanks in advance
Let's say that when a user logs into a webapp, he sees a list of information.
Let's say that list of information is served by one of two dynos (via heroku), but that the list of information originates from a single mongo database (i.e., the nodejs dynos are just passing the mongo information to a user when he logs into the webapp).
Question: Suppose I want to make it possible for a user to both modify and add to that list of information.
At a scale of 1,000-10,000 users, is the following strategy suitable:
User modifies/adds to data; HTTP POST sent to one of the two nodejs dynos with the updated data.
Dyno (whichever one it may be) takes modification/addition of data and makes a direct query into the mongo database to update the data.
Dyno sends confirmation back to the client that the update was successful.
Is this OK? Would I have to likely add more dynos (heroku)? I'm basically worried that if a bunch of users are trying to access a single database at once, it will be slow, or I'm somehow risking corrupting the entire database at the 1,000-10,000 person scale. Is this fear reasonable?
Short answer: Yes, it's a reasonable fear. Longer answer, depends.
MongoDB will queue the responses, and handle them in the order it receives. Depending on how much of it is being served from memory, it may or maybe not be fast enough.
NodeJS has the same design pattern, where it will queue responses it doesn't process, and execute them when the resources become available.
The only way to tell if performance is being hindered is by monitoring it, and seeing if resources consistently hit a threshold you're uncomfortable with passing. On the upside, during your discovery phase your clients will probably only notice a few milliseconds of delay.
The proper way to implement that is to spin up a new instance as the resources get consumed to handle the traffic.
Your database likely won't corrupt, but if your data is important (and why would you collect it if it isn't?), you should be creating a replica set. I would probably go with a replica set of data before I go with a second instance of node.
I wrote a multi-process realtime WebSocket server which uses the session id to load-balance traffic to the relevant worker based on the port number that it is listening on. The session id contains the hostname, source port number, worker port number and the actual hash id which the worker uses to uniquely identify the client. A typical session id would look like this:
localhost_9100_8000_0_AoT_eIwV0w4HQz_nAAAV
I would like to know the security implications for having the worker port number (in this case 9100) as part of the session id like that.
I am a bit worried about Denial of Service (DoS) threats - In theory, this could allow a malicious user to generate a large number of HTTP requests targeted at a specific port number (for example by using a fake sessionID which contains that port number) - But is this a serious threat? (assuming you have decent firewalls)? How do big companies like Google handle dealing with sticky sessions from a security perspective?
Are there any other threats which I should consider?
The reason why I designed the server like this is to account for the initial HTTP handshake and also for when the client does not support WebSocket (in which case HTTP long-polling is used - And hence subsequent HTTP requests from a client need to go to the same worker in the backend).
So there are several sub-questions in your question. I'll try to split them up and answer them accordingly:
Is DoS-Attack on a specific worker a serious threat?
It depends. If you will have 100 users, probably not. But you can be sure, that there are people out there, which will have a look at your application and will try to figure out the weaknesses and exploit those.
Now is a DoS-Attack on single workers a serious possibility, if you can just attack the whole server? I would actually say yes, because it is a more precise attack => you need less resources to kill the workers when you do it one by one. However if you allow connection from the outside only on port 80 for HTTP and block everything else, this problem will be solved.
How do big companies like Google handle dealing with sticky sessions?
Simple answer - who says, they do? There are multiple other ways to solve the problem of sessions, when you have a distributed system:
don't store anything session based on the server, just have a key in the cooky with which you can identify the user again, similar as with automatic login.
store the session state in a data base or object storage (this will add a lot of overhead)
store session information in the proxy (or broker, http endpoint, ...) and send them together with the request to the next worker
Are there any other threats which I should consider?
There are always unforeseen threats, and that's the reason, why you should never publish more information than necessary. In that case, most big companies don't even publish the correct name and version of their WebServer (for google it is gws for instance)
That being said, I see your point why you might want to keep your implementation, but maybe you can modify it slightly to store in your load balancer a dictionary with a hashed value of hostname, source port number, worker port number and have as a session id a collection of two hashes. Than the load balancer knows, by looking into the dictionary, to which worker it needs to be sent. This info should be saved together with a timestamp, when the info was retrieved the last time, and every minute you can delete unused data.