replicaset vs multi-mongos vs multiple connections - node.js

what is the difference and why use each of this features of mongoose?
for now I just need a method to transfer a document from one database to another.

Replica-Set
A replica-set are two or more MongoDB servers which mirror the same data. Reads can be served by any member of the set, but writes can only be handled by a single server (the "Master" or "Primary").
An application can only connect to the replica-set members it knows, so you need to tell it the hostnames and ports of all of them. There are cases where you want to restrict an application to specific members. In that case you wouldn't tell them about the other servers.
Multiple mongos
Another feature to scale MongoDB on multiple servers is sharding. A sharded cluster consists of multiple replica-sets or stand-alone MongoDB servers where each one has only a part of the data. This improves both read- and write performance but is technically more complex. When an application wants to connect to a cluster, it doesn't connect to the MongoDB processes directly. Each connection goes through a MongoDB router instead (mongos) which forwards each query to the mongod's who are responsible for it. For increased performance and redundancy, a cluster can have multiple mongos servers. When this is the case, the clients should pick one at random for each connection.
Multiple connections
When your application opens multiple connections to the database, it can perform multiple requests in parallel. Usually the database driver should do this automatically, so you don't have to worry about this, unless you need to connect to multiple databases at the same time or you need connections with different connection settings for some reason.

Related

How does Postgres handle more requests than connections

While going through the Postgres Architecture, one of the things mentioned was that the Postgres DB has a connection limit of 500(which can be modified). And to fetch any data from the Postgres DB, we first need to make a connection to it. So in this case what happens if there are simultaneous 10k requests coming to the DB? How does the requests map to the connection limit, since we have the limit of 500. Do we need to increase the limit or do we need to create more instance of Postgres or is concurrency in play?
If there are 10000 concurrent statements running on a single database, any hardware will be overloaded. You just cannot do that.
Even 500 is way too many concurrent requests, so that value is too high for max_connections (or for the number of concurrent active sessions to be precise).
The good thing is that you don't have to do that. You use a connection pool that acts as a proxy between the application and the database. If your database statements are sufficiently short, you can easily handle thousands of concurrent application users with a few dozen database connections. This protects the database from getting overloaded and avoids opening database connections frequently, which is expensive.
If you try to open more database connections than max_connections allows, you will get an error message. If more processes request a database connection from the pool than the limit allows, some sessions will hang and wait until a connection is available. Yet another point for using a connection pool!

Should a single Mongoose connection be used for all requests? Or different connection for each request?

I'm coming from SQL databases, and I'm wondering if single/multiple connections to MongoDB have any differences than SQL databases.
Are there any performance or security issues to using any of the approaches?
Code-wise there is no problem with only connecting once to your database. What you can do however, is increase the amount of sockets/connections kept open by mongoose by specifying the poolSize during connection. See here mongoose connection doc. The default is 5. This does for example make sense, if you have a few slow queries blocking many fast queries. Also, sometimes it makes sense to create multiple database connections to separate packages/models from each other, but you should be careful not to create any race conditions like this. Finally, depending on your specification, it could also make sense to use a single connection to write and another to read from either the same or a replica/slave database to improve your performance.

MongoDB Performance when connecting to multiple databases via parent-child connections

When connecting to a mongo server containing multiple dbs, what is more performant approach using node-mongodb-native driver.
Let's say I have 8 dbs(db1...db8) on the same Mongo Server. My node app needs to connect to all 8 depending on the queries received to it. What is a better option here for me
1) Create 8 separate connections (1 with each db)
OR
2) Create one parent connection to the server on test db and then call db.db 8 times to create 8 child connections under that parent. As I read in the doc(http://mongodb.github.io/node-mongodb-native/2.0/api/Db.html#db), all 8 child connections will be running on the same socket
Has anyone researched into this or has some background or thoughts that can help me determine the right course of action?
How granular is MongoDB concurrency?: this depends on the version. Since MongoDB 3 many operations lock on the document. Earlier versions would apply a lock on the entire collection. Some operations still lock on the entire instance (aka server). This means that sometimes an operation (likely operations involving multiple databases) can block an entire instance affecting all databases within it. https://docs.mongodb.com/manual/faq/concurrency/#how-granular-are-locks-in-mongodb
Threading model: node.js is asynchronous while MongoDB is not. MongoDB will use one thread per socket. If you perceive operations are blocking each other you should keep seperate connection pools. http://mongodb.github.io/node-mongodb-native/2.2/reference/faq/

Cassandra Failed to create a selector. Multithreading multiple concurrent cassandra connections

I am running an ExecutorService of more than 50 threads concurrently. Each thread is opening a connection to Cassandra and performing inserts using springframework.data.cassandra. The problem is when I open more than 50 connections at a time, I get the following error.
Caused by: org.jboss.netty.channel.ChannelException: Failed to create a selector.
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.openSelector(AbstractNioSelector.java:343)
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.<init>(AbstractNioSelector.java:100)
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.<init>(AbstractNioWorker.java:52)
at org.jboss.netty.channel.socket.nio.NioWorker.<init>(NioWorker.java:45)
at org.jboss.netty.channel.socket.nio.NioWorkerPool.createWorker(NioWorkerPool.java:45)
at org.jboss.netty.channel.socket.nio.NioWorkerPool.createWorker(NioWorkerPool.java:28)
at org.jboss.netty.channel.socket.nio.AbstractNioWorkerPool.newWorker(AbstractNioWorkerPool.java:143)
at org.jboss.netty.channel.socket.nio.AbstractNioWorkerPool.init(AbstractNioWorkerPool.java:81)
at org.jboss.netty.channel.socket.nio.NioWorkerPool.<init>(NioWorkerPool.java:39)
at org.jboss.netty.channel.socket.nio.NioWorkerPool.<init>(NioWorkerPool.java:33)
at org.jboss.netty.channel.socket.nio.NioClientSocketChannelFactory.<init>(NioClientSocketChannelFactory.java:151)
at org.jboss.netty.channel.socket.nio.NioClientSocketChannelFactory.<init>(NioClientSocketChannelFactory.java:116)
at com.datastax.driver.core.Connection$Factory.<init>(Connection.java:532)
at com.datastax.driver.core.Cluster$Manager.<init>(Cluster.java:1201)
at com.datastax.driver.core.Cluster$Manager.<init>(Cluster.java:1144)
at com.datastax.driver.core.Cluster.<init>(Cluster.java:121)
at com.datastax.driver.core.Cluster.<init>(Cluster.java:108)
at com.datastax.driver.core.Cluster.buildFrom(Cluster.java:177)
at com.datastax.driver.core.Cluster$Builder.build(Cluster.java:1109)
If I open exactly 50 threads (or less), it works fine. Is there a way to configure this so I can allow more? In my cassandra.yaml file, rpc_max_threads according to the comments by default "The default is unlimited"
My guess is you are overwhelming your OS by creating too many connections. You should only create 1 Cluster instance per Cassandra cluster. Clusters create Sessions, which manage their own connection pools. Both Cluster and Session are thread safe, so you can share them between threads.
Four simple rules for coding with the driver distills these concepts well:
When writing code that uses the driver, there are four simple rules that you should follow that will also make your code efficient:
Use one cluster instance per (physical) cluster (per application lifetime)
Use at most one session instance per keyspace, or use a single Session and explicitly specify the keyspace in your queries
...
A Cluster instance allows to configure different important aspects of the way connections and queries will be handled. At this level you can configure everything from contact points (address of the nodes to be contacted initially before the driver performs node discovery), the request routing policy, retry and reconnection policies, and so forth. Generally such settings are set once at the application level.
While the session instance is centered around query execution, the Session it also manages the per-node connection pools. The session instance is a long-lived object, and it should not be used in a request-response, short-lived fashion. The code should share the same cluster and session instances across your application.

Why are database connection pools better than a single connection?

I'm currently working on writing a multithreaded application that will need to access a database in order to serve requests. I see many people saying that using a pool of many persistent database connections is the way to go for this type of application, but I'm trying to wrap my head around why exactly this is the case.
Keep in mind that I'm designing this application in Erlang, so I'll be using threads/processes/workers a lot.
So let's compare two situations:
You have a single thread that owns a single database connection. All your client-handling-threads talk to this thread in order to make database queries.
You have a pool of threads, each with their own database connection. When a client-handling-thread wants to access the database, it gets one of these threads from the pool, and uses that to query the DB.
In the first case, I see many people saying that it is bad because having one thread handling all database related queries will in turn cause a bottleneck. But my confusion is the following: Wouldn't the bottleneck in that single thread actually be the database itself? If all that the thread is doing is querying the database through its connection handle, isn't waiting for the DB to respond to requests the main source of latency? How will throwing more connections threads at this problem solve it?
The database probably has well-developed multithreading abilities. Using a connection pool allows:
Make use of the DB's multithreading / load-balancing ability
Avoid the overhead of setting up and tearing down connections over and over
When the database is serving multiple connections, it can make its own decisions on how to prioritize requests. Imagine this scenario:
User A requests a set of records from Table A with 100,000 rows
User B requests a set of records from Table B with 50 rows
User C updates Table A
If multiple connections are used, the DB can take advantage of the fact that (1) and (2) can occur concurrently, and User B gets his 50 records without having to wait for User A to get all 100,000 of his. Only User C has to wait for User A to finish.
Also, setting up and tearing down TCP connections is a relatively expensive task. Using a pool allows one user to release the resource without tearing down the TCP connection, so the next user doesn't have to wait for a new connection. Your single-threaded approach wouldn't benefit from this aspect of connection-pooling, though.

Resources