Global Redis Twemproxy Architecture - node.js

I'm launching a global service on a Node/Mongo/Redis stack. I've got an Architecture question about my Redis/Twemproxy config. Here's the crux of the issue: all of my data is 'global' - that is to say, users from anywhere around the world need to access the same data. Unfortunately, there's it's a ~300ms hop across an ocean - so, to avoid slow reads, I need to host a copy of all my data on a server that's 'local' to the user.
This is pretty easy to accomplish with MongoDB. You simply create a replica set, with members all over the globe, and you set readPreference to 'nearest' (least lag). Done.
However, with Redis/Twemproxy, it's not that easy...
My current solution is to take a serious hit on write performance by writing to every global server (within the req/res cycle). This does lead to faster reads since I can let every user read from a local set of the data. If you do it the other way around, write 'local', read 'global' -- you save a bunch of space (you only have one copy of the data), but reads take a huge performance hit. If I had to choose, I need faster reads.
I've tried creating a 'master' cluster (AMER) and then slaving other 'global' clusters (ASIA, EUROPE) to that, but when I tried to read from the 'global' clusters, it returned nothing. This works with a single Redis instance, so, I'm assuming this has to do with the addition of Twemproxy, and key mapping.
Does anyone have any suggestions or ideas? What's the optimal way to configure a global Redis/Twemprox architecture?

Related

Running two instances of MongoDB

I am working on a highly I/O Intensive application (A selection based on the availability of seats) using MERN Stack.
The app is expected to get 2000 concurrent users.
I want to know whether it's wise to use two instances of MongoDB, one on the RAM (in memory) and another on the Hard drive.
The RAM one to be used to store the available seats.
And the Hard drive one to backup the data after regular intervals.
But at the same time I know that if the server crashes my MongoDB data on the RAM is lost.
Could anyone guide me please?
I am using Socket IO instead of AJAX...
I don't think you need this. You can get a good server, with a good amount of RAM, and if you create your indexes correctly, everything should work fine.
Also Mongo 3 won't lock the entire database on each update, like Mongo 2 used to do.
I believe the best approach would be using something like Memcached in order to improve reads. Also, in order to improve database performance and have automated failover use sharding and replica sets.
Consider also that you would have headaches when your server restarted and you lose your data...
This seems unnecessary, because MongoDB already behaves exactly like that out-of-the-box.
The old engine (MMAPv1) was using memory-mapped files, which means that if you have as much RAM as you have data, it practically behaves like an in-memory database with automatic hard-drive backing.
The new engine (Wired Tiger) works a bit different in detail, but the same in general. It allows you to set a cache size (config key storage.wiredTiger.engineConfig.cacheSizeGB). When the cache size is as large enough, you again have an in-memory database with automatic hard-drive mirroring.
More about that in the storage FAQ.
What you are talking about is a scaling problem. You have two options when it comes to scaling: Add resources causing the bottleneck to your existing setup (more RAM and faster disks, usually) or expand your setup. You should first add resources, almost up to the point where adding resources does not give you an according bang for the buck.
At some point, this "scaling up" will not be feasible any more and you have to distribute the load amongst more nodes.
MongoDB comes with a feature for distributing load amongst (logical) nodes: sharding.
Basically, it works like this: multiple replica sets each form a logical node called a shard. Each shard in turn only holds a subset of your data. Instead of connecting to the shards directly, you acres your data via a mongos query router which is aware of which shard holds the data to answer the query and where to write new data.
By carefully selecting your shard key, your reads and writes should be evenly distributed between the shards.
Side note: putting production data on a standalone instance instead of a replica set crosses the border of negligence in my book. Given the prices of today's (rented) hardware, it has never been easier to eliminate a single point of failure than with a MongoDB replica set.

Mongodb, can i trigger secondary replication only at the given time or manually?

I'm not a mongodb expert, so I'm a little unsure about server setup now.
I have a single instance running mongo3.0.2 with wiredtiger, accepting both read and write ops. It collects logs from client, so write load is decent. Once a day I want to process this logs and calculate some metrics using aggregation framework, data set to process is something like all logs from last month and all calculation takes about 5-6 hours.
I'm thinking about splitting write and read to avoid locks on my collections (server continues to write logs while i'm reading, newly written logs may match my queries, but i can skip them, because i don't need 100% accuracy).
In other words, i want to make a setup with a secondary for read, where replication is not performing continuously, but starts in a configured time or better is triggered before all read operations are started.
I'm making all my processing from node.js so one option i see here is to export data created in some period like [yesterday, today] and import it to read instance by myself and make calculations after import is done. I was looking on replica set and master/slave replication as possible setups but i didn't get how to config it to achieve the described scenario.
So maybe i wrong and miss something here? Are there any other options to achieve this?
Your idea of using a replica-set is flawed for several reasons.
First, a replica-set always replicates the whole mongod instance. You can't enable it for individual collections, and certainly not only for specific documents of a collection.
Second, deactivating replication and enabling it before you start your report generation is not a good idea either. When you enable replication, the new slave will not be immediately up-to-date. It will take a while until it has processed the changes since its last contact with the master. There is no way to tell how long this will take (you can check how far a secondary is behind the primary using rs.status() and comparing the secondaries optimeDate with its lastHeartbeat date).
But when you want to perform data-mining on a subset of your documents selected by timespan, there is another solution.
Transfer the documents you want to analyze to a new collection. You can do this with an aggregation pipeline consisting only of a $match which matches the documents from the last month followed by an $out. The out-operator specifies that the results of the aggregation are not sent to the application/shell, but instead written to a new collection (which is automatically emptied before this happens). You can then perform your reporting on the new collection without locking the actual one. It also has the advantage that you are now operating on a much smaller collection, so queries will be faster, especially those which can't use indexes. Also, your data won't change between your aggregations, so your reports won't have any inconsistencies between them due to data changing between them.
When you are certain that you will need a second server for report generation, you can still use replication and perform the aggregation on the secondary. However, I would really recommend you to build a proper replica-set (consisting of primary, secondary and an arbiter) and leave replication active at all times. Not only will that make sure that your data isn't outdated when you generate your reports, it also gives you the important benefit of automatic failover should your primary go down for some reason.

Is there a limit of sub-databases in LMDB?

Posting here as I could not find any forums for lmdb key-value store.
Is there a limit for sub-databases? What is a reasonable number of sub-databases concurrently open?
I would like to have ~200 databases which seems like a lot and clearly indicates my model is wrong.
I suppose could remodel and embed id of each db in key itself and keep one db only but then I have longer keys and also I cannot drop database if needed.
I'm interested though if LMDB uses some sort of internal prefixes for keys already.
Any suggestions how to address that problem appreciated.
Instead of calling mdb_dbi_open each time, keep your own map with database names to database handles returned from mdb_dbi_open. Reuse these handles for the lifetime of your program. This will allow you to have multiple databases within an environment and prevent the overhead with mdb_dbi_open.
If you read the documentation for mdb_env_set_maxdbs.
Currently a moderate number of slots are cheap but a huge number gets expensive: 7-120 words per transaction, and every mdb_dbi_open() does a linear search of the opened slots.
http://www.lmdb.tech/doc/group__mdb.html#gaa2fc2f1f37cb1115e733b62cab2fcdbc
The best way to know is to test the function call mdb_dbi_open performance to see if it is acceptable.

On-disk lookup table with node.js bindings

For a project I am creating a queuing library and basically store URLs in a Set (it's actually an object, where I set keys to true, but one can see it as an array), so the queue only takes every url once. This works really well, however I am facing the problem that there are many URLs and so the RAM usage becomes really high.
Therefor I want to use an on-disk key-value store (actually only keys are required, no idea whether there is some different approach) with the following requirements:
No need to load the whole data set into RAM
Speedy lookups
Node.js bindings
It doesn't have to be too safe (losing data once in a while isn't a huge problem, low RAM requirements are more important) and even though I use Node.JS in this scenario this lookup doesn't necessarily need to run async.
Actually a side question would be whether there is some better way than a on-disk key-value approach. A term would be nice. Lookuptables somehow always lets me find data sets (IPs, ZIP codes, etc.)
I'd use a sql table with a single column (to store the url). Better control on memory usage than redis (which pretty much stores all in memory).
easy to check if there is already the same value
easy to insert
easy to remove one element
If it really "doesn't have to be too safe", another design would be to keep storing everything in memory but limit the number of URLs you store, for example by using an LRU cache.
You could either use a cache in node.js (easy to find via Google) or use a separate memcached server, possibly on the same machine.

alternative to polling database?

I have an application that works as follows: Linux machines generate 28 different types of letter to customers. The letters must be sent in .docx (Microsoft Word format). A secretary maintains MS Word templates, which are automatically used as necessary. Changing from using MS Word is not an option.
To coordinate all this, document jobs are placed into a database table and a python program running on each of the windows machines polls the database frequently, locking out jobs and running them as necessary.
We use a central database table for the job information to coordinate different states ("new", "processing", "finished", "printed")... as well to give accurate status information.
Anyway, I don't like the clients polling the database frequently, seeing as they aren't working most of the time. Clients hpoll every 5 seconds.
To avoid polling, I kind of want a broadcast "there's some work to do" or "check your database for some work to do" message sent to all the client machines.
I think some kind of publish/subscribe message queue would be up to the job, but I don't want any massive extra complexity.
Is there a zero or near zero config/maintenance piece of software that would achieve this? What are the options?
X
Is there any objective evidence that any significant load is being put on the server? If it works, I'd make sure there's really a problem to solve here.
It must be nice to have everything running so smoothly that you're looking at things that might only possibly be improved!
Is there a zero or near zero config/maintenance piece of software that would achieve this? What are the options?
Possibly, but what you would save in configuration and implementation time would likely hurt performance more than your polling service ever could. SQL Server isn't made to do a push really (not easily anyway). There are things that you could use to push data out (replication service, log shipping - icky stuff), but they would be more complex and require more resources than your simple polling service. Some options would be:
some kind of trigger which runs your executable using command-line calls (sp_cmdshell)
using a COM object which SQL Server could open and run
using a SQL Agent job to run a VBScript (which would again be considered "polling")
These options are a bit ridiculous considering what you have already done is much simpler.
If you are worried about the polling service using too many cycles or something - you can always throttle it back - polling every minute, every 10 minutes, or even just once a day might be more appropriate - this would be a business decision, so go ask someone in the business how fast it needs to be.
Simple polling services are fairly common, because they are, well... simple. In addition they are also low overhead, remotely stable, and error-tolerant. The down side is that they can hammer the database into dust if not carefully controlled.
A message queue might work well, as they're usually setup to be able to block for a while without wasting resources. But with MySQL, I don't think that's an option.
If you just want to reduce load on the DB, you could create a table with a single row: the latest job ID. Then clients just need to compare that to their last ID to see if they need to run a full poll against the real table. This way the overhead should be greatly reduced, if it's an issue now.
Unlike Postgres and SQL Server (or object stores like CouchDb), MySQL does not emit database events. However there are some coding patterns you can use to simulate this.
If you have one or more tables that you wish to monitor, you can create triggers on these tables that add a row to a "changes" table that records a queue of events to process. Your triggers filter the subset of data changes that you care about and create records in your changes table for each event you wish to perform. Because this pattern queues and persists events it works well even when the workers that process these events have outages.
You might think that MyISAM is the best choice for the changes table since it's mostly performing writes (or even MEMORY if you don't need to persist the events between database server outages). However, keep in mind that both Memory and MEMORY and MyISAM have only full-table locks so your trigger on an InnoDB table might hit a bottle neck when performing an insert into a MEMORY and MyISAM table. You may also require InnoDB for the changes table if you're using a ON DELETE CASCADE with another InnoDB table (requires both tables to use the same engine).
You might also use SHOW TABLE STATUS to check the last update time of you changes table to check if there's something to perform. This feature wont work for InnoDB tables.
These articles describes in more depth some of alternative ways to implement queues in MySQL and even avoid polling!
How to notify event listeners in MySQL
How to implement a queue in SQL
5 subtle ways you're using MySQL as a queue, and why it'll bite you

Resources