On-disk lookup table with node.js bindings

On-disk lookup table with node.js bindings - node.js

For a project I am creating a queuing library and basically store URLs in a Set (it's actually an object, where I set keys to true, but one can see it as an array), so the queue only takes every url once. This works really well, however I am facing the problem that there are many URLs and so the RAM usage becomes really high.
Therefor I want to use an on-disk key-value store (actually only keys are required, no idea whether there is some different approach) with the following requirements:
No need to load the whole data set into RAM
Speedy lookups
Node.js bindings
It doesn't have to be too safe (losing data once in a while isn't a huge problem, low RAM requirements are more important) and even though I use Node.JS in this scenario this lookup doesn't necessarily need to run async.
Actually a side question would be whether there is some better way than a on-disk key-value approach. A term would be nice. Lookuptables somehow always lets me find data sets (IPs, ZIP codes, etc.)

I'd use a sql table with a single column (to store the url). Better control on memory usage than redis (which pretty much stores all in memory).
easy to check if there is already the same value
easy to insert
easy to remove one element

If it really "doesn't have to be too safe", another design would be to keep storing everything in memory but limit the number of URLs you store, for example by using an LRU cache.
You could either use a cache in node.js (easy to find via Google) or use a separate memcached server, possibly on the same machine.

Related

CouchDB 2 global_changes system table is getting insanely big

We have a system that basically writes 250MB of data into our CouchDB 2 instance, which generates ~50GB/day in the global_changes database.
This makes CouchDB2 consumes all the disk.
Once you get to this state, CouchDB2 goes and never comes back.
We would like to know if there is any way of limiting the size of the global_changes table, or if there is a way of managing this table, like a set of best practices.

tldr: just delete it
http://docs.couchdb.org/en/latest/install/setup.html#single-node-setup
States:
Note that the last of these (referring to _global_changes) is not necessary if you do not expect to be using the global changes feed. Feel free to delete this database if you have created it, it has grown in size, and you do not need the function (and do not wish to waste system resources on compacting it regularly.

Running two instances of MongoDB

I am working on a highly I/O Intensive application (A selection based on the availability of seats) using MERN Stack.
The app is expected to get 2000 concurrent users.
I want to know whether it's wise to use two instances of MongoDB, one on the RAM (in memory) and another on the Hard drive.
The RAM one to be used to store the available seats.
And the Hard drive one to backup the data after regular intervals.
But at the same time I know that if the server crashes my MongoDB data on the RAM is lost.
Could anyone guide me please?
I am using Socket IO instead of AJAX...

I don't think you need this. You can get a good server, with a good amount of RAM, and if you create your indexes correctly, everything should work fine.
Also Mongo 3 won't lock the entire database on each update, like Mongo 2 used to do.
I believe the best approach would be using something like Memcached in order to improve reads. Also, in order to improve database performance and have automated failover use sharding and replica sets.
Consider also that you would have headaches when your server restarted and you lose your data...

This seems unnecessary, because MongoDB already behaves exactly like that out-of-the-box.
The old engine (MMAPv1) was using memory-mapped files, which means that if you have as much RAM as you have data, it practically behaves like an in-memory database with automatic hard-drive backing.
The new engine (Wired Tiger) works a bit different in detail, but the same in general. It allows you to set a cache size (config key storage.wiredTiger.engineConfig.cacheSizeGB). When the cache size is as large enough, you again have an in-memory database with automatic hard-drive mirroring.
More about that in the storage FAQ.

What you are talking about is a scaling problem. You have two options when it comes to scaling: Add resources causing the bottleneck to your existing setup (more RAM and faster disks, usually) or expand your setup. You should first add resources, almost up to the point where adding resources does not give you an according bang for the buck.
At some point, this "scaling up" will not be feasible any more and you have to distribute the load amongst more nodes.
MongoDB comes with a feature for distributing load amongst (logical) nodes: sharding.
Basically, it works like this: multiple replica sets each form a logical node called a shard. Each shard in turn only holds a subset of your data. Instead of connecting to the shards directly, you acres your data via a mongos query router which is aware of which shard holds the data to answer the query and where to write new data.
By carefully selecting your shard key, your reads and writes should be evenly distributed between the shards.
Side note: putting production data on a standalone instance instead of a replica set crosses the border of negligence in my book. Given the prices of today's (rented) hardware, it has never been easier to eliminate a single point of failure than with a MongoDB replica set.

Updating lucene index frequently causing performance degrade

I am trying to add lucene.net on my project where searching getting more complicated data. but transaction (or table modifying frequently like inserting new data or modifying the field which is used in lucene index).
Is it good to use lucene.net searching here?
How can I find modified fields & update to specific lucene index which is already created? Lucene index contains documents that are deleted from the table then how can I remove them from lucene index?
while loading right now,
I have removed index which are not available in the table based on unique Field
inserting if index does not exist otherwise updating all index which are matching table unique field
While loading page it's taking more time than normal, due to my removing/inserting/updating index method calling.
How can I proceed with it?

Lucene is absolutely suited for this type of feature. It is completely thread-safe... IF you use it the right way.
Solution pointers
Create a single IndexWriter and keep it in a globally accessible singleton (either a global static variable or via dependency injection). IWs are completely threadsafe. NEVER open multiple IWs on the same folder.
Perform all updates/deletes via this singleton. (I had one project doing 100's of ops/second with no issues, even on slightly crappy hardware).
Depending on the frequency of change and the latency acceptable to the app, you could:
Send an update/delete to the index every time you update the DB
Keep a "transaction log" or queue (probably in the same DB) of changed rows and deletions (which are are to track otherwise). Then update the index by consuming the log/queue.
To search, create your IndexSearcher with searcher = new IndexSearcher(writer.GetReader()). This is part of the NRT (near real time) pattern. NEVER create a separate IndexReader on an index folder that is also open by an IW.
Depending on your pattern of usage you may wish to introduce a period of "latency" between changes happening and those changes being "visible" to the searches...
Instances of IS are also threadsafe. So you can also keep an instance of an IS through which all your searches go. Then recreate it periodically (eg with a timer) then swap it using Interlocked.Exchange.
I previously created a small framework to isolate this from the app and make it reusable.
Caveat
Having said that... Hosting this inside IIS does raise some problems. IIS will occasionally restart your app. Is will also (by default) start the new instance before stopping the existing one, then swaps them (so you don't see the startup time of the new one).
So, for a short time there will be two instances of the writer (which is bad!)
You can tell IIS to disable "overlapping" or increase the time between restarts. But this will cause other side-effects.
So, you are actually better creating a separate service to host your lucene bits. A simple self hosted WebAPI Windows service is ideal and pretty simple. This also gives you better control over where the index folder goes and the ability to host it on a different machine (which isolates the disk IO load). And means that the service can be accessed from other parts of your system, tested separately etc etc
Why is this "better" than one of the other services suggested?
It's a matter of choice. I am a huge fan of ElasticSearch. It solves a lot of problems around scale and resilience. It also uses the latest version of Java Lucene which is far, far ahead of lucene.net in terms of capability and performance. (The same goes for the other two).
BUT, ES and Solr are Java (which may or may not be an issue for you). AzureSearch is hosted in Azure which again may or may not be an issue.
All three will require climbing a learning curve and will require infrastructure support or external third party SaaS commitment.
If you keep the service inhouse and in c# it keeps it simple and you have control over the capabilities and the shape of the API can be turned for your needs.
No "right" answer. You'll have to make choices based on your situation.

You should be indexing preferrably according to some schedule (periodically). The easiest approach is to keep the date of last index and then query for all the changes since then and index new, update and remove records. In order to keep track of removed entries in the database you will need to have a log of deleted records with a date it was removed. You can then query using that date to what needs to be removed from the lucene.
Now simply run that job every 2 minutes or so.
That said, Lucene.net is not really suited for web application, you should consider using ElasticSearch, SOLR or AzureSearch. Basically server that can handle load and multi threading better.

Is there a limit of sub-databases in LMDB?

Posting here as I could not find any forums for lmdb key-value store.
Is there a limit for sub-databases? What is a reasonable number of sub-databases concurrently open?
I would like to have ~200 databases which seems like a lot and clearly indicates my model is wrong.
I suppose could remodel and embed id of each db in key itself and keep one db only but then I have longer keys and also I cannot drop database if needed.
I'm interested though if LMDB uses some sort of internal prefixes for keys already.
Any suggestions how to address that problem appreciated.

Instead of calling mdb_dbi_open each time, keep your own map with database names to database handles returned from mdb_dbi_open. Reuse these handles for the lifetime of your program. This will allow you to have multiple databases within an environment and prevent the overhead with mdb_dbi_open.

If you read the documentation for mdb_env_set_maxdbs.
Currently a moderate number of slots are cheap but a huge number gets expensive: 7-120 words per transaction, and every mdb_dbi_open() does a linear search of the opened slots.
http://www.lmdb.tech/doc/group__mdb.html#gaa2fc2f1f37cb1115e733b62cab2fcdbc
The best way to know is to test the function call mdb_dbi_open performance to see if it is acceptable.

Caching mongoose objects with Redis

Is it possible to cache, say, mongoose document obejcts in Redis,
perhaps for implementing a write-back or write-through cache with timeout-based cache flush mechanisms?
P.S.:
I am familiar with mongoose-redis-cache, but I guess it supports only lean queries, which do not quite serve the purpose here. (But I may be wrong).

As Mongoose objects wrap a MongoDB document, there'd be no reason that you couldn't call
JSON.stringify(mongooseObject.toJSON())
which would return a string representing the MongoDB document. (toJSON) You could then store that result in a key/value in redis.
Where it starts to get more complex is that you'd need to first override the normal save and update functionality to save any modifications to your redis store rather than to the database. While doable, Mongoose wasn't designed for that and you'd be probably more successful to just use the native mongodb drivers and manage general document interactions that way. There are a number of extremely handy operators that you'd need to independently handle (like $push for example, which would add a single value to an array).
The real issue though is that you loose nearly all of the power of MongoDB by not being able to use the query engine or aggregation framework if all of the data isn't already stored in MongoDB (and even if it is, you're still bypassing your caching layer). And, if you're not using any of that functionality, then MongoDB may not be the best match for your needs (and you might instead consider something like CouchDB).
While I can see the potential value of using a caching layer for a high performance MongoDB system, the complexity of a write-back style cache may be more complex than it's worth (and not necessarily safe).
Of course, a write-through cache would be simpler (although you have the complexity of two data-stores and making sure writes are committed consistently if you're going to trust the cache and DB).
(As an aside, I'm not sure how you'd actually manage timeouts, as I thought redis deleted the values associated with keys if they were assigned a lifetime/timeout? I wouldn't want to loose data to the redis cache if you were doing write-back).

In Redis you can only cache raw JSONs, but to cache the whole Mongoose.Documents objects, You can use my library, which handles both - caching results in Redis and mongoose documents in memory. It also has event-based logic to clear both caches, if some related changes appear.
https://www.npmjs.com/package/speedgoose

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string