Does opening of multiple lmdb-environments within the same client process has a legitimate use case or is it redundant? - lmdb

§1 1 lmdb environment corresponds to one database file on the disk.
§2 Theoretically, the same client process, could call the lmdb-c-api, multiple times, to instantiate different lmdb-environments.
The question is, whether §2 is redundant, or could a client have a legitimate use case for the same.

Here are some use cases, that would probably require an app managing its data, in multiple lmdb-environment-files
if the client app needed to differentiate its data, at file system level, as in, two different file system files.
thinking that absolutely all of an app's data, stored in a single .mdb file could be like storing all the eggs in the same basket.
For security purposes, if all data was present in a single file, it would be at greater risk of exploitation, compared to the case when different pieces of the data were spread across different fs-files.
Maybe a client process wouldn't want, one data file to grow beyond an upper bound on bytes, eg 1024MiB, so it would like to create a new filesystem-level-file.
So probably, it's not overkill, to allow a single client process to be able to create multiple lmdb-environment files.

Related

How can tokio tasks access shared data in Rust?

I am creating a webserver using tokio. Whenever a client connection comes in, a green thread is created via tokio::spawn.
The main function of my web server is proxy. Target server information for proxy is stored as a global variable, and for proxy, all tasks must access the data. Since there are multiple target servers, they must be selected by round robin. So the global variable (struct) must have information of the recently selected server(by index).
Concurrency problems occur because shared information can be read/written by multiple tasks at the same time.
According to the docs, there seems to be a way to use Mutex and Arc or a way to use channel to solve this.
I'm curious which one you usually prefer, or if there is another way to solve the problem.
If it's shared data, you generally do want Arc, or you can leak a box to get a 'static reference (assuming that the data is going to exist until the program exits), or you can use a global variable (though global variables tends to impede testability and should generally be considered an anti-pattern).
As far as what goes in the Arc/Box/global, that depends on what your data's access pattern will be. If you will often read but rarely write, then Tokio's RwLock is probably what you want; if you're going to be updating the data every time you read it, then use Tokio's Mutex instead.
Channels make the most sense when you have separate parts of the program with separate responsibilities. It doesn't work as well to update multiple workers with the same changes to data, because then you get into message ordering problems that can result in each worker's state disagreeing about something. (You get many of the problems of a distributed system without any of the benefits.)
Channels can work if there is a single entity responsible for maintaining the data, but at that point there isn't much benefit over using some kind of mutual exclusion mechanism; it winds up being the same thing with extra steps.

Can LMDB be made concurrent for writes as well under specific circumstances?

MDB_NOLOCK as described at mdb_env_open() apidoc:
MDB_NOLOCK Don't do any locking. If concurrent access is anticipated, the caller must manage all concurrency itself. For proper operation the caller must enforce single-writer semantics, and must ensure that no readers are using old transactions while a writer is active. The simplest approach is to use an exclusive lock so that no readers may be active at all when a writer begins.
What if an RW txnA intends to modify a set of keys which has no key in common with another set of keys which another RW txnB intends to modify? Couldn't they be sent concurrently?
Isn't the single-writer semantic wasteful for such situations? As one txn is waiting for the previous one to finish, even though they intend to operate in entirely separate regions in an lmdb env.
In an environment opened with MDB_NOLOCK, what if the client app calculates in the domainland, that two write transactions are intending to RW to mutually exclusive set of keys anywhere in an lmdb environment, and sends only such transactions concurrently anyway? What could go wrong?
Could such concurrent writes scale linearly with cores? Like RO txns do? Given the app is able to manage these concurrent writes, in the manner described in 3.
No, since modifying key/value pairs requires also modifying the b-tree structure, and the two transactions would conflict with each other.
You should avoid doing long-running computations in the middle of a write transaction. Try to do as much as possible beforehand. If you can't do this, then LMDB might not be a great fit for you application. Usually you can though.
Very bad stuff. Application crashes and DB corruption.
Writes are generally IO bound, and will not scale with many cores anyway. There are some very hacky things you can do with LMDB's writemap and/or pwrite(2), but you are very much on your own here.
I'm going to assume that writing to the value part of a pre-existing key does not modify the b-tree because you are not modifying the keys. So what Doug Hoyte's comment stands, except possibly point 3:
Key phrase here is "are intending to RW to mutually exclusive set of keys". So assuming that the keys are pre-allocated, and already in the DB, changing the values should not matter. I don't even know if LMDB can store variable sized values, in which case it could matter if the values are different sizes.
So, it should be possible to write with MDB_NOLOCK concurrently as long as you can guarantee to never modify, add, or delete any keys during the concurrent writes.
Empirically I can state that working with LMDB opened with MDB_NO_LOCK (or lock=False in Python) and simply modifying values of pre-existing keys, or even only adding new key/values - seems to work well. Even if LMDB itself is mounted across an NFS like medium and queried from different machines.
#Doug Hoyte - I would appreciate more context as to what specific circumstances might lead to a crash or corruption. In my case there are many small short-lived type of writes to the same DB.

How to Synchronize object between multiple instance of Node Js application

Is there any to lock any object in Node JS application.
Is there are multiple instance for application is available some function shouldnt run concurrent. If instance A function is completed, it should unlock that object/key or some identifier and B instance of application should check if its unlock it should run some function.
Any Object or Key can be used for identifying the locking and unlocking the function.
How to do that in NodeJS application which have multiple instances.
As mentioned above Redis may be your answer, however, it really depends on the resources available to you. There are some other possibilities less complicated and certainly less powerful which may also do the trick.
node-cache may also do the trick, if you set it up correctly. It is not any where near as powerful as Redis, but on the bright side it does not require as much setup and interaction with your environment.
So there is Redis and node-cache for memory locks. I should mention there are quite a few NPM packages which do the cache. Depends on what you need, and how intricate your cache needs to be.
However, there are less elegant ways to do what you want, though less elegant is not necessarily worse.
You could use a JSON file based system and hold locks on the files for a TTL. lockfile or proper-lockfile will accomplish the task. You can read the information from the files when needed, delete when required, give them a TTL. Basically a cache system to disk.
The memory system is obviously faster. The file system requires just as much planning in your code as the memory system.
There is yet another way. This is possibly the most dangerous one, and you would have to think long and hard on the consequences in terms of security and need.
Node.js has its own process.env. As most know this holds the system global variables available to all by simply writing process.env.foo where foo would have been declared as a global system variable. A package such as .dotenv allows you to add to your system variables by way of a .env text file. Thus if you put in that file sam=mongoDB, then in your code where you write process.env.sam it will be interpreted as mongoDB. Tons of system wide variables can be set up here.
So what good does that do, you may ask? Well these are system wide variables, and they can be changed in mid-flight. So if you need to lock the variables and then change them it is a simple manner to do it with. Beware though of the gotcha here. Once the system goes down, or all processes stop, and is started again, your environment variables will return to the default in the .env file.
Additionally, unless you are running a system which is somewhat safe on AWS or Azure etc. I would not feel secure in having my .env file open to the world. There is a way around this one too. You can use a hash to encrypt all variables and put the hash in the file. When you call it, decrypt before actually requesting use of the full variable.
There are probably many wore ways to lock and unlock, not the least of which is to use the native Node.js structure. Combine File System events together with Crypto. But this demands a much deeper level of understanding of the actual Node.js library and structures.
Hope some of this helped.
I strongly recommend Redis in your case.
There are several ways to create a application/process shared object, using locks is one of them, as you mentioned.
But they're just complicated. Unless you really need to do that yourself, Redis will be good enough. Atomic ops cross multiple process, transaction and so on.
Old thread but I didn't want to use redis so I made my own open source solution which utilizes websocket connections:
https://github.com/OneAndonlyFinbar/sync-cache

Should I cache results of functions involving mass file I/O in a node.js server app?

I'm writing my first 'serious' Node/Express application, and I'm becoming concerned about the number of O(n) and O(n^2) operations I'm performing on every request. The application is a blog engine, which indexes and serves up articles stored in markdown format in the file system. The contents of the articles folder do not change frequently, as the app is scaled for a personal blog, but I would still like to be able to add a file to that folder whenever I want, and have the app include it without further intervention.
Operations I'm concerned about
When /index is requested, my route is iterating over all files in the directory and storing them as objects
When a "tag page" is requested (/tag/foo) I'm iterating over all the articles, and then iterating over their arrays of tags to determine which articles to present in an index format
Now, I know that this is probably premature optimisation as the performance is still satisfactory over <200 files, but definitely not lightning fast. And I also know that in production, measures like this wouldn't be considered necessary/worthwhile unless backed by significant benchmarking results. But as this is purely a learning exercise/demonstration of ability, and as I'm (perhaps excessively) concerned about learning optimal habits and patterns, I worry I'm committing some kind of sin here.
Measures I have considered
I get the impression that a database might be a more typical solution, rather than filesystem I/O. But this would mean monitoring the directory for changes and processing/adding new articles to the database, a whole separate operation/functionality. If I did this, would it make sense to be watching that folder for changes even when a request isn't coming in? Or would it be better to check the freshness of the database, then retrieve results from the database? I also don't know how much this helps ultimately, as database calls are still async/slower than internal state, aren't they? Or would a database query, e.g. articles where tags contain x be O(1) rather than O(n)? If so, that would clearly be ideal.
Also, I am beginning to learn about techniques/patterns for caching results, e.g. a property on the function containing the previous result, which could be checked for and served up without performing the operation. But I'd need to check if the folder had new files added to know if it was OK to serve up the cached version, right? But more fundamentally (and this is the essential newbie query at hand) is it considered OK to do this? Everyone talks about how node apps should be stateless, and this would amount to maintaining state, right? Once again, I'm still a fairly raw beginner, and so reading the source of mature apps isn't always as enlightening to me as I wish it was.
Also have I fundamentally misunderstood how routes work in node/express? If I store a variable in index.js, are all the variables/objects created by it destroyed when the route is done and the page is served? If so I apologise profusely for my ignorance, as that would negate basically everything discussed, and make maintaining an external database (or just continuing to redo the file I/O) the only solution.
First off, the request and response objects that are part of each request last only for the duration of a given request and are not shared by other requests. They will be garbage collected as soon as they are no longer in use.
But, module-scoped variables in any of your Express modules last for the duration of the server. So, you can load some information in one request, store it in a module-level variable and that information will still be there when the next request comes along.
Since multiple requests can be "in-flight" at the same time if you are using any async operations in your request handlers, then if you are sharing/updating information between requests you have to make sure you have atomic updates so that the data is shared safely. In node.js, this is much simpler than in a multi-threaded response handler web server, but there still can be issues if you're doing part of an update to a shared object, then doing some async operation, then doing the rest of an update to a shared object. When you do an async operation, another request could run and see the shared object.
When not doing an async operation, your Javascript code is single threaded so other requests won't interleave until you go async.
It sounds like you want to cache your parsed state into a simple in-memory Javascript structure and then intelligently update this cache of information when new articles are added.
Since you already have the code to parse your set of files and tags into in-memory Javascript variables, you can just keep that code. You will want to package that into a separate function that you can call at any time and it will return a newly updated state.
Then, you want to call it when your server starts and that will establish the initial state.
All your routes can be changed to operate on the cached state and this should speed them up tremendously.
Then, all you need is a scheme to decide when to update the cached state (e.g. when something in the file system changed). There are lots of options and which to use depends a little bit on how often things will change and how often the changes need to get reflected to the outside world. Here are some options:
You could register a file system watcher for a particular directory of your file system and when it triggers, you figure out what has changed and update your cache. You can make the update function as dumb (just start over and parse everything from scratch) or as smart (figure out what one item changed and update only that part of the cache) as it is worth doing. I'd suggest you start simple and only invest more in it when you're sure that effort is needed.
You could just manually rebuild the cache once every hour. Updates would take an average of 30 minutes to show, but this would take 10 seconds to implement.
You could create an admin function in your server to instruct the server to update its cache now. This might be combined with option 2, so that if you added new content, it would automatically show within an hour, but if you wanted it to show immediately, you could hit the admin page to tell it to update its cache.

Is Locking needed necessary during read operation by mutiple threads?

Say, my application has n number of threads trying to read the same collection object, say a List. Will there be any race-codition or dead-lock or any similar problems ? In other words, Is it necessary to lock the List for read only operation ?
It totally depends on you whether you want to restrict the number of users or not. Normally if you see excel files in Windows, when it is shared across network, a maximum of 10 people can open it for reading at a time. This number can be increased to any number or for that matter there need not be any restriction at all. It is your wish as a programmer whether you want to restrict or not. The only thing you need to keep in mind is that if the file is on a server and if 1 million read requests are coming every second, if there is no restriction imposed, it is likely that your system will slow down and it will not be able to serve anyone. Instead if you impose locking say that only 100 users can read it at a time, you can be sure that your system will not be overloaded. This is a real time scenario I am explaining considering the worst case.
But If you are asking it only for learning sake, I would say it is not required. If n number of users are opening the same file for reading, ideally speaking you can give access to all the n users to read the collection object. No synchronisation mechanism is needed. When there is no synchronisation there will be no dead lock or anything.
Hope this removes your confusion. Thanks.
not necessary unless the read operation causes internal status change of the collection object.

Resources