Does node-cache uses locks - node.js

I'm trying to understand if the node-cache package uses locks for the cache object and can't find anything.
I tried to look at the source code and it doesn't look like it, but this answer suggests otherwise with the quote:
So there is Redis and node-cache for memory locks.
This cache is used in a CRUD server and I want to make sure that GET/UPDATE requests will not create a race condition on the data.

I don't see any evidence of locking in the code.
If two requests for the same key which is not in the cache are made one after the other, then it will launch two separate fetch() operations and whichever request comes back last is the one that will remain in the cache. This is probably not normally a problem, but an improved implementation could make only one request for that same key and have the second request just wait for the first request to provide the value that was already in flight.
Since the cache itself is all in-memory, all access to the cache is synchronous and thus regulated by Javascript's single threaded nature. So, the only place concurrency issues could affect things in the cache code itself are when they launch an asynchronous fetch() operation.
There are, of course, race conditions waiting to happen in how one uses the code that accesses the data just like there are with a database interface so the calling code has to be smart about how it uses the interface to avoid creating race conditions because of how it calls things.

Unfortunately no, you can write a unit test to confirm it.
I have written a library to fix that and also added read through method to easy the code usage:
https://github.com/KhanhPham2411/node-cache-async-lock

Related

QLDB Transaction Isolation

I was looking at the sample code here: https://docs.aws.amazon.com/qldb/latest/developerguide/getting-started.python.step-5.html, and noticed that the three functions get_document_id_by_gov_id, is_secondary_owner_for_vehicle and add_secondary_owner_for_vin are executed separately in three driver.execute_lambda calls. So if there are two concurrent requests that are trying to add a secondary owner, would this trigger a serialization conflict for one of the requests?
The reason I'm asking is that I initially thought we would have to run all three functions within the same execute_lambda call in order for the serialization conflict to happen since each execute_lambda call uses one session, which in turn uses one transaction. If they are run in three execute_lambda calls, then they would be spread out into three transactions, and QLDB wouldn't be able to detect a conflict. But it seems like my assumption is not true, and the only benefit of batching up the function calls would just be better performance?
Got an answer from a QLDB specialist so going to answer my own question: the operations should have been wrapped in a single transaction so my original assumption was actually true. They are going to update the code sample to reflect this.

Knot Resolver: Paralelism and concurrency in modules

Context
Dear Knot Resolver users, I have a module that hooks into Knot's finish phase,
static knot_layer_api_t _layer = {
.finish = &collect,
};
the purpose of the collect function static int collect(knot_layer_t *ctx) { is to ask an external oraculum via a REST API whether a particular domain is listed for containing a malware or phishing campaign and whether it should be resolved or sinkholed.
It works well as long as Knot Resolver is not targeted with hundreds of concurrent DNS requests.
When that happens, given the fact that the oraculum's API response time varies and could be as long as tens to hundreds of milliseconds on occasion,
clients start to temporarily perceive very long response times from Knot Resolver, far exceeding the hard timeout set on communication to oraculum's API.
Possible problem
I think that the scaling-with-processes actually
renders the module very inefficiently implemented, because queries are being queued and processed by
module one by one (in a particular process). That means if n queries almost-hit oraculum's API timeout limit t, the client
who sent its n+1 query to this particular kresd process, will perceive a very long response time of accumulated n*t.
Or would it? Am I completely off?
When I prototyped similar functionality in GoDNS using goroutines, GoDNS server (at the cost of hideous CPU usage) let numerous
DNS clients' queries talk to the oraculum and return to clients "concurrently".
Question
Is it O.K. to use Apache Portable Runtime threading or OpenMP threading and to start hiding the API's response time in the module? Isn't it a complete Knot Resolver antipattern?
I'm caching oraculum's API responses in a simple in memory ephemeral LRU cache that resides in each kresd process. Would it be possible to use kresd's own MVCC cache instead for my arbitrary structure?
Is it possible that the problem is elsewhere, for instance, that Knot Resolver doesn't expect any blocking delay in finish layer and thus some network queue is filled and subsequent DNS queries are rejected and/or intolerably delayed?
Thanks for pointers (pun intended)
A Knot Resolver developer here :-) (I also repeat some things answered by Jan already.)
Scaling-with-processes is able to work fine. Waiting for responses from name-servers is done by libuv (via event-loop and callbacks, all within a single thread).
Due to the single-threaded style, no layer function should be blocked (on I/O), as that would make everything block on it. AFAIK currently the only case when this can really happen is when (part of) the cache gets swapped-out.
There is the YIELD state http://knot-resolver.readthedocs.io/en/latest/lib.html?highlight=yield It's used when a sub-request is needed before processing of the layer can continue, but I currently don't know details of its working. I don't think it's directly applicable, as resuming the layers seems currently only triggered by a sub-request finishing.
Cache: if you put your module before the rrcache module and you change the RRset, it will get cached changed already.
Knot DNS developer here (not Resolver though). I think you are right. My understanding is that the layer code is executed synchronously in the daemon thread. The asynchrony appears only at the resolver network I/O level.
Internally the server runs libuv loop which just executes callbacks for events on primitives provided by libuv (sockets, timers, signals, etc.). The problem is that you just cannot suspend the running callback (C function) at an arbitrary point, escape back to libuv loop, and continue with the callback execution at some point later.
That said, asynchronous waiting for an event can happen only where this was expected. And the code driving layers doesn't expect that.
Answers:
I'm not very familiar with libapr or OpenMP. But I don't think this could be really solved without reworking the layer interface and making it asynchronous.
The shared cache could be used for sure. If you cannot find the API, jolly Knot DNS folks will happily accept a patch or help you writing one.
This is exactly the case. Knot Resolver doesn't expect blocking code in the layer finish callback.

Should I cache results of functions involving mass file I/O in a node.js server app?

I'm writing my first 'serious' Node/Express application, and I'm becoming concerned about the number of O(n) and O(n^2) operations I'm performing on every request. The application is a blog engine, which indexes and serves up articles stored in markdown format in the file system. The contents of the articles folder do not change frequently, as the app is scaled for a personal blog, but I would still like to be able to add a file to that folder whenever I want, and have the app include it without further intervention.
Operations I'm concerned about
When /index is requested, my route is iterating over all files in the directory and storing them as objects
When a "tag page" is requested (/tag/foo) I'm iterating over all the articles, and then iterating over their arrays of tags to determine which articles to present in an index format
Now, I know that this is probably premature optimisation as the performance is still satisfactory over <200 files, but definitely not lightning fast. And I also know that in production, measures like this wouldn't be considered necessary/worthwhile unless backed by significant benchmarking results. But as this is purely a learning exercise/demonstration of ability, and as I'm (perhaps excessively) concerned about learning optimal habits and patterns, I worry I'm committing some kind of sin here.
Measures I have considered
I get the impression that a database might be a more typical solution, rather than filesystem I/O. But this would mean monitoring the directory for changes and processing/adding new articles to the database, a whole separate operation/functionality. If I did this, would it make sense to be watching that folder for changes even when a request isn't coming in? Or would it be better to check the freshness of the database, then retrieve results from the database? I also don't know how much this helps ultimately, as database calls are still async/slower than internal state, aren't they? Or would a database query, e.g. articles where tags contain x be O(1) rather than O(n)? If so, that would clearly be ideal.
Also, I am beginning to learn about techniques/patterns for caching results, e.g. a property on the function containing the previous result, which could be checked for and served up without performing the operation. But I'd need to check if the folder had new files added to know if it was OK to serve up the cached version, right? But more fundamentally (and this is the essential newbie query at hand) is it considered OK to do this? Everyone talks about how node apps should be stateless, and this would amount to maintaining state, right? Once again, I'm still a fairly raw beginner, and so reading the source of mature apps isn't always as enlightening to me as I wish it was.
Also have I fundamentally misunderstood how routes work in node/express? If I store a variable in index.js, are all the variables/objects created by it destroyed when the route is done and the page is served? If so I apologise profusely for my ignorance, as that would negate basically everything discussed, and make maintaining an external database (or just continuing to redo the file I/O) the only solution.
First off, the request and response objects that are part of each request last only for the duration of a given request and are not shared by other requests. They will be garbage collected as soon as they are no longer in use.
But, module-scoped variables in any of your Express modules last for the duration of the server. So, you can load some information in one request, store it in a module-level variable and that information will still be there when the next request comes along.
Since multiple requests can be "in-flight" at the same time if you are using any async operations in your request handlers, then if you are sharing/updating information between requests you have to make sure you have atomic updates so that the data is shared safely. In node.js, this is much simpler than in a multi-threaded response handler web server, but there still can be issues if you're doing part of an update to a shared object, then doing some async operation, then doing the rest of an update to a shared object. When you do an async operation, another request could run and see the shared object.
When not doing an async operation, your Javascript code is single threaded so other requests won't interleave until you go async.
It sounds like you want to cache your parsed state into a simple in-memory Javascript structure and then intelligently update this cache of information when new articles are added.
Since you already have the code to parse your set of files and tags into in-memory Javascript variables, you can just keep that code. You will want to package that into a separate function that you can call at any time and it will return a newly updated state.
Then, you want to call it when your server starts and that will establish the initial state.
All your routes can be changed to operate on the cached state and this should speed them up tremendously.
Then, all you need is a scheme to decide when to update the cached state (e.g. when something in the file system changed). There are lots of options and which to use depends a little bit on how often things will change and how often the changes need to get reflected to the outside world. Here are some options:
You could register a file system watcher for a particular directory of your file system and when it triggers, you figure out what has changed and update your cache. You can make the update function as dumb (just start over and parse everything from scratch) or as smart (figure out what one item changed and update only that part of the cache) as it is worth doing. I'd suggest you start simple and only invest more in it when you're sure that effort is needed.
You could just manually rebuild the cache once every hour. Updates would take an average of 30 minutes to show, but this would take 10 seconds to implement.
You could create an admin function in your server to instruct the server to update its cache now. This might be combined with option 2, so that if you added new content, it would automatically show within an hour, but if you wanted it to show immediately, you could hit the admin page to tell it to update its cache.

Are there greenDAO thread safety best practices?

I'm having a go with greenDAO and so far it's going pretty well. One thing that doesn't seem to be covered by the docs or website (or anywhere :( ) is how it handles thread safety.
I know the basics mentioned elsewhere, like "use a single dao session" (general practice for Android + SQLite), and I understand the Java memory model quite well. The library internals even appear threadsafe, or at least built with that intention. But nothing I've seen covers this:
greenDAO caches entities by default. This is excellent for a completely single-threaded program - transparent and a massive performance boost for most uses. But if I e.g. loadAll() and then modify one of the elements, I'm modifying the same object globally across my app. If I'm using it on the main thread (e.g. for display), and updating the DB on a background thread (as is right and proper), there are obvious threading problems unless extra care is taken.
Does greenDAO do anything "under the hood" to protect against common application-level threading problems? For example, modifying a cached entity in the UI thread while saving it in a background thread (better hope they don't interleave! especially when modifying a list!)? Are there any "best practices" to protect against them, beyond general thread safety concerns (i.e. something that greenDAO expects and works well with)? Or is the whole cache fatally flawed from a multithreaded-application safety standpoint?
I've no experience with greenDAO but the documentation here:
http://greendao-orm.com/documentation/queries/
Says:
If you use queries in multiple threads, you must call forCurrentThread() on the query to get a Query instance for the current thread. Starting with greenDAO 1.3, object instances of Query are bound to their owning thread that build the query. This lets you safely set parameters on the Query object while other threads cannot interfere. If other threads try to set parameters on the query or execute the query bound to another thread, an exception will be thrown. Like this, you don’t need a synchronized statement. In fact you should avoid locking because this may lead to deadlocks if concurrent transactions use the same Query object.
To avoid those potential deadlocks completely, greenDAO 1.3 introduced the method forCurrentThread(). This will return a thread-local instance of the Query, which is safe to use in the current thread. Every time, forCurrentThread() is called, the parameters are set to the initial parameters at the time the query was built using its builder.
While so far as I can see the documentation doesn't explicitly say anything about multi threading other than this this seems pretty clear that it is handled. This is talking about multiple threads using the same Query object, so clearly multiple threads can access the same database. Certainly it's normal for databases and DAO to handle concurrent access and there are a lot of proven techniques for working with caches in this situation.
By default GreenDAO caches and returns cached entity instances to improve performance. To prevent this behaviour, you need to call:
daoSession.clear()
to clear all cached instances. Alternatively you can call:
objectDao.detachAll()
to clear cached instances only for the specific DAO object.
You will need to call these methods every time you want to clear the cached instances, so if you want to disable all caching, I recommend calling them in your Session or DAO accessor methods.
Documentation:
http://greenrobot.org/greendao/documentation/sessions/#Clear_the_identity_scope
Discussion: https://github.com/greenrobot/greenDAO/issues/776

Is there a blocking redis library for node.js?

Redis is very fast. For most part on my machine it is as fast as say native Javascript statements or function calls in node.js. It is easy/painless to write regular Javascript code in node.js because no callbacks are needed. I don't see why it should not be that easy to get/set key/value data in Redis using node.js.
Assuming node.js and Redis are on the same machine, are there any npm libraries out there that allow interacting with Redis on node.js using blocking calls? I know this has to be a C/C++ library interfacing with V8.
I suppose you want to ensure all your redis insert operations have been performed. To achieve that, you can use the MULTI commands to insert keys or perform other operations.
The https://github.com/mranney/node_redis module queues up the commands pushed in multi object, and executes them accordingly.
That way you only require one callback, at the end of exec call.
This seems like a common bear-trap for developers who are trying to get used to Node's evented programming model.
What happens is this: you run into a situation where the async/callback pattern isn't a good fit, you figure what you need is some way of doing blocking code, you ask Google/StackExchange about blocking in Node, and all you get is admonishment on how bad blocking is.
They're right - blocking, ("wait for the result of this before doing anything else"), isn't something you should try to do in Node. But what I think is more helpful is to realize that 99.9% of the time, you're not really looking for a way to do blocking, you're just looking for a way to make your app, "wait for the result of this before going on to do that," which is not exactly the same thing.
Try looking into the idea of "flow control" in Node rather than "blocking" for some design patterns that could be a clearer fit for what you're trying to do. Here's a list of libraries to check out:
https://github.com/joyent/node/wiki/modules#wiki-async-flow
I'm new to Node too, but I'm really digging Async: https://github.com/caolan/async
Blocking code creates a MASSIVE bottleneck.
If you use blocking code your server will become INCREDIBLY slow.
Remember, node is single threaded. So any blocking code, will block node for every connected client.
Your own benchmarking shows it's fast enough for one client. Have you benchmarked it with a 1000 clients? If you try this you will see why blocking code is bad
Whilst Redis is quick it is not instantaneous ... this is why you must use a callback if you want to continue execution ensuring your values are there.
The only way I think you could (and am not suggesting you do) achieve this use a callback with a variable that is the predicate for leaving a timer.

Resources