Do multiple node.js "requires" impact production run time? - node.js

We are integrating Amazon's node.js SDK into our project and while I do not think it matters due to require's cache and the fact that everything is compiled, I could not find a site that definitively states that multiple requires will not affect performance in run time.
Obviously it depends on what files you are requiring, the contents of those files, and whether or not they could block the event loop or have other code inside of them to slow performance.
I prefer to structure code based on functionality rather than just having a 10000+ line file that does not really relate to the task at hand. I just want to make sure I'm not shooting myself in the foot by break out functionality into separate modules and then requiring on an as needed basis.

Well, require() is a synchronous operation so it should ONLY be used during server initialization, never during an actual request. Therefore, the performance of require() should only affect your server startup time, not your request handling time.
Second, require() does have a cache behind it. It matches the fully resolved path of the module you are attempting to load. So, if you call require(somePath) and a module at that same path has previously been loaded, then the module handle is just immediately returned from the cache. No module is loaded from disk a second time. The module code is not executed a second time.
Obviously it depends on what files you are requiring, the contents of those files, and whether or not they could block the event loop or have other code inside of them to slow performance.
If you are requiring a module for the first time, it WILL block the event loop while loading that module because require() uses blocking, synchronous I/O when the module is not yet cached. That's why you should be doing this at server initialization time, not during a request handler.
I prefer to structure code based on functionality rather than just having a 10000+ line file that does not really relate to the task at hand. I just want to make sure I'm not shooting myself in the foot by break out functionality into separate modules and then requiring on an as needed basis.
Breaking code into logical modules is good for ease of maintenance, ease of testing and ease of reuse, so it's definitely a good thing.
I have seen people go too far where there are so many modules each with only a few lines of code in them that it backfires and makes the project unwieldly to work on, find things in, design test suites for, etc... So, there is a balance.

Related

Should I cache results of functions involving mass file I/O in a node.js server app?

I'm writing my first 'serious' Node/Express application, and I'm becoming concerned about the number of O(n) and O(n^2) operations I'm performing on every request. The application is a blog engine, which indexes and serves up articles stored in markdown format in the file system. The contents of the articles folder do not change frequently, as the app is scaled for a personal blog, but I would still like to be able to add a file to that folder whenever I want, and have the app include it without further intervention.
Operations I'm concerned about
When /index is requested, my route is iterating over all files in the directory and storing them as objects
When a "tag page" is requested (/tag/foo) I'm iterating over all the articles, and then iterating over their arrays of tags to determine which articles to present in an index format
Now, I know that this is probably premature optimisation as the performance is still satisfactory over <200 files, but definitely not lightning fast. And I also know that in production, measures like this wouldn't be considered necessary/worthwhile unless backed by significant benchmarking results. But as this is purely a learning exercise/demonstration of ability, and as I'm (perhaps excessively) concerned about learning optimal habits and patterns, I worry I'm committing some kind of sin here.
Measures I have considered
I get the impression that a database might be a more typical solution, rather than filesystem I/O. But this would mean monitoring the directory for changes and processing/adding new articles to the database, a whole separate operation/functionality. If I did this, would it make sense to be watching that folder for changes even when a request isn't coming in? Or would it be better to check the freshness of the database, then retrieve results from the database? I also don't know how much this helps ultimately, as database calls are still async/slower than internal state, aren't they? Or would a database query, e.g. articles where tags contain x be O(1) rather than O(n)? If so, that would clearly be ideal.
Also, I am beginning to learn about techniques/patterns for caching results, e.g. a property on the function containing the previous result, which could be checked for and served up without performing the operation. But I'd need to check if the folder had new files added to know if it was OK to serve up the cached version, right? But more fundamentally (and this is the essential newbie query at hand) is it considered OK to do this? Everyone talks about how node apps should be stateless, and this would amount to maintaining state, right? Once again, I'm still a fairly raw beginner, and so reading the source of mature apps isn't always as enlightening to me as I wish it was.
Also have I fundamentally misunderstood how routes work in node/express? If I store a variable in index.js, are all the variables/objects created by it destroyed when the route is done and the page is served? If so I apologise profusely for my ignorance, as that would negate basically everything discussed, and make maintaining an external database (or just continuing to redo the file I/O) the only solution.
First off, the request and response objects that are part of each request last only for the duration of a given request and are not shared by other requests. They will be garbage collected as soon as they are no longer in use.
But, module-scoped variables in any of your Express modules last for the duration of the server. So, you can load some information in one request, store it in a module-level variable and that information will still be there when the next request comes along.
Since multiple requests can be "in-flight" at the same time if you are using any async operations in your request handlers, then if you are sharing/updating information between requests you have to make sure you have atomic updates so that the data is shared safely. In node.js, this is much simpler than in a multi-threaded response handler web server, but there still can be issues if you're doing part of an update to a shared object, then doing some async operation, then doing the rest of an update to a shared object. When you do an async operation, another request could run and see the shared object.
When not doing an async operation, your Javascript code is single threaded so other requests won't interleave until you go async.
It sounds like you want to cache your parsed state into a simple in-memory Javascript structure and then intelligently update this cache of information when new articles are added.
Since you already have the code to parse your set of files and tags into in-memory Javascript variables, you can just keep that code. You will want to package that into a separate function that you can call at any time and it will return a newly updated state.
Then, you want to call it when your server starts and that will establish the initial state.
All your routes can be changed to operate on the cached state and this should speed them up tremendously.
Then, all you need is a scheme to decide when to update the cached state (e.g. when something in the file system changed). There are lots of options and which to use depends a little bit on how often things will change and how often the changes need to get reflected to the outside world. Here are some options:
You could register a file system watcher for a particular directory of your file system and when it triggers, you figure out what has changed and update your cache. You can make the update function as dumb (just start over and parse everything from scratch) or as smart (figure out what one item changed and update only that part of the cache) as it is worth doing. I'd suggest you start simple and only invest more in it when you're sure that effort is needed.
You could just manually rebuild the cache once every hour. Updates would take an average of 30 minutes to show, but this would take 10 seconds to implement.
You could create an admin function in your server to instruct the server to update its cache now. This might be combined with option 2, so that if you added new content, it would automatically show within an hour, but if you wanted it to show immediately, you could hit the admin page to tell it to update its cache.

NodeJS: Most efficient way to use "require"

What is the best way to use NodeJS's require function? By this, I'm referring to the placement of the require statement. Isf it better to load all dependencies from the beginning of the script, as you need them, or does it not make a notable difference whatsoever?
This article has a lot useful information regarding how require works, though I still can't come to a definitive conclusion as to which method would be most efficient.
Assuming you're using node.js for some sort of server environment, several things are generally true about that server environment:
You want fast response time to any given request.
The code that runs for processing requests should not use synchronous I/O operations because that seriously lessens the scalability of the server.
Server startup time is generally not something you need to optimize for (within reason) so if you're going to pay an initialization cost somewhere, it is usually better paid once at server startup time.
So, given that require() uses synchronous I/O when the module has not yet been cached, that means you really don't generally want to be doing require() operations inside a request handler. And, you want fast response times for your request handlers so you don't want require() calls inside your handler anyway.
All of these leads to a general rule of thumb that you load necessary modules at startup time into a module level variable that you can reuse from one request to the next and you don't load modules inside your request handlers.
In addition to all of this, if you put all your require() statements in a block near the top of your module, it makes your module a lot more self-documenting about what other modules it depends on and how it initializes those modules. If require() statements are sprinkled all over the code, then it makes it a lot harder for a developer to see what this module is using without a lot more study of the code.
It depends what performance characteristics you're looking for.
require() is not cheap; it has to read the JS file from disk, parse it, and execute any top-level code (and do all of that recursively for all files require()d by that file).
If you put all of your require()s on top, your code may take more time to start, but it won't suddenly slow down later. (note that moving the require() further down in the synchronous top-level code will make no difference beyond order of execution).
If you only require() other modules when first used asynchronously, your first request may be noticeably slower, as Node parses all of your dependencies. This also means that any errors from dependencies won't be caught until later. (note that all require() calls are cached, so the second request won't do any more work)
The other disadvantage to scattering require() calls throughout your code is that it makes it less readable; it's very nice to easily see exactly what each file depends on up top.

Testing background processes in nodejs (using tape)

This is a general question about testing, but I will frame it in the context of Node.js. I'm not as concerned with a particular technology, but it may matter.
In my application, I have several modules that are called upon to do work when my web server receives a request. In the case of some of these modules, I close the request before I call upon them.
What is a good way to test that these modules are doing what they are supposed to do?
The advice here for RSpec is to mock out the work these modules are doing and just ensure that the appropriate methods are being called. This makes sense to me, but in Node.js, since my modules are not global, I don't think I cannot mock out functions without changing my program architecture so that every instance receives instances of objects that it needs1.
[1] This is a well known programming paradigm, but I cannot remember its name right now.
The other option I see is to use setTimeout and take my best guess at when these modules are done with their work.
Neither of these seems ideal.
Am I missing something? Are background processes not tested?
Since you are speaking of integration tests of these background components, a few strategies come to mind.
Take all the asynchronicity out of their operation for test mode. I'm imagining you have some sort of queueing process (that could be a faulty assumption), you toss work into the queue, and then your modules pick up that work and do their task. You could rework your test harness such that the test harness stands in as the queuing mechanism and you effectively get direct control over when the modules execute.
Refactor your modules to take some sort of next callback function. They would end up functioning a bit like Express's middleware layer or how async's each function works, but into each module you'd pass some callback that you call when that module's task is complete. Once all of the modules have reported in, then you can check the state of the program.
Exactly what you already suggested-- wait some amount of time, and if it still isn't done, consider that a failure. Mocha sort of does that, in that if a given test is over a definable threshold, then it's a failure. I don't like this way though, because if you add more tests, they all have to wait the same amount of time.

RequireJS: To Bundle or Not to Bundle

I'm using RequireJS for my web application. I'm using EmberJS for the application framework. I've come to a point where, I think, I should start bundling my application into a single js file. That is where I get a little confused:
If I finally bundle everything into one file for deployment, then my whole application loads in one shot, instead of on demand. Isn't bundling contradictory to AMD in general and RequireJS in particular?
What further confuses me, is what I found on the RequireJS website:
Once you are finished doing development and want to deploy your code for your end users, you can use the optimizer to combine the JavaScript files together and minify it. In the example above, it can combine main.js and helper/util.js into one file and minify the result.
I found this similar thread but it doesn't answer my question.
If I finally bundle everything into one file for deployment, then my whole application loads in one shot, instead of on demand. Isn't bundling contradictory to AMD in general and RequireJS in particular?
It is not contradictory. Loading modules on demand is only one benefit of RequireJS. A greater benefit in my book is that modularization helps to use a divide-and-conquer approach. We can look at it in this way: even though all the functions and classes we put in a single file do not benefit from loading on demand, we still write multiple functions and multiple classes because it helps break down the problem in a structured way.
However, the multiplicity of modules we create in development do not necessarily make sense when running the application in a browser. The greatest cost of on-demand loading is sending multiple HTTP requests over the wire. Let's say your application has 10 modules and you send 10 requests to load it because you load these modules individually. Your total cost is going to be the cost you have to pay to load the bytes from the 10 files (let's call it Pc for payload cost), plus an overhead cost for each HTTP request (let's call it Oc, for overhead cost). The overhead has to do with the data and computations that have to occur to initiate and close these requests. They are not insignificant. So you are paying Pc + 10*Oc. If you send everything in one chunk you pay Pc + 1*Oc. You've saved 9*Oc. In fact the savings are probably greater because (since compression is often used at both ends to reduce the size of the data transmitted) compression is going to provide greater benefits if the entire data is compressed together than if it is compressed as 10 chunks. (Note: the above analysis omits details that are not useful to cover.)
Someone might object: "But you are comparing loading all the modules in separately versus loading all the modules in one chunk. If we load on demand then we won't load all the modules." As a matter of fact, most applications have a core of modules that will always be loaded, no matter what. These are the modules without which the application won't work at all. For some small applications this means all modules, so it make sense to bundle all of them together. For bigger applications, this means that a core set of modules will be used every single time the application runs, but a small set will be used only on occasion. In the latter case, the optimization should create multiple bundles. I have an application like this. It is an editor with modes for various editing needs. A good 90% of the modules belong to the core. They are going to be loaded and used anyway so it makes sense to bundle them. The code for the modes themselves is not always going to be used but all the files for a given mode are going to be needed if the mode is loaded at all so each mode should be its own bundle. So in this case a model with one core bundle and a series of mode bundles makes sense to a) optimize the deployed application but b) keep some of the benefits of loading on demand. That's the beauty of RequireJS: it does not require to do one or the other exclusively.
While developing you want to have single-focused, small files. This causes their number to increase. When running in production, many HTTP requests really harm performance. Then again you do not want to load the entire application upfront - this is also not optimal.
To address this, I have created a small project in GitHub, require-lazy, you can call it plugin to the builder - r.js. It can lazy load parts of your application with a simple syntax and then create separately donloadable bundles during the build process; so if your application consists of 2 views that need to be independently loaded, require-lazy will (ideally) build 3 js files: (1) the bootstrap code and common libraries, (2) view 1 with all its private scripts and (3) view 2 with all its private scripts.
Lazy loading is simply defined as:
define(["lazy!view1"], function(view1) { .... });
And view1 must be accessed with a promise:
view1.get().done(function(realView1) {
...
});
The project is available through npm, the build process through grunt and there is a bower component.
Comments are more than welcome.

Should I worry about File handles in logging from async methods?

I'm using the each() method of the async lib and experiencing some very odd (and inconsistent) errors that appear to be File handle errors when I attempt to log to file from within the child processes.
The array that I'm handing to this method frequently has hundreds of items and I'm curious if Node is having trouble running out of available file handles as it tries to log to file from within all these simultaneous processes. The problem goes away when I comment out my log calls, so it's definitely related to this somehow, but I'm having a tough time tracking down why.
All the logging is trying to go into a single file... I'm entirely unclear on how that works given that each write (presumably) blocks, which makes me wonder how all these simultaneous processes are able to run independently if they're all sitting around waiting on the file to become available to write to.
Assuming that this IS the source of my troubles, what's the right way to log from a process such as Asnyc.each() which runs N number of processes at once?
I think you should have some adjustable limit to how many concurrent/outstanding write calls you are going to do. No, none of them will block, but I think async.eachLimit or async.queue will give you the flexibility to set the limit low and be sure things behave and then gradually increase it to find out what resource constraints you eventually bump up against.

Resources