Zip Create Process with Node Express of large ZIP packages - node.js

Goal
We standing up a low volume site, where users (browser client) will select image files (284 KB per file) and then request a Node Express Server to bundle them into a ZIP for download to the web client.
Issues & Design Constraints
The resultant ZIP might be on the order of 50 MB - 5 GB. Therefore we would like
to give the user a running progress bar while the ZIP is being
constructed. (We assume the browser will give running updates as to
the progress of the actual download).
While we expect low volume of requests
(1-2 request at a time). However, we do not want to completely tie up our 4
core server processor, so we want to minimize synchronous calls that tie up the express server.
Given the size of the ZIP, we cannot expect the zip to be assembled only in memory
Is there any other issues we should worry about?
Question
We assume that running 7zip as a child process is bad, since we would not get any running status as to how many of the 258KB files had been added to the ZIP.
So which of the following packages are very Node/ExpressJS friendly packages given the design constraints/goals listed above?
archiver: https://www.npmjs.com/package/archiver
jszip: https://www.npmjs.com/package/jszip
easyzip: https://www.npmjs.com/package/easy-zip
expresszip: https://www.npmjs.com/package/express-zip
zipstream: https://www.npmjs.com/package/zip-stream
What I am seeing above is that most packages first collect the files, and then finalize them to memory and then pipe them to the http request (probably not good for 5GB of data or am I missing something). Some seem to be able to use disk, but the question will be does one get update events as each file is added?
Others seem to be fully async and I don't see how you would get a running progress value as each file added to the ZIP package.

Of the packages listed above. Most were not appropriate
JSZIP is mainly for the browser
EasyZip is a node wrapper for of JSZIP, but it does not provide
progress notifications durring creation
Express-Zip is an in-memory express friendly RES solution (but
probably would not handle the size of the ZIP we are talking about)
ZIP-Stream is underlying utility underleath Archiver. Archiver has
the queuing services, so one should just user archiver
YAZL might work, but the interface is more complex for progress
tracking than Archiver
We chose Archiver, since it had most of the features desired:
Express Friendly
low memory footprint
as fast as 7ZIP for the particular image archives we create (we don't need to compress, files are large, etc.) You might have 25% hit in performance for other types of archives
It does not let you append to existing archives (that was one feature we wanted), but adm-zip might provide that gap
As for the 7zip solution. We tend not to like reading the entrails of a standard output stream from a spawned child process.
It is messy to find strings int he streams
it causes context switches to read the stream,
you have a brittle solution trying to deal with what output stream puts out (e.g. in the case of 7zip it sometimes leaps the counter by 30% sometimes by 1%), as well as other sources for brittle solutions.

We assume that running 7zip as a child process is bad, since we would not get any running status as to how many of the 258KB files had been added to the ZIP.
That appears to be a false assumption.
A command line like this will show progress for each file added to the archive on stdout as each new file is added:
7z a -bsp1 -bb3 test.7z *
So, you can launch that from node.js using the child process module and you should be able to capture the stdout progress as it happens. You will need to use spawn, not exec so you can get the stdout data live as it happens.
Running this as a child process will keep your nodejs process free to serve other requests and will allow the child process to manage its own memory, independent of nodejs.
The 7zip program handles extremely large archives and files with appropriate memory usage. With the right flags to get progress to stdout and running it as a child process, it appears to meet all your requirements.

Related

Stream files generated on request in memory

I have a loop where I generate files (around 500KB each) and if there is too much data Node throws out of memory error (no wonder, it's around 4GB of data). I read about streams and I'm trying to understand how can I incorporate it in my app.
Most of the information I find is about streaming file that is already on the disk. What I want to do is to create files on the fly (which I already do), send one by one (or however chunks work) as they are generated and hand it to the client in a zip when it's done (so it's easy on the RAM).
I don't ask for specific code - more about where to look so I can read about it.

Resize image when uploading to server or when serving from server to client?

My website uses many images. On a weak day users will upload hundreds of new images.
I'm trying to figure out what is the best-practice for manipulating sizes of images.
This project uses Node.js with gm module for manipulating images, but I don't think this question is node or gm specific.
I came up with several strategies, but I can't make a decision as to which is the best, and I am not sure if I am missing an obvious best-practice strategy.
Please enlighten me with your thoughts and experience.
Option 1: Resize the file with gm on every client request.
Option 1 pros:
If I run gm function every time I serve a file, I can control the size, quality, compression, filters and so on whenever I need it.
On the server I only save 1, full quality - full size version of the file and save storage space.
Option 1 cons:
gm is very resource intensive, and that means that I will be abusing my RAM for every single image server to every single client.
It means I will be always working from a big file, which makes things even worse.
I will always have to fetch the file from my storage (in my case S3) to the server, then manipulate it, then serve it. It seems like it would create redundant bandwidth issues.
Option 2: resize the file on first upload and keep multiple sizes of the file on the server.
Option 2 pros:
I will only have to use gm on uploads.
Serving the files will require almost no resources.
Option 2 cons:
I will use more storage because I will be saving multiple versions of the same file (i.e full, large, medium, small, x-small) instead of only one version.
I will be limited to using only the sizes that were created when the user uploaded their image.
Not flexible - If in the future I decide I need an additional size version (x-x-small for instance) I will have to run a script that processes every image in my storage to create the new version of the image.
Option 3:
Use option 2 to only process files on upload, but retain a resize module when serving file sizes that don't have a stored version in my storage.
Option 3 pros:
I will be able to reduce resource usage significantly when serving files in a selection of set sizes.
Option 3 cons:
I would still take more storage as in option 2 vs option 1.
I will still have to process files when I serve them in cases where I don't have the file size I want
Option 4: I do not create multiple versions of files on upload. I do resize the images when I serve them, BUT when ever an image size was requested, this version of the file will be saved in my storage and for future requests I will not have to process the image again.
Option 4 pros:
I will only use storage for the versions I use.
I could add a new file size when ever I need, it will be automatically created on a need-basis if it doesn't already exists.
Will use a lot of resources only once per file
Option 4 cons:
Files that are only accessed once will be both resource intensive AND storage intensive. Because I will access the file, see that the size version I need doesn't exist, create the new file version, use the resources needed, and save it to my storage wasting storage space for a file that will only be used once (note, I can't know how many times files will be used)
I will have to check if the file already exists for every request.
So,
Which would you choose? Why?
Is there a better way than the ways I suggested?
Solution highly depends on the usage you have for your resources. If you have an intensive utilisation then option 2 is from far the better. If not, option 1 could work nicely also.
From a qualitative point of view I think option 4 is the best of course. But for a question of simplicity and automation, I think option 2 is way better.
Because simplicity matter, I suggest to mix the option 2 and 4 : you will have a list of size (e.g. large,medium,small), but will not process them on upload but when requested as in option 4.
So that in the end, in the worst case you will arrive to the option 2 solution.
My final word would be that you should also use the <img> and/or <canvas> object in your website to perform the final sizing, so that the small computation overhead is not done on the server side.

Node .fs Working with a HUGE Directory

Picture a directory with a ton of files. As a rough gauge of magnitude I think the most that we've seen so far is a couple of million but it could technically go another order higher. Using node, I would like to read files from this directory, process them (upload them, basically), and then move them out of the directory. Pretty simple. New files are constantly being added while the application is running, and my job (like a man on a sinking ship holding a bucket) is to empty this directory as fast as it's being filled.
So what are my options? fs.readdir is not ideal, it loads all of the filenames into memory which becomes a problem at this kind of scale. Especially as new files are being added all the time and so it would require repeated calls. (As an aside for anybody referring to this in the future, there is something being proposed to address this whole issue which may or may not have been realised within your timeline.)
I've looked at the myriad of fs drop-ins (graceful-fs, chokadir, readdirp, etc), none of which have this particular use-case within their remit.
I've also come across a couple of people suggesting that this can be handled with child_process, and there's a wrapper called inotifywait which tasks itself with exactly what I am asking but I really don't understand how this addresses the underlying problem, especially at this scale.
I'm wondering if what I really need to do is find a way to just get the first file (or, realistically, batch of files) from the directory without having the overhead of reading the entire directory structure into memory. Some sort of stream that could be terminated after a certain number of files had been read? I know Go has a parameter for reading the first n files from a directory but I can't find a node equivalent, has anybody here come across one or have any interesting ideas? Left-field solutions more than welcome at this point!
You can use your operation system listing file command, and stream the result into NodeJS.
For example in Linux:
var cp=require('child_process')
var stdout=cp.exec('ls').stdout
stdout.on('data',function(a){
console.log(a)
});0
RunKit: https://runkit.com/aminanadav/57da243180f3bb140059a31d

Multiple Machines -- Process Many Files Concurrently?

I need to concurrently process a large amount of files (thousands of different files, with avg. size of 2MB per file).
All the information is stored on one (1.5TB) network hard drive, and will be accessed (read) by about 30 different machines. For efficiency, each machine will be reading (and processing) different files (there are thousands of files that need to be processed).
Every machine -- following its reading of a file from the 'incoming' folder on the 1.5TB hard drive -- will be processing the information and be ready to output the processed information back to the 'processed' folder on the 1.5TB drive. the processed information for every file is of roughly the same average size as the input files (about ~2MB per file).
Are there any 'do' and 'donts' when one is building such an operation? is it a problem to have 30 machines or so read (or write) information to the same network drive, at the same time?
(note: existing files will only be read, not appended/written; new files will be created from scratch, so there are no issues of multiple access to the same file...).
Are there any bottlenecks that I should expect?
(I am use Linux, Ubuntu 10.04 LTS on all machines if it all matters)
Things you should think about:
If the processing to be done for each file is simple, then your real bottleneck isn't the amount of parallel files you read, but the capabilities of the hard disk drive.
Unless processing takes a long time (say, some seconds per file) you'll go past a point in which adding more processes will only slow down matters to a crawl, since every process is reading and writing results, and the disk can only do so much.
Try to minimize disk access: for example, download files and produce results locally while other processes are downloading, and send the results back when the load on the disk goes down.
The more I write the more it boils down to how much processing needs to be done for each file. If it's simple parsing, something that takes milliseconds, 1 machine or 30 will make little difference.
You need to be careful that two worker processes don't pick up (and try to do) the same piece of work at the same time.
Unfortunately, NFS filesystems don't have semantics that allow you to easily do that.
So what I'd recommend is to use something like Gearman and a producer/consumer model, where one process gives out work to whoever is available to do it.
Another possibility is to have a database (e.g. mysql) with a table of all tasks, and have the processes atomically "claim" tasks for themselves.
But all of this is only worthwhile if your processes are mostly CPU-bound. If you're trying to get more IO bandwidth (or operations) out of your NAS by using multiple clients, it's not going to work.
I am assuming that you will be running at least gigabit ethernet here (or it's probably not worth it).
Have you tried running multiple processes on the same machine?

Uploading & extracting archive (zip, rar, targz, tarbz) automatically - security issue?

I'd like to create following functionality for my web-based application:
user uploads an archive file (zip/rar/tar.gz/tar.bz etc) (content - several image files)
archive is automatically extracted after upload
images are shown in the HTML list (whatever)
Are there any security issues involved with extraction process? E.g. possibility of malicious code execution contained within uploaded files (or well-prepared archive file), or else?
Aside the possibility of exploiting the system with things like buffer overflows if it's not implemented carefully, there can be issues if you blindly extract a well crafted compressed file with a large file with redundant patterns inside (a zip bomb). The compressed version is very small but when you extract, it'll take up the whole disk causing denial of service and possibly crashing the system.
Also, if you are not careful enough, the client might hand a zip file with server-side executable contents (.php, .asp, .aspx, ...) inside and request the file over HTTP, which, if not configured properly can result in arbitrary code execution on the server.
In addition to Medrdad's answer: Hosting user supplied content is a bit tricky. If you are hosting a zip file, then that can be used to store Java class files (also used for other formats) and therefore the "same origin policy" can be broken. (There was the GIFAR attack where a zip was attached to the end of another file, but that no longer works with the Java PlugIn/WebStart.) Image files should at the very least be checked that they actually are image files. Obviously there is a problem with web browsers having buffer overflow vulnerabilities, that now your site could be used to attack your visitors (this may make you unpopular). You may find some client side software using, say, regexs to pass data, so data in the middle of the image file can be executed. Zip files may have naughty file names (for instance, directory traversal with ../ and strange characters).
What to do (not necessarily an exhaustive list):
Host user supplied files on a completely different domain.
The domain with user files should use different IP addresses.
If possible decode and re-encode the data.
There's another stackoverflow question on zip bombs - I suggest decompressing using ZipInputStream and stopping if it gets too big.
Where native code touches user data, do it in a chroot gaol.
White list characters or entirely replace file names.
Potentially you could use an IDS of some description to scan for suspicious data (I really don't know how much this gets done - make sure your IDS isn't written in C!).

Resources