Stream files generated on request in memory - node.js

I have a loop where I generate files (around 500KB each) and if there is too much data Node throws out of memory error (no wonder, it's around 4GB of data). I read about streams and I'm trying to understand how can I incorporate it in my app.
Most of the information I find is about streaming file that is already on the disk. What I want to do is to create files on the fly (which I already do), send one by one (or however chunks work) as they are generated and hand it to the client in a zip when it's done (so it's easy on the RAM).
I don't ask for specific code - more about where to look so I can read about it.

Related

Is there a way to make rtorrent write bytes "on-the-go" in the disk?

Hello stackoverflow users,
I am having a bit of trouble getting something to work.
What I want to accomplish?
I want to read bytes (and write them somewhere) while the file is downloading in rtorrent,
First of all, I have rtorrent connected to flood. In my nodejs app I have an interval to get the percentage of files downloaded in rtorrent.
Based on the percentage I read "x" bytes from the file it's downloading through rtorrent and writing to user.
However, after diving deep into rtorrent writing to disk, I've discovered that rtorrent doesn't write to disk "on-the-go", which means my app won't work.
My question is, is there a way I can make rtorrent write the bytes directly on the disk and not after a certain period ( which I don't know the period, it's just random after many many tests ) ?

Correct way to read and append file from different process in node.js

I have 2 node.js processes both point to same file on disk one of them just appends to the file, the other process just reads from file...
Is this correct design if i am ok to read half commited data? And or if there are any other things/issues to look out for?
The reason to do this is because i am trying to do Write Ahead Log which needs persistent and will not grow beyond 5MB or any better way to do it on single host?

Zip Create Process with Node Express of large ZIP packages

Goal
We standing up a low volume site, where users (browser client) will select image files (284 KB per file) and then request a Node Express Server to bundle them into a ZIP for download to the web client.
Issues & Design Constraints
The resultant ZIP might be on the order of 50 MB - 5 GB. Therefore we would like
to give the user a running progress bar while the ZIP is being
constructed. (We assume the browser will give running updates as to
the progress of the actual download).
While we expect low volume of requests
(1-2 request at a time). However, we do not want to completely tie up our 4
core server processor, so we want to minimize synchronous calls that tie up the express server.
Given the size of the ZIP, we cannot expect the zip to be assembled only in memory
Is there any other issues we should worry about?
Question
We assume that running 7zip as a child process is bad, since we would not get any running status as to how many of the 258KB files had been added to the ZIP.
So which of the following packages are very Node/ExpressJS friendly packages given the design constraints/goals listed above?
archiver: https://www.npmjs.com/package/archiver
jszip: https://www.npmjs.com/package/jszip
easyzip: https://www.npmjs.com/package/easy-zip
expresszip: https://www.npmjs.com/package/express-zip
zipstream: https://www.npmjs.com/package/zip-stream
What I am seeing above is that most packages first collect the files, and then finalize them to memory and then pipe them to the http request (probably not good for 5GB of data or am I missing something). Some seem to be able to use disk, but the question will be does one get update events as each file is added?
Others seem to be fully async and I don't see how you would get a running progress value as each file added to the ZIP package.
Of the packages listed above. Most were not appropriate
JSZIP is mainly for the browser
EasyZip is a node wrapper for of JSZIP, but it does not provide
progress notifications durring creation
Express-Zip is an in-memory express friendly RES solution (but
probably would not handle the size of the ZIP we are talking about)
ZIP-Stream is underlying utility underleath Archiver. Archiver has
the queuing services, so one should just user archiver
YAZL might work, but the interface is more complex for progress
tracking than Archiver
We chose Archiver, since it had most of the features desired:
Express Friendly
low memory footprint
as fast as 7ZIP for the particular image archives we create (we don't need to compress, files are large, etc.) You might have 25% hit in performance for other types of archives
It does not let you append to existing archives (that was one feature we wanted), but adm-zip might provide that gap
As for the 7zip solution. We tend not to like reading the entrails of a standard output stream from a spawned child process.
It is messy to find strings int he streams
it causes context switches to read the stream,
you have a brittle solution trying to deal with what output stream puts out (e.g. in the case of 7zip it sometimes leaps the counter by 30% sometimes by 1%), as well as other sources for brittle solutions.
We assume that running 7zip as a child process is bad, since we would not get any running status as to how many of the 258KB files had been added to the ZIP.
That appears to be a false assumption.
A command line like this will show progress for each file added to the archive on stdout as each new file is added:
7z a -bsp1 -bb3 test.7z *
So, you can launch that from node.js using the child process module and you should be able to capture the stdout progress as it happens. You will need to use spawn, not exec so you can get the stdout data live as it happens.
Running this as a child process will keep your nodejs process free to serve other requests and will allow the child process to manage its own memory, independent of nodejs.
The 7zip program handles extremely large archives and files with appropriate memory usage. With the right flags to get progress to stdout and running it as a child process, it appears to meet all your requirements.

Max writeable streams in Node JS

Prologue:
I'm experimenting with a "multi-tenant" flat file database system. On application start, prior to starting the server, each flat file database (which I'm calling a journal) is converted into a large javascript object in memory. From there the application starts it's service.
The application runtime behavior will be to serve requests from many different databases (one db per domain). All reads come from the in memory object alone. While any CRUDs both modify the in memory object as well as stream it to the journal.
Question:
If I have a N of these database objects in memory which are already loaded from flat files (let's say averaging around 1MB each), what kind of limitations would I be dealing with by having N number of write streams?
If you are using streams that have an open file handle behind them, then your limit for how many of them you can have open will likely be governed by the process limit on open file handles which will vary by OS and (in some cases) by how you have that OS configured. Each open stream also consumes some memory, both for the stream object and for read/write buffers associated with the stream.
If you are using some sort of custom stream that just reads/writes to memory, not to files, then there would be no file handle involved and you would just be limited by the memory consumed by the stream objects and their buffers. You could likely have thousands of these with no issues.
Some reference posts:
Node.js and open files limit in linux
How do I change the number of open files limit in Linux?
Check the open FD limit for a given process in Linux

Multiple Machines -- Process Many Files Concurrently?

I need to concurrently process a large amount of files (thousands of different files, with avg. size of 2MB per file).
All the information is stored on one (1.5TB) network hard drive, and will be accessed (read) by about 30 different machines. For efficiency, each machine will be reading (and processing) different files (there are thousands of files that need to be processed).
Every machine -- following its reading of a file from the 'incoming' folder on the 1.5TB hard drive -- will be processing the information and be ready to output the processed information back to the 'processed' folder on the 1.5TB drive. the processed information for every file is of roughly the same average size as the input files (about ~2MB per file).
Are there any 'do' and 'donts' when one is building such an operation? is it a problem to have 30 machines or so read (or write) information to the same network drive, at the same time?
(note: existing files will only be read, not appended/written; new files will be created from scratch, so there are no issues of multiple access to the same file...).
Are there any bottlenecks that I should expect?
(I am use Linux, Ubuntu 10.04 LTS on all machines if it all matters)
Things you should think about:
If the processing to be done for each file is simple, then your real bottleneck isn't the amount of parallel files you read, but the capabilities of the hard disk drive.
Unless processing takes a long time (say, some seconds per file) you'll go past a point in which adding more processes will only slow down matters to a crawl, since every process is reading and writing results, and the disk can only do so much.
Try to minimize disk access: for example, download files and produce results locally while other processes are downloading, and send the results back when the load on the disk goes down.
The more I write the more it boils down to how much processing needs to be done for each file. If it's simple parsing, something that takes milliseconds, 1 machine or 30 will make little difference.
You need to be careful that two worker processes don't pick up (and try to do) the same piece of work at the same time.
Unfortunately, NFS filesystems don't have semantics that allow you to easily do that.
So what I'd recommend is to use something like Gearman and a producer/consumer model, where one process gives out work to whoever is available to do it.
Another possibility is to have a database (e.g. mysql) with a table of all tasks, and have the processes atomically "claim" tasks for themselves.
But all of this is only worthwhile if your processes are mostly CPU-bound. If you're trying to get more IO bandwidth (or operations) out of your NAS by using multiple clients, it's not going to work.
I am assuming that you will be running at least gigabit ethernet here (or it's probably not worth it).
Have you tried running multiple processes on the same machine?

Resources