Prologue:
I'm experimenting with a "multi-tenant" flat file database system. On application start, prior to starting the server, each flat file database (which I'm calling a journal) is converted into a large javascript object in memory. From there the application starts it's service.
The application runtime behavior will be to serve requests from many different databases (one db per domain). All reads come from the in memory object alone. While any CRUDs both modify the in memory object as well as stream it to the journal.
Question:
If I have a N of these database objects in memory which are already loaded from flat files (let's say averaging around 1MB each), what kind of limitations would I be dealing with by having N number of write streams?
If you are using streams that have an open file handle behind them, then your limit for how many of them you can have open will likely be governed by the process limit on open file handles which will vary by OS and (in some cases) by how you have that OS configured. Each open stream also consumes some memory, both for the stream object and for read/write buffers associated with the stream.
If you are using some sort of custom stream that just reads/writes to memory, not to files, then there would be no file handle involved and you would just be limited by the memory consumed by the stream objects and their buffers. You could likely have thousands of these with no issues.
Some reference posts:
Node.js and open files limit in linux
How do I change the number of open files limit in Linux?
Check the open FD limit for a given process in Linux
Related
I have a loop where I generate files (around 500KB each) and if there is too much data Node throws out of memory error (no wonder, it's around 4GB of data). I read about streams and I'm trying to understand how can I incorporate it in my app.
Most of the information I find is about streaming file that is already on the disk. What I want to do is to create files on the fly (which I already do), send one by one (or however chunks work) as they are generated and hand it to the client in a zip when it's done (so it's easy on the RAM).
I don't ask for specific code - more about where to look so I can read about it.
Consider there are multiple small applications running at the same time, each of them open a few files, each of size 10 MB - 100 MB, load into memory, perform some sorts of calculation, then write the result back to disk on a different file, which is also 10 MB - 100 MB.
My question is: are these read and write operations considered as sequential disk access or random disk access?
If we focus on one application, it seems to be sequential as all file are read from begin to the end, continuously (or from a specific offset to the end).
However if we look at all applications, they are doing file reading and writing all at the same time, which seems to be a random disk access.
Examples on sequential disk access and random disk access, especially in terms of the multi-thread / multi-application scenario will be appreciated!
It does not matter how many applications or threads do you use to access files.
Sequential access and random access terminology is applicable to the media you are accessing.
Suppose you need to read data block from arbitrary offset from encrypted file which is encrypted by good algorithm (where each subsequent block is encrypted with data dependent on previous block encryption operation).
While you can use seek and read the block you still can't decrypt it (so you can't access it).
Such media as a result does not support random access to the data, i.e. you need to decrypt previous stream to fetch the block at specific offset of the file.
So such media supports only sequential access.
On the other hand your non-ecrypted file can have structured information so you can read the head of file to get geometry of your data and access arbitrary block without reading all previous bytes.
This is example of random access.
Multiple applications can mix sequential access with random access method to access the data from the same file simultaneously.
If some app uses random access to fetch piece of information from huge file then you can tell that it uses random access for this particular operation but it most likely read some other files as well so it can use both approaches.
So random and sequential accesses are capabilities of media but not properties of application's operations.
Goal
We standing up a low volume site, where users (browser client) will select image files (284 KB per file) and then request a Node Express Server to bundle them into a ZIP for download to the web client.
Issues & Design Constraints
The resultant ZIP might be on the order of 50 MB - 5 GB. Therefore we would like
to give the user a running progress bar while the ZIP is being
constructed. (We assume the browser will give running updates as to
the progress of the actual download).
While we expect low volume of requests
(1-2 request at a time). However, we do not want to completely tie up our 4
core server processor, so we want to minimize synchronous calls that tie up the express server.
Given the size of the ZIP, we cannot expect the zip to be assembled only in memory
Is there any other issues we should worry about?
Question
We assume that running 7zip as a child process is bad, since we would not get any running status as to how many of the 258KB files had been added to the ZIP.
So which of the following packages are very Node/ExpressJS friendly packages given the design constraints/goals listed above?
archiver: https://www.npmjs.com/package/archiver
jszip: https://www.npmjs.com/package/jszip
easyzip: https://www.npmjs.com/package/easy-zip
expresszip: https://www.npmjs.com/package/express-zip
zipstream: https://www.npmjs.com/package/zip-stream
What I am seeing above is that most packages first collect the files, and then finalize them to memory and then pipe them to the http request (probably not good for 5GB of data or am I missing something). Some seem to be able to use disk, but the question will be does one get update events as each file is added?
Others seem to be fully async and I don't see how you would get a running progress value as each file added to the ZIP package.
Of the packages listed above. Most were not appropriate
JSZIP is mainly for the browser
EasyZip is a node wrapper for of JSZIP, but it does not provide
progress notifications durring creation
Express-Zip is an in-memory express friendly RES solution (but
probably would not handle the size of the ZIP we are talking about)
ZIP-Stream is underlying utility underleath Archiver. Archiver has
the queuing services, so one should just user archiver
YAZL might work, but the interface is more complex for progress
tracking than Archiver
We chose Archiver, since it had most of the features desired:
Express Friendly
low memory footprint
as fast as 7ZIP for the particular image archives we create (we don't need to compress, files are large, etc.) You might have 25% hit in performance for other types of archives
It does not let you append to existing archives (that was one feature we wanted), but adm-zip might provide that gap
As for the 7zip solution. We tend not to like reading the entrails of a standard output stream from a spawned child process.
It is messy to find strings int he streams
it causes context switches to read the stream,
you have a brittle solution trying to deal with what output stream puts out (e.g. in the case of 7zip it sometimes leaps the counter by 30% sometimes by 1%), as well as other sources for brittle solutions.
We assume that running 7zip as a child process is bad, since we would not get any running status as to how many of the 258KB files had been added to the ZIP.
That appears to be a false assumption.
A command line like this will show progress for each file added to the archive on stdout as each new file is added:
7z a -bsp1 -bb3 test.7z *
So, you can launch that from node.js using the child process module and you should be able to capture the stdout progress as it happens. You will need to use spawn, not exec so you can get the stdout data live as it happens.
Running this as a child process will keep your nodejs process free to serve other requests and will allow the child process to manage its own memory, independent of nodejs.
The 7zip program handles extremely large archives and files with appropriate memory usage. With the right flags to get progress to stdout and running it as a child process, it appears to meet all your requirements.
I need to concurrently process a large amount of files (thousands of different files, with avg. size of 2MB per file).
All the information is stored on one (1.5TB) network hard drive, and will be accessed (read) by about 30 different machines. For efficiency, each machine will be reading (and processing) different files (there are thousands of files that need to be processed).
Every machine -- following its reading of a file from the 'incoming' folder on the 1.5TB hard drive -- will be processing the information and be ready to output the processed information back to the 'processed' folder on the 1.5TB drive. the processed information for every file is of roughly the same average size as the input files (about ~2MB per file).
Are there any 'do' and 'donts' when one is building such an operation? is it a problem to have 30 machines or so read (or write) information to the same network drive, at the same time?
(note: existing files will only be read, not appended/written; new files will be created from scratch, so there are no issues of multiple access to the same file...).
Are there any bottlenecks that I should expect?
(I am use Linux, Ubuntu 10.04 LTS on all machines if it all matters)
Things you should think about:
If the processing to be done for each file is simple, then your real bottleneck isn't the amount of parallel files you read, but the capabilities of the hard disk drive.
Unless processing takes a long time (say, some seconds per file) you'll go past a point in which adding more processes will only slow down matters to a crawl, since every process is reading and writing results, and the disk can only do so much.
Try to minimize disk access: for example, download files and produce results locally while other processes are downloading, and send the results back when the load on the disk goes down.
The more I write the more it boils down to how much processing needs to be done for each file. If it's simple parsing, something that takes milliseconds, 1 machine or 30 will make little difference.
You need to be careful that two worker processes don't pick up (and try to do) the same piece of work at the same time.
Unfortunately, NFS filesystems don't have semantics that allow you to easily do that.
So what I'd recommend is to use something like Gearman and a producer/consumer model, where one process gives out work to whoever is available to do it.
Another possibility is to have a database (e.g. mysql) with a table of all tasks, and have the processes atomically "claim" tasks for themselves.
But all of this is only worthwhile if your processes are mostly CPU-bound. If you're trying to get more IO bandwidth (or operations) out of your NAS by using multiple clients, it's not going to work.
I am assuming that you will be running at least gigabit ethernet here (or it's probably not worth it).
Have you tried running multiple processes on the same machine?
I wish to understand the way kernel works when a user/app tries to create a file in a directorty.
The background - We have a java applicaiton which consumes messages over JMS, processes it and then writes the XML to an outbound queue+a local directory. Yesterday we obeserved unsual delays in writing to the directory. On 'ls|wc -l' we found >300,000 files in there. Did a quick strace on the process and found it full of mutex calls (More than 3/4 calls in the strace were mutex).
So i thought that new file creation is taking time becasue the system has to every time check certain things (e.g name of files to make sure that the new file with a specific name can be created) amongst 300,000 files and then create a file.
I cleared the directory and the applicaiton resumed to normal service levels.
My questions
Was my analysis correct (It seems cuz the app started working fine after a clear down)?
More imporatant, how does the kernel work when you try to creat a new file in directory.
Can the abnormal number of mutex calls be attributed to the high number of files in the directory?
Many thanks
J
Please read about the Linux Filesystem, i-nodes and d-nodes.
http://en.wikipedia.org/wiki/Inode_pointer_structure
The file system is organized into fixed-sized blocks. If your directory is relatively small, it fits in the direct blocks and things are fast. If your directory is not too big, it fits in the direct blocks and some indirect blocks, and is still reasonably fast. If your directory becomes too big, it spills into double indirect blocks and becomes slow.
Actual sizes depend on file system and kernel configuration.
Rule of thumb is to keep the directory under 12 blocks, depending on your block size. Many systems use 8K blocks; a fast directory is under 98,304 bytes.
A file entry is something like 16*4 bytes in size (IIRC), so plan on no more than 1500 files per directory as a practical upper limit.
Directories with large numbers of entries are often slow - how slow depends on the underlying filesystem.
The common solution is to create a hierarchy of directories, so each dir only has a few hundred entries.
Mutex system calls are a result of the application (probably something in the JVM or the Java libraries) making mutex calls.
Synchronisation internal to the kernel you will not see via strace, as this only examines system calls themselves.
A directory with lots of files should not become inefficient if you are using a filesystem which uses directory indexes; most now do (ext3 does optionally but it's normally enabled nowadays).
Non-indexed directories (like those used on the bad old filesystems - ext2, vfat etc) get really bad with lots of files, and you'll see the "open" system call taking a lot longer.