My mental model of buffered streaming in Nodejs is that it revolves around a fixed sized queue. On one end, a producer may push arbitrary-sized byte chunks into the queue as fast as it can, until the queue is full, at which point the queue blocks until there is room again. On the other end, a consumer may pull bytes from the queue as fast as it can, until the queue is empty, at which point pulling blocks until the producer has stuffed enough bytes into the queue.
Lets say I have an enormous json file I want to convert to a csv. There are two big advantages I can think of to streaming over a naïve approach that loads the entire json file into memory, processes it into csv lines, and writes the csv file to disk all at once. For one, the amount of memory used when processing a large amount of data can be controlled by adjusting the queue size. I never need to have the full json file or csv output in memory at once, yet I can process the whole thing. The second advantage is that both the reading and writing side are working as fast as the slowest of either of them. In the naïve approach, I can't start writing anything to the csv until I have processed the entire json file, meaning the execution time is the sum of the time it takes to do the whole read and the whole write. In streaming, I can be writing while reading and processing, meaning the execution time should simply be as fast as the slowest end.
This last advantage though doesn't seem compatible with a single threaded runtime like Nodejs. With only one thread, it seems like it wouldn't be possible for both producer and consumer in a stream to work in true parallel. It seems like they would still be blocking each other, and the speed of the stream would be no better than the naïve approach above, maybe even worse if there is a lot of overhead from pushing and pulling to the stream queue. Is this the case in Nodejs? Is streaming actually slower than pulling the entire file into memory, processing it, then outputting it? If not, how does streaming improve on the naïve approach in Node? Is there any official documentation that explains how this works?
Related
I am using createReadStream api to read huge data from a file.
Sometimes it takes 7 seconds to read while sometimes it takes 30 seconds to read.
I would like to understand why it is taking more time to read the same file with same data in some instances while it is taking less time in few instances. Why it is not fixed time as I am reading the same file with same data?
Here are some possible reasons:
1. Disk caching. The OS has a disk cache and uses it when it knows it is safe to use the cache. This will often make the first read of some data slower (because it's being read directly from the disk) and later reads of the same data faster (if the OS thinks it can be successfully cached and is not too large to be cached and it stays in the cache).
2. Nodejs event loop unpredictability. Reading a large set of data will necessarily require reading many blocks of the file, each one of the read operations will go through the nodejs event loop. If there are other events also being inserted into the event loop, the disk-read-related events may sometimes have to wait their turn.
3. Garbage collector unpredictability. If you're dealing with large amounts of data (even if not all in memory at once), you may be creating lots of objects in the nodejs heap, many of which needs to be garbage collected. Eventually, the garbage collector will have to run and may introduce a bit of a pause in the execution of your code. If this happens multiple times during an operation, this could become noticeable.
4. Disk busyness variability. A disk read/write head (assuming this is a spinning disk and the OS is reading from the actual disk) can only be on one track at a time. If it's busy reading something else that the OS asked it to read, your request may have to wait for some prior requests to finish. This wouldn't typically add up to many seconds, but it can lead to some variability. As an example of a worst case, the OS could be running a defrag operation on your hard drive which your disk operations would have to interleave with.
5. OS/CPU busyness. If the OS or CPU is busy doing something else, your app may not be getting full cycles to run.
6. Nodejs threadpool busy. Nodejs uses a threadpool with a default size of 4 for disk operations. If you happen to have multiple disk operations (or other operations that use the thread pool) in flight at the same time and max out the threadpool, then your operation may have to wait for some previous operation to finish before you get allocated a thread to run your disk operation in. The size of the threadpool is customizable, but making it larger than the number of actual CPU cores you have is probably not helpful.
I'm thinking about implementing an video converter using node.js with ffmpeg but since it's a cpu intensive task, It might block express from handling other requests. I've found a couple of articles about this and some of them use worker threads while others use queues like Agendajs or Bull.
Which one is more suitable for my use case? The video converter doesn't have to respond with the actual video, all it has to do is just convert it and then upload it into an S3 bucket for later retrieval.
Two sub-problems, here:
First problem is keeping your interface responsive during the conversion. If the conversion may take a long time, and you have no good way of splitting it into small chunks (such that you can service requests in between), then you will need to handle it asynchronously, indeed.
So you'll probably want to create at least one worker thread to work in parallel with the main thread.
The second problem is - presumably - making the conversion run fast. Since - as you write - it's a CPU intensive task, it may profit from additional worker threads. This could mean:
2a. several threads working on a single (queued) conversion task, simultaneously
2b. several threads each working on separate conversion tasks at the same time
2c. a mix of both.
The good news is that you really won't have to worry about most of this yourself, because a) ffmpeg is already using multithreading where possible (this depends on the codec in use!), providing you with a ready made solution for 2a. And b), node-fluent-ffmpeg (or node-ffmpeg) is already designed to call ffmpeg, asynchronously, thus solving problem 1.
The only remaining question, then, is will you want to make sure to run only one ffmpeg job at a time (queued), or start conversions as soon as they are requested (2b / 2c)? The latter is going to be easier to implement. However, this could get you in trouble, if a lot of jobs are running simultaneously. At the very least, each conversion job will buffer some input and some output data, and this could get you into memory troubles,
This is where a queue comes into the picture. You'll want to put jobs in a simple queue, and start them so that no more than n are running, concurrently. The optimal n will not necessarily be 1, but is unlikely to be larger than 4 or so (again, as each single conversion is making use of parallelism). You'll have to experiment with that a bit, always keeping in mind that the answer may differ from codec to codec.
I have two processes, one of which is writing (appending) to a file, the other is reading from it. Both processes are running concurrently, but do not communicate. Another reader process may start before the writer process has finished.
This approach works, but read() often returns having read zero bytes with no error. They ratio of zero length reads to non-zero length reads is high, which is inefficient.
Is there any way around this? This is on POSIX filesystems.
Without a communication channel, there's no guaranteed method to prevent zero-byte reads or even long periods of hanging without reading any data when reading a file that is actively being written. The Linux implementation of tail uses inotify to effectively create a communication channel and obtain information about the file write activity.
It's an interesting enough problem to that IBM has even published a Redbook describing an implementation that was able to do such "read-behind-write" at about 15 GB/sec:
Read-behind-write is a technique used by some high-end customers to
lower latency and improve performance. The read-behind-write technique
means that once the writer starts to write, the reader will
immediately trail behind to read; the idea is to overlap the write
time with read time. This concept is beneficial on machines with slow
I/O performance. For a high I/O throughput machine such as pSeries
690, it may be worth considering first writing the entire file out in
parallel and then reading the data back in parallel.
There are many ways that read-behind-write can be implemented. In the
scheme implemented by Xdd, after the writer writes one record, it will
wait for the reader to read that record before the writer can proceed.
Although this scheme keeps the writer and reader in sync just one
record apart, it takes system time to do the locking and
synchronization between writer and reader.
If one does not care about how many records that a reader lags behind
the writer, then one can implement a scheme for the writer to stream
down the writes as fast as possible. The writer can update a global
variable after a certain number of records are written. The reader can
then pull the global variable to find out how many records it has to
read.
Without a communications channel, you're pretty much left having to keep trying, perhaps calling sleep() or something similar after a number of zero-byte read() results.
I have two processes, one of which is writing (appending) to a file, the other is reading from it. Both processes are running concurrently, but do not communicate. Another reader process may start before the writer process has finished.
This approach works, but read() often returns having read zero bytes with no error. They ratio of zero length reads to non-zero length reads is high, which is inefficient.
Is there any way around this? This is on POSIX filesystems.
Without a communication channel, there's no guaranteed method to prevent zero-byte reads or even long periods of hanging without reading any data when reading a file that is actively being written. The Linux implementation of tail uses inotify to effectively create a communication channel and obtain information about the file write activity.
It's an interesting enough problem to that IBM has even published a Redbook describing an implementation that was able to do such "read-behind-write" at about 15 GB/sec:
Read-behind-write is a technique used by some high-end customers to
lower latency and improve performance. The read-behind-write technique
means that once the writer starts to write, the reader will
immediately trail behind to read; the idea is to overlap the write
time with read time. This concept is beneficial on machines with slow
I/O performance. For a high I/O throughput machine such as pSeries
690, it may be worth considering first writing the entire file out in
parallel and then reading the data back in parallel.
There are many ways that read-behind-write can be implemented. In the
scheme implemented by Xdd, after the writer writes one record, it will
wait for the reader to read that record before the writer can proceed.
Although this scheme keeps the writer and reader in sync just one
record apart, it takes system time to do the locking and
synchronization between writer and reader.
If one does not care about how many records that a reader lags behind
the writer, then one can implement a scheme for the writer to stream
down the writes as fast as possible. The writer can update a global
variable after a certain number of records are written. The reader can
then pull the global variable to find out how many records it has to
read.
Without a communications channel, you're pretty much left having to keep trying, perhaps calling sleep() or something similar after a number of zero-byte read() results.
I need to write some code (in any language) to process 10,000 files that reside on a local Linux filesystem. Each file is ~500KB in size, and consists of fixed-size records of 4KB each.
The processing time per record is negligible, and the records can be processed in any order, both within and across different files.
A naïve implementation would read the files one by one, in some arbitrary order. However, since my disks are very fast to read but slow to seek, this will almost certainly produce code that's bound by disk seeks.
Is there any way to code the reading up so that it's bound by disk throughput rather than seek time?
One line of inquiry is to try and get an approximate idea of where the files reside on disk, and use that to sequence the reads. However, I am not sure what API could be used to do that.
I am of course open to any other ideas.
The filesystem is ext4, but that's negotiable.
Perhaps you could do the reads by scheduling all of them in quick succession with aio_read. That would put all reads in the filesystem read queue at once, and then the filesystem implementation is free to complete the reads in a way that minimizes seeks.
A very simple approach, although no results guaranteed. Open as many of the files at once as you can and read all of them at once - either using threads or asynchronous I/O. This way the disk scheduler knows what you read and can reduce the seeks by itself. Edit: as wildplasser observes, parallel open() is probably only doable using threads, not async I/O.
The alternative is to try to do the heavy lifting yourself. Unfortunately this involves a difficult step - getting the mapping of the files to physical blocks. There is no standard interface to do that, you could probably extract the logic from something like ext2fsprogs or the kernel FS driver. And this involves reading the physical device underlying a mounted filesystem, which can be writing to it at the same time you're trying to get a consistent snapshot.
Once you get the physical blocks, just order them, reverse the mapping back to the file offsets and execute the reads in the physical block order.
could you recommend using a SSD for the file storage? that should reduce seek times greatly as there's no head to move.
Since operations are similar and data are independent you can try using a thread pool to submit jobs that work on a number of files (can be a single file). Then you can have an idle thread complete a single job. This might help overlapping IO operations with execution.
A simple way would be to keep the original program, but fork an extra process which has no other task than to prefetch the files, and prime the disk buffer cache. ( a unix/linux system uses all "free" memory as disk buffer).
The main task will stay a few files behind (say ten). The hard part would be to keep things synchronised. A pipe seems the obvious way to accomplish this.
UPDATE:
Pseudo code for the main process:
fetch filename from worklist
if empty goto 2.
(maybe) fork a worker process or thread
add to prefetch queue
add to internal queue
if fewer than XXX items on internal queue goto 1
fetch filename from internal queue
process it
goto 1
For the slave processes:
fetch from queue
if empty: quit
prefetch file
loop or quit
For the queue, a message queue seems most appropiate, since it maintains message boundaries. Another way would be to have one pipe per child (in the fork() case) or use mutexes (when using threads).
You'll need approximate seektime_per_file / processing_time_per_file worker threads / processes.
As a simplification: if seeking the files is not required (only sequential access), the slave processes could consist of the equivalent of
dd if=name bs=500K
, which could be wrapped into a popen() or a pipe+fork().