Some kinds of I/O operate at exact frequencies. Under extreme latency requirements, it would be useful to know how much time there is until the next piece of data is due to arrive on a certain fd.
For example, consider a stream processor which must output data to hardware at some predetermined points in time. Suppose the stream's content depends on some input. In order to reduce latency from input to output, the stream processor should wait for input for as long as possible before rendering the next piece of data. In order to do that, though, the processor needs to know how much time is left before the data is required.
Are there extensions to the standard unix I/O library (unistd.h, read(), write(), file descriptors, etc.) that allow data streams to operate in a mode where you can determine the time until the next I/O operation? Is there a word for this kind of I/O extension?
You need to be more precise about your question. Technically speaking, you need to start a timer once one output is debugged and then stop it after a second output is debugged
Related
I have two processes, one of which is writing (appending) to a file, the other is reading from it. Both processes are running concurrently, but do not communicate. Another reader process may start before the writer process has finished.
This approach works, but read() often returns having read zero bytes with no error. They ratio of zero length reads to non-zero length reads is high, which is inefficient.
Is there any way around this? This is on POSIX filesystems.
Without a communication channel, there's no guaranteed method to prevent zero-byte reads or even long periods of hanging without reading any data when reading a file that is actively being written. The Linux implementation of tail uses inotify to effectively create a communication channel and obtain information about the file write activity.
It's an interesting enough problem to that IBM has even published a Redbook describing an implementation that was able to do such "read-behind-write" at about 15 GB/sec:
Read-behind-write is a technique used by some high-end customers to
lower latency and improve performance. The read-behind-write technique
means that once the writer starts to write, the reader will
immediately trail behind to read; the idea is to overlap the write
time with read time. This concept is beneficial on machines with slow
I/O performance. For a high I/O throughput machine such as pSeries
690, it may be worth considering first writing the entire file out in
parallel and then reading the data back in parallel.
There are many ways that read-behind-write can be implemented. In the
scheme implemented by Xdd, after the writer writes one record, it will
wait for the reader to read that record before the writer can proceed.
Although this scheme keeps the writer and reader in sync just one
record apart, it takes system time to do the locking and
synchronization between writer and reader.
If one does not care about how many records that a reader lags behind
the writer, then one can implement a scheme for the writer to stream
down the writes as fast as possible. The writer can update a global
variable after a certain number of records are written. The reader can
then pull the global variable to find out how many records it has to
read.
Without a communications channel, you're pretty much left having to keep trying, perhaps calling sleep() or something similar after a number of zero-byte read() results.
I have two processes, one of which is writing (appending) to a file, the other is reading from it. Both processes are running concurrently, but do not communicate. Another reader process may start before the writer process has finished.
This approach works, but read() often returns having read zero bytes with no error. They ratio of zero length reads to non-zero length reads is high, which is inefficient.
Is there any way around this? This is on POSIX filesystems.
Without a communication channel, there's no guaranteed method to prevent zero-byte reads or even long periods of hanging without reading any data when reading a file that is actively being written. The Linux implementation of tail uses inotify to effectively create a communication channel and obtain information about the file write activity.
It's an interesting enough problem to that IBM has even published a Redbook describing an implementation that was able to do such "read-behind-write" at about 15 GB/sec:
Read-behind-write is a technique used by some high-end customers to
lower latency and improve performance. The read-behind-write technique
means that once the writer starts to write, the reader will
immediately trail behind to read; the idea is to overlap the write
time with read time. This concept is beneficial on machines with slow
I/O performance. For a high I/O throughput machine such as pSeries
690, it may be worth considering first writing the entire file out in
parallel and then reading the data back in parallel.
There are many ways that read-behind-write can be implemented. In the
scheme implemented by Xdd, after the writer writes one record, it will
wait for the reader to read that record before the writer can proceed.
Although this scheme keeps the writer and reader in sync just one
record apart, it takes system time to do the locking and
synchronization between writer and reader.
If one does not care about how many records that a reader lags behind
the writer, then one can implement a scheme for the writer to stream
down the writes as fast as possible. The writer can update a global
variable after a certain number of records are written. The reader can
then pull the global variable to find out how many records it has to
read.
Without a communications channel, you're pretty much left having to keep trying, perhaps calling sleep() or something similar after a number of zero-byte read() results.
I was going through topics of Operating Systems using the text book by Galvin (the 9th edition). In Chapter 4 on multi-threading, I came across problem 14 which is as follows:
A system with two dual-core processors has four processors available for scheduling. A CPU -intensive application is running on this system. All input is performed at program start-up, when a single file must be opened. Similarly, all output is performed just before the program terminates, when the program results must be written to a single file. Between startup and termination, the program is entirely CPU - bound. Your task is to improve the performance of this application by multithreading it. The application runs on a system that uses the one-to-one threading model (each user thread maps to a kernel thread).
• How many threads will you create to perform the input and output? Explain.
• How many threads will you create for the CPU -intensive portion of the application? Explain.
For the first part, I think we could create 4 threads for taking input for reading from a file as well as for writing output to a file. This is because during either input or output, there is no updating of the data being carried out.
For the second part, the nature of operation to be carried out on data is not known, for example, whether (1) average of the data is to be printed or (2) a function to print the average of first and last data points, then print average of second and second last data points, and so on.
Therefore, for second part, one thread could be employed to handle the operation.
But I am not very sure of the answer I gave here being right. So, I would be very grateful if you could let me know the right answer for this.
The question is testing if you understand some principles about parallelizing work to increase speed. Some of these principles are:
In the usual case, reading and writing a single file cannot be sped up using multiple cores. Speed of file I/O is determine by the properties of where and how the file is stored. Throwing more threads at it is not going to help, because those threads are just going to be waiting for the I/O to complete.
How many threads you use for CPU intensive portion depends entirely on what is being computed. If the program is generating imagery for a movie, use 4 threads because that is completely parallel. If the workload is entirely serial, use 1 thread because adding more threads won't help (by definition).
Computing the averages in your example is almost completely parallel, so you should use four threads, not one.
I have a simple task that is easily parallelizable. Basically, the same operation must be performed repeatedly on each line of a (large, several Gb) input file. While I've made a multithreaded version of this, I noticed my I/O was the bottleneck. I decided to build a utility class that involves a single "file reader" thread that simply goes and reads straight ahead as fast as it can into a circular buffer. Then, multiple consumers can call this class and get their 'next line'. Given n threads, each thread i's starting line is line i in the file, and each subsequent line for that thread is found by adding n. It turns out that locks are not needed for this, a couple key atomic ops are enough to preserve invariants.
I've tested the code and it seems faster, but upon second thought, I'm not sure why. Wouldn't it be just as fast to divide the large file into n input files ( you can 'seek' ahead into the same file to achieve the same thing, minimal preprocessing ), and then have each process simply call iostream::readLine on its own chunk? ( since iostream reads into its own buffer as well ). It doesn't seem that sharing a single buffer amongst multiple threads has any inherent advantage, since the workers are not actually operating on the same lines of data. Plus, there's no good way I don't think to parallelize so that they do work on the same lines. I just want to understand the performance gain I'm seeing, and know whether it is 'flukey' or scalable/reproducible across platforms...
When you are I/O limited, you can get a good speedup by using two threads, one reading the file, second doing the processing. This way the reading will never wait for processing (expect for the very last line) and you will be doing reading 100 %.
The buffer should be large enough to give the consumer thread enough work in one go, which most often means it should consist of multiple lines (I would recommend at least 4000 characters, but probably even more). This will prevent thread context switching cost to be impractically high.
Single threaded:
read 1
process 1
read 2
process 2
read 3
process 3
Double threaded:
read 1
process 1/read 2
process 2/read 3
process 3
On some platforms you can get the same speedup also without threads, using overlapped I/O, but using threads can be often clearer.
Using more than one consumer thread will bring no benefit as long as you are really I/O bound.
In your case, there are at least two resources that your program competes for, the CPU and the harddisk. In a single-threaded approach, you request data then wait with an idle CPU for the HD to deliver it. Then, you handle the data, while the HD is idle. This is bad, because one of the two resources is always idle. This changes a bit if you have multiple CPUs or multiple HDs. Also, in some cases the memory bandwidth (i.e. the RAM connection) is also a limiting resource.
Now, your solution is right, you use one thread to keep the HD busy. If this threads blocks waiting for the HD, the OS just switches to a different thread that handles some data. If it doesn't have any data, it will wait for some. That way, CPU and HD will work in parallel, at least some of the time, increasing the overall throughput. Note that you can't increase the throughput with more than two threads, unless you also have multiple CPUs and the CPU is the limiting factor and not the HD. If you are writing back some data, too, you could improve performance with a third thread that writes to a second harddisk. Otherwise, you don't get any advantage from more threads.
I need to write some code (in any language) to process 10,000 files that reside on a local Linux filesystem. Each file is ~500KB in size, and consists of fixed-size records of 4KB each.
The processing time per record is negligible, and the records can be processed in any order, both within and across different files.
A naïve implementation would read the files one by one, in some arbitrary order. However, since my disks are very fast to read but slow to seek, this will almost certainly produce code that's bound by disk seeks.
Is there any way to code the reading up so that it's bound by disk throughput rather than seek time?
One line of inquiry is to try and get an approximate idea of where the files reside on disk, and use that to sequence the reads. However, I am not sure what API could be used to do that.
I am of course open to any other ideas.
The filesystem is ext4, but that's negotiable.
Perhaps you could do the reads by scheduling all of them in quick succession with aio_read. That would put all reads in the filesystem read queue at once, and then the filesystem implementation is free to complete the reads in a way that minimizes seeks.
A very simple approach, although no results guaranteed. Open as many of the files at once as you can and read all of them at once - either using threads or asynchronous I/O. This way the disk scheduler knows what you read and can reduce the seeks by itself. Edit: as wildplasser observes, parallel open() is probably only doable using threads, not async I/O.
The alternative is to try to do the heavy lifting yourself. Unfortunately this involves a difficult step - getting the mapping of the files to physical blocks. There is no standard interface to do that, you could probably extract the logic from something like ext2fsprogs or the kernel FS driver. And this involves reading the physical device underlying a mounted filesystem, which can be writing to it at the same time you're trying to get a consistent snapshot.
Once you get the physical blocks, just order them, reverse the mapping back to the file offsets and execute the reads in the physical block order.
could you recommend using a SSD for the file storage? that should reduce seek times greatly as there's no head to move.
Since operations are similar and data are independent you can try using a thread pool to submit jobs that work on a number of files (can be a single file). Then you can have an idle thread complete a single job. This might help overlapping IO operations with execution.
A simple way would be to keep the original program, but fork an extra process which has no other task than to prefetch the files, and prime the disk buffer cache. ( a unix/linux system uses all "free" memory as disk buffer).
The main task will stay a few files behind (say ten). The hard part would be to keep things synchronised. A pipe seems the obvious way to accomplish this.
UPDATE:
Pseudo code for the main process:
fetch filename from worklist
if empty goto 2.
(maybe) fork a worker process or thread
add to prefetch queue
add to internal queue
if fewer than XXX items on internal queue goto 1
fetch filename from internal queue
process it
goto 1
For the slave processes:
fetch from queue
if empty: quit
prefetch file
loop or quit
For the queue, a message queue seems most appropiate, since it maintains message boundaries. Another way would be to have one pipe per child (in the fork() case) or use mutexes (when using threads).
You'll need approximate seektime_per_file / processing_time_per_file worker threads / processes.
As a simplification: if seeking the files is not required (only sequential access), the slave processes could consist of the equivalent of
dd if=name bs=500K
, which could be wrapped into a popen() or a pipe+fork().