zero length read from file - linux

I have two processes, one of which is writing (appending) to a file, the other is reading from it. Both processes are running concurrently, but do not communicate. Another reader process may start before the writer process has finished.
This approach works, but read() often returns having read zero bytes with no error. They ratio of zero length reads to non-zero length reads is high, which is inefficient.
Is there any way around this? This is on POSIX filesystems.

Without a communication channel, there's no guaranteed method to prevent zero-byte reads or even long periods of hanging without reading any data when reading a file that is actively being written. The Linux implementation of tail uses inotify to effectively create a communication channel and obtain information about the file write activity.
It's an interesting enough problem to that IBM has even published a Redbook describing an implementation that was able to do such "read-behind-write" at about 15 GB/sec:
Read-behind-write is a technique used by some high-end customers to
lower latency and improve performance. The read-behind-write technique
means that once the writer starts to write, the reader will
immediately trail behind to read; the idea is to overlap the write
time with read time. This concept is beneficial on machines with slow
I/O performance. For a high I/O throughput machine such as pSeries
690, it may be worth considering first writing the entire file out in
parallel and then reading the data back in parallel.
There are many ways that read-behind-write can be implemented. In the
scheme implemented by Xdd, after the writer writes one record, it will
wait for the reader to read that record before the writer can proceed.
Although this scheme keeps the writer and reader in sync just one
record apart, it takes system time to do the locking and
synchronization between writer and reader.
If one does not care about how many records that a reader lags behind
the writer, then one can implement a scheme for the writer to stream
down the writes as fast as possible. The writer can update a global
variable after a certain number of records are written. The reader can
then pull the global variable to find out how many records it has to
read.
Without a communications channel, you're pretty much left having to keep trying, perhaps calling sleep() or something similar after a number of zero-byte read() results.

Related

If I flush after every write, do no partial writes happen? [duplicate]

I have two processes, one of which is writing (appending) to a file, the other is reading from it. Both processes are running concurrently, but do not communicate. Another reader process may start before the writer process has finished.
This approach works, but read() often returns having read zero bytes with no error. They ratio of zero length reads to non-zero length reads is high, which is inefficient.
Is there any way around this? This is on POSIX filesystems.
Without a communication channel, there's no guaranteed method to prevent zero-byte reads or even long periods of hanging without reading any data when reading a file that is actively being written. The Linux implementation of tail uses inotify to effectively create a communication channel and obtain information about the file write activity.
It's an interesting enough problem to that IBM has even published a Redbook describing an implementation that was able to do such "read-behind-write" at about 15 GB/sec:
Read-behind-write is a technique used by some high-end customers to
lower latency and improve performance. The read-behind-write technique
means that once the writer starts to write, the reader will
immediately trail behind to read; the idea is to overlap the write
time with read time. This concept is beneficial on machines with slow
I/O performance. For a high I/O throughput machine such as pSeries
690, it may be worth considering first writing the entire file out in
parallel and then reading the data back in parallel.
There are many ways that read-behind-write can be implemented. In the
scheme implemented by Xdd, after the writer writes one record, it will
wait for the reader to read that record before the writer can proceed.
Although this scheme keeps the writer and reader in sync just one
record apart, it takes system time to do the locking and
synchronization between writer and reader.
If one does not care about how many records that a reader lags behind
the writer, then one can implement a scheme for the writer to stream
down the writes as fast as possible. The writer can update a global
variable after a certain number of records are written. The reader can
then pull the global variable to find out how many records it has to
read.
Without a communications channel, you're pretty much left having to keep trying, perhaps calling sleep() or something similar after a number of zero-byte read() results.

limit memory to load large file and look for a phone number.

There are one billion entries in a very large file, each entry contains a name and his/her phone number.
I need to design system to input a name and quickly find his/her phone number. The tricky part is that, memory can not load a such large file. I am thinking of steaming the data into memory and maybe with some multi-threading technique, however, I am not sure how to decide how many threads I should use.
Haven't found such type of questions online and not sure what's the generic rule to tackle with this question.
Could someone please give me some suggestions ?
There are a couple of approaches I can think of:
Approach 1: Single file reader thread, multiple processing threads
Read (stream) the file in a single thread
Push each record into a queue as you read them
There would be multiple threads monitoring the queue for records, which will pop a record from the queue and check if the name matches
If the name matches, it will put the number in a result queue and signal the file reader and other threads to stop further processing
If the name doesn't match, it continues onto the next record from the queue
This will ensure you are not using too much memory since records will be popped from the queue by the processing threads. But the issue here is that I/O will most probably be slower than matching the input name, so you might not be able to saturate the processing threads (see the answer to your question about how many threads to choose).
Approach 2: Multiple file reader threads, multiple processing threads
Read the file in multiple threads, by dividing the parts of the file that each thread looks at
Rest of the steps are similar to Approach 1
The tricky part here is how to divide the file parts amongst the multiple reader threads. Using a random access file could work, but will be messy.
A better way could be that instead of having one big file, you have multiple smaller files, that can each easily be searched by multiple threads. You could also look at first dividing the file into smaller chunks by the first letter of the first name, and then further into smaller chunks of say 10000 records each.
This approach should give you much better parallelism.
How many threads should you use
My recommendation would be to read Java Concurrency in Practice chapter 11, even if you don't use Java, it is a great resource for the answer to this question:
Java Concurrency in Practice - Chapter 11

Does Akka.IO Tcp has a bottleneck in bidirectional communication?

EDIT: This is a duplicate of Does Akka Tcp support full-duplex communication? (please don’t ask the same question multiple times, the same goes for duplicating on mailing lists, this wastes the time of those who volunteer their help, reducing your chances of getting answers in the future)
I've modified Echo server from https://github.com/akka/akka/blob/master/akka-docs/rst/scala/code/docs/io/EchoServer.scala#L96
case Received(data) =>
connection ! Write(data, Ack(currentOffset))
log.debug("same {}", sender.eq(connection)) // true
buffer(data)
That means incoming and outgoing messages are handled by the same actor. So a single working thread(that takes messages from a mailbox) will process read and write operations. Looks like a potential bottleneck.
In "classical" world I can create one thread to read from a socket and another for a writing and get simultaneous communication.
Update
Discussion in google group https://groups.google.com/forum/#!topic/akka-dev/mcs5eLKiAVQ
While there is a single Actor that either reads or writes at any given point in time, each of these operations takes very few cycles since it only occurs when there are data to be read or buffer space available to be written to. The system call overhead of ~1µs means that with the default buffer sizes of 128kiB you should be able to transfer up to 100GiB/s in total, which sure is a bottleneck but probably not today and in practice (this roughly coincides with typical CPU memory bandwidth, so more data rate is currently impossible anyway). Once this changes we can split the reading and writing responsibilities between different selectors and wake up different Actors, but before doing that we’ll need to verify that there actually is a measurable effect.
The other question that needs answering is which operating system kernels actually allow concurrent operations on a single socket from multiple threads. I have not researched this yet, but I would not be surprised to find that fully independent locking will be hard to do and there might not (yet) be a reason to expend that effort.

single file reader/multiple consumer model: good idea for multithreaded program?

I have a simple task that is easily parallelizable. Basically, the same operation must be performed repeatedly on each line of a (large, several Gb) input file. While I've made a multithreaded version of this, I noticed my I/O was the bottleneck. I decided to build a utility class that involves a single "file reader" thread that simply goes and reads straight ahead as fast as it can into a circular buffer. Then, multiple consumers can call this class and get their 'next line'. Given n threads, each thread i's starting line is line i in the file, and each subsequent line for that thread is found by adding n. It turns out that locks are not needed for this, a couple key atomic ops are enough to preserve invariants.
I've tested the code and it seems faster, but upon second thought, I'm not sure why. Wouldn't it be just as fast to divide the large file into n input files ( you can 'seek' ahead into the same file to achieve the same thing, minimal preprocessing ), and then have each process simply call iostream::readLine on its own chunk? ( since iostream reads into its own buffer as well ). It doesn't seem that sharing a single buffer amongst multiple threads has any inherent advantage, since the workers are not actually operating on the same lines of data. Plus, there's no good way I don't think to parallelize so that they do work on the same lines. I just want to understand the performance gain I'm seeing, and know whether it is 'flukey' or scalable/reproducible across platforms...
When you are I/O limited, you can get a good speedup by using two threads, one reading the file, second doing the processing. This way the reading will never wait for processing (expect for the very last line) and you will be doing reading 100 %.
The buffer should be large enough to give the consumer thread enough work in one go, which most often means it should consist of multiple lines (I would recommend at least 4000 characters, but probably even more). This will prevent thread context switching cost to be impractically high.
Single threaded:
read 1
process 1
read 2
process 2
read 3
process 3
Double threaded:
read 1
process 1/read 2
process 2/read 3
process 3
On some platforms you can get the same speedup also without threads, using overlapped I/O, but using threads can be often clearer.
Using more than one consumer thread will bring no benefit as long as you are really I/O bound.
In your case, there are at least two resources that your program competes for, the CPU and the harddisk. In a single-threaded approach, you request data then wait with an idle CPU for the HD to deliver it. Then, you handle the data, while the HD is idle. This is bad, because one of the two resources is always idle. This changes a bit if you have multiple CPUs or multiple HDs. Also, in some cases the memory bandwidth (i.e. the RAM connection) is also a limiting resource.
Now, your solution is right, you use one thread to keep the HD busy. If this threads blocks waiting for the HD, the OS just switches to a different thread that handles some data. If it doesn't have any data, it will wait for some. That way, CPU and HD will work in parallel, at least some of the time, increasing the overall throughput. Note that you can't increase the throughput with more than two threads, unless you also have multiple CPUs and the CPU is the limiting factor and not the HD. If you are writing back some data, too, you could improve performance with a third thread that writes to a second harddisk. Otherwise, you don't get any advantage from more threads.

Reducing seek times when reading many small files

I need to write some code (in any language) to process 10,000 files that reside on a local Linux filesystem. Each file is ~500KB in size, and consists of fixed-size records of 4KB each.
The processing time per record is negligible, and the records can be processed in any order, both within and across different files.
A naïve implementation would read the files one by one, in some arbitrary order. However, since my disks are very fast to read but slow to seek, this will almost certainly produce code that's bound by disk seeks.
Is there any way to code the reading up so that it's bound by disk throughput rather than seek time?
One line of inquiry is to try and get an approximate idea of where the files reside on disk, and use that to sequence the reads. However, I am not sure what API could be used to do that.
I am of course open to any other ideas.
The filesystem is ext4, but that's negotiable.
Perhaps you could do the reads by scheduling all of them in quick succession with aio_read. That would put all reads in the filesystem read queue at once, and then the filesystem implementation is free to complete the reads in a way that minimizes seeks.
A very simple approach, although no results guaranteed. Open as many of the files at once as you can and read all of them at once - either using threads or asynchronous I/O. This way the disk scheduler knows what you read and can reduce the seeks by itself. Edit: as wildplasser observes, parallel open() is probably only doable using threads, not async I/O.
The alternative is to try to do the heavy lifting yourself. Unfortunately this involves a difficult step - getting the mapping of the files to physical blocks. There is no standard interface to do that, you could probably extract the logic from something like ext2fsprogs or the kernel FS driver. And this involves reading the physical device underlying a mounted filesystem, which can be writing to it at the same time you're trying to get a consistent snapshot.
Once you get the physical blocks, just order them, reverse the mapping back to the file offsets and execute the reads in the physical block order.
could you recommend using a SSD for the file storage? that should reduce seek times greatly as there's no head to move.
Since operations are similar and data are independent you can try using a thread pool to submit jobs that work on a number of files (can be a single file). Then you can have an idle thread complete a single job. This might help overlapping IO operations with execution.
A simple way would be to keep the original program, but fork an extra process which has no other task than to prefetch the files, and prime the disk buffer cache. ( a unix/linux system uses all "free" memory as disk buffer).
The main task will stay a few files behind (say ten). The hard part would be to keep things synchronised. A pipe seems the obvious way to accomplish this.
UPDATE:
Pseudo code for the main process:
fetch filename from worklist
if empty goto 2.
(maybe) fork a worker process or thread
add to prefetch queue
add to internal queue
if fewer than XXX items on internal queue goto 1
fetch filename from internal queue
process it
goto 1
For the slave processes:
fetch from queue
if empty: quit
prefetch file
loop or quit
For the queue, a message queue seems most appropiate, since it maintains message boundaries. Another way would be to have one pipe per child (in the fork() case) or use mutexes (when using threads).
You'll need approximate seektime_per_file / processing_time_per_file worker threads / processes.
As a simplification: if seeking the files is not required (only sequential access), the slave processes could consist of the equivalent of
dd if=name bs=500K
, which could be wrapped into a popen() or a pipe+fork().

Resources