I have two processes, one of which is writing (appending) to a file, the other is reading from it. Both processes are running concurrently, but do not communicate. Another reader process may start before the writer process has finished.
This approach works, but read() often returns having read zero bytes with no error. They ratio of zero length reads to non-zero length reads is high, which is inefficient.
Is there any way around this? This is on POSIX filesystems.
Without a communication channel, there's no guaranteed method to prevent zero-byte reads or even long periods of hanging without reading any data when reading a file that is actively being written. The Linux implementation of tail uses inotify to effectively create a communication channel and obtain information about the file write activity.
It's an interesting enough problem to that IBM has even published a Redbook describing an implementation that was able to do such "read-behind-write" at about 15 GB/sec:
Read-behind-write is a technique used by some high-end customers to
lower latency and improve performance. The read-behind-write technique
means that once the writer starts to write, the reader will
immediately trail behind to read; the idea is to overlap the write
time with read time. This concept is beneficial on machines with slow
I/O performance. For a high I/O throughput machine such as pSeries
690, it may be worth considering first writing the entire file out in
parallel and then reading the data back in parallel.
There are many ways that read-behind-write can be implemented. In the
scheme implemented by Xdd, after the writer writes one record, it will
wait for the reader to read that record before the writer can proceed.
Although this scheme keeps the writer and reader in sync just one
record apart, it takes system time to do the locking and
synchronization between writer and reader.
If one does not care about how many records that a reader lags behind
the writer, then one can implement a scheme for the writer to stream
down the writes as fast as possible. The writer can update a global
variable after a certain number of records are written. The reader can
then pull the global variable to find out how many records it has to
read.
Without a communications channel, you're pretty much left having to keep trying, perhaps calling sleep() or something similar after a number of zero-byte read() results.
Related
I have two processes, one of which is writing (appending) to a file, the other is reading from it. Both processes are running concurrently, but do not communicate. Another reader process may start before the writer process has finished.
This approach works, but read() often returns having read zero bytes with no error. They ratio of zero length reads to non-zero length reads is high, which is inefficient.
Is there any way around this? This is on POSIX filesystems.
Without a communication channel, there's no guaranteed method to prevent zero-byte reads or even long periods of hanging without reading any data when reading a file that is actively being written. The Linux implementation of tail uses inotify to effectively create a communication channel and obtain information about the file write activity.
It's an interesting enough problem to that IBM has even published a Redbook describing an implementation that was able to do such "read-behind-write" at about 15 GB/sec:
Read-behind-write is a technique used by some high-end customers to
lower latency and improve performance. The read-behind-write technique
means that once the writer starts to write, the reader will
immediately trail behind to read; the idea is to overlap the write
time with read time. This concept is beneficial on machines with slow
I/O performance. For a high I/O throughput machine such as pSeries
690, it may be worth considering first writing the entire file out in
parallel and then reading the data back in parallel.
There are many ways that read-behind-write can be implemented. In the
scheme implemented by Xdd, after the writer writes one record, it will
wait for the reader to read that record before the writer can proceed.
Although this scheme keeps the writer and reader in sync just one
record apart, it takes system time to do the locking and
synchronization between writer and reader.
If one does not care about how many records that a reader lags behind
the writer, then one can implement a scheme for the writer to stream
down the writes as fast as possible. The writer can update a global
variable after a certain number of records are written. The reader can
then pull the global variable to find out how many records it has to
read.
Without a communications channel, you're pretty much left having to keep trying, perhaps calling sleep() or something similar after a number of zero-byte read() results.
EDIT: This is a duplicate of Does Akka Tcp support full-duplex communication? (please don’t ask the same question multiple times, the same goes for duplicating on mailing lists, this wastes the time of those who volunteer their help, reducing your chances of getting answers in the future)
I've modified Echo server from https://github.com/akka/akka/blob/master/akka-docs/rst/scala/code/docs/io/EchoServer.scala#L96
case Received(data) =>
connection ! Write(data, Ack(currentOffset))
log.debug("same {}", sender.eq(connection)) // true
buffer(data)
That means incoming and outgoing messages are handled by the same actor. So a single working thread(that takes messages from a mailbox) will process read and write operations. Looks like a potential bottleneck.
In "classical" world I can create one thread to read from a socket and another for a writing and get simultaneous communication.
Update
Discussion in google group https://groups.google.com/forum/#!topic/akka-dev/mcs5eLKiAVQ
While there is a single Actor that either reads or writes at any given point in time, each of these operations takes very few cycles since it only occurs when there are data to be read or buffer space available to be written to. The system call overhead of ~1µs means that with the default buffer sizes of 128kiB you should be able to transfer up to 100GiB/s in total, which sure is a bottleneck but probably not today and in practice (this roughly coincides with typical CPU memory bandwidth, so more data rate is currently impossible anyway). Once this changes we can split the reading and writing responsibilities between different selectors and wake up different Actors, but before doing that we’ll need to verify that there actually is a measurable effect.
The other question that needs answering is which operating system kernels actually allow concurrent operations on a single socket from multiple threads. I have not researched this yet, but I would not be surprised to find that fully independent locking will be hard to do and there might not (yet) be a reason to expend that effort.
I have a simple task that is easily parallelizable. Basically, the same operation must be performed repeatedly on each line of a (large, several Gb) input file. While I've made a multithreaded version of this, I noticed my I/O was the bottleneck. I decided to build a utility class that involves a single "file reader" thread that simply goes and reads straight ahead as fast as it can into a circular buffer. Then, multiple consumers can call this class and get their 'next line'. Given n threads, each thread i's starting line is line i in the file, and each subsequent line for that thread is found by adding n. It turns out that locks are not needed for this, a couple key atomic ops are enough to preserve invariants.
I've tested the code and it seems faster, but upon second thought, I'm not sure why. Wouldn't it be just as fast to divide the large file into n input files ( you can 'seek' ahead into the same file to achieve the same thing, minimal preprocessing ), and then have each process simply call iostream::readLine on its own chunk? ( since iostream reads into its own buffer as well ). It doesn't seem that sharing a single buffer amongst multiple threads has any inherent advantage, since the workers are not actually operating on the same lines of data. Plus, there's no good way I don't think to parallelize so that they do work on the same lines. I just want to understand the performance gain I'm seeing, and know whether it is 'flukey' or scalable/reproducible across platforms...
When you are I/O limited, you can get a good speedup by using two threads, one reading the file, second doing the processing. This way the reading will never wait for processing (expect for the very last line) and you will be doing reading 100 %.
The buffer should be large enough to give the consumer thread enough work in one go, which most often means it should consist of multiple lines (I would recommend at least 4000 characters, but probably even more). This will prevent thread context switching cost to be impractically high.
Single threaded:
read 1
process 1
read 2
process 2
read 3
process 3
Double threaded:
read 1
process 1/read 2
process 2/read 3
process 3
On some platforms you can get the same speedup also without threads, using overlapped I/O, but using threads can be often clearer.
Using more than one consumer thread will bring no benefit as long as you are really I/O bound.
In your case, there are at least two resources that your program competes for, the CPU and the harddisk. In a single-threaded approach, you request data then wait with an idle CPU for the HD to deliver it. Then, you handle the data, while the HD is idle. This is bad, because one of the two resources is always idle. This changes a bit if you have multiple CPUs or multiple HDs. Also, in some cases the memory bandwidth (i.e. the RAM connection) is also a limiting resource.
Now, your solution is right, you use one thread to keep the HD busy. If this threads blocks waiting for the HD, the OS just switches to a different thread that handles some data. If it doesn't have any data, it will wait for some. That way, CPU and HD will work in parallel, at least some of the time, increasing the overall throughput. Note that you can't increase the throughput with more than two threads, unless you also have multiple CPUs and the CPU is the limiting factor and not the HD. If you are writing back some data, too, you could improve performance with a third thread that writes to a second harddisk. Otherwise, you don't get any advantage from more threads.
Related question: How to use SQLite in a multi-threaded application.
I've been trying to get decent performance out of SQLite3 in a multi-threaded program. I've been very impressed with its performance except for write latency. That's not it's fault, it has to wait for the disk to spin to commit the data. But having reads blocked during those writes, even if they could read from cache, is pretty intolerable.
My use case involves a large number of small read operations to get one tiny object by an indexed field, but latency is important for these operations because there are a lot of them. Writes are large and are accumulated into a single transaction. I don't want reads to have huge latency due to completing writes.
I first just used a single connection with a mutex to protect it. However, while the writing thread is waiting for the transaction to complete, readers are blocked on disk I/O because they can't acquire the mutex until the writer releases it. I tried using multiple connections, but then I get SQLITE_LOCKED from sqlite3_step, which means having to redesign all the reading code.
My write logic currently looks like this:
Acquire connection mutex.
START TRANSACTION
Do all writes. (Typically 10 to 100 small ones.)
END TRANSACTION -- here's where it blocks
Release mutex.
Is there some solution I'm not aware of? Is there an easy way to keep my readers from having to wait for the disk to finish rotating if the entry is in cache without having to rewrite all my reading code to handle SQLITE_LOCKED, reset, and retry?
To allow multiple readers and one writer to access the database concurrently, enable write-ahead logging.
WAL works well with small transactions, so you don't need to accumulate writes.
Please note that WAL does not work with networked file systems, and for optimal performance, requires regular checkpointing.
First of all, sqlite offers multi-threaded support on it's own. You do not have to use your own mutexes, since you only slow the entire program down. Consult sqlite thread options if you have any doubts.
Using write-ahead log may solve your problems, but it is a double-edged sword. As long as there is a read ongoing, the inserted data will not be written to the main database file and the WAL journal will grow. This is covered in detail in Write-Ahead Logging
I am using sqlite in WAL mode in one of my applications. For small amounts of data it works well. However, when there is a lot of data (several hundred inserts per second, in peaks even more) I experience some issues which I don't seem to be able to fix through any meddling with sqlite configuration.
What you may consider is using several database files, each assigned to a certain time span. This will be applicable only when your queries depend on time.
I am probably running too much ahead. WAL journal should help:)
I need to write some code (in any language) to process 10,000 files that reside on a local Linux filesystem. Each file is ~500KB in size, and consists of fixed-size records of 4KB each.
The processing time per record is negligible, and the records can be processed in any order, both within and across different files.
A naïve implementation would read the files one by one, in some arbitrary order. However, since my disks are very fast to read but slow to seek, this will almost certainly produce code that's bound by disk seeks.
Is there any way to code the reading up so that it's bound by disk throughput rather than seek time?
One line of inquiry is to try and get an approximate idea of where the files reside on disk, and use that to sequence the reads. However, I am not sure what API could be used to do that.
I am of course open to any other ideas.
The filesystem is ext4, but that's negotiable.
Perhaps you could do the reads by scheduling all of them in quick succession with aio_read. That would put all reads in the filesystem read queue at once, and then the filesystem implementation is free to complete the reads in a way that minimizes seeks.
A very simple approach, although no results guaranteed. Open as many of the files at once as you can and read all of them at once - either using threads or asynchronous I/O. This way the disk scheduler knows what you read and can reduce the seeks by itself. Edit: as wildplasser observes, parallel open() is probably only doable using threads, not async I/O.
The alternative is to try to do the heavy lifting yourself. Unfortunately this involves a difficult step - getting the mapping of the files to physical blocks. There is no standard interface to do that, you could probably extract the logic from something like ext2fsprogs or the kernel FS driver. And this involves reading the physical device underlying a mounted filesystem, which can be writing to it at the same time you're trying to get a consistent snapshot.
Once you get the physical blocks, just order them, reverse the mapping back to the file offsets and execute the reads in the physical block order.
could you recommend using a SSD for the file storage? that should reduce seek times greatly as there's no head to move.
Since operations are similar and data are independent you can try using a thread pool to submit jobs that work on a number of files (can be a single file). Then you can have an idle thread complete a single job. This might help overlapping IO operations with execution.
A simple way would be to keep the original program, but fork an extra process which has no other task than to prefetch the files, and prime the disk buffer cache. ( a unix/linux system uses all "free" memory as disk buffer).
The main task will stay a few files behind (say ten). The hard part would be to keep things synchronised. A pipe seems the obvious way to accomplish this.
UPDATE:
Pseudo code for the main process:
fetch filename from worklist
if empty goto 2.
(maybe) fork a worker process or thread
add to prefetch queue
add to internal queue
if fewer than XXX items on internal queue goto 1
fetch filename from internal queue
process it
goto 1
For the slave processes:
fetch from queue
if empty: quit
prefetch file
loop or quit
For the queue, a message queue seems most appropiate, since it maintains message boundaries. Another way would be to have one pipe per child (in the fork() case) or use mutexes (when using threads).
You'll need approximate seektime_per_file / processing_time_per_file worker threads / processes.
As a simplification: if seeking the files is not required (only sequential access), the slave processes could consist of the equivalent of
dd if=name bs=500K
, which could be wrapped into a popen() or a pipe+fork().