Reducing seek times when reading many small files

Reducing seek times when reading many small files - linux

I need to write some code (in any language) to process 10,000 files that reside on a local Linux filesystem. Each file is ~500KB in size, and consists of fixed-size records of 4KB each.
The processing time per record is negligible, and the records can be processed in any order, both within and across different files.
A naïve implementation would read the files one by one, in some arbitrary order. However, since my disks are very fast to read but slow to seek, this will almost certainly produce code that's bound by disk seeks.
Is there any way to code the reading up so that it's bound by disk throughput rather than seek time?
One line of inquiry is to try and get an approximate idea of where the files reside on disk, and use that to sequence the reads. However, I am not sure what API could be used to do that.
I am of course open to any other ideas.
The filesystem is ext4, but that's negotiable.

Perhaps you could do the reads by scheduling all of them in quick succession with aio_read. That would put all reads in the filesystem read queue at once, and then the filesystem implementation is free to complete the reads in a way that minimizes seeks.

A very simple approach, although no results guaranteed. Open as many of the files at once as you can and read all of them at once - either using threads or asynchronous I/O. This way the disk scheduler knows what you read and can reduce the seeks by itself. Edit: as wildplasser observes, parallel open() is probably only doable using threads, not async I/O.
The alternative is to try to do the heavy lifting yourself. Unfortunately this involves a difficult step - getting the mapping of the files to physical blocks. There is no standard interface to do that, you could probably extract the logic from something like ext2fsprogs or the kernel FS driver. And this involves reading the physical device underlying a mounted filesystem, which can be writing to it at the same time you're trying to get a consistent snapshot.
Once you get the physical blocks, just order them, reverse the mapping back to the file offsets and execute the reads in the physical block order.

could you recommend using a SSD for the file storage? that should reduce seek times greatly as there's no head to move.

Since operations are similar and data are independent you can try using a thread pool to submit jobs that work on a number of files (can be a single file). Then you can have an idle thread complete a single job. This might help overlapping IO operations with execution.

A simple way would be to keep the original program, but fork an extra process which has no other task than to prefetch the files, and prime the disk buffer cache. ( a unix/linux system uses all "free" memory as disk buffer).
The main task will stay a few files behind (say ten). The hard part would be to keep things synchronised. A pipe seems the obvious way to accomplish this.
UPDATE:
Pseudo code for the main process:
fetch filename from worklist
if empty goto 2.
(maybe) fork a worker process or thread
add to prefetch queue
add to internal queue
if fewer than XXX items on internal queue goto 1
fetch filename from internal queue
process it
goto 1
For the slave processes:
fetch from queue
if empty: quit
prefetch file
loop or quit
For the queue, a message queue seems most appropiate, since it maintains message boundaries. Another way would be to have one pipe per child (in the fork() case) or use mutexes (when using threads).
You'll need approximate seektime_per_file / processing_time_per_file worker threads / processes.
As a simplification: if seeking the files is not required (only sequential access), the slave processes could consist of the equivalent of
dd if=name bs=500K
, which could be wrapped into a popen() or a pipe+fork().

Related

NodeJS is reading the same file blocking?

We have a backend expressjs server that will read off of the disk for many files whenever a front-end client connects.
At the OS level, are these reads blocking?
I.E., if two people connect at the same time, will whoever gets scheduled second have to wait to read the file until the first person who is currently reading it finishes?
We are just using fs.readFile to read files.
EDIT: I'm implementing caching anyway (it's a legacy codebase, don't hate me), I'm just curious if these reads are blocking and this might improve response time from not having to wait until the file is free to read.

fs.readFile() is not blocking for nodejs. It's a non-blocking, asynchronous operation. While one fs.readFile() operation is in progress, other nodejs code can run.
If two fs.readFile() calls are in operation at the same time, they will both proceed in parallel.
Nodejs itself uses a native OS thread pool with a default size of 4 for file operations so it will support up to 4 file operations in parallel. Beyond 4, it queues the next operation so when one of the 4 finishes, then the next one in line will start to execute.
Within the OS, it will time slice these different threads to achieve parallel operation. But, at the disk controller itself for a spinning drive, only one particular read operation can be occurring at once because the disk head can only be on one track at a given time. So, the underlying read operations reading from different parts of a spinning disk will eventually be serialized at the disk controller as it moves the disk head to read from a given track.
But, if two separate reads are trying to read from the same file, the OS will typically cache that info so the 2nd read won't have to read from the disk again, it will just get the data from an OS cache.
I inherited this codebase and am going to implement some caching anyway, but was just curious if caching would also improve response time since we would be reading from non-blocking process memory instead of (potentially) blocking filesystem memory.
OS file caching is heavily, heavily optimized (it's a problem operating systems have spent decades working on). Implementing my own level of caching on top of the OS isn't where I would think you'd find the highest bang for the buck for improving performance. While there may be a temporary lock used in the OS file cache, that lock would only exist for the duration of a memory copy from cache to target read location which is really, really short. Probably not something anything would notice. And, that temporary lock is not blocking nodejs at all.

single file reader/multiple consumer model: good idea for multithreaded program?

I have a simple task that is easily parallelizable. Basically, the same operation must be performed repeatedly on each line of a (large, several Gb) input file. While I've made a multithreaded version of this, I noticed my I/O was the bottleneck. I decided to build a utility class that involves a single "file reader" thread that simply goes and reads straight ahead as fast as it can into a circular buffer. Then, multiple consumers can call this class and get their 'next line'. Given n threads, each thread i's starting line is line i in the file, and each subsequent line for that thread is found by adding n. It turns out that locks are not needed for this, a couple key atomic ops are enough to preserve invariants.
I've tested the code and it seems faster, but upon second thought, I'm not sure why. Wouldn't it be just as fast to divide the large file into n input files ( you can 'seek' ahead into the same file to achieve the same thing, minimal preprocessing ), and then have each process simply call iostream::readLine on its own chunk? ( since iostream reads into its own buffer as well ). It doesn't seem that sharing a single buffer amongst multiple threads has any inherent advantage, since the workers are not actually operating on the same lines of data. Plus, there's no good way I don't think to parallelize so that they do work on the same lines. I just want to understand the performance gain I'm seeing, and know whether it is 'flukey' or scalable/reproducible across platforms...

When you are I/O limited, you can get a good speedup by using two threads, one reading the file, second doing the processing. This way the reading will never wait for processing (expect for the very last line) and you will be doing reading 100 %.
The buffer should be large enough to give the consumer thread enough work in one go, which most often means it should consist of multiple lines (I would recommend at least 4000 characters, but probably even more). This will prevent thread context switching cost to be impractically high.
Single threaded:
read 1
process 1
read 2
process 2
read 3
process 3
Double threaded:
read 1
process 1/read 2
process 2/read 3
process 3
On some platforms you can get the same speedup also without threads, using overlapped I/O, but using threads can be often clearer.
Using more than one consumer thread will bring no benefit as long as you are really I/O bound.

In your case, there are at least two resources that your program competes for, the CPU and the harddisk. In a single-threaded approach, you request data then wait with an idle CPU for the HD to deliver it. Then, you handle the data, while the HD is idle. This is bad, because one of the two resources is always idle. This changes a bit if you have multiple CPUs or multiple HDs. Also, in some cases the memory bandwidth (i.e. the RAM connection) is also a limiting resource.
Now, your solution is right, you use one thread to keep the HD busy. If this threads blocks waiting for the HD, the OS just switches to a different thread that handles some data. If it doesn't have any data, it will wait for some. That way, CPU and HD will work in parallel, at least some of the time, increasing the overall throughput. Note that you can't increase the throughput with more than two threads, unless you also have multiple CPUs and the CPU is the limiting factor and not the HD. If you are writing back some data, too, you could improve performance with a third thread that writes to a second harddisk. Otherwise, you don't get any advantage from more threads.

Transferring data between process calls

I have a Linux process that is being called numerous times, and I need to make this process as fast as possible.
The problem is that I must maintain a state between calls (load data from previous call and store it for the next one), without running another process / daemon.
Can you suggest fast ways to do so? I know I can use files for I/O, and would like to avoid it, for obvious performance reasons. Should (can?) I create a named pipe to read/write from and by that avoid real disk I/O?

Pipes aren't appropriate for this. Use posix shared memory or a posix message queue if you are absolutely sure files are too slow - which you should test first.
In the shared memory case your program creates the segment with shm_open() if it doesn't exist or opens it if it does. You mmap() the memory and make whatever changes and exit. You only shm_unlink() when you know your program won't be called anymore and no longer needs the shared memory.
With message queues, just set up the queue. Your program reads the queue, makes whatever changes, writes the queue and exits. Mq_unlink() when you no longer need the queue.
Both methods have kernel persistence so you lose the shared memory and the queue on a reboot.

It sounds like you have a process that is continuously executed by something.
Why not create a factory that spawns the worker threads?
The factory could provide the workers with any information needed.

... I can use files for I/O, and would like to avoid it, for obvious performance reasons.
I wonder what are these reasons please...
Linux caches files in kernel memory in the page cache. Writes go to the page cash first, in other words, a write() syscall is a kernel call that only copies the data from the user space to the page cache (it is a bit more complicated when the system is under stress). Some time later pdflush writes data to disk asynchronously.
File read() first checks the page cache to see if the data is already available in memory to avoid a disk read. What it means is that if one program writes data to files and another program reads it, these two programs are effectively communicating via kernel memory as long as the page cache keeps those files.
If you want to avoid disk writes entirely, that is, the state does not need to be persisted across OS reboots, those files can be put in /dev/shm or in /tmp, which are normally the mount points of in-memory filesystems.

Data access synchronization between multiple threads

I'm trying to implement a multi threaded, recursive file search logic in Visual C++. The logic is as follows:
Threads 1,2 will start at a directory location and match the files present in the directory with the search criteria. If they find a child directory, they will add it to a work Queue. Once a thread finishes with the files in a directory, it grabs another directory path from the work queue. The work queue is a STL Stack class guarded with CriticalSections for push(),pop(),top() calls.
If the stack is empty at any point, the threads will wait for a minute amount of time before retrying. Also when all the threads are in waiting state, the search is marked as complete.
This logic works without any problems but I feel that I'm not gaining the full potential of using threads because there isn't drastic performance gain compared to using single thread. I feel the work Stack is the bottle neck but can't figure out how to do away with the locking part. I tried another variation where each thread will be having its own Stack and will add a work item to the global Stack only when the local stack size crosses a fixed number of work items. If the local Stack is empty, threads will try fetching from global queue. I didn't find noticeable difference even with this variation. Does any one have any suggestions for improving the synchronization logic.
Regards,

I really doubt that your work stack is the bottleneck. The disk only has one head, and can only read one stream of data at a time. As long as your threads are processing the data as fast as the disk can supply it, there's not much else you can do that's going to have any significant effect on overall speed.
For other types of tasks your queue might become a significant bottleneck, but for this task, I doubt it. Keep in mind the time scales of the operations here. A simple operation that happens inside of a CPU takes considerably less than a nanosecond. A read from main memory takes on the order of tens of nanoseconds. Something like a thread switch or synchronization takes on the order of a couple hundred nanoseconds or so. A single head movement on the disk drive takes on the order of a millisecond or so (1,000,000 nanoseconds).

In addition to #Jerry's answer, your bottleneck is the disk system. If you have a RAID array you might see some moderate improvement from using 2 or 3 threads.
If you have to search multiple drives (note: physical drives, not volumes on a single physical drive) you can use extra threads for each of them.

Does multithreading make sense for IO-bound operations?

When performing many disk operations, does multithreading help, hinder, or make no difference?
For example, when copying many files from one folder to another.
Clarification: I understand that when other operations are performed, concurrency will obviously make a difference. If the task was to open an image file, convert to another format, and then save, disk operations can be performed concurrently with the image manipulation. My question is when the only operations performed are disk operations, whether concurrently queuing and responding to disk operations is better.

Most of the answers so far have had to do with the OS scheduler. However, there is a more important factor that I think would lead to your answer. Are you writing to a single physical disk, or multiple physical disks?
Even if you parallelize with multiple threads...IO to a single physical disk is intrinsically a serialized operation. Each thread would have to block, waiting for its chance to get access to the disk. In this case, multiple threads are probably useless...and may even lead to contention problems.
However, if you are writing multiple streams to multiple physical disks, processing them concurrently should give you a boost in performance. This is particularly true with managed disks, like RAID arrays, SAN devices, etc.
I don't think the issue has much to do with the OS scheduler as it has more to do with the physical aspects of the disk(s) your writing to.

That depends on your definition of "I/O bound" but generally multithreading has two effects:
Use multiple CPUs concurrently (which won't necessarily help if the bottleneck is the disk rather than the CPU[s])
Use a CPU (with a another thread) even while one thread is blocked (e.g. waiting for I/O completion)
I'm not sure that Konrad's answer is always right, however: as a counter-example, if "I/O bound" just means "one thread spends most of its time waiting for I/O completion instead of using the CPU", but does not mean that "we've hit the system I/O bandwidth limit", then IMO having multiple threads (or asynchronous I/O) might improve performance (by enabling more than one concurrent I/O operation).

I would think it depends on a number of factors, like the kind of application you are running, the number of concurrent users, etc.
I am currently working on a project that has a high degree of linear (reading files from start to finish) operations. We use a NAS for storage, and were concerned about what happens if we run multiple threads. Our initial thought was that it would slow us down because it would increase head seeks. So we ran some tests and found out that the ideal number of threads is the same as the number of cores in the computer.
But your mileage may vary.

It can do, simply because whenever there is more work for a thread to do (identifying the next file to copy) the OS wakes it up, so threads are a simple way to hook into the OS scheduler and yet still write code in a traditional sequential way, instead of having to break it up into a state machine with callbacks.
This is mainly an assistance with clear programming rather than performance.

In most cases, using multi-thread for disk IO will not benefit efficiency. Let's imagine 2 circumstances:
Lock-Free File: We can split the file for each thread by giving them different IO offset. For instance, a 1024B bytes file is split into n pieces and each thread writes the 1024/n respectively. This will cause a lot of verbose disk head movement because of the different offset.
Lock File: Actually lock the IO operation for each critical section. This will cause a lot of verbose thread switches and it turns out that only one thread can write the file simultaneously.
Correct me if I' wrong.

No, it makes no sense. At some point, the operations have to be serialized (by the OS). On the other hand, since modern OS's have to cope with multiple processes anyway I doubt that there's an added overhead.

I'd think it would hinder the operations... You only have one controller and one drive.
You could use a second thread to do the operation, and a main thread that shows an updated UI.

I think it could worsen the performance, because the multiple threads will compete for the same resources.
You can test the impact of doing concurrent IO operations on the same device by copying a set of files from one place to another and measuring the time, then split the set in two parts and make the copies in parallel... the second option will be sensibly slower.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string