I understand the cost of random writes and sequential write, but who [OS or Storage Driver or any other component] decides a write() system call to end up as either sequential write or random write.
The environment I am talking about runs of a RAID 5 SAN storage, presented to the server as multiple LUNS.
Thanks,
Soumya
Logically, a single write can be seen as both, random and sequential, because it always affects a specific number of consecutive bytes. In case of synchronous and direct I/O, where writes are not delayed based on caching, the actual execution order and timing of writes are critical.
At any point-in-time, the block device has an queue containing zero or more writes. The size and location of those writes determine if they can be considered random or sequential. If the queue to be processed contains small writes that affect different non-consecutive blocks, they represent random writes and will "become" random writes because a limited number of consecutive block writes take place as a result. If, on the other hand, one of those supposedly random writes is larger and therefore covers a large number of consecutive blocks, that one write represents a sequential write because a significant number of consecutive block writes will be triggered as a result.
The situation is a little more nuanced because those I/O requests are executed in a specific order. If the disk queue contains writes for the blocks 1, 2, 3, 4, 10, 11, 12, 13, two larger writes will be triggered that can be considered sequential. If, however, the write order issued by the application was different (e.g. due to multi-threading), a disk queue of 1, 10, 2, 11, 3, 12, 4, 13 would basically trigger 10 writes that can be considered random even though the the application might have issued two sequential writes from a logical POV.
The situation changes / improves dramatically with two additional aspects:
Deferring writes using the OS page cache / storage controller cache
Reordering writes using I/O scheduling
In case of asynchronous I/O (the usual default), writes only affect the OS page cache during the call and will either be executed in the background or whenever a sync operation (e.g. fsync) is triggered. The same is true for the storage controller cache that can defer writes using a similar strategy.
Using write deferral, individual writes can be aggregated into large writes if they affect consecutive blocks and might become sequential writes even though they would otherwise have been random writes. Even if those writes represent random writes of a database that updated individual database pages in the course of different SQL statements, they can become sequential writes if they can be aggregated because the timing happened to be right.
As outlined before, the write order affects this scenario. In case of write deferral, the OS and/or the controller is free to reorder writes based on whichever criteria is in place. The reordering process could favor specific blocks, small writes, old writes, writes with better sequential locality. As in the example above, the order can turn writes that would otherwise have been considered sequential into random writes if the scheduling policy prioritizes those.
In a SAN environment, a write is
triggered by an application
cached/aggregated/reordered by the OS
forwarded by the HBA
cached/aggregated/reordered by the storage array controller
aggregated/reordered and executed by the disk.
So, while the disk has the final say in terms of I/O scheduling (the disk queue can be reordered) and thus, that's where the final decision will be made on whether or not the write will be sequential or random, the storage array controller represents the last major stage of caching/aggregation/reordering.
The sequential locality and size of the I/O request issued by the OS / storage array controller after potentially aggregating multiple individual writes based on the page cache / storage controller cache turns writes into either random or sequential writes.
Related
I am using createReadStream api to read huge data from a file.
Sometimes it takes 7 seconds to read while sometimes it takes 30 seconds to read.
I would like to understand why it is taking more time to read the same file with same data in some instances while it is taking less time in few instances. Why it is not fixed time as I am reading the same file with same data?
Here are some possible reasons:
1. Disk caching. The OS has a disk cache and uses it when it knows it is safe to use the cache. This will often make the first read of some data slower (because it's being read directly from the disk) and later reads of the same data faster (if the OS thinks it can be successfully cached and is not too large to be cached and it stays in the cache).
2. Nodejs event loop unpredictability. Reading a large set of data will necessarily require reading many blocks of the file, each one of the read operations will go through the nodejs event loop. If there are other events also being inserted into the event loop, the disk-read-related events may sometimes have to wait their turn.
3. Garbage collector unpredictability. If you're dealing with large amounts of data (even if not all in memory at once), you may be creating lots of objects in the nodejs heap, many of which needs to be garbage collected. Eventually, the garbage collector will have to run and may introduce a bit of a pause in the execution of your code. If this happens multiple times during an operation, this could become noticeable.
4. Disk busyness variability. A disk read/write head (assuming this is a spinning disk and the OS is reading from the actual disk) can only be on one track at a time. If it's busy reading something else that the OS asked it to read, your request may have to wait for some prior requests to finish. This wouldn't typically add up to many seconds, but it can lead to some variability. As an example of a worst case, the OS could be running a defrag operation on your hard drive which your disk operations would have to interleave with.
5. OS/CPU busyness. If the OS or CPU is busy doing something else, your app may not be getting full cycles to run.
6. Nodejs threadpool busy. Nodejs uses a threadpool with a default size of 4 for disk operations. If you happen to have multiple disk operations (or other operations that use the thread pool) in flight at the same time and max out the threadpool, then your operation may have to wait for some previous operation to finish before you get allocated a thread to run your disk operation in. The size of the threadpool is customizable, but making it larger than the number of actual CPU cores you have is probably not helpful.
I have a huge data file (close to 4T) that I need to crunch. I am using 4 threads on my 4-core CPU. First thread analyzes the first quarter of the file, and so on. All the threads need to add their results to the same single hash and single array after they have analyzed sections of their own quarter of the data file. So, is the "push" and "pop" and "shift" and "unshift" operations for hash and array atomic and thread-safe, or I have to resort to more complicated mechanisms like semaphores?
No, they are neither atomic nor threadsafe, and use from multiple threads will lead to crashes or data inconsistencies.
That said, even if they were, a design that involves lots of contention on the same data structure will scale poorly as you add more threads. This is because of the way hardware works in the face of parallelism; briefly:
Memory performance is heavily dependent on caches
Some cache levels are per CPU core
Writing to memory means getting it exclusively into the current core's cache
The process of moving it from one core's cache in order to write to it is costly (ballpack 60-100 cycle penalty)
You can use locking to attain correctness. For this, I don't recommend working with a lock directly, but instead look in to a module like OO::Monitors, where you can encapsulate the hash in an object and have locking done at the boundaries.
If the number of pushes you do on the shared data structure is low compared to the amount of work done to produce the items to push, then you might not bottleneck on the locking and contention around the data structure. If you are doing thousands of pushes or similar per second, however, I suggest looking for an alternative design. For example:
Break the work up into a part for each worker
Use start to set off each worker, which returns a Promise. Put the Promises into an array.
Have each Promise return an array or hash of the items that it produced.
Merge the results from each one. For example, if each returns an array, then my #all-results = flat await #promises; or similar is enough to gather all of the results together.
You might find your problem fits well into the parallel iterator paradigm, using hyper or race, in which case you don't even need to break up the work or set up the workers yourself; instead, you can pick a degree and batch size.
I have a logical store which has multiple physical disks assigned to it
STORE
X:\
Y:\
Z:\
...
I also have a pool of threads that write data (size unknown) to the STORE. Is there an algorithm which I can use (load balancing, scheduling... etc.) to help me determine on which physical disk I should write?
Factors to take under consideration:
Free disk space available.
Disk utilization (proper distribution of threads across physical disks).
Free space % on all disks should be more or less be the same.
Notes:
Each thread has its own data to process, so a single thread can sleep if its data is not available.
Disks are not necessarily the same size.
One or more disks could be taken offline.
One or more disks could be added to the STORE.
UPDATE:
I should've explained the objective of these threads better in my question; these threads read from different data sources/streams and write immediately to disk(s), buffering steams in memory is not much of an option because their size tend to grow huge quickly
Whatever you go with is going to require some tuning. What I describe below is a simple and effective starting point that might very well fit your needs.
First, I doubt that you actually need three threads to handle writing to three disk drives. The amount of processing required to orchestrate this is actually quite small.
As a first cut, you could do a simple round-robin scheduling with one thread and asynchronous writes. That is, you just have a circular queue that you fill with [X, Y, Z]. When a request comes in, you take a disk from the front of the queue and initiate an asynchronous write to that drive.
When the next request comes in, you again take the first item from the queue and issue an asynchronous write.
When an asynchronous write completes, the disk to which the data was written is added to the end of the queue.
If a drive is taken offline, it's removed from the queue. If a drive is added to the store, you make a new entry for it in the queue.
An obvious problem with the above is what to do if you get more concurrent write requests than you have drives. Using the technique I described above, the thread would have to block until there is a drive available. If you have to support bursts of activity, you could easily create a request queue into which requests are written (with their associated data). The thread doing the orchestration, then, would read an item from the queue, get a disk drive from the drive queue, and start the asynchronous write.
Note that with this setup, no drive can be doing more than a single write at a time. That's typically not a problem because the drive hardware typically can't handle multiple concurrent writes.
Keeping free space percentage relatively the same across drives might not be much harder. You could keep track of the free space percentage on each drive easily enough, and rather than using a FIFO queue for the drives, use a priority queue so that you always write to the drive that has the highest free space percentage. That will work well as long as your average write size isn't a huge percentage of a drive's free space.
Edit
Note that I said asynchronous writes. So you can have as many concurrent writes as you have drives. Those writes are running concurrently and will notify on an I/O completion port when done. There's no need for multiple threads.
As for the priority queue, there are plenty of those to choose from, although finding a good concurrent priority queue is a bit more work. In the past I've just used locks to synchronize access to my own priority queue implementation. I guess I should formalize that at some point.
You could play with what I describe above by, for example, adding two or more entries in the queue for each drive. More for faster drives, fewer for slower drives. It's unclear how well that would work, but it's probably worth trying. If those "drives" are high performance network storage devices, they might actually be able to handle multiple concurrent writes better than a typical local disk drive can. But at some point you'll have to buffer writes because your computer can almost certainly create data much faster than your drives can write. The key is making your buffer large enough to handle the normal bursts of data, and also robust enough to block the program briefly if the buffer fills up.
I have a simple task that is easily parallelizable. Basically, the same operation must be performed repeatedly on each line of a (large, several Gb) input file. While I've made a multithreaded version of this, I noticed my I/O was the bottleneck. I decided to build a utility class that involves a single "file reader" thread that simply goes and reads straight ahead as fast as it can into a circular buffer. Then, multiple consumers can call this class and get their 'next line'. Given n threads, each thread i's starting line is line i in the file, and each subsequent line for that thread is found by adding n. It turns out that locks are not needed for this, a couple key atomic ops are enough to preserve invariants.
I've tested the code and it seems faster, but upon second thought, I'm not sure why. Wouldn't it be just as fast to divide the large file into n input files ( you can 'seek' ahead into the same file to achieve the same thing, minimal preprocessing ), and then have each process simply call iostream::readLine on its own chunk? ( since iostream reads into its own buffer as well ). It doesn't seem that sharing a single buffer amongst multiple threads has any inherent advantage, since the workers are not actually operating on the same lines of data. Plus, there's no good way I don't think to parallelize so that they do work on the same lines. I just want to understand the performance gain I'm seeing, and know whether it is 'flukey' or scalable/reproducible across platforms...
When you are I/O limited, you can get a good speedup by using two threads, one reading the file, second doing the processing. This way the reading will never wait for processing (expect for the very last line) and you will be doing reading 100 %.
The buffer should be large enough to give the consumer thread enough work in one go, which most often means it should consist of multiple lines (I would recommend at least 4000 characters, but probably even more). This will prevent thread context switching cost to be impractically high.
Single threaded:
read 1
process 1
read 2
process 2
read 3
process 3
Double threaded:
read 1
process 1/read 2
process 2/read 3
process 3
On some platforms you can get the same speedup also without threads, using overlapped I/O, but using threads can be often clearer.
Using more than one consumer thread will bring no benefit as long as you are really I/O bound.
In your case, there are at least two resources that your program competes for, the CPU and the harddisk. In a single-threaded approach, you request data then wait with an idle CPU for the HD to deliver it. Then, you handle the data, while the HD is idle. This is bad, because one of the two resources is always idle. This changes a bit if you have multiple CPUs or multiple HDs. Also, in some cases the memory bandwidth (i.e. the RAM connection) is also a limiting resource.
Now, your solution is right, you use one thread to keep the HD busy. If this threads blocks waiting for the HD, the OS just switches to a different thread that handles some data. If it doesn't have any data, it will wait for some. That way, CPU and HD will work in parallel, at least some of the time, increasing the overall throughput. Note that you can't increase the throughput with more than two threads, unless you also have multiple CPUs and the CPU is the limiting factor and not the HD. If you are writing back some data, too, you could improve performance with a third thread that writes to a second harddisk. Otherwise, you don't get any advantage from more threads.
I need to write some code (in any language) to process 10,000 files that reside on a local Linux filesystem. Each file is ~500KB in size, and consists of fixed-size records of 4KB each.
The processing time per record is negligible, and the records can be processed in any order, both within and across different files.
A naïve implementation would read the files one by one, in some arbitrary order. However, since my disks are very fast to read but slow to seek, this will almost certainly produce code that's bound by disk seeks.
Is there any way to code the reading up so that it's bound by disk throughput rather than seek time?
One line of inquiry is to try and get an approximate idea of where the files reside on disk, and use that to sequence the reads. However, I am not sure what API could be used to do that.
I am of course open to any other ideas.
The filesystem is ext4, but that's negotiable.
Perhaps you could do the reads by scheduling all of them in quick succession with aio_read. That would put all reads in the filesystem read queue at once, and then the filesystem implementation is free to complete the reads in a way that minimizes seeks.
A very simple approach, although no results guaranteed. Open as many of the files at once as you can and read all of them at once - either using threads or asynchronous I/O. This way the disk scheduler knows what you read and can reduce the seeks by itself. Edit: as wildplasser observes, parallel open() is probably only doable using threads, not async I/O.
The alternative is to try to do the heavy lifting yourself. Unfortunately this involves a difficult step - getting the mapping of the files to physical blocks. There is no standard interface to do that, you could probably extract the logic from something like ext2fsprogs or the kernel FS driver. And this involves reading the physical device underlying a mounted filesystem, which can be writing to it at the same time you're trying to get a consistent snapshot.
Once you get the physical blocks, just order them, reverse the mapping back to the file offsets and execute the reads in the physical block order.
could you recommend using a SSD for the file storage? that should reduce seek times greatly as there's no head to move.
Since operations are similar and data are independent you can try using a thread pool to submit jobs that work on a number of files (can be a single file). Then you can have an idle thread complete a single job. This might help overlapping IO operations with execution.
A simple way would be to keep the original program, but fork an extra process which has no other task than to prefetch the files, and prime the disk buffer cache. ( a unix/linux system uses all "free" memory as disk buffer).
The main task will stay a few files behind (say ten). The hard part would be to keep things synchronised. A pipe seems the obvious way to accomplish this.
UPDATE:
Pseudo code for the main process:
fetch filename from worklist
if empty goto 2.
(maybe) fork a worker process or thread
add to prefetch queue
add to internal queue
if fewer than XXX items on internal queue goto 1
fetch filename from internal queue
process it
goto 1
For the slave processes:
fetch from queue
if empty: quit
prefetch file
loop or quit
For the queue, a message queue seems most appropiate, since it maintains message boundaries. Another way would be to have one pipe per child (in the fork() case) or use mutexes (when using threads).
You'll need approximate seektime_per_file / processing_time_per_file worker threads / processes.
As a simplification: if seeking the files is not required (only sequential access), the slave processes could consist of the equivalent of
dd if=name bs=500K
, which could be wrapped into a popen() or a pipe+fork().