How are pipes implemented re buffering? I might be creating many pipes but only ever sending/receiving a few bytes through them at a time, so don't want to waste memory unnecessarily.
Edit: I understand what buffering is, I am asking how the buffering is implemented in Linux pipes specifically, ie does the full 64K get allocated regardless of highwatermark?
Buffers are used to equal out the difference in speed between producer and consumer. If you didn't have a buffer, you would have to switch tasks after every byte produced, which would be very inefficient due to the cost of context switches, data and code caches never becoming hot etc. If your consumer can produce data about as fast as the producer consumes it, your buffer use will usually be low (but read on). If the producer is much faster than the consumer, the buffer will fill up completely and the producer will be forced to wait until more space becomes available. The reversed case of slow producer and fast consumer will use a very small part of the buffer for most of the time.
The usage also depends on whether your both processes actually run in parallel (e.g. on separate cores) or if they share a core and only due to the OS's process management are fooled into thinking that they are concurrent. If you have real concurrency (separate core/CPU), your buffer will usually be used less.
Any way, if your applications are not producing much data and their speeds are similar, the buffer will not be very full most of the time. However, I wouldn't be surprised if at OS level, the full 64 kB were allocated any way. But unless you are using an embedded device, 64 kB is not much, so even if always the maximum size is alloctaed, I wouldn't worry about it.
By the way, it is not easy to modify the size of the pipe buffer, for example in this discussion a number of tricks are suggested but they are actually workarounds which modify the way data from the buffer is consumed, not modifying the actual buffer size. You could check ulimit -p but I'm not 100% sure it will give you the control you need.
EDIT: Looking at fs/pipe.c and include/linux/pipe_fs_i.h in Linux code, it looks like the buffers do change their size. Minimum size of the buffer is a full page, though, so if you only need a few bytes, there will be waste. I'm not sure at this point, but some code that uses PIPE_DEF_BUFFERS, which is 16, giving 64 kB with 4 kB pages, makes me wonder if the buffer can fall below 64 kB (the 1 page minimum could be just an additional restriction).
Related
I have python code that sends data to socket (a rather large file). Should I divide it into 1kb chunks, or would just conn.sendall(file.read()) be acceptable?
It will make little difference to the sending operation. (I assume you are using a TCP socket for the purposes of this discussion.)
When you attempt to send 1K, the kernel will take that 1K, copy it into kernel TCP buffers, and return success (and probably begin sending to the peer at the same time). At which point, you will send another 1K and the same thing happens. Eventually if the file is large enough, and the network can't send it fast enough, or the receiver can't drain it fast enough, the kernel buffer space used by your data will reach some internal limit and your process will be blocked until the receiver drains enough data. (This limit can often be pretty high with TCP -- depending on the OSes, you may be able to send a megabyte or two without ever hitting it.)
If you try to send in one shot, pretty much the same thing will happen: data will be transferred from your buffer into kernel buffers until/unless some limit is reached. At that point, your process will be blocked until data is drained by the receiver (and so forth).
However, with the first mechanism, you can send a file of any size without using undue amounts of memory -- your in-memory buffer (not including the kernel TCP buffers) only needs to be 1K long. With the sendall approach, file.read() will read the entire file into your program's memory. If you attempt that with a truly giant file (say 40G or something), that might take more memory than you have, even including swap space.
So, as a general purpose mechanism, I would definitely favor the first approach. For modern architectures, I would use a larger buffer size than 1K though. The exact number probably isn't too critical; but you could choose something that will fit several disk blocks at once, say, 256K.
On a very large SMP machine with many CPUS scripts are run with tens of simultaneous jobs (fewer than the number of CPUs) like this:
some_program -in FIFO1 >OUTPUT1 2>s_p1.log </dev/null &
some_program -in FIFO2 >OUTPUT2 2>s_p2.log </dev/null &
...
some_program -in FIFO40 >OUTPUT40 2>s_p40.log </dev/null &
splitter_program -in real_input.dat -out FIFO1,FIFO2...FIFO40
The splitter reads the input data flat out and distributes it to the FIFOs in order. (Records 1,41,81... to FIFO1; 2,42,82 to FIFO2, etc.) The splitter has low overhead and can pretty much process data as fast as the file system can supply it.
Each some_program processes its stream and writes it to its output file. However, nothing controls the order in which the file system sees these writes. The writes are also very small, on the order of 10 bytes. The script "knows" that there are 40 streams here and that they could be buffered in 20M (or whatever) chunks, and then each chunk written to the file system sequentially. That is, queued writes should be used to maximize write speed to the disks. The OS, however, just sees a bunch of writes at about the same rate on each of the 40 streams.
What happens in practice during a run is that the subprocesses get a lot of CPU time (in top, >80%), then a flush process appears (10% CPU), and all the others drop to low CPU (1%), then it goes back to the higher rate. These pauses go on for several seconds at a time. The flush means that the writes are overwhelming the file caching. Also I think the OS and/or the underlying RAID controller is probably bouncing the physical disk heads around erratically which is reducing the ultimate write speed to the physical disks. That is just a guess though, since it is hard to say what exactly is happening as there is file cache (in a system with over 500Gb of RAM) and a RAID controller between the writes and the disk.
Is there a program or method around for controlling this sort of IO, forcing the file system writes to queue nicely to maximize write speed?
The "buffer" program is not going to help much here because, while it would accumulate an output stream into a big chunk, there wouldn't be an orderly queuing of the writes, so several could go out at the same time. If the data rate in the output streams was uncorrelated this would be less of a problem, but in some cases the data rate is exactly the same in all streams, which means the buffers would all fill at the same time. This would stall the entire tree until the last one was written because any process that cannot write an output will not reads its next input, and that would stall the splitter, as all I/O is synchronous. The buffers need to be emptied in a cyclical manner, preferably before any of them completely fill up, although that may not be avoidable when data output rate exceeds the file system write rate.
There are dozens of parameters for tuning the file system, some of those might help. The scheduler was changed from cfq to deadline because the system was locking up for minutes at a time with the former.
If the problem is sheer I/O bandwidth then buffering won't solve anything. In that case, you need to shrink the data or send it to a higher-bandwidth sink to improve and level your performance. One way to do that would be to reduce the number of parallel jobs, as #thatotherguy said.
If in fact the problem is with the number of distinct I/O operations rather than with the overall volume of data, however, then buffering might be a viable solution. I am unfamiliar with the buffer program you mentioned, but I suppose it does what its name suggests. I don't completely agree with your buffering comments, however:
The "buffer" program is not going to help much here because, while it would accumulate an output stream into a big chunk, there wouldn't be an orderly queuing of the writes, so several could go out at the same time.
You don't necessarily need big chunks. It would probably be ideal to chunk at the native block size of the file system, or a small integer multiple thereof. That might be, say, 4096- or 8192-byte chunks.
Moreover, I don't see why you think you have an "orderly queueing of writes" now, or why you're confident that such a thing is needed.
If the data rate in the output streams was uncorrelated this would be less of a problem, but in some cases the data rate is exactly the same in all streams, which means the buffers would all fill at the same time. This would stall the entire tree until the last one was written because any process that cannot write an output will not reads its next input, and that would stall the splitter, as all I/O is synchronous.
Your splitter is writing to FIFOs. Though it may do so serially, that is not "synchronous" in the sense that the data needs to be drained out the other end before the splitter can proceed -- at least, not if the writes do not exceed the size of the FIFOs' buffers. FIFO buffer capacity varies from system to system, adapts dynamically on some systems, and is configurable (e.g. via fcntl()) on some systems. The default buffer size on modern Linux is 64kB.
The buffers need to be emptied in a cyclical manner, preferably before any of them completely fill up, although that may not be avoidable when data output rate exceeds the file system write rate.
I think this is a problem that pretty much solves itself. If one of the buffers backs up enough to block the splitter, then that ensures that the competing processes will, before too long, give the blocked buffer the opportunity to write. But this is also why you don't want enormous buffers -- you want to interleave disk I/O from different processes relatively finely to try to keep everything going.
The alternative to an external buffer program is to modify your processes to perform internal buffering. That might be an advantage because it removes a whole set of pipes (to an external buffering program) from the mix, and it lightens the process load on the machines. It does mean modifying your working processing program, though, so perhaps it would be better to start with external buffering to see how well that does.
If the problem is that your 40 streams each have a high data rate, and your RAID controller cannot write to physical disk fast enough, then you need to redesign your disk system. Basically, divide it into 40 RAID-1 mirrors and write one file to each mirror set. That makes the writes sequential for each stream, but requires 80 disks.
If the data rate isn't the problem then you need to add more buffering. You might need a pair of threads. One thread to collect the data into memory buffers and another thread to write it into data files and fsync() it. To make the disk writes sequential it should fsync each output file one at a time. That should result in writing large sequential chunks of whatever your buffer size is. 8 MB maybe?
Since writes are immediate anyway (copy to kernel buffer and return), what's the advantage of using io_submit for writes?
In fact, it (aio/io_submit) seems worse since you have to allocate the write buffers on the heap and can't use stack-based buffers.
My question is only about writes, not reads.
EDIT: I am talking about relatively small writes (few KB at most), not MB or GB, so buffer copy should not be a big problem.
Copying a buffer into the kernel is not necessarily instantaneous.
First the kernel needs to find a free page. If there is none (which is fairly likely under heavy disk-write pressure), it has to decide to evict one. If it decides to evict a dirty page (instead of evicting your process for instance), it will have to actually write it before it can use that page.
there's a related issue in linux when saturating writing to a slow drive, the page cache fills up with dirty pages backed by a slow drive. Whenever the kernel needs a page, for any reason, it takes a long time to acquire one and the whole system freezes as a result.
The size of each individual write is less relevant than the write pressure of the system. If you have a million small writes already queued up, this may be the one that has to block.
Regarding whether the allocation lives on the stack or the heap is also less relevant. If you want efficient allocation of blocks to write, you can use a dedicated pool allocator (from the heap) and not pay for the general purpose heap allocator.
aio_write() gets around this by not copying the buffer into the kernel at all, it may even be DMAd straight out of your buffer (given the alignment requirements), which means you're likely to save a copy as well.
I would like to store a couple of entries to a file (optimized for reading) and a good data structure for that seems to be a B+ tree. It offers a O(log(n)/log(b)) access time where b is the number of entries in one block.
There are many papers etc. describing B+ trees, but I still have some troubles understaning block based storage systems in general. Maybe someone can point me to the right direction or answer a couple of questions:
Do (all common) file systems create new files at the beginning of a new block? So, can I be sure that seek(0) will set the read/write head to a multiply of the device's block size?
Is it right that I only should use calls like pread(fd, buf, n * BLOCK_SIZE, p * BLOCK_SIZE) (with n, p being integers) to ensure that I always read full blocks?
Is it better to read() BLOCK_SIZE bytes into an array or mmap() those instead? Or is there only a difference if I mmap many blocks and access only a few? What is better?
Should I try to avoid keys spawning multiple blocks by adding padding bytes at the end of each block? Should I do the same for the leaf nodes by adding padding bytes between the data too?
Many thanks,
Christoph
In general, file systems create new files at the beginning of a new block because that is how the underlying device works. Hard disks are block devices and thus cannot handle anything less than a "block" or "sector". Additionally, operating systems treat memory and memory mappings in terms of pages, which are usually even larger (sectors are often 512 or 1024 bytes, pages usually 4096 bytes).
One exception to this rule that comes to mind would be ReiserFS, which puts small files directly into the filesystem structure (which, if I remember right, is incidentially a B+ tree!). For very small files this can actually a viable optimization since the data is already in RAM without another seek, but it can equally be an anti-optimization, depending on the situation.
It does not really matter, because the operating system will read data in units of full pages (normally 4kB) into the page cache anyway. Reading one byte will transfer 4kB and return a byte, reading another byte will serve you from the page cache (if it's the same page or one that was within the readahead range).
read is implemented by copying data from the page cache whereas mmap simply remaps the pages into your address space (possibly marking them copy-on-write, depending on your protection flags). Therefore, mmap will always be at least as fast and usually faster. mmap is more comfortable too, but has the disadvantage that it may block at unexpected times when it needs to fetch more pages that are not in RAM (though, that is generally true for any application or data that is not locked into memory). readon the other hand blocks when you tell it, not otherwise.
The same is true under Windows with the exception that memory mapped files under pre-Vista Windows don't scale well under high concurrency, as the cache manager serializes everything.
Generally one tries to keep data compact, because less data means fewer pages, and fewer pages means higher likelihood they're in the page cache and fit within the readahead range. Therefore I would not add padding, unless it is necessary for other reasons (alignment).
Filesystems which support delayed allocation don't create new files anywhere on disc. Lots of newer filesystems support packing very small files into their own pages or sharing them with metadata (For example, reiser puts very tiny files into the inode?). But for larger files, mostly, yes.
You can do this, but the OS page cache will always read an entire block in, and just copy the bits you requested into your app's memory.
It depends on whether you're using direct IO or non-direct IO.
If you're using direct IO, which bypasses the OS's cache, you don't use mmap. Most databases do not use mmap and use direct IO.
Direct IO means that the pages don't go through the OS's page cache, they don't get cached at all by the OS and don't push other blocks out of the OS cache. It also means that all reads and writes need to be done on block boundaries. Block boundaries can sometimes be determined by a statfs call on the filesystem.
Most databases seem to take the view that they should manage their own page cache themselves, and use the OS only for physical reads/writes. Therefore they typically use direct and synchronous IO.
Linus Torvalds famously disagrees with this approach. I think the vendors really do it to achieve better consistency of behaviour across different OSs.
Yes. Doing otherwise would cause unnecessary complications in FS design.
And the options (as an alternative to "only") are ...?
In Windows memory-mapped files work faster than file API (ReadFile). I guess on Linux it's the same, but you can conduct your own measurements
I'm writing lots and lots of data that will not be read again for weeks - as my program runs the amount of free memory on the machine (displayed with 'free' or 'top') drops very quickly, the amount of memory my app uses does not increase - neither does the amount of memory used by other processes.
This leads me to believe the memory is being consumed by the filesystems cache - since I do not intend to read this data for a long time I'm hoping to bypass the systems buffers, such that my data is written directly to disk. I dont have dreams of improving perf or being a super ninja, my hope is to give a hint to the filesystem that I'm not going to be coming back for this memory any time soon, so dont spend time optimizing for those cases.
On Windows I've faced similar problems and fixed the problem using FILE_FLAG_NO_BUFFERING|FILE_FLAG_WRITE_THROUGH - the machines memory was not consumed by my app and the machine was more usable in general. I'm hoping to duplicate the improvements I've seen but on Linux. On Windows there is the restriction of writing in sector sized pieces, I'm happy with this restriction for the amount of gain I've measured.
is there a similar way to do this in Linux?
The closest equivalent to the Windows flags you mention I can think of is to open your file with the open(2) flags O_DIRECT | O_SYNC:
O_DIRECT (Since Linux 2.4.10)
Try to minimize cache effects of the I/O to and from this file. In
general this will degrade performance, but it is useful in special
situations, such as when applications do their own caching. File I/O
is done directly to/from user space buffers. The O_DIRECT flag on its
own makes at an effort to transfer data synchronously, but does not
give the guarantees of the O_SYNC that data and necessary metadata are
transferred. To guarantee synchronous I/O the O_SYNC must be used in
addition to O_DIRECT. See NOTES below for further discussion.
A semantically similar (but deprecated) interface for block devices is
described in raw(8).
Granted, trying to do research on this flag to confirm it's what you want I found this interesting piece telling you that unbuffered I/O is a bad idea, Linus describing it as "brain damaged". According to that you should be using madvise() instead to tell the kernel how to cache pages. YMMV.
You can use O_DIRECT, but in that case you need to do the block IO yourself; you must write in multiples of the FS block size and on block boundaries (it is possible that it is not mandatory but if you do not its performance will suck x1000 because every unaligned write will need a read first).
Another much less impacting way of stopping your blocks using up the OS cache without using O_DIRECT, is to use posix_fadvise(fd, offset,len, POSIX_FADV_DONTNEED). Under Linux 2.6 kernels which support it, this immediately discards (clean) blocks from the cache. Of course you need to use fdatasync() or such like first, otherwise the blocks may still be dirty and hence won't be cleared from the cache.
It is probably a bad idea of fdatasync() and posix_fadvise( ... POSIX_FADV_DONTNEED) after every write, but instead wait until you've done a reasonable amount (50M, 100M maybe).
So in short
after every (significant chunk) of writes,
Call fdatasync followed by posix_fadvise( ... POSIX_FADV_DONTNEED)
This will flush the data to disc and immediately remove them from the OS cache, leaving space for more important things.
Some users have found that things like fast-growing log files can easily blow "more useful" stuff out of the disc cache, which reduces cache hits a lot on a box which needs to have a lot of read cache, but also writes logs quickly. This is the main motivation for this feature.
However, like any optimisation
a) You're not going to need it so
b) Do not do it (yet)
as my program runs the amount of free memory on the machine drops very quickly
Why is this a problem? Free memory is memory that isn't serving any useful purpose. When it's used to cache data, at least there is a chance it will be useful.
If one of your programs requests more memory, file caches will be the first thing to go. Linux knows that it can re-read that data from disk whenever it wants, so it will just reap the memory and give it a new use.
It's true that Linux by default waits around 30 seconds (this is what the value used to be anyhow) before flushing writes to disk. You can speed this up with a call to fsync(). But once the data has been written to disk, there's practically zero cost to keeping a cache of the data in memory.
Seeing as you write to the file and don't read from it, Linux will probably guess that this data is the best to throw out, in preference to other cached data. So don't waste effort trying to optimise unless you've confirmed that it's a performance problem.