How can I know when data is written to disk? - linux

We'd like to measure the I/O time from an application by instrumenting the read() and write() routines on a Linux system. However, the calls to write() return very fast. According to my OS man page for write (man 2 write):
NOTES
A successful return from write() does not make any guarantee that data has been committed to disk. In fact, on some buggy
implementations, it does not even guarantee that space has
successfully
been reserved for the data. The only way to be sure is to call fsync(2) after you are done writing all your data.
Linux manual as of 2013-01-27
so we understand that the write() call initiates an asynchronous call that at some point will flush the data to disk.
So the question is, is there a way to know when the data (even if it has been grouped for caching purposes) is being actually written into disk? -- preferably, when that process starts and ends?
EDIT1 We're particularly interested on measuring the application behavior and we'd like to avoid changing the semantics of the application by changing the parameters to open() -- adding O_SYNC -- or injecting calls to sync(). By changing the application semantics, you can't actually tell about the behavior of the original application.

You could open the file as O_SYNC, which in theory means that write won't return until the data is written to disk. Though what data, real or metadata, is written is dependant on the file system and how it is mounted. This is changing how your application is really working though.
If you're really interested in handling actual I/O to storage yourself (are you a database?) then O_DIRECT leaves you control. Again this is a change in behaviour and imposes additional constraints on your application. It may be what you need, may not.
You really appear to be asking about benchmarking real performance, so the real question is what you want to know. Since a real system does so much caching, the "instant" return from the write is "real" in the sense of what delays on your application actually are. If you're looking for I/O throughput you might be better looking at higher level system statistics.

You basically can't know when the data is really written to disk, and the actual disk writing may happen long time after (typically, a few minutes) your process has terminated. Also, your disk itself has (inside the disk controller) some cache. Be happy with that, since the page cache of your system is then very effective (and makes your Linux system behave quickly).
You might consider calling the sync(2) system call, but you often should not (it could be slow, and still don't guarantee any writing, it is often asking the kernel to flush buffers later).
On a given opened file descriptor, you could consider fsync(2). As Joe answered, you might pass O_SYNC to open, but that would slow down the system.
I strongly suggest (for performance reasons) to trust your kernel page cache management and avoid forcing any disk flush manually. See also the related posix_fadvise(2) & madvise(2) system calls.
If you benchmark some program, run it several times (and take into account what matters to you the most: an average of the measured times -perhaps excluding the best and/or worst of them-, or the worse or the best of them). So the point is that the I/O time (or the CPU time, or the elapsed real time) of an application is something very ambiguous. You probably want to explain your benchmarking process when publishing benchmark results.

You can refer to this link. It might help you.
Flush Data to disk
As far as writing to disk is concerned it is unpredictable. There is no definitive way of telling it. But you can make sure that data is written to disk by calling sync.

Related

Performance implications of using inter-process communication (IPC)

What type of usage is IPC intended for and is it is OK to send larger chunks of JSON (hundreds of characters) between processes using IPC? Should I be trying to send as tiny as message as possible using IPC or would the performance gains coming from reducing message size not be worth the effort?
What type of usage is IPC intended for and is it is OK to send larger chunks of JSON (hundreds of characters) between processes using IPC?
At it's core, IPC is what it says on the tin. It's a tool to use when you need to communicate information between processes, whatever that may be. The topic is very broad, and technically includes allocating shared memory and doing the communication manually, but given the tone of the question, and the tags, I'm assuming you're talking about the OS provided facilities.
Wikipedia does a pretty good job discussing how IPC is used, and I don't think I can do much better, so I'll concentrate on the second question.
Should I be trying to send as tiny as message as possible using IPC or would the performance gains coming from reducing message size not be worth the effort?
This smells a bit like a micro-optimization. I can't say definitively, because I'm not privy to the source code at Microsoft and Apple, and I really don't want to dig through the Linux kernel's implementation of IPC, but, here's a couple points:
IPC is a common operation, so OS designers are likely to optimize it for efficiency. There are teams of engineers that have considered the problem and figured out how to make this fast.
The bottleneck in communication across processes/threads is almost always synchronization. Delays are bad, but race conditions and deadlocks are worse. There are, however, lots of creative ways that OS designers can speed up the procedure, since the system controls the process scheduler and memory manager.
There's lots of ways to make the data transfer itself fast. For the OS, if the data needs to cross process boundaries, then there is some copying that may need to take place, but the OS copies memory all over the place all the time. Think about a command line utility, like netstat. When that executable is run, memory needs to be allocated, the process needs to be loaded from disk, and any address fixing that the OS needs to do is done, before the process can even start. This is done so quickly that you hardly even notice. On Windows netstat is about 40k, and it loads into memory almost instantly. (Notepad, another fast loader is 10 times that size, but it still launches in a tiny amount of time.)
The big exception to #2 above is if you're talking about IPC between processes that aren't on the same computer. (Think Windows RPC) Then you're really bound by the speed of the networking/communication stack, but at that point a few kb here or there isn't going to make a whole lot of difference. (You could consider AJAX to be a form of IPC where the 'processes' are the server and your browser. Now consider how fast Google Docs operates.)
If the IPC is between processes on the same system, I don't think that it's worth a ton of effort shaving bytes from your message. Make your message easy to debug.
In the case that the communication is happening between processes on different machines, then you may have something to think about, having spent a lot of time debugging issues that would have been simple with a better data format, a few dozen extra milliseconds transit time isn't worth making the data harder to parse/debug. Remember the three rules of optimization1:
Don't.
Don't... yet. (For experts)
Profile before you do.
1 The first two rules are usually attributed to Michael Jackson. (This one not this one)

Recovering data after process restart

I have this requirement in a x86 based Linux system running 2.6.3x kernel..
My process has some dynamic data (not much, in few Mega Bytes range) that has to be recovered if the process crashes. obvious solution is to store the data in shared memory and read it again if the process re-starts. Write to shared memory has to be done carefully so that a process crash in the middle of update won't leave the data corrupted in the shared memory.
Before coding this myself just wanted to check if there is any open source program/library that provides this functionality.. Thanks.
-Santhosh.
I don't think your proposed design is sound. An OS crash (e.g. power failure etc), may cause a mmap'd area to be partially sync'd to disc (maybe the pages are written out in a different order than you wrote them etc), which means your data structures will get corrupted in arbitrary ways.
If you need your database changes to be durable and atomic (maybe consistency and integrity wouldn't hurt either, right?) then I'd strongly recommend using an existing database system which supports ACID, or the appropriate subset. Maybe sqlite or Berkeley DB would do the trick.
You could do it yourself, in principle, but not in the way that you've described - you'd need to create some kind of log file which was updated in a way which could be read back atomically, and be able to "replay" events from some known snapshot etc, which is technically challenging.
Remember that:
An OS failure might cause a write initiated by msync() or similar, to be partially completed to durable disc
mmap does not guarantee to never write back data at other times, i.e. when you haven't called msync() for a while
Pages aren't necessarily written back in the same order that you modified the pages in memory - e.g. you can write to a[0] and then a[4096], and have a[4096] durable but a[0] not after a crash.
Even flushing an individual page is not absolutely guaranteed to be atomic.
I realise that using a library (e.g. bdb or sqlite) for every read or write operation to your data structure is an intrusive change, but if you want this kind of robustness, I think it's necessary.

Runtime integrity check of executed files

I just finished writing a linux security module which verifies the integrity of executable files at the start of their execution (using digital signatures). Now I want to dig a little bit deeper and want to check the files' integrity during run-time (i.e. periodically check them - since I am mostly dealing with processes that get started and run forever...) so that an attacker is not able to change the file within main memory without being identified (at least after some time).
The problem here is that I have absolutely no clue how I can check the file's current memory image. My authentication method mentioned above makes use of a mmap-hook which gets called whenever a file is mmaped before its execution, but as far as I know the LSM framework does not provide tools for periodical checks.
So my question: Are there any hints how I shoudl start this? How I can read a memory image and check its integrity?
Thank you
I understand what you're trying to do, but I'm really worried that this may be a security feature that gives you a warm fuzzy feeling for no good reason; and those are the most dangerous kinds of security features to have. (Another example of this might be the LSM sitting right next to yours, SElinux. Although I think I'm in the minority on this opinion...)
The program data of a process is not the only thing that affects its behavior. Stack overflows, where malicious code is written into the stack and jumped into, make integrity checking of the original program text moot. Not to mention the fact that an attacker can use the original unchanged program text to his advantage.
Also, there are probably some performance issues you'll run into if you are constantly computing DSA inside the kernel. And, you're adding that much more to long list of privileged kernel code that could be possibly exploited later on.
In any case, to address the question: You can possibly write a kernel module that instantiates a kernel thread that, on a timer, hops through each process and checks its integrity. This can be done by using the page tables for each process, mapping in the read only pages, and integrity checking them. This may not work, though, as each memory page probably needs to have its own signature, unless you concatenate them all together somehow.
A good thing to note is that shared libraries only need to be integrity checked once per sweep, since they are re-mapped across all the processes that use them. It takes sophistication to implement this though, so maybe have this under this "nice-to-have" section of your design.
If you disagree with my rationale that this may not be a good idea, I'd be very interested in your thoughts. I ran into this idea at work a while ago, and it would be nice to bring fresh ideas to our discussion.

The state of Linux async IO?

I ask here since googling leads you on a merry trip around archives with no hint as to what the current state is. If you go by Google, it seems that async IO was all the rage in 2001 to 2003, and by 2006 some stuff like epoll and libaio was turning up; kevent appeared but seems to have disappeared, and as far as I can tell, there is still no good way to mix completion-based and ready-based signaling, async sendfile - is that even possible? - and everything else in a single-threaded event loop.
So please tell me I'm wrong and it's all rosy! - and, importantly, what APIs to use.
How does Linux compare to FreeBSD and other operating systems in this regard?
AIO as such is still somewhat limited and a real pain to get started with, but it kind of works for the most part, once you've dug through it.
It has some in my opinion serious bugs, but those are really features. For example, when submitting a certain amount of commands or data, your submitting thread will block. I don't remember the exact justification for this feature, but the reply I got back then was something like "yes of course, the kernel has a limit on its queue size, that is as intended". Which is acceptable if you submit a few thousand requests... obviously there has to be a limit somewhere. It might make sense from a DoS point of view, too (otherwise a malicious program could force the kernel to run out of memory by posting a billion requests). But still, it's something that you can realistically encounter with "normal" numbers (a hundred or so) and it will strike you unexpectedly, which is no good. Plus, if you only submit half a dozen or so requests and they're a bit larger (some megabytes of data) the same may happen, apparently because the kernel breaks them up in sub-requests. Which, again, kind of makes sense, but seeing how the docs don't tell you, one should expect that it makes no difference (apart from taking longer) whether you read 500 bytes or 50 megabytes of data.
Also, there seems to be no way of doing buffered AIO, at least on any of my Debian and Ubuntu systems (although I've seen other people complain about the exact opposite, i.e. unbuffered writes in fact going via the buffers). From what I can see on my systems, AIO is only really asynchronous with buffering turned off, which is a shame (it is why I am presently using an ugly construct around memory mapping and a worker thread instead).
An important issue with anything asynchronous is being able to epoll_wait() on it, which is important if you are doing anything else apart from disk IO (such as receiving network traffic). Of course there is io_getevents, but it is not so desirable/useful, as it only works for one singular thing.
In recent kernels, there is support for eventfd. At first sight, it appears useless, since it is not obvious how it may be helpful in any way.
However, to your rescue, there is the undocumented function io_set_eventfd which lets you associate AIO with an eventfd, which is epoll_wait()-able. You have to dig through the headers to find out about it, but it's certainly there, and it works just fine.
Asynchronous disc IO is alive and kicking ... it is actually supported and works reasonably well now, but has significant limitations (but with enough functionality that some of the major users can usefully use it - for example MySQL's Innodb does in the latest version).
Asynchronous disc IO is the ability to invoke disc IO operations in a non-blocking manner (in a single thread) and wait for them to complete. This works fine, http://lse.sourceforge.net/io/aio.html has more info.
AIO does enough for a typical application (database server) to be able to use it. AIO is a good alternative to either creating lots of threads doing synchronous IO, or using scatter/gather in the preadv family of system calls which now exist.
It's possible to do a "shopping list" synchronous IO job using the newish preadv call where the kernel will go and get a bunch of pages from different offsets in a file. This is ok as long as you have only one file to read. (NB: Equivalent write function exists).
poll, epoll etc, are just fancy ways of doing select() that suffer from fewer limitations and scalability problems - they may not be able to be mixed with disc aio easily, but in a real-world application, you can probably get around this fairly trivially by using threads (some database servers tend to do these kinds of operations in separate threads anyway). Poll() is good, epoll is better, for large numbers of file descriptors. select() is ok too for small numbers of file descriptors (or specifically, low file descriptor numbers).
(At the tail end of 2019 there's a glimmer of hope almost a decade after the original question was asked)
If you have a 5.1 or later Linux kernel you can use the io_uring interface which will hopefully usher in a better asynchronous I/O future for Linux (see one of the answers to the Stack Overflow question "Is there really no asynchronous block I/O on Linux?" for benefits io_uring provides over KAIO). Hopefully this will allow Linux to provide stiff competition to FreeBSD's asynchronous AIO without huge contortions!
Most of what I've learned about asynchronous I/O in Linux was by working on the Lighttpd source. It is a single-threaded web server that handles many simultaneous connections, using the what it believes is the best of whatever asynchronous I/O mechanisms are available on the running system. Take a look at the source, it supports Linux, BSD, and (I think) a few other operating systems.

Simultaneous Or Sequential writes-- Does it matter in terms of speed?

Simultaneous Or Sequential write operation-- Does it matter in terms of speed?
With multicore processor, does it make sense to parallelize all the file write operation using multi thread, just to get a boost of speed? Of course, all those write operations are independent.
Generally, no.
As of now, the physical write to disk IS the bottle neck by some orders of magnitude, and it is in most scenarios rather sequential. Parallelizing writes you have good chances to worsen performance by incurring seeks. Sequential reads and writes will largely outperform interleaving n most cases.
Per-disk parallelization (TCQ and NCQ) mainly work by reducing the seeks that are naturally required when different clients concurrently request data from different sections of the disk. If you can avoid these seeks in the first place, you are better off.
I some scenarios - RAID 1, JBOD or when different streams of data arrive rather slowly - the right scheduling can improve your throughput, but that requires intimate knowledge of the hardware at hand, and other processes not spoiling your fun.
At best, you can leave that as a decision to the end user (e.g. give an option to turn it off), and provide performance measures to guide him. (You might even prove me wrong ;))
That depends on the disks and their controller. Do they have TCQ/NCQ? Is it RAID?
If so that might make some sense. With one regular SATA disk w/o NCQ, it won't.
Write the simplest code first, and see whether that performs well enough with the target environment. (Different disks, operating system versions, CPUs, drivers etc may well affect the result significantly.)
If the simplest correct code isn't fast enough, then it makes sense to try to work out faster ways of performing IO. At a guess, it might make sense to parallelize the write operations if you're writing to different disks, but possibly not otherwise. That's only a complete guess though.
Purely by coincidence, I'm planning to benchmark a related situation soon. I have a blog post describing the tests I intend to perform, and will update the entry with a link to results when I've got some. It's not quite the same as what you're describing, but close enough to perhaps be of interest.
Technically, you can mmap a file and have multiple threads write to it, but the disk will probably still create a bottleneck.
If you need maximize I/O throughput, a starting point would be to investigate the asynchronous I/O your environment supports.
This is a simple question, but the answer can be really really complicated. Les try to narrow down the scenario with some assumptions: The OS is Windows, you have a relatively large number of writes that are truly independent.
You can skip the multi-threading by simply issuing the writes asynchronously.
Issue them all at once - let the OS schedule the writes
It doesn't matter if the writes are to the same file or to different files. Note, this is only true if the above assumption about the writes being independent is true.
Worst case, this won' be any slower than a single plain old every day disk on a parallel ATA controller: it will be slow.
Best case, the OS can schedule the writes very efficiency. This would be true in the case of a storage system with lots of spindles, or with a disk that supports NCQ.
The key thing to remember here is that disk I/O (in general) isn't CPU bound, so going out of your way to use multi-core won't help you; it will just make life complex.
Note, you can help things if you order the writes so they are sequential in a file (overall) or sequential on the disk by sorting them by their extent.
If you are talking about writing to one file, the answer is no. You can't parallelize writing to one file since every process or thread has to acquire a lock for the file from the OS to do writes.
Otherwize this has to depend on the hardware controllers and type of storage, the OS kernel and filesystem implementation.

Resources