Recovering data after process restart - linux

I have this requirement in a x86 based Linux system running 2.6.3x kernel..
My process has some dynamic data (not much, in few Mega Bytes range) that has to be recovered if the process crashes. obvious solution is to store the data in shared memory and read it again if the process re-starts. Write to shared memory has to be done carefully so that a process crash in the middle of update won't leave the data corrupted in the shared memory.
Before coding this myself just wanted to check if there is any open source program/library that provides this functionality.. Thanks.
-Santhosh.

I don't think your proposed design is sound. An OS crash (e.g. power failure etc), may cause a mmap'd area to be partially sync'd to disc (maybe the pages are written out in a different order than you wrote them etc), which means your data structures will get corrupted in arbitrary ways.
If you need your database changes to be durable and atomic (maybe consistency and integrity wouldn't hurt either, right?) then I'd strongly recommend using an existing database system which supports ACID, or the appropriate subset. Maybe sqlite or Berkeley DB would do the trick.
You could do it yourself, in principle, but not in the way that you've described - you'd need to create some kind of log file which was updated in a way which could be read back atomically, and be able to "replay" events from some known snapshot etc, which is technically challenging.
Remember that:
An OS failure might cause a write initiated by msync() or similar, to be partially completed to durable disc
mmap does not guarantee to never write back data at other times, i.e. when you haven't called msync() for a while
Pages aren't necessarily written back in the same order that you modified the pages in memory - e.g. you can write to a[0] and then a[4096], and have a[4096] durable but a[0] not after a crash.
Even flushing an individual page is not absolutely guaranteed to be atomic.
I realise that using a library (e.g. bdb or sqlite) for every read or write operation to your data structure is an intrusive change, but if you want this kind of robustness, I think it's necessary.

Related

Does a Hazelcast cache read have the potential to block a write to the same cache

We are using a Hazelcast cache within our application.
We are looking at creating a read-only reporting system to display data from items within the cache.
Some of the cached items are quite large. We are concerned that a read from a new reporting app, of some of these large items, might block a write from occurring in one of our existing apps.
Looking at the documentation for Hazelcast's IMap, I can't see any mention of reads blocking writes.
I have read that a cache-miss might cause data to be loaded across to the cache we are accessing. I think we are fine with that (as long as it doesn't cause write-locks)
Any advice on this would be highly appreciated.
Technically, for a default configured IMap the answer is yes but it's unlikely it will cause a problem. Every operation on a particular key is serviced by the same "partition thread" so any two operations on the same key will be serialized for a small portion of the entire operation. See: here. The partition thread will only do the local map operations on the member. The I/O will be handed off to another thread. Note that the I/O is the part of the operation that will vary by object size. So overall, the operations run concurrently but there will be a brief point of synchronization for operations on the same key. My suggestion would be to perform a high concurrency test. In practice, I've never seen this be a problem.
If needed, there are a couple of options for allowing completely concurrent reads and writes. The first would be to enable read from backups and the second would be to enable near cache on the reporting system clients. Of course in both cases, an additional copy is involved so the "get" may return a value that is behind the current state.

Single write - single read big memory buffer sharing without locks

Let's suppose I have a big memory buffer used as a framebuffer, what is constantly written by a thread (or even multiple threads, guaranteed that no two threads write the same byte concurrently). These writes are indeterministic in time, scattered through the codebase, and cannot be blocked.
I have another single thread which periodically reads out (copies) the whole buffer for generating a display frame. This read should not be blocked, too. Tearing is not a problem in my case. In other words, my only goal is that every change done by the write thread(s) should eventually appear in the reading thread. The ordering or some (negligible compared to a display refresh rate) delay does not matter.
Reading and writing the same memory location concurrently is a data race, which results in an undefined behavior in c++11, and this article lists same really dreadful examples where the optimizing compiler generates code for a memory read that alters the memory contents in the presence of data race.
Still, I need some solution without completely redesigning this legacy code. Every advice counts what is safe from practical standpoints, independent of if it is theoretically correct or not. I am also open to not-fully-portable solutions, too.
Aside from that I have a data race, I can easily force the visibility of the buffer changes in the reading thread by establishing a synchronizes-with relation between the threads (acquire-release an atomic guard variable, used for nothing else), or by adding platform-specific memory fence calls to key points in the writer thread(s).
My ideas to target the data race:
Use assembly for the reading thread. I would try to avoid that.
Make the memory buffer volatile, thus preventing the compiler to optimize such nasty things what are described in the referenced article.
Put the reading thread's code in a separate compile unit, and compile with -O0
+1. Leave everything as is, and cross my fingers (as currently I do not notice issues) :)
What is the safest from the list above? Do you see a better solution?
FYI, the target platform is ARM (with multiple cores) and x86 (for testing).
(This question is concretizing a previous one what was a little too generic.)

How can I know when data is written to disk?

We'd like to measure the I/O time from an application by instrumenting the read() and write() routines on a Linux system. However, the calls to write() return very fast. According to my OS man page for write (man 2 write):
NOTES
A successful return from write() does not make any guarantee that data has been committed to disk. In fact, on some buggy
implementations, it does not even guarantee that space has
successfully
been reserved for the data. The only way to be sure is to call fsync(2) after you are done writing all your data.
Linux manual as of 2013-01-27
so we understand that the write() call initiates an asynchronous call that at some point will flush the data to disk.
So the question is, is there a way to know when the data (even if it has been grouped for caching purposes) is being actually written into disk? -- preferably, when that process starts and ends?
EDIT1 We're particularly interested on measuring the application behavior and we'd like to avoid changing the semantics of the application by changing the parameters to open() -- adding O_SYNC -- or injecting calls to sync(). By changing the application semantics, you can't actually tell about the behavior of the original application.
You could open the file as O_SYNC, which in theory means that write won't return until the data is written to disk. Though what data, real or metadata, is written is dependant on the file system and how it is mounted. This is changing how your application is really working though.
If you're really interested in handling actual I/O to storage yourself (are you a database?) then O_DIRECT leaves you control. Again this is a change in behaviour and imposes additional constraints on your application. It may be what you need, may not.
You really appear to be asking about benchmarking real performance, so the real question is what you want to know. Since a real system does so much caching, the "instant" return from the write is "real" in the sense of what delays on your application actually are. If you're looking for I/O throughput you might be better looking at higher level system statistics.
You basically can't know when the data is really written to disk, and the actual disk writing may happen long time after (typically, a few minutes) your process has terminated. Also, your disk itself has (inside the disk controller) some cache. Be happy with that, since the page cache of your system is then very effective (and makes your Linux system behave quickly).
You might consider calling the sync(2) system call, but you often should not (it could be slow, and still don't guarantee any writing, it is often asking the kernel to flush buffers later).
On a given opened file descriptor, you could consider fsync(2). As Joe answered, you might pass O_SYNC to open, but that would slow down the system.
I strongly suggest (for performance reasons) to trust your kernel page cache management and avoid forcing any disk flush manually. See also the related posix_fadvise(2) & madvise(2) system calls.
If you benchmark some program, run it several times (and take into account what matters to you the most: an average of the measured times -perhaps excluding the best and/or worst of them-, or the worse or the best of them). So the point is that the I/O time (or the CPU time, or the elapsed real time) of an application is something very ambiguous. You probably want to explain your benchmarking process when publishing benchmark results.
You can refer to this link. It might help you.
Flush Data to disk
As far as writing to disk is concerned it is unpredictable. There is no definitive way of telling it. But you can make sure that data is written to disk by calling sync.

Linux splice() + kernel AIO when writing to disk

With kernel AIO and O_DIRECT|O_SYNC, there is no copying into kernel buffers and it is possible to get fine grained notification when data is actually flushed to disk. However, it requires data to be held in user space buffers for io_prep_pwrite().
With splice(), it is possible to move data directly to disk from kernel space buffers (pipes) without never having to copy it around. However, splice() returns immediately after data is queued and does not wait for actual writes to the disk.
The goal is to move data from sockets to disk without copying it around while getting confirmation that it has been flushed out. How to combine both previous approaches?
By combining splice() with O_SYNC, I expect splice() to block and one has to use multiple threads to mask latency. Alternatively, one could use asynchronous io_prep_fsync()/io_prep_fdsync(), but this waits for all data to be flushed, not for a specific write. Neither is perfect.
What would be required is a combination of splice() with kernel AIO, allowing zero copy and asynchronous confirmation of writes, such that a single event driven thread can move data from sockets to the disk and get confirmations when required, but this doesn't seem to be supported. Is there a good workaround / alternative approach?
To get a confirmation of the writes, you can't use splice().
There's aio stuff in userspace, but if you were doing it in the kernel it might come to finding out which bio's (block I/O) are generated and waiting for those:
Block I/O structure:
http://www.makelinux.net/books/lkd2/ch13lev1sec3
If you want to use AIO, you will need to use io_getevents():
http://man7.org/linux/man-pages/man2/io_getevents.2.html
Here are some examples on how to perform AIO:
http://www.fsl.cs.sunysb.edu/~vass/linux-aio.txt
If you do it from userspace and use msync it's still kind of up in the air if it is actually on spinning rust yet.
msync() docs:
http://man7.org/linux/man-pages/man2/msync.2.html
You might have to soften expectations in order to make it more robust, because it might be very expensive to actually be sure that the writes are fisically written on disk.
The 'highest' typical standard for write assurance in light of something like power removal is a journal recording operation that modifies the storage. The journal itself is append only and you can see if entries are complete when you play it back. That very last journal entry may not be complete, so something may still be potentially lost.

Poor performance from SQLite, big writes bring little reads to a crawl

Related question: How to use SQLite in a multi-threaded application.
I've been trying to get decent performance out of SQLite3 in a multi-threaded program. I've been very impressed with its performance except for write latency. That's not it's fault, it has to wait for the disk to spin to commit the data. But having reads blocked during those writes, even if they could read from cache, is pretty intolerable.
My use case involves a large number of small read operations to get one tiny object by an indexed field, but latency is important for these operations because there are a lot of them. Writes are large and are accumulated into a single transaction. I don't want reads to have huge latency due to completing writes.
I first just used a single connection with a mutex to protect it. However, while the writing thread is waiting for the transaction to complete, readers are blocked on disk I/O because they can't acquire the mutex until the writer releases it. I tried using multiple connections, but then I get SQLITE_LOCKED from sqlite3_step, which means having to redesign all the reading code.
My write logic currently looks like this:
Acquire connection mutex.
START TRANSACTION
Do all writes. (Typically 10 to 100 small ones.)
END TRANSACTION -- here's where it blocks
Release mutex.
Is there some solution I'm not aware of? Is there an easy way to keep my readers from having to wait for the disk to finish rotating if the entry is in cache without having to rewrite all my reading code to handle SQLITE_LOCKED, reset, and retry?
To allow multiple readers and one writer to access the database concurrently, enable write-ahead logging.
WAL works well with small transactions, so you don't need to accumulate writes.
Please note that WAL does not work with networked file systems, and for optimal performance, requires regular checkpointing.
First of all, sqlite offers multi-threaded support on it's own. You do not have to use your own mutexes, since you only slow the entire program down. Consult sqlite thread options if you have any doubts.
Using write-ahead log may solve your problems, but it is a double-edged sword. As long as there is a read ongoing, the inserted data will not be written to the main database file and the WAL journal will grow. This is covered in detail in Write-Ahead Logging
I am using sqlite in WAL mode in one of my applications. For small amounts of data it works well. However, when there is a lot of data (several hundred inserts per second, in peaks even more) I experience some issues which I don't seem to be able to fix through any meddling with sqlite configuration.
What you may consider is using several database files, each assigned to a certain time span. This will be applicable only when your queries depend on time.
I am probably running too much ahead. WAL journal should help:)

Resources