How to perform conditional IO in the file system? - linux

I'm trying to implement a multi-user key-value store over the file system, such as the local Linux or Windows file system, or a network-based one (SMB or NFS). My intent is to fully avoid the need of a server because servers require some VM, deployment, upgrades, etc. And filesystems are typically readily available.
The engine returns the timestamp of when the value was set. One operation that uses the timestamp is "put if not modified since", which is similar to compare-and-swap and supports synchronization among processes. It turns out that this is quite costly to implement without a server.
It seems that no file system supports "write if not modified" or any form of conditional write semantics. At best I can lock a file, but then I need to read the date and compare inside the process, and only then write the new content and release the lock. The minimum number of IOs to implement is four: 1) lock entire file; 2) read modification date and compare locally; 3) write the new content; 4) unlock. And this ignores the IOs to open and close the file, which are pooled so they will be less frequent.
Is there any OS or filesystem facility, or algorithm that could reduce the number of IOs? Please remember that I need the solution to work over NFS or SMB...
Thanks

Filesystems already do read-ahead and write avoidance, so I/O calls will only block for disk when read data is not in cache or write cache is full and a flush is required. The performance problem with the "write if not modified since" is the 4 syscalls, which can get expensive. One way to fix this would be to add a conditional write kernel module. You would pass it the timestamp, file name, and data. It would do the conditional write using internal calls and callbacks, and return the status and new timestamp, reducing the overhead to a single syscall. Properly written, it should be filesystem-agnostic.

Related

node fs.fsync (when to use?)

I want to safely write a file and I wan't to understand the proper use/place for fsync.
https://linux.die.net/man/2/fsync
After reading ^ that, I am puzzled as to where to effectively use it.
Question, do I:
fs.write('temp/file.txt','utf-8',function(error){
if(error){fs.unlink('temp/file.txt',function(){cb(error,undefined);});}
else{
fs.rename('temp/file.txt','real/file.txt',function(){
fs.fsync('real/file.txt',function(){
cb(undefined,true);
});
});
}
});
I'm writing something that performs many file changes. I have looked at modules that write atomic, but I would like to understand the process.
fsync is one of those functions where it's extremely rare that you'll need to use it.
All operating systems mask the fact that storage devices are slow by caching reads and writes. When you write to a file, it doesn't immediately write to the actual storage medium; it'll capture it in a cache, tell your program that the write has completed, and go and write the contents to the storage device in the background instead. The operating system will keep everything consistent though; if another application reads from that file, it'll see the new contents, as the OS will serve the contents from cache.
Note for a moment that this isn't universal; I believe Windows disables caching for removable storage devices to prevent data loss when people pull the drive out. There's also some set of flags you can pass to open() to disable the cache.
For almost all use cases, you don't need to care that this happens. The only upshot for you is that your program can continue faster. There are some cases where this is problematic though:
If power is lost, the contents of the cache are lost, so the disk won't have all the new contents of the file.
If the drive is removed, writes will equally be lost. This is pretty typical for removable storage devices, and I'm pretty sure 90% of people ignore the "safely remove" prompt ;).
I think doing direct reads directly from a device (i.e. /dev/sdX in Linux) will bypass this cache, but I'm not 100% sure.
Examples of where it is needed are, say, databases. When you run an update query, the database will normally update its in-memory state, and write the mutation to a transaction log. Reliability is a good thing for a database though, so it will write to the transaction log and do an fsync on that file before responding to the user (or will have opened the transaction log as unbuffered) so there's some level of guarantee that the transaction has been persisted.
In your example, the fsync will ensure that the rename has actually taken place and has been flushed to disk.

How to update part of a file atomically?

I have a big file (several gigabytes), and I want to update a small section in it (overwrite some bytes with a new value). This must be done atomically (either the operation succeeds, or the file is left unchanged). How can I do that?
The purpose is to store progress information in a file that takes a lot of time to generate/upload (it can be on a remote file system). There will probably be times where I need to write at different locations in the file (and commit all changes at once), but if needed I can rewrite the whole index, which is a contiguous block and relatively small compared to the rest of the file. There is only one process and thread writing to the file at any given time.
Normal disks are not transactional, and don't provide atomicity guarantees.
If the underlying file system doesn't provide atomic writes (and most of them don't), then you'll need to create atomicity in your own application/data structure. This could be done via journaling (as many file systems and databases do), copy-on-write techniques, etc.
In Windows, the Transactional File System (TxF) feature does exactly what you need - but your application will need to explicitly use the Win32 transactional file I/O APIs to do that.
I think simple lockfile should be enough...
For example proper-lockfile:
const lockfile = require('proper-lockfile');
lockfile.lock('some/file')
.then(() => doStuff())
.finally(() => lockfile.unlock('some/file'));
Note that any logic working with some/file has to respect the lockfile.

Persistant storage values handling in Linux

I have a QSPI flash on my embedded board.
I have a driver + process "Q" to handle reading and writing into.
I want to store variables like SW revisions, IP, operation time, etc.
I would like to ask for suggestions how to handle the different access rights to write and read values from user space and other processes.
I was thinking to have file for each variable. Than I can assign access rights for those files and process Q can change the value in file if value has been changed. So process Q will only write into and other processes or users can only read.
But I don't know about writing. I was thinking about using message queue or zeroMQ and build the software around it but i am not sure if it is not overkill. But I am not sure how to manage access rights anyway.
What would be the best approach? I would really appreciate if you could propose even totally different approach.
Thanks!
This question will probably be downvoted / flagged due to the "Please suggest an X" nature.
That said, if a file per variable is what you're after, you might want to look at implementing a FUSE file system that wraps your SPI driver/utility "Q" (or build it into "Q" if you get to compile/control source to "Q"). I'm doing this to store settings in an EEPROM on a current work project and its turned out nicely. So I have, for example, a file, that when read, retrieves 6 bytes from EEPROM (or a cached copy) provides a MAC address in std hex/colon-separated notation.
The biggest advantage here, is that it becomes trivial to access all your configuration / settings data from shell scripts (e.g. your init process) or other scripting languages.
Another neat feature of doing it this way is that you can use inotify (which comes "free", no extra code in the fusefs) to create applications that efficiently detect when settings are changed.
A disadvantage of this approach is that it's non-trivial to do atomic transactions on multiple settings and still maintain normal file semantics.

store some data in the struct inode

Hello I am a newbie to kernel programming. I am writing a small kernel module
that is based on wrapfs template to implement a backup mechanism. This is
purely for learning basis.
I am extending wrapfs so that when a write call is made wrapfs transparently
makes a copy of that file in a separate directory and then write is performed
on the file. But I don't want that I create a copy for every write call.
A naive approach could be I check for existence of file in that directory. But
I think for each call checking this could be a severe penalty.
I could also check for first write call and then store a value for that
specific file using private_data attribute. But that would not be stored on
disk. So I would need to check that again.
I was also thinking of making use of modification time. I could save a
modification time. If the older modification time is before that time then only
a copy is created otherwise I won't do anything. I tried to use inode.i_mtime
for this but it was the modified time even before write was called, also
applications can modify that time.
So I was thinking of storing some value in inode on disk that indicates its
backup has been created or not. Is that possible? Any other suggestions or
approaches are welcome.
You are essentially saying you want to do a Copy-On-Write virtual filesystem layer.
IMO, some of these have been done, and it would be easier to implement these in userland (using libfuse and the fuse module, e.g.). That way, you can be king of your castle and add your metadata in any which way you feel is appriate:
just add (hidden) metadata files to each directory
use extended POSIX attributes (setfattr and friends)
heck, you could even use a sqlite database
If you really insist on doing these things in-kernel, you'll have a lot more work since accessing the metadata from kernel mode is goind to take a lot more effort (you'd most likely want to emulate your own database using memory mapped files so as to minimize the amount of 'userland (style)' work required and to make it relatively easy to get atomicity and reliability right1.
1
On How Everybody Gets File IO Wrong: see also here
You can use atime instead of mtime. In that case setting S_NOATIME flag on the inode prevents it from updating (see touch_atime() function at the inode.c). The only thing you'll need is to mount your filesystem with noatime option.

Possible to implement journaling with a single fsync per commit?

Let's say you're building a journaling/write-ahead-logging storage system. Can you simply implement this by (for each transaction) appending the data (with write(2)), appending a commit marker, and then fsync-ing?
The scenario to consider is if you do a large set of writes to this log then fsync it, and there's a failure during the fsync. Are the inode direct/indirect block pointers flushed only after all data blocks are flushed, or are there no guarantees that blocks are being flushed in order? If the latter, then during recovery, if you see a commit marker at the end of the file, you can't trust that the data between it and the previous commit marker is meaningful. Thus you have to rely on another mechanism (involving at least another fsync) to determine what extent of the log file is consistent (e.g., writing/fsyncing the data, then writing/fsyncing the commit marker).
If it makes a difference, mainly wondering about ext3/ext4 as the context.
Note that linux's and mac os's fsync and fdatasync are incorrect by default. Windows is correct by default, but can emulate linux for benchmarking purposes.
Also, fdatasync issues multiple disk writes if you append to the end of a file, since it needs to update the file inode with the new length. If you want to have one write per commit, your best bet is to pre-allocate log space, store a CRC of the log entries in the commit marker, and issue a single fdatasync() at commit. That way, no matter how much the OS / hardware reorder behind your back, you can find a prefix of the log that actually hit disk.
If you want to use the log for durable commits or write ahead, things get harder, since you need to make sure that fsync actually works. Under Linux, you'll want to disable the disk write cache with hdparm, or mount the partition with barrier set to true. [Edit: I stand corrected, barrier doesn't seem to give the correct semantics. SATA and SCSI introduce a number of primitives, such as write barriers and native command queuing, that make it possible for operating systems to export primitives that enable write-ahead logging. From what I can tell from manpages and online, Linux only exposes these to filesystem developers, not to userspace.]
Paradoxically, disabling the disk write cache sometimes leads to better performance, since you get more control over write scheduling in user space; if the disk queues up a bunch of synchronous write requests, you end up exposing strange latency spikes to the application. Disabling write cache prevents this from happening.
Finally, real systems use group commit, and do < 1 sync write per commit with concurrent workloads.
There's no guarantee on the order in which blocks are flushed to disk. These days even the drive itself can re-order blocks on their way to the platters.
If you want to enforce ordering, you need to at least fdatasync() between the writes that you want ordered. All a sync promises is that when it returns, everything written before the sync has hit storage.

Resources