FUSE's write sequence guarantees - linux

Should write() implementations assume random-access, or can there be some assumptions, like that they'll ever be performed sequentially, and at increasing offsets?
You'll get extra points for a link to the part of a POSIX or SUS specification that describes the VFS interface.

Random, for certain. There's a reason why the read and write interfaces take both size and offset. You'll notice that there isn't a seek field in the fuse_operations struct; when a user program calls seek/lseek on a FUSE file, the offset in the kernel file descriptor is updated, but the FUSE fs isn't notified at all. Later reads and writes just start coming to you with a different offset, and you should be able to handle that. If something about your implementation makes it impossible, you should probably return -EIO on the writes you can't satisfy.

Unless there is something unusual about your FUSE filesystem that would prevent an existing file from being opened for write, your implementation of the write operation must support writes to any offset — an application can write to any location in a file by lseek()-ing around in the file while it's open, e.g.
fd = open("file", O_WRONLY);
lseek(fd, SEEK_SET, 100);
write(fd, ...);
lseek(fd, SEEK_SET, 0);
write(fd, ...);

Related

Do I need to close a file before calling syncfs()

On my embedded system, I want to make sure that the data is safely written when I close a file - if the system reports that the data was saved, the user should be able to remove power immediately.
I know that the proper way to do this is fsync(), fclose(), and fsync() on the directory (cfr. this blog entry). However, it's a bit tricky to get a file descriptor for the directory in my case (I'd have to go through /proc/self/fd to find back the filename and derive the directory from there). It would be much simpler for me to just do syncfs() on the entire filesystem - I know that this is the only file that is open on the filesystem anyway.
Now my question is:
Is it sufficient to do syncfs()?
Do I need to fclose() the FILE * first (for the directory entry to be up-to-date)? Or is fflush() sufficient?
If it needs to be closed, is it useful to dup() the file descriptor before closing so I can use it directly for syncfs()?
First of all, don't mix standard library <stdio.h> calls (like fprintf(3) or fopen(3)) with system calls (like open(2) or close(2) or sync(2)) as the formers are library routines that use in-process' buffers to store temporary data, for which the system is unaware, and the others are operating system interfaces that make the system responsible for the data maintainance from now onwards. You'll distinguish them easily as the former use FILE * descriptors to operate, while the last use int integer descriptors to operate on.
So if you use a system call to ensure your data is properly synced to disk, it is absolutely neccessary to first fflush(3) your process' buffer data before you do the filesystem sync(2) or fsync(2) call.
No sync(2) is warranted to happen at fclose(3) or even on close(2) time, or in the atexit() callbacks your process does before exit().
The operating system buffers are write delayed for performance reasons, and close(2) is not an event that makes it to trigger such a thing. Just think that many processes can be reading and writing the same file at the same time, and each close(2) triggering a filesystem flush could be a pain to achieve. Operating system triggers such calls at regular intervals, on umount(2) system calls, on system shutdown, and on specific calls to the sync(2) and fsync(2) system calls.
If you need to maintain the FILE *fd descriptor open, just do a fflush(fd) for that descriptor to ensure that the operating system has all its buffers for fwrite(3)d or fprintf(3)ed data first.
So finally, if you are using <stdio.h> functions, first do a fflush() for all the FILE * descriptors you have written to, or call fflush(NULL); to tell stdio to synch all descriptors in one call. Then do the sync(2) or fsync(2) call to ensure all your data is physically on disk. No need to close anything.
FILE *fd;
...
fflush(fd);
fsync(fileno(fd));
/* here you know that up to the last write(2) or fwrite(3)...
* data is synced to disk */
By the way, your approach of going to /dev/fd/<number> to get the descriptor (that you had previously) is faulty for two reasons:
Once you close your descriptor, /dev/fd/<number> is not anymore the descriptor you want. Normally, it doesn't exist, even. Just try this:
#include <string.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <errno.h>
int main()
{
int fd;
char fn[] = "/dev/fd/1";
close(1); /* close standard output */
fd = open(fn, O_RDONLY); /* try to reopen from /dev/fd */
if (fd < 0) {
fprintf(stderr,
"%s: %s(errno=%d)\n",
fn,
strerror(errno),
errno);
exit(EXIT_FAILURE);
}
exit(EXIT_SUCCESS);
} /* main */
You cannot get the directory where an open file belongs to with only the file descriptor. In a multilinked file, there can be thousands of directories just pointing to it. There's nothing on the inode (or in the open file structure) that allows you to get the path used to open that file. A common way to use temporary files is just to create them and immediately unlink(2) them, so nobody can open it again. As much as you retain the file open you have access to it, but no path points to it anymore.
Enable the "sync" flag in your filesystem (/etc/fstab), default is "async" (disabled) . When this flag is enabled, all changes to the according filesystem are inmediately flushed to disk. This makes your entire filesystem slow, but depending on your embedded system requirements, this can be a great option to consider.

Is overwriting a small file atomic on ext4?

Assume we have a file of FILE_SIZE bytes, and:
FILE_SIZE <= min(page_size, physical_block_size);
file size never changes (i.e. truncate() or append write() are never performed);
file is modified only by completly overwriting its contents using:
pwrite(fd, buf, FILE_SIZE, 0);
Is it guaranteed on ext4 that:
Such writes are atomic with respect to concurrent reads?
Such writes are transactional with respect to a system crash?
(i.e., after a crash the file's contents is completely from some previous write and we'll never see a partial write or empty file)
Is the second true:
with data=ordered?
with data=journal or alternatively with journaling enabled for a single file?
(using ioctl(fd, EXT4_IOC_SETFLAGS, EXT4_JOURNAL_DATA_FL))
when physical_block_size < FILE_SIZE <= page_size?
I've found related question which links discussion from 2011. However:
I didn't find an explicit answer for my question 2.
I wonder, if the above is true, is it documented somewhere?
From my experiment it was not atomic.
Basically my experiment was to have two processes, one writer and one reader. The writer writes to a file in a loop and reader reads from the file
Writer Process:
char buf[][18] = {
"xxxxxxxxxxxxxxxx",
"yyyyyyyyyyyyyyyy"
};
i = 0;
while (1) {
pwrite(fd, buf[i], 18, 0);
i = (i + 1) % 2;
}
Reader Process
while(1) {
pread(fd, readbuf, 18, 0);
//check if readbuf is either buf[0] or buf[1]
}
After a while of running both processes, I could see that the readbuf is either xxxxxxxxxxxxxxxxyy or yyyyyyyyyyyyyyyyxx.
So it definitively shows that the writes are not atomic. In my case 16byte writes were always atomic.
The answer was: POSIX doesn't mandate atomicity for writes/reads except for pipes. The 16 byte atomicity that I saw was kernel specific and may/can change in future.
Details of the answer in the actual post:
write(2)/read(2) atomicity between processes in linux
I am familiar with theory about filesystems in general, not with implementation of Ext4. Take this as educated guess.
Yes, I believe one sector reads and writes will be atomic because
Link you provided quotes "Currently concurrent reads/writes are atomic only wrt individual pages, however are not on the system call. "
Disk sector (512 bytes) writes are atomic according to Stephen Tweedie. In private email conversation with him, he acknowledged that this guarantee is only as good as the hardware.
Ext filesystems overwrite data in place, no copy on write. No allocation.
There is some effort to implement inline data, very small files data can fit in the inode itself. If you only need to store few bytes, that may have impact.
Not sure about one page, but it would make little sense in full journaling mode to send less than a page to the journal before commiting.

(open + write) vs. (fopen + fwrite) to kernel /proc/

I have a very strange bug. If I do:
int fd = open("/proc/...", O_WRONLY);
write(fd, argv[1], strlen(argv[1]));
close(fd);
everything is working including for a very long string which length > 1024.
If I do:
FILE *fd = fopen("/proc/...", "wb");
fwrite(argv[1], 1, strlen(argv[1]), fd);
fclose(fd);
the string is cut around 1024 characters.
I'm running an ARM embedded device with a 3.4 kernel. I have debugged in the kernel and I see that the string is already cut when I reach the very early function vfs_write (I spotted this function with a WARN_ON instruction to get the stack).
The problem is the same with fputs vs. puts.
I can use fwrite for a very long string (>1024) if I write to a standard rootfs file. So the problem is really linked how the kernel handles /proc.
Any idea what's going on?
Probably the problem is with buffers.
The issue is that special files, such as those at /proc are, well..., special, they are not always simple stream of bytes, and have to be written to (or read from) with specific sizes and or offsets. You do not say what file you are writing to, so it is impossible to be sure.
Then, the call to fwrite() assumes that the output fd is a simple stream of bytes, so it does smart fancy things, such as buffering and splicing and copying the given data. In a regular file it will just work, but in a special file, funny things may happen.
Just to be sure, try to run strace with both versions of your program and compare the outputs. If you wish, post them for additional comments.

In general, on ucLinux, is ioctl faster than writing to /sys filesystem?

I have an embedded system I'm working with, and it currently uses the sysfs to control certain features.
However, there is function that we would like to speed up, if possible.
I discovered that this subsystem also supports and ioctl interface, but before rewriting the code, I decided to search to see which is a faster interface (on ucLinux) in general: sysfs or ioctl.
Does anybody understand both implementations well enough to give me a rough idea of the difference in overhead for each? I'm looking for generic info, such as "ioctl is faster because you've removed the file layer from the function calls". Or "they are roughly the same because sysfs has a very simple interface".
Update 10/24/2013:
The specific case I'm currently doing is as follows:
int fd = open("/sys/power/state",O_WRONLY);
write( fd, "standby", 7 );
close( fd );
In kernel/power/main.c, the code that handles this write looks like:
static ssize_t state_store(struct kobject *kobj, struct kobj_attribute *attr,
const char *buf, size_t n)
{
#ifdef CONFIG_SUSPEND
suspend_state_t state = PM_SUSPEND_STANDBY;
const char * const *s;
#endif
char *p;
int len;
int error = -EINVAL;
p = memchr(buf, '\n', n);
len = p ? p - buf : n;
/* First, check if we are requested to hibernate */
if (len == 7 && !strncmp(buf, "standby", len)) {
error = enter_standby();
goto Exit;
((( snip )))
Can this be sped up by moving to a custom ioctl() where the code to handle the ioctl call looks something like:
case SNAPSHOT_STANDBY:
if (!data->frozen) {
error = -EPERM;
break;
}
error = enter_standby();
break;
(so the ioctl() calls the same low-level function that the sysfs function did).
If by sysfs you mean the sysfs() library call, notice this in man 2 sysfs:
NOTES
This System-V derived system call is obsolete; don't use it. On systems with /proc, the same information can be obtained via
/proc/filesystems; use that interface instead.
I can't recall noticing stuff that had an ioctl() and a sysfs interface, but probably they exist. I'd use the proc or sys handle anyway, since that tends to be less cryptic and more flexible.
If by sysfs you mean accessing files in /sys, that's the preferred method.
I'm looking for generic info, such as "ioctl is faster because you've removed the file layer from the function calls".
Accessing procfs or sysfs files does not entail an I/O bottleneck because they are not real files -- they are kernel interfaces. So no, accessing this stuff through "the file layer" does not affect performance. This is a not uncommon misconception in linux systems programming, I think. Programmers can be squeamish about system calls that aren't well, system calls, and paranoid that opening a file will be somehow slower. Of course, file I/O in the ABI is just system calls anyway. What makes a normal (disk) file read slow is not the calls to open, read, write, whatever, it's the hardware bottleneck.
I always use low level descriptor based functions (open(), read()) instead of high level streams when doing this because at some point some experience led me to believe they were more reliable for this specifically (reading from /proc). I can't say whether that's definitively true.
So, the question was interesting, I built a couple of modules, one for ioctl and one for sysfs, the ioctl implementing only a 4 bytes copy_from_user and nothing more, and the sysfs having nothing in its write interface.
Then a couple of userspace test up to 1 million iterations, here the results:
time ./sysfs /sys/kernel/kobject_example/bar
real 0m0.427s
user 0m0.056s
sys 0m0.368s
time ./ioctl /run/temp
real 0m0.236s
user 0m0.060s
sys 0m0.172s
edit
I agree with #goldilocks answer, HW is the real bottleneck, in a Linux environment with a well written driver choosing ioctl or sysfs doesn't make a big difference, but if you are using uClinux probably in your HW even few cpu cycles can make a difference.
The test I've done is for Linux not uClinux and it never wanted to be an absolute reference profiling the two interfaces, my point is that you can write a book about how fast is one or another but only testing will let you know, took me few minutes to setup the thing.

Are file descriptors for linux sockets always in increasing order

I have a socket server in C/linux. Each time I create a new socket it is assigned a file descriptor. I want to use these FD's as uniqueID's for each client. If they are guaranteed to always be assigned in increasing order (which is the case for the Ubuntu that I am running) then I could just use them as array indices.
So the question: Are the file descriptors that are assigned from linux sockets guaranteed to always be in increasing order?
Let's look at how this works internally (I'm using kernel 4.1.20). The way file descriptors are allocated in Linux is with __alloc_fd. When you do a open syscall, do_sys_open is called. This routine gets a free file descriptor from get_unused_fd_flags:
long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode)
{
...
fd = get_unused_fd_flags(flags);
if (fd >= 0) {
struct file *f = do_filp_open(dfd, tmp, &op);
get_unused_d_flags calls __alloc_fd setting minimum and maximum fd:
int get_unused_fd_flags(unsigned flags)
{
return __alloc_fd(current->files, 0, rlimit(RLIMIT_NOFILE), flags);
}
__alloc_fd gets the file descriptor table for the process, and gets the fd as next_fd, which is actually set from the previous time it ran:
int __alloc_fd(struct files_struct *files,
unsigned start, unsigned end, unsigned flags)
{
...
fd = files->next_fd;
...
if (start <= files->next_fd)
files->next_fd = fd + 1;
So you can see how file descriptors indeed grow monotonically... up to certain point. When the fd reaches the maximum, __alloc_fd will try to find the smallest unused file descriptor:
if (fd < fdt->max_fds)
fd = find_next_zero_bit(fdt->open_fds, fdt->max_fds, fd);
At this point the file descriptors will not be growing monotonically anymore, but instead will jump trying to find free file descriptors. After this, if the table gets full, it will be expanded:
error = expand_files(files, fd);
At which point they will grow again monotonically.
Hope this helps
FD's are guaranteed to be unique, for the lifetime of the socket. So yes, in theory, you could probably use the FD as an index into an array of clients. However, I'd caution against this for at least a couple of reasons:
As has already been said, there is no guarantee that FDs will be allocated monotonically. accept() would be within its rights to return a highly-numbered FD, which would then make your array inefficient. So short answer to your question: no, they are not guaranteed to be monotonic.
Your server is likely to end up with lots of other open FDs - stdin, stdout and stderr to name but three - so again, your array is wasting space.
I'd recommend some other way of mapping from FDs to clients. Indeed, unless you're going to be dealing with thousands of clients, searching through a list of clients should be fine - it's not really an operation that you should need to do a huge amount.
Do not depend on the monotonicity of file descriptors. Always refer to the remote system via a address:port pair.

Resources