use aio_write() but still see data going through cache? - linux

I'm playing with this code on Linux 2.6.16.46:
io.aio_fildes = open(name, O_CREAT | O_TRUNC | O_WRONLY | O_SYNC, 00300);
io.aio_buf = buffer;
io.aio_nbytes = size;
io.aio_sigevent = sigev;
io.aio_lio_opcode = LIO_WRITE;
aio_write( &io );
This should use the memory pointed by buffer for the IO operation. Still, I see the number of dirty pages go up as if I was writing to cache. Why is that?
On the build machine, there's no O_DIRECT support in open(). But since I'm not using write(), should that still be a problem?
I'm pretty sure there's direct IO support on the target.

figured this out. Direct/buffered IO is one thing, sync/async is another. To have async writes avoid cache one still needs to give O_DIRECT to the open() call, even if write() is not used.
There will likely be compiler errors at first - read man 2 open carefully.

Related

Is overwriting a small file atomic on ext4?

Assume we have a file of FILE_SIZE bytes, and:
FILE_SIZE <= min(page_size, physical_block_size);
file size never changes (i.e. truncate() or append write() are never performed);
file is modified only by completly overwriting its contents using:
pwrite(fd, buf, FILE_SIZE, 0);
Is it guaranteed on ext4 that:
Such writes are atomic with respect to concurrent reads?
Such writes are transactional with respect to a system crash?
(i.e., after a crash the file's contents is completely from some previous write and we'll never see a partial write or empty file)
Is the second true:
with data=ordered?
with data=journal or alternatively with journaling enabled for a single file?
(using ioctl(fd, EXT4_IOC_SETFLAGS, EXT4_JOURNAL_DATA_FL))
when physical_block_size < FILE_SIZE <= page_size?
I've found related question which links discussion from 2011. However:
I didn't find an explicit answer for my question 2.
I wonder, if the above is true, is it documented somewhere?
From my experiment it was not atomic.
Basically my experiment was to have two processes, one writer and one reader. The writer writes to a file in a loop and reader reads from the file
Writer Process:
char buf[][18] = {
"xxxxxxxxxxxxxxxx",
"yyyyyyyyyyyyyyyy"
};
i = 0;
while (1) {
pwrite(fd, buf[i], 18, 0);
i = (i + 1) % 2;
}
Reader Process
while(1) {
pread(fd, readbuf, 18, 0);
//check if readbuf is either buf[0] or buf[1]
}
After a while of running both processes, I could see that the readbuf is either xxxxxxxxxxxxxxxxyy or yyyyyyyyyyyyyyyyxx.
So it definitively shows that the writes are not atomic. In my case 16byte writes were always atomic.
The answer was: POSIX doesn't mandate atomicity for writes/reads except for pipes. The 16 byte atomicity that I saw was kernel specific and may/can change in future.
Details of the answer in the actual post:
write(2)/read(2) atomicity between processes in linux
I am familiar with theory about filesystems in general, not with implementation of Ext4. Take this as educated guess.
Yes, I believe one sector reads and writes will be atomic because
Link you provided quotes "Currently concurrent reads/writes are atomic only wrt individual pages, however are not on the system call. "
Disk sector (512 bytes) writes are atomic according to Stephen Tweedie. In private email conversation with him, he acknowledged that this guarantee is only as good as the hardware.
Ext filesystems overwrite data in place, no copy on write. No allocation.
There is some effort to implement inline data, very small files data can fit in the inode itself. If you only need to store few bytes, that may have impact.
Not sure about one page, but it would make little sense in full journaling mode to send less than a page to the journal before commiting.

(open + write) vs. (fopen + fwrite) to kernel /proc/

I have a very strange bug. If I do:
int fd = open("/proc/...", O_WRONLY);
write(fd, argv[1], strlen(argv[1]));
close(fd);
everything is working including for a very long string which length > 1024.
If I do:
FILE *fd = fopen("/proc/...", "wb");
fwrite(argv[1], 1, strlen(argv[1]), fd);
fclose(fd);
the string is cut around 1024 characters.
I'm running an ARM embedded device with a 3.4 kernel. I have debugged in the kernel and I see that the string is already cut when I reach the very early function vfs_write (I spotted this function with a WARN_ON instruction to get the stack).
The problem is the same with fputs vs. puts.
I can use fwrite for a very long string (>1024) if I write to a standard rootfs file. So the problem is really linked how the kernel handles /proc.
Any idea what's going on?
Probably the problem is with buffers.
The issue is that special files, such as those at /proc are, well..., special, they are not always simple stream of bytes, and have to be written to (or read from) with specific sizes and or offsets. You do not say what file you are writing to, so it is impossible to be sure.
Then, the call to fwrite() assumes that the output fd is a simple stream of bytes, so it does smart fancy things, such as buffering and splicing and copying the given data. In a regular file it will just work, but in a special file, funny things may happen.
Just to be sure, try to run strace with both versions of your program and compare the outputs. If you wish, post them for additional comments.

In general, on ucLinux, is ioctl faster than writing to /sys filesystem?

I have an embedded system I'm working with, and it currently uses the sysfs to control certain features.
However, there is function that we would like to speed up, if possible.
I discovered that this subsystem also supports and ioctl interface, but before rewriting the code, I decided to search to see which is a faster interface (on ucLinux) in general: sysfs or ioctl.
Does anybody understand both implementations well enough to give me a rough idea of the difference in overhead for each? I'm looking for generic info, such as "ioctl is faster because you've removed the file layer from the function calls". Or "they are roughly the same because sysfs has a very simple interface".
Update 10/24/2013:
The specific case I'm currently doing is as follows:
int fd = open("/sys/power/state",O_WRONLY);
write( fd, "standby", 7 );
close( fd );
In kernel/power/main.c, the code that handles this write looks like:
static ssize_t state_store(struct kobject *kobj, struct kobj_attribute *attr,
const char *buf, size_t n)
{
#ifdef CONFIG_SUSPEND
suspend_state_t state = PM_SUSPEND_STANDBY;
const char * const *s;
#endif
char *p;
int len;
int error = -EINVAL;
p = memchr(buf, '\n', n);
len = p ? p - buf : n;
/* First, check if we are requested to hibernate */
if (len == 7 && !strncmp(buf, "standby", len)) {
error = enter_standby();
goto Exit;
((( snip )))
Can this be sped up by moving to a custom ioctl() where the code to handle the ioctl call looks something like:
case SNAPSHOT_STANDBY:
if (!data->frozen) {
error = -EPERM;
break;
}
error = enter_standby();
break;
(so the ioctl() calls the same low-level function that the sysfs function did).
If by sysfs you mean the sysfs() library call, notice this in man 2 sysfs:
NOTES
This System-V derived system call is obsolete; don't use it. On systems with /proc, the same information can be obtained via
/proc/filesystems; use that interface instead.
I can't recall noticing stuff that had an ioctl() and a sysfs interface, but probably they exist. I'd use the proc or sys handle anyway, since that tends to be less cryptic and more flexible.
If by sysfs you mean accessing files in /sys, that's the preferred method.
I'm looking for generic info, such as "ioctl is faster because you've removed the file layer from the function calls".
Accessing procfs or sysfs files does not entail an I/O bottleneck because they are not real files -- they are kernel interfaces. So no, accessing this stuff through "the file layer" does not affect performance. This is a not uncommon misconception in linux systems programming, I think. Programmers can be squeamish about system calls that aren't well, system calls, and paranoid that opening a file will be somehow slower. Of course, file I/O in the ABI is just system calls anyway. What makes a normal (disk) file read slow is not the calls to open, read, write, whatever, it's the hardware bottleneck.
I always use low level descriptor based functions (open(), read()) instead of high level streams when doing this because at some point some experience led me to believe they were more reliable for this specifically (reading from /proc). I can't say whether that's definitively true.
So, the question was interesting, I built a couple of modules, one for ioctl and one for sysfs, the ioctl implementing only a 4 bytes copy_from_user and nothing more, and the sysfs having nothing in its write interface.
Then a couple of userspace test up to 1 million iterations, here the results:
time ./sysfs /sys/kernel/kobject_example/bar
real 0m0.427s
user 0m0.056s
sys 0m0.368s
time ./ioctl /run/temp
real 0m0.236s
user 0m0.060s
sys 0m0.172s
edit
I agree with #goldilocks answer, HW is the real bottleneck, in a Linux environment with a well written driver choosing ioctl or sysfs doesn't make a big difference, but if you are using uClinux probably in your HW even few cpu cycles can make a difference.
The test I've done is for Linux not uClinux and it never wanted to be an absolute reference profiling the two interfaces, my point is that you can write a book about how fast is one or another but only testing will let you know, took me few minutes to setup the thing.

Testing whether memory is accessible in Linux

Given an untrusted memory address, is there a way in Linux to test whether it points to valid, accessible memory?
For example, in mach you can use vm_read_overwrite() to attempt to copy data from the specified location. If the address is invalid or inaccessible, it will return an error code rather than crashing the process.
write from that memory (into /dev/null, for example (EDIT: with /dev/null it might not work as expected, use a pipe)), and you'll receive EFAULT error if the address is unaccessible.
I have no idea how to test for writable memory without destroying its content if it is writable.
This a typical case of TOCTOU - you check at some point that the memory is writeable, then later on you try to write to it, and somehow (e.g. because the application deallocated it), the memory is no longer accessible.
There is only one valid way to actually do this, and that is, trap the fault you get from writing to it when you actually need to use it.
Of course, you can use tricks to try to figure out if the memory "may be writeable", but there is no way you can actually ensure it is writeable.
You may want to explain slightly more what you are actually trying to do, and maybe we can have some better ideas if you are more specific.
You can try msync:
int page_size = getpagesize();
void *aligned = (void *)((uintptr_t)p & ~(page_size - 1));
if (msync(aligned, page_size, MS_ASYNC) == -1 && errno == ENOMEM) {
// Non-accessibe
}
But this function may be slow and should not be used in performance critical circumstance.

How to get Linux kernel modules list from user space C code?

I want to get the list of kernel modules by C code, and later on print their version.
From a script this is simple:
cat /proc/modules
lsmod
and later on, run for all drivers found:
modinfo driver_name
From C code, I can open /proc/modules, and analyze the data there, but is there a simpler way of reading this drivers list?
From C code, I can open /proc/modules, and analyze the data there, but is there a simpler way of reading this drivers list?
Depends on your definition of simple. The concept in Unix land of everything being a file does make everything simpler in one respect, because:
int fd = open("/proc/modules" | O_RDONLY);
while ( read(fd, &buffer, BUFFER_LIMIT) )
{
// parse buffer
}
close(fd);
involves the same set of function calls as opening and reading any file.
The alternative mechanism would be for the kernel to allocate some memory in your process' address space pointing to that information (and you could probably do this with a custom system call) but there's really no need - as you've seen, this way works very well not just with C, but with scripts also.

Resources