Is overwriting a small file atomic on ext4?

Is overwriting a small file atomic on ext4? - linux

Assume we have a file of FILE_SIZE bytes, and:
FILE_SIZE <= min(page_size, physical_block_size);
file size never changes (i.e. truncate() or append write() are never performed);
file is modified only by completly overwriting its contents using:
pwrite(fd, buf, FILE_SIZE, 0);
Is it guaranteed on ext4 that:
Such writes are atomic with respect to concurrent reads?
Such writes are transactional with respect to a system crash?
(i.e., after a crash the file's contents is completely from some previous write and we'll never see a partial write or empty file)
Is the second true:
with data=ordered?
with data=journal or alternatively with journaling enabled for a single file?
(using ioctl(fd, EXT4_IOC_SETFLAGS, EXT4_JOURNAL_DATA_FL))
when physical_block_size < FILE_SIZE <= page_size?
I've found related question which links discussion from 2011. However:
I didn't find an explicit answer for my question 2.
I wonder, if the above is true, is it documented somewhere?

From my experiment it was not atomic.
Basically my experiment was to have two processes, one writer and one reader. The writer writes to a file in a loop and reader reads from the file
Writer Process:
char buf[][18] = {
"xxxxxxxxxxxxxxxx",
"yyyyyyyyyyyyyyyy"
};
i = 0;
while (1) {
pwrite(fd, buf[i], 18, 0);
i = (i + 1) % 2;
}
Reader Process
while(1) {
pread(fd, readbuf, 18, 0);
//check if readbuf is either buf[0] or buf[1]
}
After a while of running both processes, I could see that the readbuf is either xxxxxxxxxxxxxxxxyy or yyyyyyyyyyyyyyyyxx.
So it definitively shows that the writes are not atomic. In my case 16byte writes were always atomic.
The answer was: POSIX doesn't mandate atomicity for writes/reads except for pipes. The 16 byte atomicity that I saw was kernel specific and may/can change in future.
Details of the answer in the actual post:
write(2)/read(2) atomicity between processes in linux

I am familiar with theory about filesystems in general, not with implementation of Ext4. Take this as educated guess.
Yes, I believe one sector reads and writes will be atomic because
Link you provided quotes "Currently concurrent reads/writes are atomic only wrt individual pages, however are not on the system call. "
Disk sector (512 bytes) writes are atomic according to Stephen Tweedie. In private email conversation with him, he acknowledged that this guarantee is only as good as the hardware.
Ext filesystems overwrite data in place, no copy on write. No allocation.
There is some effort to implement inline data, very small files data can fit in the inode itself. If you only need to store few bytes, that may have impact.
Not sure about one page, but it would make little sense in full journaling mode to send less than a page to the journal before commiting.

Related

How to monitor which files consumes iops?

I need to understand which files consumes iops of my hard disc. Just using "strace" will not solve my problem. I want to know, which files are really written to disc, not to page cache. I tried to use "systemtap", but I cannot understand how to find out which files (filenames or inodes) consumes my iops. Is there any tools, which will solve my problem?

Yeah, you can definitely use SystemTap for tracing that. When upper-layer (usually, a VFS subsystem) wants to issue I/O operation, it will call submit_bio and generic_make_request functions. Note that these doesn't necessary mean a single physical I/O operation. For example, writes from adjacent sectors can be merged by I/O scheduler.
The trick is how to determine file path name in generic_make_request. It is quite simple for reads, as this function will be called in the same context as read() call. Writes are usually asynchronous, so write() will simply update page cache entry and mark it as dirty, while submit_bio gets called by one of the writeback kernel threads which doesn't have info of original calling process:
Writes can be deduced by looking at page reference in bio structure -- it has mapping of struct address_space. struct file which corresponds to an open file also contains f_mapping which points to the same address_space instance and it also points to dentry containing name of the file (this can be done by using task_dentry_path)
So we would need two probes: one to capture attempts to read/write a file and save path and address_space into associative array and second to capture generic_make_request calls (this is performed by probe ioblock.request).
Here is an example script which counts IOPS:
// maps struct address_space to path name
global paths;
// IOPS per file
global iops;
// Capture attempts to read and write by VFS
probe kernel.function("vfs_read"),
kernel.function("vfs_write") {
mapping = $file->f_mapping;
// Assemble full path name for running task (task_current())
// from open file "$file" of type "struct file"
path = task_dentry_path(task_current(), $file->f_path->dentry,
$file->f_path->mnt);
paths[mapping] = path;
}
// Attach to generic_make_request()
probe ioblock.request {
for (i = 0; i < $bio->bi_vcnt ; i++) {
// Each BIO request may have more than one page
// to write
page = $bio->bi_io_vec[i]->bv_page;
mapping = #cast(page, "struct page")->mapping;
iops[paths[mapping], rw] <<< 1;
}
}
// Once per second drain iops statistics
probe timer.s(1) {
println(ctime());
foreach([path+, rw] in iops) {
printf("%3d %s %s\n", #count(iops[path, rw]),
bio_rw_str(rw), path);
}
delete iops
}
This example script is works for XFS, but needs to be updated to support AIO and volume managers (including btrfs). Plus I'm not sure how it will handle metadata reads and writes, but it is a good start ;)
If you want to know more on SystemTap you can check out my book: http://myaut.github.io/dtrace-stap-book/kernel/async.html

Maybe iotop gives you a hint about which process are doing I/O, in consequence you have an idea about the related files.
iotop --only
the --only option is used to see only processes or threads actually doing I/O, instead of showing all processes or threads

Determine the number of "logical" bytes read/written in a Linux system

I would like to determine the number of bytes logically read/written by all processes via syscalls such as read() and write(). This is different than the number of bytes actually fetched from the storage layer (displayed by tools like iotop) since it includes (for example) reads that hit the pagecache, and is also differs in when writes are recognized: the logical write IO happens immediately when the write call is issued, while the actual physical IO may occur some time later depending on various factors (Linux usually buffers writes and does the physical IO some time later).
I know how to do it on a per-process basis (see this question for example), but not how to the get the system-wide count.

If you want to use /proc filesystem for the total counts (and not for per second counts), it is quite easy.
This works also on quite old kernels (tested on Debian Squeeze 2.6.32 kernel).
# cat /proc/1979/io
rchar: 111195372883082
wchar: 10424431162257
syscr: 130902776102
syscw: 6236420365
read_bytes: 2839822376960
write_bytes: 803408183296
cancelled_write_bytes: 374812672
For system-wide, just sum the numbers from all processes, which however will be good enough only in short-term, because as processes die, their statistics are removed from memory. You would need process accounting enabled to save them.
Meaning of these files is documented in the kernel sources file Documentation/filesystems/proc.txt:
rchar - I/O counter: chars read
The number of bytes which this task has caused
to be read from storage. This is simply the sum of bytes which this
process passed to read() and pread(). It includes things like tty IO
and it is unaffected by whether or not actual physical disk IO was
required (the read might have been satisfied from pagecache)
wchar - I/O counter: chars written
The number of bytes which this task has
caused, or shall cause to be written to disk. Similar caveats apply
here as with rchar.
syscr - I/O counter: read syscalls
Attempt to count the number of read I/O
operations, i.e. syscalls like read() and pread().
syscw - I/O counter: write syscalls
Attempt to count the number of write I/O
operations, i.e. syscalls like write() and pwrite().
read_bytes - I/O counter: bytes read
Attempt to count the number of bytes which
this process really did cause to be fetched from the storage layer.
Done at the submit_bio() level, so it is accurate for block-backed
filesystems.
write_bytes - I/O counter: bytes written
Attempt to count the number of bytes which
this process caused to be sent to the storage layer. This is done at
page-dirtying time.
cancelled_write_bytes
The big inaccuracy here is truncate. If a process writes 1MB to a file
and then deletes the file, it will in fact perform no writeout. But it
will have been accounted as having caused 1MB of write. In other
words: The number of bytes which this process caused to not happen, by
truncating pagecache. A task can cause "negative" IO too.

Here is a SystemTap script that tracks the logical IO. It is based on the script at https://sourceware.org/systemtap/SystemTap_Beginners_Guide/traceiosect.html
#! /usr/bin/env stap
# traceio.stp
# Copyright (C) 2007 Red Hat, Inc., Eugene Teo <eteo#redhat.com>
# Copyright (C) 2009 Kai Meyer <kai#unixlords.com>
# Fixed a bug that allows this to run longer
# And added the humanreadable function
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License version 2 as
# published by the Free Software Foundation.
#
global reads, writes
probe vfs.read.return {
if ($return > 0) {
reads += $return
}
}
probe vfs.write.return {
if ($return > 0) {
writes += $return
}
}
function humanreadable(bytes) {
if (bytes > 1024*1024*1024) {
return sprintf("%d GiB", bytes/1024/1024/1024)
} else if (bytes > 1024*1024) {
return sprintf("%d MiB", bytes/1024/1024)
} else if (bytes > 1024) {
return sprintf("%d KiB", bytes/1024)
} else {
return sprintf("%d B", bytes)
}
}
probe timer.s(1) {
printf("reads: %12s writes: %12s\n", humanreadable(reads), humanreadable(writes))
# Note we don't zero out reads and writes,
# so the values are cumulative since the script started.
}

Writing out DMA buffers into memory mapped file

I need to write in embedded Linux(2.6.37) as fast as possible incoming DMA buffers to HD partition as raw device /dev/sda1. Buffers are aligned as required and are of equal 512KB length. The process may continue for a very long time and fill as much as, for example, 256GB of data.
I need to use the memory-mapped file technique (O_DIRECT not applicable), but can't understand the exact way how to do this.
So, in pseudo code "normal" writing:
fd=open(/dev/sda1",O_WRONLY);
while(1) {
p = GetVirtualPointerToNewBuffer();
if (InputStopped())
break;
write(fd, p, BLOCK512KB);
}
Now, I will be very thankful for the similar pseudo/real code example of how to utilize memory-mapped technique for this writing.
UPDATE2:
Thanks to kestasx the latest working test code looks like following:
#define TSIZE (64*KB)
void* TBuf;
int main(int argc, char **argv) {
int fdi=open("input.dat", O_RDONLY);
//int fdo=open("/dev/sdb2", O_RDWR);
int fdo=open("output.dat", O_RDWR);
int i, offs=0;
void* addr;
i = posix_memalign(&TBuf, TSIZE, TSIZE);
if ((fdo < 1) || (fdi < 1)) {
printf("Error in files\n");
return -1; }
while(1) {
addr = mmap((void*)TBuf, TSIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fdo, offs);
if ((unsigned int)addr == 0xFFFFFFFFUL) {
printf("Error MMAP=%d, %s\n", errno, strerror(errno));
return -1; }
i = read(fdi, TBuf, TSIZE);
if (i != TSIZE) {
printf("End of data\n");
return 0; }
i = munmap(addr, TSIZE);
offs += TSIZE;
sleep(1);
};
}
UPDATE3:
1. To precisely imitate the DMA work, I need to move read() call before mmp(), because when the DMA finishes it provides me with the address where it has put data. So, in pseudo code:
while(1) {
read(fdi, TBuf, TSIZE);
addr = mmap((void*)TBuf, TSIZE, PROT_READ|PROT_WRITE, MAP_FIXED|MAP_SHARED, fdo, offs);
munmap(addr, TSIZE);
offs += TSIZE; }
This variant fails after(!) the first loop - read() says BAD ADDRESS on TBuf.
Without understanding exactly what I do, I substituted munmap() with msync(). This worked perfectly.
So, the question here - why unmapping the addr influenced on TBuf?
2.With the previous example working I went to the real system with the DMA. The same loop, just instead of read() call is the call which waits for a DMA buffer to be ready and its virtual address provided.
There are no error, the code runs, BUT nothing is recorded (!).
My thought was that Linux does not see that the area was updated and therefore does not sync() a thing.
To test this, I eliminated in the working example the read() call - and yes, nothing was recorded too.
So, the question here - how can I tell Linux that the mapped region contains new data, please, flush it!
Thanks a lot!!!

If I correctly understand, it makes sense if You mmap() file (not sure if it You can mmap() raw partition/block-device) and data via DMA is written directly to this memory region.
For this to work You need to be able to control p (where new buffer is placed) or address where file is maped. If You don't - You'll have to copy memory contents (and will lose some benefits of mmap).
So psudo code would be:
truncate("data.bin", 256GB);
fd = open( "data.bin", O_RDWR );
p = GetVirtualPointerToNewBuffer();
adr = mmap( p, 1GB, PROT_READ | PROT_WRITE, MAP_SHARED, fd, offset_in_file );
startDMA();
waitDMAfinish();
munmap( adr, 1GB );
This is first step only and I'm not completely sure if it will work with DMA (have no such experience).
I assume it is 32bit system, but even then 1GB mapped file size may be too big (if Your RAM is smaller You'll be swaping).
If this setup will work, next step would be to make loop to map regions of file at different offsets and unmap already filled ones.
Most likely You'll need to align addr to 4KB boundary.
When You'll unmap region, it's data will be synced to disk. So You'll need some testing to select appropriate mapped region size (while next region is filled by DMA, there must be enough time to unmap/write previous one).
UPDATE:
What exactly happens when You fill mmap'ed region via DMA I simply don't know (not sure how exactly dirty pages are detected: what is done by hardware, and what must be done by software).
UPDATE2: To my best knowledge:
DMA works the following way:
CPU arranges DMA transfer (address where to write transfered data in RAM);
DMA controller does the actual work, while CPU can do it's own work in parallel;
once DMA transfer is complete - DMA controller signals CPU via IRQ line (interrupt), so CPU can handle the result.
This seems simple while virtual memory is not involved: DMA should work independently from runing process (actual VM table in use by CPU). Yet it should be some mehanism to invalidate CPU cache for modified by DMA physical RAM pages (don't know if CPU needs to do something, or it is done authomatically by hardware).
mmap() forks the following way:
after successfull call of mmap(), file on disk is attached to process memory range (most likely some data structure is filled in OS kernel to hold this info);
I/O (reading or writing) from mmaped range triggers pagefault, which is handled by kernel loading appropriate blocks from atached file;
writes to mmaped range are handled by hardware (don't know how exactly: maybe writes to previously unmodified pages triger some fault, which is handled by kernel marking these pages dirty; or maybe this marking is done entirely in hardware and this info is available to kernel when it needs to flush modified pages to disk).
modified (dirty) pages are written to disk by OS (as it sees appropriate) or can be forced via msync() or munmap()
In theory it should be possible to do DMA transfers to mmaped range, but You need to find out, how exactly pages ar marked dirty (if You need to do something to inform kernel which pages need to be written to disk).
UPDATE3:
Even if modified by DMA pages are not marked dirty, You should be able to triger marking by rewriting (reading ant then writing the same) at least one value in each page (most likely each 4KB) transfered. Just make sure this rewriting is not removed (optimised out) by compiler.
UPDATE4:
It seems file opened O_WRONLY can't be mmap'ed (see question comments, my experimets confirm this too). It is logical conclusion of mmap() workings described above. The same is confirmed here (with reference to POSIX standart requirement to ensure file is readable regardless of maping protection flags).
Unless there is some way around, it actually means that by using mmap() You can't avoid reading of results file (unnecessary step in Your case).
Regarding DMA transfers to mapped range, I think it will be a requirement to ensure maped pages are preloalocated before DMA starts (so there is real memory asigned to both DMA and maped region). On Linux there is MAP_POPULATE mmap flag, but from manual it seams it works with MAP_PRIVATE mapings only (changes are not writen to disk), so most likely it is usuitable. Likely You'll have to triger pagefaults manually by accessing each maped page. This should triger reading of results file.
If You still wish to use mmap and DMA together, but avoid reading of results file, You'll have to modify kernel internals to allow mmap to use O_WRONLY files (for example by zero-filling trigered pages, instead of reading them from disk).

FUSE's write sequence guarantees

Should write() implementations assume random-access, or can there be some assumptions, like that they'll ever be performed sequentially, and at increasing offsets?
You'll get extra points for a link to the part of a POSIX or SUS specification that describes the VFS interface.

Random, for certain. There's a reason why the read and write interfaces take both size and offset. You'll notice that there isn't a seek field in the fuse_operations struct; when a user program calls seek/lseek on a FUSE file, the offset in the kernel file descriptor is updated, but the FUSE fs isn't notified at all. Later reads and writes just start coming to you with a different offset, and you should be able to handle that. If something about your implementation makes it impossible, you should probably return -EIO on the writes you can't satisfy.

Unless there is something unusual about your FUSE filesystem that would prevent an existing file from being opened for write, your implementation of the write operation must support writes to any offset — an application can write to any location in a file by lseek()-ing around in the file while it's open, e.g.
fd = open("file", O_WRONLY);
lseek(fd, SEEK_SET, 100);
write(fd, ...);
lseek(fd, SEEK_SET, 0);
write(fd, ...);

How to create a large file on a VFAT partition efficiently in embedded Linux

I'm trying to create a large empty file on a VFAT partition by using the `dd' command in an embedded linux box:
dd if=/dev/zero of=/mnt/flash/file bs=1M count=1 seek=1023
The intention was to skip the first 1023 blocks and write only 1 block at the end of the file, which should be very quick on a native EXT3 partition, and it indeed is. However, this operation turned out to be quite slow on a VFAT partition, along with the following message:
lowmem_shrink:: nr_to_scan=128, gfp_mask=d0, other_free=6971, min_adj=16
// ... more `lowmem_shrink' messages
Another attempt was to fopen() a file on the VFAT partition and then fseek() to the end to write the data, which has also proved slow, along with the same messages from the kernel.
So basically, is there a quick way to create the file on the VFAT partition (without traversing the first 1023 blocks)?
Thanks.

Why are VFAT "skipping" writes so slow ?
Unless the VFAT filesystem driver were made to "cheat" in this respect, creating large files on FAT-type filesystems will always take a long time. The driver, to comply with FAT specification, will have to allocate all data blocks and zero-initialize them, even if you "skip" the writes. That's because of the "cluster chaining" FAT does.
The reason for that behaviour is FAT's inability to support either:
UN*X-style "holes" in files (aka "sparse files")
that's what you're creating on ext3 with your testcase - a file with no data blocks allocated to the first 1GB-1MB of it, and a single 1MB chunk of actually committed, zero-initialized blocks) at the end.
NTFS-style "valid data length" information.
On NTFS, a file can have uninitialized blocks allocated to it, but the file's metadata will keep two size fields - one for the total size of the file, another for the number of bytes actually written to it (from the beginning of the file).
Without a specification supporting either technique, the filesystem would always have to allocate and zerofill all "intermediate" data blocks if you skip a range.
Also remember that on ext3, the technique you used does not actually allocate blocks to the file (apart from the last 1MB). If you require the blocks preallocated (not just the size of the file set large), you'll have to perform a full write there as well.
How could the VFAT driver be modified to deal with this ?
At the moment, the driver uses the Linux kernel function cont_write_begin() to start even an asynchronous write to a file; this function looks like:
/*
* For moronic filesystems that do not allow holes in file.
* We may have to extend the file.
*/
int cont_write_begin(struct file *file, struct address_space *mapping,
loff_t pos, unsigned len, unsigned flags,
struct page **pagep, void **fsdata,
get_block_t *get_block, loff_t *bytes)
{
struct inode *inode = mapping->host;
unsigned blocksize = 1 << inode->i_blkbits;
unsigned zerofrom;
int err;
err = cont_expand_zero(file, mapping, pos, bytes);
if (err)
return err;
zerofrom = *bytes & ~PAGE_CACHE_MASK;
if (pos+len > *bytes && zerofrom & (blocksize-1)) {
*bytes |= (blocksize-1);
(*bytes)++;
}
return block_write_begin(mapping, pos, len, flags, pagep, get_block);
}
That is a simple strategy but also a pagecache trasher (your log messages are a consequence of the call to cont_expand_zero() which does all the work, and is not asynchronous). If the filesystem were to split the two operations - one task to do the "real" write, and another one to do the zero filling, it'd appear snappier.
The way this could be achieved while still using the default linux filesystem utility interfaces were by internally creating two "virtual" files - one for the to-be-zerofilled area, and another for the actually-to-be-written data. The real file's directory entry and FAT cluster chain would only be updated once the background task is actually complete, by linking its last cluster with the first one of the "zerofill file" and the last cluster of that one with the first one of the "actual write file". One would also want to go for a directio write to do the zerofilling, in order to avoid trashing the pagecache.
Note: While all this is technically possible for sure, the question is how worthwhile would it be to do such a change ? Who needs this operation all the time ? What would side effects be ?
The existing (simple) code is perfectly acceptable for smaller skipping writes, you won't really notice its presence if you create a 1MB file and write a single byte at the end. It'll bite you only if you go for filesizes on the order of the limits of what the FAT filesystem allows you to do.
Other options ...
In some situations, the task at hand involves two (or more) steps:
freshly format (e.g.) a SD card with FAT
put one or more big files onto it to "pre-fill" the card
(app-dependent, optional)
pre-populate the files, or
put a loopback filesystem image into them
One of the cases I've worked on we've folded the first two - i.e. modified mkdosfs to pre-allocate/ pre-create files when making the (FAT32) filesystem. That's pretty simple, when writing the FAT tables just create allocated cluster chains instead of clusters filled with the "free" marker. It's also got the advantage that the data blocks are guaranteed to be contiguous, in case your app benefits from this. And you can decide to make mkdosfs not clear the previous contents of the data blocks. If you know, for example, that one of your preparation steps involves writing the entire data anyway or doing ext3-in-file-on-FAT (pretty common thing - linux appliance, sd card for data exchange with windows app/gui), then there's no need to zero out anything / double-write (once with zeroes, once with whatever-else). If your usecase fits this (i.e. formatting the card is a useful / normal step of the "initialize it for use" process anyway) then try it out; a suitably-modified mkdosfs is part of TomTom's dosfsutils sources, see mkdosfs.c search for the -N command line option handling.
When talking about preallocation, as mentioned, there's also posix_fallocate(). Currently on Linux when using FAT, this will do essentially the same as a manual dd ..., i.e. wait for the zerofill. But the specification of the function doesn't mandate it being synchronous. The block allocation (FAT cluster chain generation) would have to be done synchronously, but the VFAT on-disk dirent size update and the data block zerofills could be backgrounded / delayed (i.e. either done at low-prio in background or only done if explicitly requested via fdsync() / sync() so that the app can e.g. alloc blocks, write the contents with non-zeroes itself ...). That's technique / design though; I'm not aware of anyone having done that kernel modification yet, if only for experimenting.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string