I need to understand which files consumes iops of my hard disc. Just using "strace" will not solve my problem. I want to know, which files are really written to disc, not to page cache. I tried to use "systemtap", but I cannot understand how to find out which files (filenames or inodes) consumes my iops. Is there any tools, which will solve my problem?
Yeah, you can definitely use SystemTap for tracing that. When upper-layer (usually, a VFS subsystem) wants to issue I/O operation, it will call submit_bio and generic_make_request functions. Note that these doesn't necessary mean a single physical I/O operation. For example, writes from adjacent sectors can be merged by I/O scheduler.
The trick is how to determine file path name in generic_make_request. It is quite simple for reads, as this function will be called in the same context as read() call. Writes are usually asynchronous, so write() will simply update page cache entry and mark it as dirty, while submit_bio gets called by one of the writeback kernel threads which doesn't have info of original calling process:
Writes can be deduced by looking at page reference in bio structure -- it has mapping of struct address_space. struct file which corresponds to an open file also contains f_mapping which points to the same address_space instance and it also points to dentry containing name of the file (this can be done by using task_dentry_path)
So we would need two probes: one to capture attempts to read/write a file and save path and address_space into associative array and second to capture generic_make_request calls (this is performed by probe ioblock.request).
Here is an example script which counts IOPS:
// maps struct address_space to path name
global paths;
// IOPS per file
global iops;
// Capture attempts to read and write by VFS
probe kernel.function("vfs_read"),
kernel.function("vfs_write") {
mapping = $file->f_mapping;
// Assemble full path name for running task (task_current())
// from open file "$file" of type "struct file"
path = task_dentry_path(task_current(), $file->f_path->dentry,
$file->f_path->mnt);
paths[mapping] = path;
}
// Attach to generic_make_request()
probe ioblock.request {
for (i = 0; i < $bio->bi_vcnt ; i++) {
// Each BIO request may have more than one page
// to write
page = $bio->bi_io_vec[i]->bv_page;
mapping = #cast(page, "struct page")->mapping;
iops[paths[mapping], rw] <<< 1;
}
}
// Once per second drain iops statistics
probe timer.s(1) {
println(ctime());
foreach([path+, rw] in iops) {
printf("%3d %s %s\n", #count(iops[path, rw]),
bio_rw_str(rw), path);
}
delete iops
}
This example script is works for XFS, but needs to be updated to support AIO and volume managers (including btrfs). Plus I'm not sure how it will handle metadata reads and writes, but it is a good start ;)
If you want to know more on SystemTap you can check out my book: http://myaut.github.io/dtrace-stap-book/kernel/async.html
Maybe iotop gives you a hint about which process are doing I/O, in consequence you have an idea about the related files.
iotop --only
the --only option is used to see only processes or threads actually doing I/O, instead of showing all processes or threads
Related
Assume we have a file of FILE_SIZE bytes, and:
FILE_SIZE <= min(page_size, physical_block_size);
file size never changes (i.e. truncate() or append write() are never performed);
file is modified only by completly overwriting its contents using:
pwrite(fd, buf, FILE_SIZE, 0);
Is it guaranteed on ext4 that:
Such writes are atomic with respect to concurrent reads?
Such writes are transactional with respect to a system crash?
(i.e., after a crash the file's contents is completely from some previous write and we'll never see a partial write or empty file)
Is the second true:
with data=ordered?
with data=journal or alternatively with journaling enabled for a single file?
(using ioctl(fd, EXT4_IOC_SETFLAGS, EXT4_JOURNAL_DATA_FL))
when physical_block_size < FILE_SIZE <= page_size?
I've found related question which links discussion from 2011. However:
I didn't find an explicit answer for my question 2.
I wonder, if the above is true, is it documented somewhere?
From my experiment it was not atomic.
Basically my experiment was to have two processes, one writer and one reader. The writer writes to a file in a loop and reader reads from the file
Writer Process:
char buf[][18] = {
"xxxxxxxxxxxxxxxx",
"yyyyyyyyyyyyyyyy"
};
i = 0;
while (1) {
pwrite(fd, buf[i], 18, 0);
i = (i + 1) % 2;
}
Reader Process
while(1) {
pread(fd, readbuf, 18, 0);
//check if readbuf is either buf[0] or buf[1]
}
After a while of running both processes, I could see that the readbuf is either xxxxxxxxxxxxxxxxyy or yyyyyyyyyyyyyyyyxx.
So it definitively shows that the writes are not atomic. In my case 16byte writes were always atomic.
The answer was: POSIX doesn't mandate atomicity for writes/reads except for pipes. The 16 byte atomicity that I saw was kernel specific and may/can change in future.
Details of the answer in the actual post:
write(2)/read(2) atomicity between processes in linux
I am familiar with theory about filesystems in general, not with implementation of Ext4. Take this as educated guess.
Yes, I believe one sector reads and writes will be atomic because
Link you provided quotes "Currently concurrent reads/writes are atomic only wrt individual pages, however are not on the system call. "
Disk sector (512 bytes) writes are atomic according to Stephen Tweedie. In private email conversation with him, he acknowledged that this guarantee is only as good as the hardware.
Ext filesystems overwrite data in place, no copy on write. No allocation.
There is some effort to implement inline data, very small files data can fit in the inode itself. If you only need to store few bytes, that may have impact.
Not sure about one page, but it would make little sense in full journaling mode to send less than a page to the journal before commiting.
I need to write in embedded Linux(2.6.37) as fast as possible incoming DMA buffers to HD partition as raw device /dev/sda1. Buffers are aligned as required and are of equal 512KB length. The process may continue for a very long time and fill as much as, for example, 256GB of data.
I need to use the memory-mapped file technique (O_DIRECT not applicable), but can't understand the exact way how to do this.
So, in pseudo code "normal" writing:
fd=open(/dev/sda1",O_WRONLY);
while(1) {
p = GetVirtualPointerToNewBuffer();
if (InputStopped())
break;
write(fd, p, BLOCK512KB);
}
Now, I will be very thankful for the similar pseudo/real code example of how to utilize memory-mapped technique for this writing.
UPDATE2:
Thanks to kestasx the latest working test code looks like following:
#define TSIZE (64*KB)
void* TBuf;
int main(int argc, char **argv) {
int fdi=open("input.dat", O_RDONLY);
//int fdo=open("/dev/sdb2", O_RDWR);
int fdo=open("output.dat", O_RDWR);
int i, offs=0;
void* addr;
i = posix_memalign(&TBuf, TSIZE, TSIZE);
if ((fdo < 1) || (fdi < 1)) {
printf("Error in files\n");
return -1; }
while(1) {
addr = mmap((void*)TBuf, TSIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fdo, offs);
if ((unsigned int)addr == 0xFFFFFFFFUL) {
printf("Error MMAP=%d, %s\n", errno, strerror(errno));
return -1; }
i = read(fdi, TBuf, TSIZE);
if (i != TSIZE) {
printf("End of data\n");
return 0; }
i = munmap(addr, TSIZE);
offs += TSIZE;
sleep(1);
};
}
UPDATE3:
1. To precisely imitate the DMA work, I need to move read() call before mmp(), because when the DMA finishes it provides me with the address where it has put data. So, in pseudo code:
while(1) {
read(fdi, TBuf, TSIZE);
addr = mmap((void*)TBuf, TSIZE, PROT_READ|PROT_WRITE, MAP_FIXED|MAP_SHARED, fdo, offs);
munmap(addr, TSIZE);
offs += TSIZE; }
This variant fails after(!) the first loop - read() says BAD ADDRESS on TBuf.
Without understanding exactly what I do, I substituted munmap() with msync(). This worked perfectly.
So, the question here - why unmapping the addr influenced on TBuf?
2.With the previous example working I went to the real system with the DMA. The same loop, just instead of read() call is the call which waits for a DMA buffer to be ready and its virtual address provided.
There are no error, the code runs, BUT nothing is recorded (!).
My thought was that Linux does not see that the area was updated and therefore does not sync() a thing.
To test this, I eliminated in the working example the read() call - and yes, nothing was recorded too.
So, the question here - how can I tell Linux that the mapped region contains new data, please, flush it!
Thanks a lot!!!
If I correctly understand, it makes sense if You mmap() file (not sure if it You can mmap() raw partition/block-device) and data via DMA is written directly to this memory region.
For this to work You need to be able to control p (where new buffer is placed) or address where file is maped. If You don't - You'll have to copy memory contents (and will lose some benefits of mmap).
So psudo code would be:
truncate("data.bin", 256GB);
fd = open( "data.bin", O_RDWR );
p = GetVirtualPointerToNewBuffer();
adr = mmap( p, 1GB, PROT_READ | PROT_WRITE, MAP_SHARED, fd, offset_in_file );
startDMA();
waitDMAfinish();
munmap( adr, 1GB );
This is first step only and I'm not completely sure if it will work with DMA (have no such experience).
I assume it is 32bit system, but even then 1GB mapped file size may be too big (if Your RAM is smaller You'll be swaping).
If this setup will work, next step would be to make loop to map regions of file at different offsets and unmap already filled ones.
Most likely You'll need to align addr to 4KB boundary.
When You'll unmap region, it's data will be synced to disk. So You'll need some testing to select appropriate mapped region size (while next region is filled by DMA, there must be enough time to unmap/write previous one).
UPDATE:
What exactly happens when You fill mmap'ed region via DMA I simply don't know (not sure how exactly dirty pages are detected: what is done by hardware, and what must be done by software).
UPDATE2: To my best knowledge:
DMA works the following way:
CPU arranges DMA transfer (address where to write transfered data in RAM);
DMA controller does the actual work, while CPU can do it's own work in parallel;
once DMA transfer is complete - DMA controller signals CPU via IRQ line (interrupt), so CPU can handle the result.
This seems simple while virtual memory is not involved: DMA should work independently from runing process (actual VM table in use by CPU). Yet it should be some mehanism to invalidate CPU cache for modified by DMA physical RAM pages (don't know if CPU needs to do something, or it is done authomatically by hardware).
mmap() forks the following way:
after successfull call of mmap(), file on disk is attached to process memory range (most likely some data structure is filled in OS kernel to hold this info);
I/O (reading or writing) from mmaped range triggers pagefault, which is handled by kernel loading appropriate blocks from atached file;
writes to mmaped range are handled by hardware (don't know how exactly: maybe writes to previously unmodified pages triger some fault, which is handled by kernel marking these pages dirty; or maybe this marking is done entirely in hardware and this info is available to kernel when it needs to flush modified pages to disk).
modified (dirty) pages are written to disk by OS (as it sees appropriate) or can be forced via msync() or munmap()
In theory it should be possible to do DMA transfers to mmaped range, but You need to find out, how exactly pages ar marked dirty (if You need to do something to inform kernel which pages need to be written to disk).
UPDATE3:
Even if modified by DMA pages are not marked dirty, You should be able to triger marking by rewriting (reading ant then writing the same) at least one value in each page (most likely each 4KB) transfered. Just make sure this rewriting is not removed (optimised out) by compiler.
UPDATE4:
It seems file opened O_WRONLY can't be mmap'ed (see question comments, my experimets confirm this too). It is logical conclusion of mmap() workings described above. The same is confirmed here (with reference to POSIX standart requirement to ensure file is readable regardless of maping protection flags).
Unless there is some way around, it actually means that by using mmap() You can't avoid reading of results file (unnecessary step in Your case).
Regarding DMA transfers to mapped range, I think it will be a requirement to ensure maped pages are preloalocated before DMA starts (so there is real memory asigned to both DMA and maped region). On Linux there is MAP_POPULATE mmap flag, but from manual it seams it works with MAP_PRIVATE mapings only (changes are not writen to disk), so most likely it is usuitable. Likely You'll have to triger pagefaults manually by accessing each maped page. This should triger reading of results file.
If You still wish to use mmap and DMA together, but avoid reading of results file, You'll have to modify kernel internals to allow mmap to use O_WRONLY files (for example by zero-filling trigered pages, instead of reading them from disk).
Should write() implementations assume random-access, or can there be some assumptions, like that they'll ever be performed sequentially, and at increasing offsets?
You'll get extra points for a link to the part of a POSIX or SUS specification that describes the VFS interface.
Random, for certain. There's a reason why the read and write interfaces take both size and offset. You'll notice that there isn't a seek field in the fuse_operations struct; when a user program calls seek/lseek on a FUSE file, the offset in the kernel file descriptor is updated, but the FUSE fs isn't notified at all. Later reads and writes just start coming to you with a different offset, and you should be able to handle that. If something about your implementation makes it impossible, you should probably return -EIO on the writes you can't satisfy.
Unless there is something unusual about your FUSE filesystem that would prevent an existing file from being opened for write, your implementation of the write operation must support writes to any offset — an application can write to any location in a file by lseek()-ing around in the file while it's open, e.g.
fd = open("file", O_WRONLY);
lseek(fd, SEEK_SET, 100);
write(fd, ...);
lseek(fd, SEEK_SET, 0);
write(fd, ...);
How guys from linux make /dev files. You can write to them and immediately they're erased.
I can imagine some program which constantly read some dev file:
FILE *fp;
char buffer[255];
int result;
fp = fopen(fileName, "r");
if (!fp) {
printf("Open file error");
return;
}
while (1)
{
result = fscanf(fp, "%254c", buffer);
printf("%s", buffer);
memset(buffer, 0, 255);
fflush(stdout);
sleep(1);
}
fclose(fp);
But how to delete content in there? Closing a file and opening them once again in "w" mode is not the way how they done it, because you can do i.e. cat > /dev/tty
What are files? Files are names in a directory structure which denote objects. When you open a file like /home/joe/foo.txt, the operating system creates an object in memory representing that file (or finds an existing one, if the file is already open), binds a descriptor to it which is returned and then operations on that file descriptor (like read and write) are directed, through the object, into file system code which manipulates the file's representation on disk.
Device entries are also names in the directory structure. When you open some /dev/foo, the operating system creates an in-memory object representing the device, or finds an existing one (in which case there may be an error if the device does not support multiple opens!). If successful, it binds a new file descriptor to the device obejct and returns that descriptor to your program. The object is configured in such a way that the operations like read and write on the descriptor are directed to call into the specific device driver for device foo, and correspond to doing some kind of I/O with that device.
Such entries in /dev/ are not files; a better name for them is "device nodes" (a justification for which is the name of the mknod command). Only when programmers and sysadmins are speaking very loosely do they call them "device files".
When you do cat > /dev/tty, there isn't anything which is "erasing" data "on the other end". Well, not exactly. Basically, cat is calling write on a descriptor, and this results in a chain of function calls which ends up somewhere in the kernel's tty subsystem. The data is handed off to a tty driver which will send the data into a serial port, or socket, or into a console device which paints characters on the screen or whatever. Virtual terminals like xterm use a pair of devices: a master and slave pseudo-tty. If a tty is connected to a pseudo-tty device, then cat > /dev/tty writes go through a kind of "trombone": they bubble up on the master side of the pseudo-tty, where in fact there is a while (1) loop in some user-space C program receiving the bytes, like from a pipe. That program is xterm (or whatever); it removes the data and draws the characters in its window, scrolls the window, etc.
Unix is designed so that devices (tty, printer, etc) are accessed like everything else (as a file) so the files in /dev are special pseudo files that represent the device within the file-system.
You don't want to delete the contents of such a device file, and honestly it could be dangerous for your system if you write to them willy-nilly without understanding exactly what you are doing.
Device files are not normal files, if "normal file" refers to an arbitrary sequence of bytes, often stored on a medium. But not all files are normal files.
More broadly, files are an abstraction referring to a system service and/or resource, a service being something you can send information to for some purpose (e.g., for a normal file, write data to storage) and a resource being something you request data from for some purpose (e.g., for a normal file, read data from storage). C defines a standard for interfacing with such a service/resource.
Device files fit within this definition, but they do not not necessarily match my more specific "normal file" examples of reading and writing to and from storage. You can directly create dev files, but the only meaningful reason to do so is within the context of a kernel module. More often you may refer to them (e.g., with udev), keeping in mind they are actually created by the kernel and represent an interface with the kernel. Beyond that, the functioning of the interface differs from dev file to dev file.
I've also found quiet nice explanation:
http://lwn.net/images/pdf/LDD3/ch18.pdf
I'm trying to create a large empty file on a VFAT partition by using the `dd' command in an embedded linux box:
dd if=/dev/zero of=/mnt/flash/file bs=1M count=1 seek=1023
The intention was to skip the first 1023 blocks and write only 1 block at the end of the file, which should be very quick on a native EXT3 partition, and it indeed is. However, this operation turned out to be quite slow on a VFAT partition, along with the following message:
lowmem_shrink:: nr_to_scan=128, gfp_mask=d0, other_free=6971, min_adj=16
// ... more `lowmem_shrink' messages
Another attempt was to fopen() a file on the VFAT partition and then fseek() to the end to write the data, which has also proved slow, along with the same messages from the kernel.
So basically, is there a quick way to create the file on the VFAT partition (without traversing the first 1023 blocks)?
Thanks.
Why are VFAT "skipping" writes so slow ?
Unless the VFAT filesystem driver were made to "cheat" in this respect, creating large files on FAT-type filesystems will always take a long time. The driver, to comply with FAT specification, will have to allocate all data blocks and zero-initialize them, even if you "skip" the writes. That's because of the "cluster chaining" FAT does.
The reason for that behaviour is FAT's inability to support either:
UN*X-style "holes" in files (aka "sparse files")
that's what you're creating on ext3 with your testcase - a file with no data blocks allocated to the first 1GB-1MB of it, and a single 1MB chunk of actually committed, zero-initialized blocks) at the end.
NTFS-style "valid data length" information.
On NTFS, a file can have uninitialized blocks allocated to it, but the file's metadata will keep two size fields - one for the total size of the file, another for the number of bytes actually written to it (from the beginning of the file).
Without a specification supporting either technique, the filesystem would always have to allocate and zerofill all "intermediate" data blocks if you skip a range.
Also remember that on ext3, the technique you used does not actually allocate blocks to the file (apart from the last 1MB). If you require the blocks preallocated (not just the size of the file set large), you'll have to perform a full write there as well.
How could the VFAT driver be modified to deal with this ?
At the moment, the driver uses the Linux kernel function cont_write_begin() to start even an asynchronous write to a file; this function looks like:
/*
* For moronic filesystems that do not allow holes in file.
* We may have to extend the file.
*/
int cont_write_begin(struct file *file, struct address_space *mapping,
loff_t pos, unsigned len, unsigned flags,
struct page **pagep, void **fsdata,
get_block_t *get_block, loff_t *bytes)
{
struct inode *inode = mapping->host;
unsigned blocksize = 1 << inode->i_blkbits;
unsigned zerofrom;
int err;
err = cont_expand_zero(file, mapping, pos, bytes);
if (err)
return err;
zerofrom = *bytes & ~PAGE_CACHE_MASK;
if (pos+len > *bytes && zerofrom & (blocksize-1)) {
*bytes |= (blocksize-1);
(*bytes)++;
}
return block_write_begin(mapping, pos, len, flags, pagep, get_block);
}
That is a simple strategy but also a pagecache trasher (your log messages are a consequence of the call to cont_expand_zero() which does all the work, and is not asynchronous). If the filesystem were to split the two operations - one task to do the "real" write, and another one to do the zero filling, it'd appear snappier.
The way this could be achieved while still using the default linux filesystem utility interfaces were by internally creating two "virtual" files - one for the to-be-zerofilled area, and another for the actually-to-be-written data. The real file's directory entry and FAT cluster chain would only be updated once the background task is actually complete, by linking its last cluster with the first one of the "zerofill file" and the last cluster of that one with the first one of the "actual write file". One would also want to go for a directio write to do the zerofilling, in order to avoid trashing the pagecache.
Note: While all this is technically possible for sure, the question is how worthwhile would it be to do such a change ? Who needs this operation all the time ? What would side effects be ?
The existing (simple) code is perfectly acceptable for smaller skipping writes, you won't really notice its presence if you create a 1MB file and write a single byte at the end. It'll bite you only if you go for filesizes on the order of the limits of what the FAT filesystem allows you to do.
Other options ...
In some situations, the task at hand involves two (or more) steps:
freshly format (e.g.) a SD card with FAT
put one or more big files onto it to "pre-fill" the card
(app-dependent, optional)
pre-populate the files, or
put a loopback filesystem image into them
One of the cases I've worked on we've folded the first two - i.e. modified mkdosfs to pre-allocate/ pre-create files when making the (FAT32) filesystem. That's pretty simple, when writing the FAT tables just create allocated cluster chains instead of clusters filled with the "free" marker. It's also got the advantage that the data blocks are guaranteed to be contiguous, in case your app benefits from this. And you can decide to make mkdosfs not clear the previous contents of the data blocks. If you know, for example, that one of your preparation steps involves writing the entire data anyway or doing ext3-in-file-on-FAT (pretty common thing - linux appliance, sd card for data exchange with windows app/gui), then there's no need to zero out anything / double-write (once with zeroes, once with whatever-else). If your usecase fits this (i.e. formatting the card is a useful / normal step of the "initialize it for use" process anyway) then try it out; a suitably-modified mkdosfs is part of TomTom's dosfsutils sources, see mkdosfs.c search for the -N command line option handling.
When talking about preallocation, as mentioned, there's also posix_fallocate(). Currently on Linux when using FAT, this will do essentially the same as a manual dd ..., i.e. wait for the zerofill. But the specification of the function doesn't mandate it being synchronous. The block allocation (FAT cluster chain generation) would have to be done synchronously, but the VFAT on-disk dirent size update and the data block zerofills could be backgrounded / delayed (i.e. either done at low-prio in background or only done if explicitly requested via fdsync() / sync() so that the app can e.g. alloc blocks, write the contents with non-zeroes itself ...). That's technique / design though; I'm not aware of anyone having done that kernel modification yet, if only for experimenting.