how to determine page(cache) boundaries when writing to a file - linux

In linux when writing to a file, kernel maintains multiple in memory pages (4KB in size). Data is first written to the pages and background process bdflush sends these data to disk drive.
Is there a way to determine page boundaries when writing sequentially to a file ?
Can I assume it is always 1-4096:page 1 and 4097-8192:page 2 ?
or can it vary ?
say if I start writing from 10 (i.e. first 10 bytes already written to the file previously and I set file position to 10 before start writing) will the page boundary still be
1-4096 : page 1
OR
10-5096 : page 1 ?
Reason for asking,
I can use sync_file_range
http://man7.org/linux/man-pages/man2/sync_file_range.2.html
to flush data from kernel pages to disk drive in a orderly manner.
If I can determine page boundaries I can call sync_file_range only when a page boundary is reached, so that unnecessary sync_file_range calls are avoided.
Edit :
Only positive thing to suggest such boundary alignment I could find was in mmap page asking offset to be multiple of page size
,
offset must be a multiple of the page size as
returned by sysconf(_SC_PAGE_SIZE)

Related

Invalidate range by virtual address in dcache_inval_poc(start,end); ARMV8; Cache;

I'm confused by the implementation of the dcache_inval_poc (start, end) as follows: https://github.com/torvalds/linux/blob/v5.15/arch/arm64/mm/cache.S#L134. There is no sanity check for the "end" address, but what will happen if the range (start, end) passes from the upper layer, like dma_sync_single_for_cpu/dma_sync_single_for_device, beyond the L1 data cache size? eg: dcache_inval_poc(start, start+256KB), but L1 D-cache size is 32KB
After going through the source code of the dcache_inval_poc (start, end) https://github.com/torvalds/linux/blob/v5.15/arch/arm64/mm/cache.S#L152 , I tried to convert the loop code to Pseudo-Code in C as the following:
x0_kaddr = start;
while ( start < end){
dc_civac( x0_kaddr );
x0_kaddr += cache_line_size;
}
If "end - start" > L1 D-cache size, the loop will still run, however, the "x0_kaddr" address no longer exists in the D-cache.
Your confusion comes from fact that you thinking in terms of cache lines somehow mapped on top of some memory range. But function is Invalidate range by virtual address in terms of available mapped memory.
So far as start and end parameters are valid virtual addresses of general memory that's fine.
Memory range does not have to be cached as a whole, only some data out of given range might be cached or none at all.
So say there is 2MB buffer in physical DDR memory that's mapped and could be accessed by virtual addresses.
Say L1 is 32KB.
So up to 32KB out of 2MB buffer might be cached (or none at all). You don't know what part, if any, is in cache.
For that reason you run a loop over virtual addresses of your 2MB buffer. If data block of cache_line_size is in cache, that cache line would be invalidated. If data is not in cache and only in DDR memory, that's basically a nop.
It's good practice to provide start and end addresses aligned to cache_line_size, because memory controller would clip lower bits and you might miss cleaning some data in buffer tail.
PS: if you want to operate directly on cache lines, there is other functions for that. And they takes way and set parameters to address directly cache lines.

Linux file dirty page write back order

In Linux, for a single file what is the dirty page write back(to disk) order ? is it from beginning to end ? or out of order ?
Scenario 1 : without overwrite
A file(in disk) is created and large amount of data written (sequentially) quickly. Now I presume these would be in multiple page caches. When writing back the dirty pages is pages written back in order ?
e.g. Say server shutdown before file write completion.
Now after reboot can we have the disk file in below state
|--correct data --|---data unset/garbage--|--correct data--|
i.e. I understand last set of bytes in file can be incomplete, but can data in mid be incomplete
Scenario 2 : with overwrite (attempt to use file similar to a circular/ring buffer)
File created, data written, after reaching a max size, "fsync"
called (i.e. data+meta data synchronized).
Now, file pointer is
moved to the beginning of the file and data written sequentially.
(No fsync done)
Now due to a server shutdown can we have the disk file in below state after reboot
|--Newly written data--|--Old data--|--New data--|...
i.e. for new data, some pages were written to disk out of order
OR
can I assume it is always
|--Newly written data--|----Newly written data--|--Old data--|
i.e. old data and new data will not mix-up (if present old data would only be at the end of file)

Objective C For loops with #autorelease and ARC

As part of an app that allows auditors to create findings and associate photos to them (Saved as Base64 strings due to a limitation on the web service) I have to loop through all findings and their photos within an audit and set their sync value to true.
Whilst I perform this loop I see a memory spike jumping from around 40MB up to 500MB (for roughly 350 photos and 255 findings) and this number never goes down. On average our users are creating around 1000 findings and 500-700 photos before attempting to use this feature. I have attempted to use #autorelease pools to keep the memory down but it never seems to get released.
for (Finding * __autoreleasing f in self.audit.findings){
#autoreleasepool {
[f setToSync:#YES];
NSLog(#"%#", f.idFinding);
for (FindingPhoto * __autoreleasing p in f.photos){
#autoreleasepool {
[p setToSync:#YES];
p = nil;
}
}
f = nil;
}
}
The relationships and retain cycles look like this
Audit has a strong reference to Finding
Finding has a weak reference to Audit and a strong reference to FindingPhoto
FindingPhoto has a weak reference to Finding
What am I missing in terms of being able to effectively loop through these objects and set their properties without causing such a huge spike in memory. I'm assuming it's got something to do with so many Base64 strings being loaded into memory when looping through but never being released.
So, first, make sure you have a batch size set on the fetch request. Choose a relatively small number, but not too small because this isn't for UI processing. You want to batch a reasonable number of objects into memory to reduce loading overhead while keeping memory usage down. Try 50 or 100 and see how it goes, then consider upping the batch size a little.
If all of the objects you're loading are managed objects then the correct way to evict them during processing is to turn them into faults. That's done by calling refreshObject:mergeChanges: on the context. BUT - that discards any changes, and your loop is specifically there to make changes.
So, what you should really be doing is batch saving the objects you've modified and then turning those objects back into faults to remove the data from memory.
So, in your loop, keep a counter of how many you've modified and save the context each time you hit that count and refresh all the objects that were processed so far. The batch on the fetch and the batch size to save should be the same number.
There's probably a big difference in size between your "Finding" objects and the associated images. So your primary aim should be to redesign your database in a way so that unfaulting (loading) a Finding object does not automatically load the base64 encoded image.
That's actually one of the major strengths of Code Data: Loading part of an object hierarchy. Just try to move the base64 encoded data to an own (managed) object so that Core Data does not load it. It will still be loaded as needed when the reference is touched.

mmap'ed file - determine which page is dirty

I've a fixed size file. The file has been ftruncate()'ed to a size = N * getpagesize(). The file has fixed size records. I've a writer process which maps the entire file by mmap(...MAP_SHARED...) and modifies records randomly (accessed like an array). I've a reader process which also does mmap(...MAP_SHARED...). Now the reader process needs to determine which page has changed in its mapping because of writer process writing to a random record. Is there a way to do it in userspace? I'm on Linux - x86_64. Platform specific code/hacks are welcome. Thank you for your time.
Edit: I have no liberty to modify the writer process' code to give me an indication of the modified records in some way.
Relevant documentation:
http://lwn.net/Articles/230975/
https://www.kernel.org/doc/Documentation/vm/pagemap.txt
Determine the number of your virtual page (i.e. divide by 4096), multipy by 8, and seek to that position in /proc/*/pagemap
Read 8 bytes, this is the page frame number (PFN)
Open /proc/kpageflags, and seek to the PFN, read 8 bytes
If the DIRTY flag is set, the page is dirty (in other words, the writer has written to it)
Repeat for each page in your mapped file
It's going to be very, very ugly. Most likely, it's just not worth trying to do this and you're better off changing whatever painted you into this corner.
You can use a shared bitmap protected by a lock. The writer protects each page. If it writes into a protected page, it faults. You'll have to catch the fault, unprotect the page, lock the bitmap, and set the bit corresponding to that bit in the bitmap. This will tell the reader that the page was modified.
The reader operates as follows (this is the painful part):
Lock the bitmap.
Make a list of modified pages.
Communicate that list of modified pages to the writer. The writer must protect those pages again and clear their bits in the bitmap. The writer must wait for the reader to complete this before its starts reading or changes can be lost.
The reader can now read the modified pages.

How to create a large file on a VFAT partition efficiently in embedded Linux

I'm trying to create a large empty file on a VFAT partition by using the `dd' command in an embedded linux box:
dd if=/dev/zero of=/mnt/flash/file bs=1M count=1 seek=1023
The intention was to skip the first 1023 blocks and write only 1 block at the end of the file, which should be very quick on a native EXT3 partition, and it indeed is. However, this operation turned out to be quite slow on a VFAT partition, along with the following message:
lowmem_shrink:: nr_to_scan=128, gfp_mask=d0, other_free=6971, min_adj=16
// ... more `lowmem_shrink' messages
Another attempt was to fopen() a file on the VFAT partition and then fseek() to the end to write the data, which has also proved slow, along with the same messages from the kernel.
So basically, is there a quick way to create the file on the VFAT partition (without traversing the first 1023 blocks)?
Thanks.
Why are VFAT "skipping" writes so slow ?
Unless the VFAT filesystem driver were made to "cheat" in this respect, creating large files on FAT-type filesystems will always take a long time. The driver, to comply with FAT specification, will have to allocate all data blocks and zero-initialize them, even if you "skip" the writes. That's because of the "cluster chaining" FAT does.
The reason for that behaviour is FAT's inability to support either:
UN*X-style "holes" in files (aka "sparse files")
that's what you're creating on ext3 with your testcase - a file with no data blocks allocated to the first 1GB-1MB of it, and a single 1MB chunk of actually committed, zero-initialized blocks) at the end.
NTFS-style "valid data length" information.
On NTFS, a file can have uninitialized blocks allocated to it, but the file's metadata will keep two size fields - one for the total size of the file, another for the number of bytes actually written to it (from the beginning of the file).
Without a specification supporting either technique, the filesystem would always have to allocate and zerofill all "intermediate" data blocks if you skip a range.
Also remember that on ext3, the technique you used does not actually allocate blocks to the file (apart from the last 1MB). If you require the blocks preallocated (not just the size of the file set large), you'll have to perform a full write there as well.
How could the VFAT driver be modified to deal with this ?
At the moment, the driver uses the Linux kernel function cont_write_begin() to start even an asynchronous write to a file; this function looks like:
/*
* For moronic filesystems that do not allow holes in file.
* We may have to extend the file.
*/
int cont_write_begin(struct file *file, struct address_space *mapping,
loff_t pos, unsigned len, unsigned flags,
struct page **pagep, void **fsdata,
get_block_t *get_block, loff_t *bytes)
{
struct inode *inode = mapping->host;
unsigned blocksize = 1 << inode->i_blkbits;
unsigned zerofrom;
int err;
err = cont_expand_zero(file, mapping, pos, bytes);
if (err)
return err;
zerofrom = *bytes & ~PAGE_CACHE_MASK;
if (pos+len > *bytes && zerofrom & (blocksize-1)) {
*bytes |= (blocksize-1);
(*bytes)++;
}
return block_write_begin(mapping, pos, len, flags, pagep, get_block);
}
That is a simple strategy but also a pagecache trasher (your log messages are a consequence of the call to cont_expand_zero() which does all the work, and is not asynchronous). If the filesystem were to split the two operations - one task to do the "real" write, and another one to do the zero filling, it'd appear snappier.
The way this could be achieved while still using the default linux filesystem utility interfaces were by internally creating two "virtual" files - one for the to-be-zerofilled area, and another for the actually-to-be-written data. The real file's directory entry and FAT cluster chain would only be updated once the background task is actually complete, by linking its last cluster with the first one of the "zerofill file" and the last cluster of that one with the first one of the "actual write file". One would also want to go for a directio write to do the zerofilling, in order to avoid trashing the pagecache.
Note: While all this is technically possible for sure, the question is how worthwhile would it be to do such a change ? Who needs this operation all the time ? What would side effects be ?
The existing (simple) code is perfectly acceptable for smaller skipping writes, you won't really notice its presence if you create a 1MB file and write a single byte at the end. It'll bite you only if you go for filesizes on the order of the limits of what the FAT filesystem allows you to do.
Other options ...
In some situations, the task at hand involves two (or more) steps:
freshly format (e.g.) a SD card with FAT
put one or more big files onto it to "pre-fill" the card
(app-dependent, optional)
pre-populate the files, or
put a loopback filesystem image into them
One of the cases I've worked on we've folded the first two - i.e. modified mkdosfs to pre-allocate/ pre-create files when making the (FAT32) filesystem. That's pretty simple, when writing the FAT tables just create allocated cluster chains instead of clusters filled with the "free" marker. It's also got the advantage that the data blocks are guaranteed to be contiguous, in case your app benefits from this. And you can decide to make mkdosfs not clear the previous contents of the data blocks. If you know, for example, that one of your preparation steps involves writing the entire data anyway or doing ext3-in-file-on-FAT (pretty common thing - linux appliance, sd card for data exchange with windows app/gui), then there's no need to zero out anything / double-write (once with zeroes, once with whatever-else). If your usecase fits this (i.e. formatting the card is a useful / normal step of the "initialize it for use" process anyway) then try it out; a suitably-modified mkdosfs is part of TomTom's dosfsutils sources, see mkdosfs.c search for the -N command line option handling.
When talking about preallocation, as mentioned, there's also posix_fallocate(). Currently on Linux when using FAT, this will do essentially the same as a manual dd ..., i.e. wait for the zerofill. But the specification of the function doesn't mandate it being synchronous. The block allocation (FAT cluster chain generation) would have to be done synchronously, but the VFAT on-disk dirent size update and the data block zerofills could be backgrounded / delayed (i.e. either done at low-prio in background or only done if explicitly requested via fdsync() / sync() so that the app can e.g. alloc blocks, write the contents with non-zeroes itself ...). That's technique / design though; I'm not aware of anyone having done that kernel modification yet, if only for experimenting.

Resources