Speeding-up reads for Linux application

Speeding-up reads for Linux application - linux

My program reads a file, interleaving it as below:
The file to be read is large. It is split into four parts that are then split into many blocks. My program first reads block 1 of part 1, then jumps to block 1 of part 2, and so on. Then back to the block 2 of part 1, ..., as such.
The performance drops in tests. I believe the reason is that the page cache feature of kernel doesn't work efficiently in such situations. But the file is too large to mmap(), and the file is located in NFS.
How do I speed up reading in such a situation? Any comments and suggestions are welcome.

You may want to use posix_fadvise() to give the system hints on your usage, eg. use POSIX_FADV_RANDOM to disable readahead, and possibly use POSIX_FADV_WILLNEED to have the system try to read the next block into the page cache before you need it (if you can predict this).
You could also try to use POSIX_FADV_DONTNEED once you are done reading a block to have the system free the underlying cache pages, although this might not be necessary

For each pair of blocks, read both in, process the first, and push the second on to a stack. When you come to the end of the file, start shifting values off the bottom of the stack, processing them one by one.

You can break up the reading into linear chunks. For example, if your code looks like this:
int index = 0;
for (int block=0; block<n_blocks; ++block) {
for (int part=0; part<n_parts; ++part) {
seek(file,part*n_blocks+block);
data[part] = readChar(file);
}
send(data);
}
change it to this:
for (int chunk=0; chunk<n_chunks; ++chunk) {
for (int part=0; part<n_parts; ++part) {
seek(file,part*n_blocks+chunk*n_blocks_per_chunk);
for (int block=0; block<n_blocks_per_chunk; ++block) {
data[block*n_parts+part] = readChar(file);
}
}
send(data);
}
Then optimize n_blocks_per_chunk for your cache.

Related

Is it safe to attempt (and fail) to write to a const on an STM32?

So, we are experimenting with an approach to perform some matrix math. This is embedded, so memory is limited, and we will have large matrices so it helps us to keep some of them stored in flash rather than RAM.
I've written a matrix structure, two arrays (one const/flash and the other RAM), and a "modify" and "get" function. One matrix, I initialize to the RAM data, and the other matrix I initialize to the flash data, using a cast from const *f32 to *f32.
What I find is that when I run this code on my STM32 embedded processor, the RAM matrix is modifiable, and the matrix pointing to the flash data simply doesn't change (the set to 12.0 doesn't "take", the value remains 2.0).
(before change) a=2, b=2, (after change) c=2, d=12
This is acceptable behavior, by design we will not attempt to modify matrices of flash data, but if we make a mistake we don't want it to crash.
If I run the same code on my windows machine with Visual C++, however, I get an "access violation" when I attempt to run the code below, when I try to modify the const array to 12.0.
This is not surprising that Windows would object, but I'd like to understand the difference in behavior better. This seems related to CPU architecture. Is it safe, on our STM32, to let the code attempt to write to a const array and let it have no effect? Or are there side effects, or reasons to avoid this?
static const f32 constarray[9] = {1,2,3,1,2,3,1,2,3};
static f32 ramarray[9] = {1,2,3,1,2,3,1,2,3};
typedef struct {
u16 rows;
u16 cols;
f32 * mat;
} matrix_versatile;
void modify_versatile_matrix(matrix_versatile * m, uint16_t r, uint16_t c, double new_value)
{
m->mat[r * m->cols + c] = new_value;
}
double get_versatile_matrix_value(matrix_versatile * m, uint16_t r, uint16_t c)
{
return m->mat[r * m->cols + c];
}
double a;
double b;
double c;
double d;
int main(void)
{
matrix_versatile matrix_with_const_data;
matrix_versatile matrix_with_ram_data;
matrix_with_const_data.cols = 3;
matrix_with_const_data.rows = 3;
matrix_with_const_data.mat = (f32 *) constarray;
matrix_with_ram_data.cols = 3;
matrix_with_ram_data.rows = 3;
matrix_with_ram_data.mat = ramarray;
a = get_versatile_matrix_value(&matrix_with_const_data, 1, 1);
b = get_versatile_matrix_value(&matrix_with_ram_data, 1, 1);
modify_versatile_matrix(&matrix_with_const_data, 1, 1, 12.0);
modify_versatile_matrix(&matrix_with_ram_data, 1, 1, 12.0);
c = get_versatile_matrix_value(&matrix_with_const_data, 1, 1);
d = get_versatile_matrix_value(&matrix_with_ram_data, 1, 1);

but if we make a mistake we don't want it to crash.
Attempting to write to ROM will not in itself cause a crash, but the code attempting to write it is by definition buggy and may crash in any case, and will certainly not behave as intended.
It is almost entirely wrong thinking; if you have a bug, you really want it to crash during development, and not after deployment. If it silently does the wrong thing, you may never notice the bug, or the crash might occur somewhere other than in proximity of the bug, so be very hard to find.
Architectures an MMU or MPU may issue an exception if you attempt to write to memory marked as read-only. That is what is happening in Windows. In that case it can be a useful debug aid given an exception handler that reports such errors by some means. In this case the error is reported exactly when it occurs, rather than crashing some time later when some invalid data is accessed or incorrect result acted upon.
Some, but mot all STM32 parts include the MPU (application note)

The answer may depend on the series (STM32F1, STM32F4, STM32L1 etc), as they have somewhat different flash controllers.
I've once made the same mistake on an STM32F429, and investigated a bit, so I can tell what would happen on an STM32F4.
Probably nothing.
The flash is by default protected, in order to be somewhat resilient to those kind of programming errors. In order to modify the flash, one has to write certain values to the FLASH->KEYR register. If the wrong value is written, then the flash will be locked until reset, so nothing really bad can happen unless the program writes 64 bits of correct values. No unexpected interrupts can happen, because the interrupt enable bit is protected by this key too. The attempt will set some error bits in FLASH->SR, so a program can check it and warn the user (preferably the tester).
However if there is some code there (e.g. a bootloader, or logging something into flash) that is supposed to write something in the flash, i.e. it unlocks the flash with the correct keys, then bad things can happen.
If the flash is left unlocked after a preceding write operation, then writing to a previously programmed area will change bits from 1 to 0, but not from 0 to 1. It means that the flash will contain the bitwise AND of the old and the newly written value.
If the failed write attempt occurs first, and unlocked afterwards, then no legitimate write or erase operation would succeed unless the status bits are properly cleared first.
If the intended and unintended accesses occur interleaved, e.g. in interrupt handlers, then all bets are off.
Even if the values are in immutable flash memory, there can still be unexpected result. Consider this code
int foo(int *array) {
array[0] = 1;
array[1] = 3;
array[2] = 5;
return array[0];
}
An optimizing compiler might recognize that the return value should always be 1, and emit code to that effect. Or it might not, and reload array[0] from wherever it is stored, possibly a different value from flash. It may behave differently in debug and release builds, or when the function is called from different places, as it might be inlined differently.
If the pointer points to an unmapped area, neither RAM nor FLASH nor some memory mapped register, then a a fault will occur, and as the default fault handlers contain just an infinite loop, the program will hang unless it has a fault handler installed that can deal with the situation. Needless to say, overwriting random RAM areas or registers can result in barely predictable behaviour.
UPDATE
I've tried your code on actual hardware. When I ran it verbatim, the compiler (gcc-arm-none-eabi-7-2018-q2-update -O3 -lto) optimized away everything, since the variables were not used afterwards. Marking a, b, c, d as volatile resulted in c=2 and d=12, it was still considering the first array const, and no accesses to the arrays were generated. constarray did not show up in the map file at all, the linker had eliminated it completely.
So I've tried a few things one at a time to force the optimizer to generate code that would actually access the arrays.
Disablig optimization (-O0)
Making all variables volatile
Inserting a couple of compile-time memory barriers (asm volatile("":::"memory");
Doing some complex calculations in the middle
Any of these has produced varying effects on different MCUs, but they were always consistent on a single platform.
STM32F103: Hard Fault. Only halfword (16 bit) write accesses are allowed to the flash, 8 or 32 bits always result in a fault. When I've changed the data types to short, the code ran, of course without any effect on the flash.
STM32F417: Code runs, with no effects on the flash contents, but bits 6 and 7, PGPERR and PGSERR in FLASH->SR were set a few cycles after the first write attempt to constarray.
STM32L151: Code runs, with no effects on the flash controller status.

What is a quick way to check if file contents are null?

I have a rather large file (32 GB) which is an image of an SD card, created using dd.
I suspected that the file is empty (i.e. filled with the null byte \x00) starting from a certain point.
I checked this using python in the following way (where f is an open file handle with the cursor at the last position I could find data at):
for i in xrange(512):
if set(f.read(64*1048576))!=set(['\x00']):
print i
break
This worked well (in fact it revealed some data at the very end of the image), but took >9 minutes.
Has anyone got a better way to do this? There must be a much faster way, I'm sure, but cannot think of one.

Looking at a guide about memory buffers in python here I suspected that the comparator itself was the issue. In most non-typed languages memory copies are not very obvious despite being a killer for performance.
In this case, as Oded R. established, creating a buffer from read and comparing the result with a previously prepared nul filled one is much more efficient.
size = 512
data = bytearray(size)
cmp = bytearray(size)
And when reading:
f = open(FILENAME, 'rb')
f.readinto(data)
Two things that need to be taken into account is:
The size of the compared buffers should be equal, but comparing bigger buffers should be faster until some point (I would expect memory fragmentation to be the main limit)
The last buffer may not be the same size, reading the file into the prepared buffer will keep the tailing zeroes where we want them.
Here the comparison of the two buffers will be quick and there will be no attempts of casting the bytes to string (which we don't need) and since we reuse the same memory all the time, the garbage collector won't have much work either... :)

How to monitor which files consumes iops?

I need to understand which files consumes iops of my hard disc. Just using "strace" will not solve my problem. I want to know, which files are really written to disc, not to page cache. I tried to use "systemtap", but I cannot understand how to find out which files (filenames or inodes) consumes my iops. Is there any tools, which will solve my problem?

Yeah, you can definitely use SystemTap for tracing that. When upper-layer (usually, a VFS subsystem) wants to issue I/O operation, it will call submit_bio and generic_make_request functions. Note that these doesn't necessary mean a single physical I/O operation. For example, writes from adjacent sectors can be merged by I/O scheduler.
The trick is how to determine file path name in generic_make_request. It is quite simple for reads, as this function will be called in the same context as read() call. Writes are usually asynchronous, so write() will simply update page cache entry and mark it as dirty, while submit_bio gets called by one of the writeback kernel threads which doesn't have info of original calling process:
Writes can be deduced by looking at page reference in bio structure -- it has mapping of struct address_space. struct file which corresponds to an open file also contains f_mapping which points to the same address_space instance and it also points to dentry containing name of the file (this can be done by using task_dentry_path)
So we would need two probes: one to capture attempts to read/write a file and save path and address_space into associative array and second to capture generic_make_request calls (this is performed by probe ioblock.request).
Here is an example script which counts IOPS:
// maps struct address_space to path name
global paths;
// IOPS per file
global iops;
// Capture attempts to read and write by VFS
probe kernel.function("vfs_read"),
kernel.function("vfs_write") {
mapping = $file->f_mapping;
// Assemble full path name for running task (task_current())
// from open file "$file" of type "struct file"
path = task_dentry_path(task_current(), $file->f_path->dentry,
$file->f_path->mnt);
paths[mapping] = path;
}
// Attach to generic_make_request()
probe ioblock.request {
for (i = 0; i < $bio->bi_vcnt ; i++) {
// Each BIO request may have more than one page
// to write
page = $bio->bi_io_vec[i]->bv_page;
mapping = #cast(page, "struct page")->mapping;
iops[paths[mapping], rw] <<< 1;
}
}
// Once per second drain iops statistics
probe timer.s(1) {
println(ctime());
foreach([path+, rw] in iops) {
printf("%3d %s %s\n", #count(iops[path, rw]),
bio_rw_str(rw), path);
}
delete iops
}
This example script is works for XFS, but needs to be updated to support AIO and volume managers (including btrfs). Plus I'm not sure how it will handle metadata reads and writes, but it is a good start ;)
If you want to know more on SystemTap you can check out my book: http://myaut.github.io/dtrace-stap-book/kernel/async.html

Maybe iotop gives you a hint about which process are doing I/O, in consequence you have an idea about the related files.
iotop --only
the --only option is used to see only processes or threads actually doing I/O, instead of showing all processes or threads

File random access in J2ME

does J2ME have something similar to RandomAccessFile class, or is there any way to emulate this particular (random access) functionality?
The problem is this: I have a rather large binary data file (~600 KB) and would like to create a mobile application for using that data. Format of that data is home-made and contains many index blocks and data blocks. Reading the data on other platforms (like PHP or C) usually goes like this:
Read 2 bytes for index key (K), another 2 for index value (V) for the data type needed
Skip V bytes from the start of the file to seek to a file position there the data for index key K starts
Read the data
Profit :)
This happens many times during the program flow.
Um, and I'm investigating possibility of doing the very same on J2ME, and while I admit I'm quite new to the whole Java thing, I can't seem to be able to find anything beyond InputStream (DataInputStream) classes which don't have the basic seeking/skipping to byte/returning position functions I need.
So, what are my chances?

You should have something like this
try {
DataInputStream di = new DataInputStream(is);
di.marke(9999);
short key = di.readShort();
short val = di.readShort();
di.reset();
di.skip(val);
byte[] b= new byte[255];
di.read(b);
}catch(Exception ex ) {
ex.printStackTrace();
}
I prefer not to use the marke/reset methods, I think it is better to save the offset from the val location not from the start of the file so you can skip these methods. I think they have som issues on some devices.
One more note, I don't recommend to open a 600 KB file, it will crash the application on many low end devices, you should split this file to multiple files.

x86 equivalent for LWARX and STWCX

I'm looking for an equivalent of LWARX and STWCX (as found on the PowerPC processors) or a way to implement similar functionality on the x86 platform. Also, where would be the best place to find out about such things (i.e. good articles/web sites/forums for lock/wait-free programing).
Edit
I think I might need to give more details as it is being assumed that I'm just looking for a CAS (compare and swap) operation. What I'm trying to do is implement a lock-free reference counting system with smart pointers that can be accessed and changed by multiple threads. I basically need a way to implement the following function on an x86 processor.
int* IncrementAndRetrieve(int **ptr)
{
int val;
int *pval;
do
{
// fetch the pointer to the value
pval = *ptr;
// if its NULL, then just return NULL, the smart pointer
// will then become NULL as well
if(pval == NULL)
return NULL;
// Grab the reference count
val = lwarx(pval);
// make sure the pointer we grabbed the value from
// is still the same one referred to by 'ptr'
if(pval != *ptr)
continue;
// Increment the reference count via 'stwcx' if any other threads
// have done anything that could potentially break then it should
// fail and try again
} while(!stwcx(pval, val + 1));
return pval;
}
I really need something that mimics LWARX and STWCX fairly accurately to pull this off (I can't figure out a way to do this with the CompareExchange, swap or add functions I've so far found for the x86).
Thanks

As Michael mentioned, what you're probably looking for is the cmpxchg instruction.
It's important to point out though that the PPC method of accomplishing this is known as Load Link / Store Conditional (LL/SC), while the x86 architecture uses Compare And Swap (CAS). LL/SC has stronger semantics than CAS in that any change to the value at the conditioned address will cause the store to fail, even if the other change replaces the value with the same value that the load was conditioned on. CAS, on the other hand, would succeed in this case. This is known as the ABA problem (see the CAS link for more info).
If you need the stronger semantics on the x86 architecture, you can approximate it by using the x86s double-width compare-and-swap (DWCAS) instruction cmpxchg8b, or cmpxchg16b under x86_64. This allows you to atomically swap two consecutive 'natural sized' words at once, instead of just the usual one. The basic idea is one of the two words contains the value of interest, and the other one contains an always incrementing 'mutation count'. Although this does not technically eliminate the problem, the likelihood of the mutation counter to wrap between attempts is so low that it's a reasonable substitute for most purposes.

x86 does not directly support "optimistic concurrency" like PPC does -- rather, x86's support for concurrency is based on a "lock prefix", see here. (Some so-called "atomic" instructions such as XCHG actually get their atomicity by intrinsically asserting the LOCK prefix, whether the assembly code programmer has actually coded it or not). It's not exactly "bomb-proof", to put it diplomatically (indeed, it's rather accident-prone, I would say;-).

You're probably looking for the cmpxchg family of instructions.
You'll need to precede these with a lock instruction to get equivalent behaviour.
Have a look here for a quick overview of what's available.
You'll likely end up with something similar to this:
mov ecx,dword ptr [esp+4]
mov edx,dword ptr [esp+8]
mov eax,dword ptr [esp+12]
lock cmpxchg dword ptr [ecx],edx
ret 12
You should read this paper...
Edit
In response to the updated question, are you looking to do something like the Boost shared_ptr? If so, have a look at that code and the files in that directory - they'll definitely get you started.

if you are on 64 bits and limit yourself to say 1tb of heap, you can pack the counter into the 24 unused top bits. if you have word aligned pointers the bottom 5 bits are also available.
int* IncrementAndRetrieve(int **ptr)
{
int val;
int *unpacked;
do
{
val = *ptr;
unpacked = unpack(val);
if(unpacked == NULL)
return NULL;
// pointer is on the bottom
} while(!cas(unpacked, val, val + 1));
return unpacked;
}

Don't know if LWARX and STWCX invalidate the whole cache line, CAS and DCAS do. Meaning that unless you are willing to throw away a lot of memory (64 bytes for each independent "lockable" pointer) you won't see much improvement if you are really pushing your software into stress. The best results I've seen so far were when people consciously casrificed 64b, planed their structures around it (packing stuff that won't be subject of contention), kept everything alligned on 64b boundaries, and used explicit read and write data barriers. Cache line invalidation can cost approx 20 to 100 cycles, making it a bigger real perf issue then just lock avoidance.
Also, you'd have to plan different memory allocation strategy to manage either controlled leaking (if you can partition code into logical "request processing" - one request "leaks" and then releases all it's memory bulk at the end) or datailed allocation management so that one structure under contention never receives memory realesed by elements of the same structure/collection (to prevent ABA). Some of that can be very counter-intuitive but it's either that or paying the price for GC.

What you are trying to do will not work the way you expect. What you implemented above can be done with the InterlockedIncrement function (Win32 function; assembly: XADD).
The reason that your code does not do what you think it does is that another thread can still change the value between the second read of *ptr and stwcx without invalidating the stwcx.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string