Optimal way to write 2d array into unformatted file using nested implicit loops - io

I have a piece of code writing a 2D array var into a fortran unformatted file.
I'd prefer it to be written line-by-line. I therefore need to loop in an non-optimal order fortran-wise, because it is recommended to loop on the outer dimension first.
write(fileid, pos = pos) ((var(i,j),j=1,size(var,2)),i=1,size(var,1))
I'm wondering whether transposing the way I write into the file would be more optimal, or whether this is not necessary because the I/O process should be dominating over the cache/memory access in this situation.

I am a retired Fortran compiler and I/O library developer, and my experience here is that you would be far better off to write the whole array. Using implied loops as you have it, out of memory order, will typically require each element to be transmitted separately to the I/O library, with lots of processing overhead per element. Writing the whole array can result in a single I/O transfer directly from your array.
Given that you are using unformatted I/O, I don't understand the desire to write a row at a time, unless this is going to be read by some other software.

Related

buffer memory in the delay operator in Modelica

In OpenModelcia, do all variables of a model save in the ring buffer only when the delay operator is used, or is it done automatically whether we use delay or not? and if so, can we access the ring buffer from an external function C?
Since Modelica models can fail at any point in time all variables are saved in some sort of backup, for the C runtime in some buffers in the so-called thread data. And in some cases the OpenModelica compiler is able to revert the last step and try again, but slightly different.
For example an assert is throwing an error, because some variable has become negative but wasn't allowed to, so the compiler tries again with a smaller step size.
This backup is done independent of the presence of the delay operator and always done.
For the delay operator a different data structure is used, which is the RINGBUFFER you are probably referring to. It is only allocated if there are delay operators in the Modelica model.
There are no API functions provided to access (this) internal data of a OpenModelica simulation. So accessing the ring buffer would only be possible if you write such a function yourself, which is of course possible.
Question would be what you are trying to accomplish in the first place.

Async usage for Loading/Saving a list of objects to files

Let's say I have a list of same-class objects packed into a single file which I save to/load from at application startup.
What I'd like to do is use the power of async processing to speed up load-all time & save-all time - let's also assume that the files themselves are efficiently packed (using Protocol Buffers or the like).
What would be the best way to go about this? Would async processing actually help in this scenario?
One method I thought of is to "pre-determine" the amount of chunking by picking a number greater than 1, dividing the list up by that number, then saving/loading using that number as the number of tasks. However, this seems somewhat arbitrary, & I was curious if there are some libraries out there that might just make the decision for me based on some conditions.
I.e. I might call my "chunkable list" something like:
Chunkable<List<SomeObject>>
.. and then the program would just divide up the list correctly to read/save in an efficient way - e.g. save 10 files like "List_01", "List_XX" - then read from the chunks when performing a load-all.
The final ordering of the list, when saving or loading, is not important - just having the objects available as a single list.
For posterity, one conceptual answer here is to use a Partitioner in the Task Parallel Library.
For saving, I can have the Partitioner serialize pieces of the list & write out files as tasks complete in a given non-repeating format.
For loading, I can get a count/list of the existing chunks in a given location on disk, then have the TPL load up & deserialize all the chunks & recombine them in whatever order they complete (using some Interlocked var to make sure each file is only read once).
I will paste code in here once I have tested some.

What is the general design ideas of read-compute-write thread-safe program based on it's single-threaded version?

Consider that the sequental version of the program already exists and implements a sequence of "read-compute-write" operations on a single input file and other single output file. "Read" and "write" operations are performed by the 3rd-party library functions which are hard (but possible) to modify, while the "compute" function is performed by the program itself. Read-write library functions seems to be not thread-safe, since they operate with internal flags and internal memory buffers.
It was discovered that the program is CPU-bounded, and it is planned to improve the program by taking advantage of multiple CPUs (up to 80) by designing the multi-processor version of the program and using OpenMP for that purpose. The idea is to instantiate multiple "compute" functions with same single input and single output.
It is obvious that something nedds to be done in insuring the consistent access to reads, data transfers, computations and data storages. Possible solutions are: (hard) rewrite the IO library functions in thread-safe manner, (moderate) write a thread-safe wrapper for IO functions that would also serve as a data cacher.
Is there any general patterns that cover the subject of converting, wrapping or rewriting the single-threaded code to comply with OpenMP thread-safety assumptions?
EDIT1: The program is fresh enough for changes to make it multi-threaded (or, generally a parallel one, implemented either by multi-threading, multi-processing or other ways).
As a quick response, if you are processing a single file and writing to another, with openMP its easy to convert the sequential version of the program to a multi-thread version without taking too much care about the IO part, provided that the compute algorithm itself can be parallelized.
This is true because usually the main thread, takes care of the IO. If this cannot be achieved because the chunks of data are too big to read at once, and the compute algorithm cannot process smaller chunks, you can use the openMP API to synchronize the IO in each thread. This does not mean that the whole application will stop or wait until the other threads finish computing so new data can be read or written, it means that only the read and write parts need to be done atomically.
For example, if the flow of your sequencial application is as follows:
1) Read
2) compute
3) Write
Given that it truly can be parallelized, and each chunk of data needs to be read from within each thread, each thread could follow the next design:
1) Synchronized read of chunk from input (only one thread at the time could execute this section)
2) Compute chunk of data (done in parallel)
3) Synchronized write of computed chunk to output (only one thread at the time could execute this section)
if you need to write the chunks in the same order you have read them, you need to buffer first, or adopt a different strategy like fseek to the correct position, but that really depends if the output file size is known from the start, ...
Take special attention to the openMP scheduling strategy, because the default may not be the best to your compute algorithm. And if you need to share results between threads, like the offset of the input file you have read, you may use reduction operations provided by the openMP API, which is way more efficient than making a single part of your code run atomically between all threads, just to update a global variable, openMP knows when its safe to write.
EDIT:
In regards of the "read, process, write" operation, as long as you keep each read and write atomic between every worker, I can't think any reason you'll find any trouble. Even when the data read is being stored in a internal buffer, having every worker accessing it atomically, that data is acquired in the exact same order. You only need to keep special attention when saving that chunk to the output file, because you don't know the order each worker will finish processing its attributed chunk, so, you could have a chunk ready to be saved that was read after others that are still being processed. You just need each worker to keep track of the position of each chunk and you can keep a list of pointers to chunks that need to be saved, until you have a sequence of finished chunks since the last one saved to the output file. Some additional care may need to be taken here.
If you are worried about the internal buffer itself (and keeping in mind I don't know the library you are talking about, so I can be wrong) if you make a request to some chunk of data, that internal buffer should only be modified after you requested that data and before the data is returned to you; and as you made that request atomically (meaning that every other worker will need to keep in line for its turn) when the next worker asks for his piece of data, that internal buffer should be in the same state as when the last worker received its chunk. Even in the case that the library particularly says it returns a pointer to a position of the internal buffer and not a copy of the chunk itself, you can make a copy to the worker's memory before releasing the lock on the whole atomic read operation.
If the pattern I suggested is followed correctly, I really don't think you would find any problem you wouldn't find in the same sequential version of the algorithm.
with a little of synchronisation you can go even further. Consider something like this:
#pragma omp parallel sections num_threads
{
#pragma omp section
{
input();
notify_read_complete();
}
#pragma omp section
{
wait_read_complete();
#pragma omp parallel num_threads(N)
{
do_compute_with_threads();
}
notify_compute_complete();
}
#pragma omp section
{
wait_compute_complete();
output();
}
}
So, the basic idea would be that input() and output() read/write chunks of data. The compute part then would work on a chunk of data while the other threads are reading/writing. It will take a bit of manual synchronization work in notify*() and wait*(), but that's not magic.
Cheers,
-michael

C++/CLI efficient multithreaded circular buffer

I have four threads in a C++/CLI GUI I'm developing:
Collects raw data
The GUI itself
A background processing thread which takes chunks of raw data and produces useful information
Acts as a controller which joins the other three threads
I've got the raw data collector working and posting results to the controller, but the next step is to store all of those results so that the GUI and background processor have access to them.
New raw data is fed in one result at a time at regular (frequent) intervals. The GUI will access each new item as it arrives (the controller announces new data and the GUI then accesses the shared buffer). The data processor will periodically read a chunk of the buffer (a seconds worth for example) and produce a new result. So effectively, there's one producer and two consumers which need access.
I've hunted around, but none of the CLI-supplied stuff sounds all that useful, so I'm considering rolling my own. A shared circular buffer which allows write-locks for the collector and read locks for the gui and data processor. This will allow multiple threads to read the data as long as those sections of the buffer are not being written to.
So my question is: Are there any simple solutions in the .net libraries which could achieve this? Am I mad for considering rolling my own? Is there a better way of doing this?
Is it possible to rephrase the problem so that:
The Collector collects a new data point ...
... which it passes to the Controller.
The Controller fires a GUI "NewDataPointEvent" ...
... and stores the data point in an array.
If the array is full (or otherwise ready for processing), the Controller sends the array to the Processor ...
... and starts a new array.
If the values passed between threads are not modified after they are shared, this might save you from needing the custom thread-safe collection class, and reduce the amount of locking required.

Is reading data in one thread while it is written to in another dangerous for the OS?

There is nothing in the way the program uses this data which will cause the program to crash if it reads the old value rather than the new value. It will get the new value at some point.
However, I am wondering if reading and writing at the same time from multiple threads can cause problems for the OS?
I am yet to see them if it does. The program is developed in Linux using pthreads.
I am not interested in being told how to use mutexs/semaphores/locks/etc edit: so my program is only getting the new values, that is not what I'm asking.
No.. the OS should not have any problem. The tipical problem is the that you dont want to read the old values or a value that is half way updated, and thus not valid (and may crash your app or if the next value depends on the former, then you can get a corrupted value and keep generating wrong values all the itme), but if you dont care about that, the OS wont either.
Are the kernel/drivers reading that data for any reason (eg. it contains structures passed in to kernel APIs)? If no, then there isn't any issue with it, since the OS will never ever look at your hot memory.
Your own reads must ensure they are consistent so you don't read half of a value pre-update and half post-update and end up with a value that is neither pre neither post update.
There is no danger for the OS. Only your program's data integrity is at risk.
Imagine you data to consist of a set (structure) of values, which cannot be updated in an atomic operation. The reading thread is bound to read inconsistent data at some point (data consisting of a mixture of old and new values). But you did not want to hear about mutexes...
Problems arise when multiple threads share access to data when accessing that data is not atomic. For example, imagine a struct with 10 interdependent fields. If one thread is writing and one is reading, the reading thread is likely to see a struct that is halfway between one state and another (for example, half of it's members have been set).
If on the other hand the data can be read and written to with a single atomic operation, you will be fine. For example, imagine if there is a global variable that contains a count... One thread is incrementing it on some condition, and another is reading it and taking some action... In this case, there is really no intermediate inconsistent state. It's either got the new value, or it has the old value.
Logically, you can think of locking as a tool that lets you make arbitrary blocks of code atomic, at least as far as the other threads of execution are concerned.

Resources