OpenMP within a tight loop - io

In my loop each thread has its own file to write and after the loop is done the master process gathers all pieces in one big file, so it is not an issue.
#openmp parallel
{
FILE *foutput;
foutput = fopen(...)
i_min = ... //I manually and equally share index 'i' between OpenMP threads
i_max = ...
for (i = i_min; i < i_max; ++i)
for (j = 0; j < 255; ++j)
for (k = 0; k < 255; ++k) {
double value = some_function(i, j, k, var1, var2, var3);
fprintf(foutput, "%lf\n", value);
}
fclose(foutput);
}
'some_function' uses variables 'var1, var2, var3' which are of size (10MB)
and which were defined before in master process. The point is that 'some_function' only reads variables, but changes nothing!
So this code works extremely slow and I don't understand why. Reading shared variables for openmp threads is totally fine and does not make any false sharing; maybe it is fprintf which makes all so slow and I should
use binary files to write by blocks?

Related

Is it possible to parallelize or unroll this loop?

I am trying to see if I can improve the performance of the following loop in C++, which uses two dimensional vectors (_external and _Table) and has a carried loop dependency on the previous iteration. Additionally, it has a calculated index accessor in the innermost loop that will make the access of _Table non sequential on the right hand side.
int N = 8000;
int M = 400
int P = 100;
for(int i = 1; i <= N; i++){
for(int j = 0; j < M; j++){
for(int k =0; k < P; k++){
int index = _external.at(j).at(k);
_Table.at(j).at(i) += _Table.at(index).at(i-1);
}
}
}
What can I do to improve the performance of a loop like this?
Well it looks to me like the order in which these statements:
int index = _external.at(j).at(k);
_Table.at(j).at(i) += _Table.at(index).at(i-1);
are executed is critical to correctness. (That is, if the iteration order for i, j, k changes, then the results will be different ... and incorrect.)
So I think you are only left with micro-optimizations, like hoisting the expressions _Table.at(j).at(i) and _external.at(j) out of the innermost loop.
Consider this:
for(int k =0; k < P; k++){
int index = _external.at(j).at(k);
_Table.at(j).at(i) += _Table.at(index).at(i-1);
}
This loop is repeatedly adding numbers to _Table.at(j).at(i). Since (by inspection) _Table.at(index).at(i-1) must be reading from a different cell of the table (because of i-1 versus i), you could do this:
int temp = 0;
for(int k =0; k < P; k++){
int index = _external.at(j).at(k);
temp += _Table.at(index).at(i-1);
}
_Table.at(j).at(i) += temp;
This will reduce the number of calls to at, and may also improve cache performance a bit.

Wrapping around negative numbers in Rust

I'm rewriting C code in Rust which heavily relies on u32 variables and wrapping them around. For example, I have a loop defined like this:
#define NWORDS 24
#define ZERO_WORDS 11
int main()
{
unsigned int i, j;
for (i = 0; i < NWORDS; i++) {
for (j = 0; j < i; j++) {
if (j < (i-ZERO_WORDS+1)) {
}
}
}
return 0;
}
Now, the if statement will need to wrap around u32 for a few values as initially i = 0. I came across the wrapping_neg method but it seems to just compute -self. Is there any more flexible way to work with u32 in Rust by also allowing wrapping?
As mentioned in the comments, the literal answer to your question is to use u32::wrapping_sub and u32::wrapping_add:
const NWORDS: u32 = 24;
const ZERO_WORDS: u32 = 11;
fn main() {
for i in 0..NWORDS {
for j in 0..i {
if j < i.wrapping_sub(ZERO_WORDS).wrapping_add(1) {}
}
}
}
However, I'd advocate avoiding relying on wrapping operations unless you are performing hashing / cryptography / compression / something similar. Wrapping operations are non-intuitive. For example, j < i-ZERO_WORDS+1 doesn't have the same results as j+ZERO_WORDS < i+1.
Even better would be to rewrite the logic. I can't even tell in which circumstances that if expression will be true without spending a lot of time thinking about it!
It turns out that the condition will be evaluated for i=9, j=8, but not for i=10, j=0. Perhaps all of this is clearer in the real code, but devoid of context it's very confusing.
This appears to have the same logic, but seems much more understandable to me:
i < ZERO_WORDS - 1 || i - j > ZERO_WORDS - 1;
Compare:
j < i.wrapping_sub(ZERO_WORDS).wrapping_add(1);

OpenMP in Biham-Middleton-Levine BML model

I've got a serial version of BML and I'm trying to write a parallel one with OpenMP. Basically my code works with a main witin a loop calling two functions for horizontal and vertical moves. Like that:
for (s = 0; s < nmovss; s++) {
horizontal_movs(grid, N);
copy_sides(grid, N);
cur = 1-cur;
vertical_movs(grid, N);
copy_sides(grid, N);
cur = 1-cur;
}
Where cur is the current grid. Then horizontal and vertical functions are similar and have a nested loop:
for(i = 1; i <= n; i++) {
for(j = 1; j <= n+1; j++) {
if(grid[cur][i][j-1] == LR && grid[cur][i][j] == EMPTY) {
grid[1-cur][i][j-1] = EMPTY;
grid[1-cur][i][j] = LR;
}
else {
grid[1-cur][i][j] = grid[cur][i][j];
}
}
}
The code produces a ppm image at every step, and whit a certain input the serial version produce an output that we can suppose good. But using #pragma omp parallel for inside the two functions H and V, the ppm file results splitted in such zones as the number of threads(i.e. 4):
I suppose the problem is that every thread should be doing both functions in sequence before termitate because movememnts are strictcly connected. I don't know how to do that. If I set pragma at a highter level like before main loop, there is no speed-up. Obviously the ppm file has to be not sliced like the image.
Goin'on I tried this solution that gives me an identical result as the serial code, but I don't excatly understand why
# pragma omp parallel num_threads(thread_count) default(none) \
shared(grid, n, cur) private(i, j)
for(i = 1; i <= n+1; i++) {
# pragma omp for
for(j = 1; j <= n; j++) {
if(grid[cur][i-1][j] == TB && grid[cur][i][j] == EMPTY) {
grid[1-cur][i-1][j] = EMPTY;
grid[1-cur][i][j] = TB;
}
else {
grid[1-cur][i][j] = grid[cur][i][j];
}
}
}
}
Therefore, if i use just one thread more than available cores(4), the execution time "explodes" instead of remain barely the same.

Assigning a new string crashes?

I'm trying to write my first real program with dynamic arrays, but I've come across a problem I cannot understand. Basically, I am trying to take a dynamic array, copy it into a temporary one, add one more address to the original array, then copy everything back to the original array. Now the original array has one more address than before. This worked perfectly when trying with ints, but strings crash my program. Here's an example of the code I'm struggling with:
void main()
{
int x = 3;
std::string *q;
q = new std::string[x];
q[0] = "1";
q[1] = "2";
q[2] = "3";
x++;
std::string *temp = q;
q = new std::string[x];
q = temp;
q[x-1] = "4";
for (int i = 0; i < 5; i++)
std::cout << q[i] << std::endl;
}
If I were to make q and temp into pointers to int instead of string then the program runs just fine. Any help would be greatly appreciated, I've been stuck on this for an hour or two.
q = temp performs only a shallow copy. You lose the original q and all of the strings it pointed to.
Since you reallocated q to have 4 elements, but then immediately reassigned temp (which was allocated with only 3 elements), accessing (and assigning) the element at x now is outside the bounds of the array.
If you have to do it this way for some reason, it should look like this:
auto temp = q;
q = new std::string[x];
for(int x = 0; x < 3; ++x)
q[x] = temp[x];
delete [] temp;
q[x] = 4;
However, this is obviously more complex and very much more prone to error than the idiomatic way of doing this in C++. Better to use std::vector<std::string> instead.

Convert For loop into Parallel.For loop

public void DoSomething(byte[] array, byte[] array2, int start, int counter)
{
int length = array.Length;
int index = 0;
while (count >= needleLen)
{
index = Array.IndexOf(array, array2[0], start, count - length + 1);
int i = 0;
int p = 0;
for (i = 0, p = index; i < length; i++, p++)
{
if (array[p] != array2[i])
{
break;
}
}
Given that your for loop appears to be using a loop body dependent on ordering, it's most likely not a candidate for parallelization.
However, you aren't showing the "work" involved here, so it's difficult to tell what it's doing. Since the loop relies on both i and p, and it appears that they would vary independently, it's unlikely to be rewritten using a simple Parallel.For without reworking or rethinking your algorithm.
In order for a loop body to be a good candidate for parallelization, it typically needs to be order independent, and have no ordering constraints. The fact that you're basing your loop on two independent variables suggests that these requirements are not valid in this algorithm.

Resources