Eigen, parallel ConjugateGradient failed: More threads, more costs - multithreading

I want to use a parallel ConjugateGradient in Eigen 3.3.7 (gitlab) to solve Ax=b, but it showed that more threads, more computational costs.
I test the code in this question and change the matrix dimension from 90000 to 9000000. Here is the code (I name the file as test-cg-parallel.cpp):
// Use RowMajor to make use of multi-threading
typedef SparseMatrix<double, RowMajor> SpMat;
typedef Triplet<double> T;
// Assemble sparse matrix from
// https://eigen.tuxfamily.org/dox/TutorialSparse_example_details.html
void insertCoefficient(int id, int i, int j, double w, vector<T>& coeffs,
VectorXd& b, const VectorXd& boundary)
{
int n = int(boundary.size());
int id1 = i+j*n;
if(i==-1 || i==n) b(id) -= w * boundary(j); // constrained coefficient
else if(j==-1 || j==n) b(id) -= w * boundary(i); // constrained coefficient
else coeffs.push_back(T(id,id1,w)); // unknown coefficient
}
void buildProblem(vector<T>& coefficients, VectorXd& b, int n)
{
b.setZero();
ArrayXd boundary = ArrayXd::LinSpaced(n, 0,M_PI).sin().pow(2);
for(int j=0; j<n; ++j)
{
for(int i=0; i<n; ++i)
{
int id = i+j*n;
insertCoefficient(id, i-1,j, -1, coefficients, b, boundary);
insertCoefficient(id, i+1,j, -1, coefficients, b, boundary);
insertCoefficient(id, i,j-1, -1, coefficients, b, boundary);
insertCoefficient(id, i,j+1, -1, coefficients, b, boundary);
insertCoefficient(id, i,j, 4, coefficients, b, boundary);
}
}
}
int main()
{
int n = 3000; // size of the image
int m = n*n; // number of unknowns (=number of pixels)
// Assembly:
vector<T> coefficients; // list of non-zeros coefficients
VectorXd b(m); // the right hand side-vector resulting from the constraints
buildProblem(coefficients, b, n);
SpMat A(m,m);
A.setFromTriplets(coefficients.begin(), coefficients.end());
// Solving:
// Use ConjugateGradient with Lower|Upper as the UpLo template parameter to make use of multi-threading
clock_t time_start, time_end;
time_start=clock();
ConjugateGradient<SpMat, Lower|Upper> solver(A);
VectorXd x = solver.solve(b); // use the factorization to solve for the given right hand side
time_end=clock();
cout<<"time use:"<<1000*(time_end-time_start)/(double)CLOCKS_PER_SEC<<"ms"<<endl;
return 0;
}
I compile the code with gcc 7.4.0, Intel Xeon E2186G CPU with 6 cores(12 threads), compile and run details are as follows:
liu#liu-Precision-3630-Tower:~/test$ g++ test-cg-parallel.cpp -O3 -fopenmp -o cg
liu#liu-Precision-3630-Tower:~/test$ OMP_NUM_THREADS=1 ./cg
time use:747563ms
liu#liu-Precision-3630-Tower:~/test$ OMP_NUM_THREADS=4 ./cg
time use: 1.49821e+06ms
liu#liu-Precision-3630-Tower:~/test$ OMP_NUM_THREADS=8 ./cg
time use: 2.60207e+06ms
Can anyone give me some advices? Thanks a lot.

Related

random shuffle in RcppArrayFire

I am experiencing trouble getting this example to run correctly. Currently it produces the same random sample for every iteration and seed input, despite the seed changing as shown by af::getSeed().
#include "RcppArrayFire.h"
#include <random>
using namespace Rcpp;
using namespace RcppArrayFire;
// [[Rcpp::export]]
af::array random_test(RcppArrayFire::typed_array<f64> theta, int counts, int seed){
const int theta_size = theta.dims()[0];
af::array out(counts, theta_size, f64);
for(int f = 0; f < counts; f++){
af::randomEngine engine;
af_set_seed(seed + f);
//Rcpp::Rcout << af::getSeed();
af::array out_temp = af::randu(theta_size, u8, engine);
out(f, af::span) = out_temp;
// out(f, af::span) = theta(out_temp);
}
return out;
}
/*** R
theta <- 1:10
random_test(theta, 5, 1)
random_test(theta, 5, 2)
*/
The immediate problem is that you are creating a random engine within each iteration of the loop but set the seed of the global random engine. Either you set the seed of the local engine via engine.setSeed(seed), or you get rid of the local engine all together, letting af::randu default to using the global engine.
However, it would still be "unusual" to change the seed during each step of the loop. Normally one sets the seed only once, e.g.:
// [[Rcpp::depends(RcppArrayFire)]]
#include "RcppArrayFire.h"
// [[Rcpp::export]]
af::array random_test(RcppArrayFire::typed_array<f64> theta, int counts, int seed){
const int theta_size = theta.dims()[0];
af::array out(counts, theta_size, f64);
af::setSeed(seed);
for(int f = 0; f < counts; f++){
af::array out_temp = af::randu(theta_size, u8);
out(f, af::span) = out_temp;
}
return out;
}
BTW, it makes sense to parallelize this as long as your device has enough memory. For example, you could generate a random counts x theta_size matrix in one go using af::randu(counts, theta_size, u8).

NxN matrix is given and we have to find

N things to select for N people, you were given a NxN matrix and cost at each element, you needed to find the one combination with max total weight, such that each person gets exactly one thing.
I found difficulty in making its dp state.
please help me and if possible then also write code for it
C++ style code:
double max_rec(int n, int r, int* c, double** m, bool* f)
{
if (r < n)
{
double max_v = 0.0;
int max_i = -1;
for (int i = 0; i < n; i++)
{
if (f[i] == false)
{
f[i] = true;
double value = m[r][i] + max_rec(n, r + 1, c, m, f);
if (value > max_v)
{
max_v = value;
max_i = i;
}
f[i] = false;
}
}
c[i] = max_i;
return max_v;
}
return 0.0;
}
int* max_comb(int n, double** m)
{
bool* f = new bool[n];
int* c = new int[n];
max_rec(n, 0, c, m, f);
delete [] f;
return c;
}
Call max_comb with N and your NxN matrix (2d array). Returns the column indices of the maximum combination.
Time complexity: O(N!)
I know this is bad but the problem does not have a greedy structure.
And as #mszalbach said, try to attempt the problem yourself before asking.
EDIT: can reduce to polynomial time by memoizing.

Errors with repeated FFTW calls

I'm having a strange issue that I can't resolve. I made this as a simple example that demonstrates the problem. I have a sine wave defined between [0, 2*pi]. I take the Fourier transform using FFTW. Then I have a for loop where I repeatedly take the inverse Fourier transform. In each iteration, I take the average of my solution and print the results. I expect that the average stays the same with each iteration because there is no change to solution, y. However, when I pick N = 256 and other even values of N, I note that the average grows as if there are numerical errors. However, if I choose, say, N = 255 or N = 257, this is not the case and I get what is expect (avg = 0.0 for each iteration).
Code:
#include <stdio.h>
#include <stdlib.h>
#include <fftw3.h>
#include <math.h>
int main(void)
{
int N = 256;
double dx = 2.0 * M_PI / (double)N, dt = 1.0e-3;
double *x, *y;
x = (double *) malloc (sizeof (double) * N);
y = (double *) malloc (sizeof (double) * N);
// initial conditions
for (int i = 0; i < N; i++) {
x[i] = (double)i * dx;
y[i] = sin(x[i]);
}
fftw_complex yhat[N/2 + 1];
fftw_plan fftwplan, fftwplan2;
// forward plan
fftwplan = fftw_plan_dft_r2c_1d(N, y, yhat, FFTW_ESTIMATE);
fftw_execute(fftwplan);
// set N/2th mode to zero if N is even
if (N % 2 < 1.0e-13) {
yhat[N/2][0] = 0.0;
yhat[N/2][1] = 0.0;
}
// backward plan
fftwplan2 = fftw_plan_dft_c2r_1d(N, yhat, y, FFTW_ESTIMATE);
for (int i = 0; i < 50; i++) {
// yhat to y
fftw_execute(fftwplan2);
// rescale
for (int j = 0; j < N; j++) {
y[j] = y[j] / (double)N;
}
double avg = 0.0;
for (int j = 0; j < N; j++) {
avg += y[j];
}
printf("%.15f\n", avg/N);
}
fftw_destroy_plan(fftwplan);
fftw_destroy_plan(fftwplan2);
void fftw_cleanup(void);
free(x);
free(y);
return 0;
}
Output for N = 256:
0.000000000000000
0.000000000000000
0.000000000000000
-0.000000000000000
0.000000000000000
0.000000000000022
-0.000000000000007
-0.000000000000039
0.000000000000161
-0.000000000000314
0.000000000000369
0.000000000004775
-0.000000000007390
-0.000000000079126
-0.000000000009457
-0.000000000462023
0.000000000900855
-0.000000000196451
0.000000000931323
-0.000000009895302
0.000000039348379
0.000000133179128
0.000000260770321
-0.000003233551979
0.000008285045624
-0.000016331672668
0.000067450106144
-0.000166893005371
0.001059055328369
-0.002521514892578
0.005493164062500
-0.029907226562500
0.093383789062500
-0.339111328125000
1.208251953125000
-3.937500000000000
13.654296875000000
-43.812500000000000
161.109375000000000
-479.250000000000000
1785.500000000000000
-5369.000000000000000
19376.000000000000000
-66372.000000000000000
221104.000000000000000
-753792.000000000000000
2387712.000000000000000
-8603776.000000000000000
29706240.000000000000000
-96833536.000000000000000
Any ideas?
libfftw has the odious habit of modifying its inputs. Back up yhat if you want to do repeated inverse transforms.
OTOH, it's perverse, but why are you repeating the same operation if you don't expect it give different results? (Despite this being the case)
As indicated in comments: "if you want to keep the input data unchanged, use the FFTW_PRESERVE_INPUT flag. Per http://www.fftw.org/doc/Planner-Flags.html"
For example:
// backward plan
fftwplan2 = fftw_plan_dft_c2r_1d(N, yhat, y, FFTW_ESTIMATE | FFTW_PRESERVE_INPUT);

CodeJam 2014: How to solve task "New Lottery Game"?

I want to know efficient approach for the New Lottery Game problem.
The Lottery is changing! The Lottery used to have a machine to generate a random winning number. But due to cheating problems, the Lottery has decided to add another machine. The new winning number will be the result of the bitwise-AND operation between the two random numbers generated by the two machines.
To find the bitwise-AND of X and Y, write them both in binary; then a bit in the result in binary has a 1 if the corresponding bits of X and Y were both 1, and a 0 otherwise. In most programming languages, the bitwise-AND of X and Y is written X&Y.
For example:
The old machine generates the number 7 = 0111.
The new machine generates the number 11 = 1011.
The winning number will be (7 AND 11) = (0111 AND 1011) = 0011 = 3.
With this measure, the Lottery expects to reduce the cases of fraudulent claims, but unfortunately an employee from the Lottery company has leaked the following information: the old machine will always generate a non-negative integer less than A and the new one will always generate a non-negative integer less than B.
Catalina wants to win this lottery and to give it a try she decided to buy all non-negative integers less than K.
Given A, B and K, Catalina would like to know in how many different ways the machines can generate a pair of numbers that will make her a winner.
For small input we can check all possible pairs but how to do it with large inputs. I guess we represent the binary number into string first and then check permutations which would give answer less than K. But I can't seem to figure out how to calculate possible permutations of 2 binary strings.
I used a general DP technique that I described in a lot of detail in another answer.
We want to count the pairs (a, b) such that a < A, b < B and a & b < K.
The first step is to convert the numbers to binary and to pad them to the same size by adding leading zeroes. I just padded them to a fixed size of 40. The idea is to build up the valid a and b bit by bit.
Let f(i, loA, loB, loK) be the number of valid suffix pairs of a and b of size 40 - i. If loA is true, it means that the prefix up to i is already strictly smaller than the corresponding prefix of A. In that case there is no restriction on the next possible bit for a. If loA ist false, A[i] is an upper bound on the next bit we can place at the end of the current prefix. loB and loK have an analogous meaning.
Now we have the following transition:
long long f(int i, bool loA, bool loB, bool loK) {
// TODO add memoization
if (i == 40)
return loA && loB && loK;
int hiA = loA ? 1: A[i]-'0'; // upper bound on the next bit in a
int hiB = loB ? 1: B[i]-'0'; // upper bound on the next bit in b
int hiK = loK ? 1: K[i]-'0'; // upper bound on the next bit in a & b
long long res = 0;
for (int a = 0; a <= hiA; ++a)
for (int b = 0; b <= hiB; ++b) {
int k = a & b;
if (k > hiK) continue;
res += f(i+1, loA || a < A[i]-'0',
loB || b < B[i]-'0',
loK || k < K[i]-'0');
}
return res;
}
The result is f(0, false, false, false).
The runtime is O(max(log A, log B)) if memoization is added to ensure that every subproblem is only solved once.
What I did was just to identify when the answer is A * B.
Otherwise, just brute force the rest, this code passed the large input.
// for each test cases
long count = 0;
if ((K > A) || (K > B)) {
count = A * B;
continue; // print count and go to the next test case
}
count = A * B - (A-K) * (B-K);
for (int i = K; i < A; i++) {
for (int j = K; j < B; j++) {
if ((i&j) < K) count++;
}
}
I hope this helps!
just as Niklas B. said.
the whole answer is.
#include <algorithm>
#include <cstring>
#include <iomanip>
#include <iostream>
#include <iterator>
#include <map>
#include <sstream>
#include <string>
#include <vector>
using namespace std;
#define MAX_SIZE 32
int A, B, K;
int arr_a[MAX_SIZE];
int arr_b[MAX_SIZE];
int arr_k[MAX_SIZE];
bool flag [MAX_SIZE][2][2][2];
long long matrix[MAX_SIZE][2][2][2];
long long
get_result();
int main(int argc, char *argv[])
{
int case_amount = 0;
cin >> case_amount;
for (int i = 0; i < case_amount; ++i)
{
const long long result = get_result();
cout << "Case #" << 1 + i << ": " << result << endl;
}
return 0;
}
long long
dp(const int h,
const bool can_A_choose_1,
const bool can_B_choose_1,
const bool can_K_choose_1)
{
if (MAX_SIZE == h)
return can_A_choose_1 && can_B_choose_1 && can_K_choose_1;
if (flag[h][can_A_choose_1][can_B_choose_1][can_K_choose_1])
return matrix[h][can_A_choose_1][can_B_choose_1][can_K_choose_1];
int cnt_A_max = arr_a[h];
int cnt_B_max = arr_b[h];
int cnt_K_max = arr_k[h];
if (can_A_choose_1)
cnt_A_max = 1;
if (can_B_choose_1)
cnt_B_max = 1;
if (can_K_choose_1)
cnt_K_max = 1;
long long res = 0;
for (int i = 0; i <= cnt_A_max; ++i)
{
for (int j = 0; j <= cnt_B_max; ++j)
{
int k = i & j;
if (k > cnt_K_max)
continue;
res += dp(h + 1,
can_A_choose_1 || (i < cnt_A_max),
can_B_choose_1 || (j < cnt_B_max),
can_K_choose_1 || (k < cnt_K_max));
}
}
flag[h][can_A_choose_1][can_B_choose_1][can_K_choose_1] = true;
matrix[h][can_A_choose_1][can_B_choose_1][can_K_choose_1] = res;
return res;
}
long long
get_result()
{
cin >> A >> B >> K;
memset(arr_a, 0, sizeof(arr_a));
memset(arr_b, 0, sizeof(arr_b));
memset(arr_k, 0, sizeof(arr_k));
memset(flag, 0, sizeof(flag));
memset(matrix, 0, sizeof(matrix));
int i = 31;
while (i >= 1)
{
arr_a[i] = A % 2;
A /= 2;
arr_b[i] = B % 2;
B /= 2;
arr_k[i] = K % 2;
K /= 2;
i--;
}
return dp(1, 0, 0, 0);
}

This function gives different result if compiled with Visual-C++ or gnu g++

The following function (extracted from a .cpp file) gives two different results (that is, the output buffer int_image is different) if executed on a PC with Visual Studio (cpu Intel i7 running Windows 7) or on my Android phone (P880). The two input buffers im1 and im2, of type int8 (a synonymous of char), are exactly the same (checked), as well as the parameters w and h. I can't understand why this is happening:
void Compute(int8* im1,
int8* im2,
int w,
int h,
int* int_image)
{
int index = 0;
int sum;
for(int i = 0; i<h; i++)
{
// reset this column sum
sum = 0;
for(int j = 0; j<w; j++)
{
int pn;
int8 v1, v2;
v1 = im1[index];
v2 = im2[index];
pn = v1*v2;
//pn = ((int)im1[index]) * ((int)im2[index]);
sum += pn;
if (i==0)
int_image[index] = sum;
else
int_image[index] = int_image[index - w] + sum;
index++;
}
}
}
Note.
The size of char images im1 and im2 can be such that an integer overflow can happen (but i think that this kind of situation is handled equally by the two compilers, but at this point i am not so sure).
Found the source of error. I defined int8 as char, believing that by default it was a signed char. Instead, it is unsigned on gcc, while signed on Visual C++. I do not know what the standard says, but i suggest to all programmers out there to use explicitly signed char and unsigned char when defining your own macro for int8 and uint8.
Interesting discussion about this on stackoverflow.

Resources