PyBind11 Sparse Solver - python-3.x

I try to build a fast sparse solver with Eigen and OpenMP for Python. For the interface between this solver and Python I use the PyBind11 package. Basically, the solver works fine, but unfortunately it only runs on one core and I cannot figure out how to use all cores of my cpu.
Although the OpenMP test functions uses the entire cpu.
Here is the C++ code:
#include <iostream>
#include <cmath>
#include <omp.h>
#include <unistd.h>
#include <pybind11/pybind11.h>
#include <pybind11/numpy.h>
namespace py = pybind11;
#include <Eigen/Sparse>
#include <Eigen/IterativeLinearSolvers>
void openmp_test()
{
int N;
N = 8;
// omp_set_num_threads(4);
#pragma omp parallel for
for (int i=0; i<N; i=i+1)
{
sleep(10);
}
}
void eigen_test(int N)
{
Eigen::SparseMatrix<double> A(N, N);
Eigen::SparseMatrix<double> b(N, 1);
Eigen::SparseMatrix<double> x(N, 1);
Eigen::BiCGSTAB<Eigen::SparseMatrix<double>> solver;
A.reserve(5*N);
b.reserve(N);
x.reserve(N);
for(int i=0; i<N; i++)
{
b.insert(i, 0) = 20.0;
for(int j=(i-2); j<=(i+2); j++)
{
if(j == i)
{
A.insert(j, i) = 10.0;
}
else if((j >= 0) && (j < N))
{
A.insert(j, i) = 5.0;
}
}
}
solver.compute(A);
x = solver.solve(b);
}
PYBIND11_MODULE(mytest, m)
{
m.def("openmp_test", &openmp_test, py::call_guard<py::gil_scoped_release>());
m.def("eigen_test", &eigen_test, py::call_guard<py::gil_scoped_release>());
}
Here is the Compiler code:
g++ \
-O3 \
-Wall \
-shared \
-std=c++14 \
-fopenmp \
-fPIC \
-I /usr/local/lib/python3.10/dist-packages/pybind11/include \
-I /usr/include/python3.10 \
-I /workspaces/pybind11/external/eigen \
mytest.cpp \
-o mytest.so
And finally here the Python code:
from time import time
import mytest
N = 10 * 10**3
start = time()
mytest.openmp_test()
print("runtime: {:.3f} s".format(time() - start))
start = time()
mytest.eigen_test(N)
print("runtime: {:.3e} s".format(time() - start))
Has anyone an idea how to fix this problem?
Thanks a lot.

See documentation:
Currently, the following algorithms can make use of multi-threading:
general dense matrix - matrix products
PartialPivLU
row-major-sparse * dense vector/matrix products
ConjugateGradient with Lower|Upper as the UpLo template parameter.
BiCGSTAB with a row-major sparse matrix format.
LeastSquaresConjugateGradient
Only the row-major sparse matrix format is supported for multi-threading. Therefore, you will need to have:
Eigen::SparseMatrix<double, Eigen::RowMajor> A(N, N);
Eigen::BiCGSTAB<Eigen::SparseMatrix<double, Eigen::RowMajor>> solver;
I have noticed that inserting elements to the matrix takes longer than calculating the solution. Since the prior is sequential, it may not be feasible to observe the parallelism. You can increase the maximum number of solver iterations (i.e solver.setMaxIterations(1e9)) to force solution to take longer (thus easier to observe CPU occupation). You can also print checkpoints to observe which part of your code is executing at that moment. For example:
void eigen_test(int N)
{
Eigen::SparseMatrix<double, Eigen::RowMajor> A(N, N);
Eigen::SparseMatrix<double> b(N, 1);
Eigen::SparseMatrix<double> x(N, 1);
Eigen::BiCGSTAB<Eigen::SparseMatrix<double, Eigen::RowMajor>> solver;
solver.setMaxIterations(1e9);
A.reserve(5*N);
b.reserve(N);
x.reserve(N);
std::cout << "checkpoint: insert elems..." << std::endl;
for(int i=0; i<N; i++)
{
b.insert(i, 0) = 20.0;
for(int j=(i-2); j<=(i+2); j++)
{
if(j == i)
{
A.insert(j, i) = 10.0;
}
else if((j >= 0) && (j < N))
{
A.insert(j, i) = 5.0;
}
}
}
std::cout << "checkpoint: find solution..." << std::endl;
solver.compute(A);
x = solver.solve(b);
std::cout << "checkpoint: done!" << std::endl;
}

Related

Why asctime(localtime(&ltime)) function does work on gcc 4.8.4 properly and cause problem with gcc 9.4.0

ltime=time(NULL);
sprintf(buffer,"\n%sTwo matrices of size %dx%d have been read. The number of threads is %d \n",asctime(localtime(&ltime)),side_Array,side_Array,val_M);
write(STDOUT_FILENO,buffer,strlen(buffer));
This the code that cause problem with output file that threads write into it. I know that localtime is not thread safe function however, i used localtime in gcc 4.8.4 it workes properly. I am sure that localtime is the problem but i don't get it why it doesn't cause problem with gcc 4.8.4 and is there any possible solution with changing makefile?
Here is my main:
int main(int argc, char *argv[])
{
ltime=time(NULL);
sprintf(buffer,"\n%sTwo matrices of size %dx%d have been read. The number of threads is %d \n",asctime(localtime(&ltime)),side_Array,side_Array,val_M);
write(STDOUT_FILENO,buffer,strlen(buffer));
// reading file, creating arrays
pthread_mutex_init (&mutex, NULL);
threadsArr = malloc(sizeof(pthread_t) * val_M);
for(i = 0; i < val_M; ++i){
prm_i = malloc(sizeof(*prm_i));
*prm_i = i;
pthread_create(&(threadsArr[i]), NULL, threads, prm_i);
}
for(i = 0; i < val_M; ++i){
pthread_join(threadsArr[i],NULL);
}
//Threads write result into arr_DFT and because of localtime this calculation gives wrong result.
for(i = 0; i < side_Array; i++)
{
for(j = 0; j < side_Array; j++)
{
sprintf(buffer, "%.3f + (%.3fi),",arr_DFT[i][j].real, arr_DFT[i][j].imag);
write(fd_csv, buffer, strlen(buffer));
}
write(fd_csv, newLine, strlen(newLine));
}
}
Here is my thread:
void *threads(void *arg1){
//mathematical calculations
pthread_mutex_lock(&mutex);
++arrived;
end = clock();
time_spend = (double)(end - begin) / CLOCKS_PER_SEC;
sprintf(printBuffer,"%sThread %d has arrived the rendezvous point in %.2f seconds.\n",asctime(localtime(&ltime)),tempVal+1, time_spend);
write(STDOUT_FILENO, printBuffer, strlen(printBuffer));
if(arrived < val_M)
pthread_cond_wait(&conditionVar, &mutex);
else
pthread_cond_broadcast(&conditionVar);
pthread_mutex_unlock(&mutex);
sprintf(printBuffer,"%sThread %d is advancing the second part.\n",asctime(localtime(&ltime)), tempVal+1);
write(STDOUT_FILENO, printBuffer, strlen(printBuffer));
begin = clock();
// mathematical calculations
end = clock();
time_spend = (double)(end - begin) / CLOCKS_PER_SEC;
sprintf(printBuffer, "%sThread %d has finished the second part in %f seconds \n",asctime(localtime(&ltime)),tempVal+1, time_spend);
write(STDOUT_FILENO, printBuffer, strlen(printBuffer));
return NULL;
}
Here is my makefile:
all:
gcc -Wall main.c -o hw5 -pthread -lm
clean:
rm -f *o && rm -f hw5

OpenACC How can I keep a data between differetn calls of a function?

I'm trying to optimize an application with OpenACC. In the main, I have an iteration loop of this type:
while(t<tstop){
add(&data, nx);
}
Where data is a variable of type Data, defined by this Structure
typedef struct Data_{
double *x;
}Data;
The function I'm calling in the while loop is parallelizable, but what I don't manage to do is to maintain the array x[] in the device memory between the different calls of the function.
void add(Data *data, int n){
#pragma acc data pcopy(data[0:1])
#pragma acc data pcopy(data->x[0:n])
#pragma acc parallel loop
for(int i=0; i < n ; i++){
data->x[i] += 1.;
}
#pragma acc exit data copyout(data->x[0:n])
#pragma acc exit data copyout(data[0:1])
}
I know the program seems to be no sense but I just wrote something to reproduce the problem I have in the real code.
I tryied to use unstructured data region:
#pragma acc enter data copyin(data[0:1])
#pragma acc enter data copyin(data->x[0:n])
#pragma acc data present(data[:1], data->x[:n])
#pragma acc parallel loop
for(int i=0; i < n ; i++){
data->x[i] += 1.;
}
#pragma acc exit data copyout(data->x[0:n])
#pragma acc exit data copyout(data[0:1])
but for some reason I get an error of this type:
FATAL ERROR: variable in data clause is partially present on the device: name=data
I'm not able to reproduce the partially present error from the code snip-it provided so it's unclear why this error is occurring. In general, the error occurs when the size of the variable in the present table differs from the size being used in the data clause. If you can provide a reproducing example, I can take a look and determine why it's happening here.
To answer the topic question, device variables can be accessed anywhere within the scope of the data region they are in, even across subroutines. For unstructured data regions (i.e. enter data/exit data), the scope is defined at runtime between the enter and exit calls. For structured data regions, the scope is defined by the structured block.
Here's an example using the structure you define above (though I've included the size of x as part of the struct).
% cat test.c
#include <stdio.h>
#include <stdlib.h>
typedef struct Data_{
double *x;
int n;
}Data;
void add(Data *data){
#pragma acc parallel loop present(data)
for(int i=0; i < data->n ; i++){
data->x[i] += 1.;
}
}
int main () {
Data *data;
data = (Data*) malloc(sizeof(Data));
data->n = 64;
data->x = (double *) malloc(sizeof(double)*data->n);
for(int i=0; i < data->n ; i++){
data->x[i] = (double) i;
}
#pragma acc enter data copyin(data[0:1])
#pragma acc enter data copyin(data->x[0:data->n])
add(data);
#pragma acc exit data copyout(data->x[0:data->n])
#pragma acc exit data delete(data)
for(int i=0; i < data->n ; i++){
printf("%d:%f\n",i,data->x[i]);
}
free(data->x);
free(data);
}
% pgcc test.c -ta=tesla -Minfo=accel; a.out
add:
12, Generating present(data[:])
Generating Tesla code
13, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
main:
28, Generating enter data copyin(data[:1])
29, Generating enter data copyin(data->x[:data->n])
31, Generating exit data copyout(data->x[:data->n])
32, Generating exit data delete(data[:1])
0:1.000000
1:2.000000
2:3.000000
3:4.000000
4:5.000000
5:6.000000
6:7.000000
7:8.000000
8:9.000000
9:10.000000
10:11.000000
11:12.000000
12:13.000000
13:14.000000
14:15.000000
15:16.000000
16:17.000000
17:18.000000
18:19.000000
19:20.000000
20:21.000000
21:22.000000
22:23.000000
23:24.000000
24:25.000000
25:26.000000
26:27.000000
27:28.000000
28:29.000000
29:30.000000
30:31.000000
31:32.000000
32:33.000000
33:34.000000
34:35.000000
35:36.000000
36:37.000000
37:38.000000
38:39.000000
39:40.000000
40:41.000000
41:42.000000
42:43.000000
43:44.000000
44:45.000000
45:46.000000
46:47.000000
47:48.000000
48:49.000000
49:50.000000
50:51.000000
51:52.000000
52:53.000000
53:54.000000
54:55.000000
55:56.000000
56:57.000000
57:58.000000
58:59.000000
59:60.000000
60:61.000000
61:62.000000
62:63.000000
63:64.000000
Also, here's a second example, but now with "data" being an array where the size of each "x" can be different.
% cat test2.c
#include <stdio.h>
#include <stdlib.h>
#define M 16
typedef struct Data_{
double *x;
int n;
}Data;
void add(Data *data){
#pragma acc parallel loop present(data)
for(int i=0; i < data->n ; i++){
data->x[i] += 1.;
}
}
int main () {
Data *data;
data = (Data*) malloc(sizeof(Data)*M);
#pragma acc enter data create(data[0:M])
for (int i =0; i < M; ++i) {
data[i].n = i+1;
data[i].x = (double *) malloc(sizeof(double)*data[i].n);
for(int j=0; j < data[i].n ; j++){
data[i].x[j] = (double)((i*data[i].n) + j);
}
#pragma acc update device(data[i].n)
#pragma acc enter data copyin(data[i].x[0:data[i].n])
}
for (int i =0; i < M; ++i) {
add(&data[i]);
}
for (int i =0; i < M; ++i) {
#pragma acc update self(data[i].x[:data[i].n])
for(int j=0; j < data[i].n ; j++){
printf("%d:%d:%f\n",i,j,data[i].x[j]);
}}
for (int i =0; i < M; ++i) {
#pragma acc exit data delete(data[i].x)
free(data[i].x);
}
#pragma acc exit data delete(data)
free(data);
}
% pgcc test2.c -ta=tesla -Minfo=accel; a.out
add:
11, Generating present(data[:1])
Generating Tesla code
14, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
main:
22, Generating enter data create(data[:16])
32, Generating update device(data->n)
Generating enter data copyin(data->x[:data->n])
38, Generating update self(data->x[:data->n])
46, Generating exit data delete(data->x[:1])
49, Generating exit data delete(data[:1])
0:0:1.000000
1:0:3.000000
1:1:4.000000
2:0:7.000000
2:1:8.000000
2:2:9.000000
3:0:13.000000
3:1:14.000000
3:2:15.000000
3:3:16.000000
4:0:21.000000
4:1:22.000000
4:2:23.000000
4:3:24.000000
4:4:25.000000
5:0:31.000000
5:1:32.000000
5:2:33.000000
5:3:34.000000
5:4:35.000000
5:5:36.000000
6:0:43.000000
6:1:44.000000
6:2:45.000000
6:3:46.000000
6:4:47.000000
6:5:48.000000
6:6:49.000000
7:0:57.000000
7:1:58.000000
7:2:59.000000
7:3:60.000000
7:4:61.000000
7:5:62.000000
7:6:63.000000
7:7:64.000000
8:0:73.000000
8:1:74.000000
8:2:75.000000
8:3:76.000000
8:4:77.000000
8:5:78.000000
8:6:79.000000
8:7:80.000000
8:8:81.000000
9:0:91.000000
9:1:92.000000
9:2:93.000000
9:3:94.000000
9:4:95.000000
9:5:96.000000
9:6:97.000000
9:7:98.000000
9:8:99.000000
9:9:100.000000
10:0:111.000000
10:1:112.000000
10:2:113.000000
10:3:114.000000
10:4:115.000000
10:5:116.000000
10:6:117.000000
10:7:118.000000
10:8:119.000000
10:9:120.000000
10:10:121.000000
11:0:133.000000
11:1:134.000000
11:2:135.000000
11:3:136.000000
11:4:137.000000
11:5:138.000000
11:6:139.000000
11:7:140.000000
11:8:141.000000
11:9:142.000000
11:10:143.000000
11:11:144.000000
12:0:157.000000
12:1:158.000000
12:2:159.000000
12:3:160.000000
12:4:161.000000
12:5:162.000000
12:6:163.000000
12:7:164.000000
12:8:165.000000
12:9:166.000000
12:10:167.000000
12:11:168.000000
12:12:169.000000
13:0:183.000000
13:1:184.000000
13:2:185.000000
13:3:186.000000
13:4:187.000000
13:5:188.000000
13:6:189.000000
13:7:190.000000
13:8:191.000000
13:9:192.000000
13:10:193.000000
13:11:194.000000
13:12:195.000000
13:13:196.000000
14:0:211.000000
14:1:212.000000
14:2:213.000000
14:3:214.000000
14:4:215.000000
14:5:216.000000
14:6:217.000000
14:7:218.000000
14:8:219.000000
14:9:220.000000
14:10:221.000000
14:11:222.000000
14:12:223.000000
14:13:224.000000
14:14:225.000000
15:0:241.000000
15:1:242.000000
15:2:243.000000
15:3:244.000000
15:4:245.000000
15:5:246.000000
15:6:247.000000
15:7:248.000000
15:8:249.000000
15:9:250.000000
15:10:251.000000
15:11:252.000000
15:12:253.000000
15:13:254.000000
15:14:255.000000
15:15:256.000000
Note, be careful about copying structs with dynamic data members. Copying the struct itself, i.e. like you have above "#pragma acc exit data copyout(data[0:1])", will overwrite the host address of "x" with the device address. Instead, copy only "data->x" and delete "data".

C++ Threaded Template Vector Quicksort

Threaded quick sort method:
#include <iostream>
#include <fstream>
#include <string>
#include <vector>
#include "MD5.h"
#include <thread>
using namespace std;
template<typename T>
void quickSort(vector<T> &arr, int left, int right) {
int i = left, j = right; //Make local copys to modify
T tmp; //Termorary variable to use for swaping.
T pivot = arr[(left + right) / 2]; //Find the centerpoint. if 0.5 truncate.
while (i <= j) {
while (arr[i] < pivot) //is i < pivot?
i++;
while (arr[j] > pivot) //Is j > pivot?
j--;
if (i <= j) { //Swap
tmp = arr[i];
arr[i] = arr[j];
arr[j] = tmp;
i++;
j--;
}
};
thread left_t; //Left thread
thread right_t; //Right thread
if (left < j)
left_t = thread(quickSort<T>, ref(arr), left, j);
if (i < right)
right_t = thread(quickSort<T>, ref(arr), i, right);
if (left < j)
left_t.join();
if (left < j)
right_t.join();
}
int main()
{
vector<int> table;
for (int i = 0; i < 100; i++)
{
table.push_back(rand() % 100);
}
cout << "Before" << endl;
for each(int val in table)
{
cout << val << endl;
}
quickSort(table, 0, 99);
cout << "After" << endl;
for each(int val in table)
{
cout << val << endl;
}
char temp = cin.get();
return 0;
}
Above program lags like mad hell and Spams "abort()" has been called.
Im thinking it has something to do with vectors and it Having threading issues
Iv seen the Question asked by Daniel Makardich, His Utilizes a Vector int While mine uses Vector T
You don't have any problem with quick sort, but with passing a templated function to a thread. There is no function quickSort. You need to explicitly give type, to instantiate the function template:
#include <thread>
#include <iostream>
template<typename T>
void f(T a) { std::cout << a << '\n'; }
int main () {
std::thread t;
int a;
std::string b("b");
t = std::thread(f, a); // Won't work
t = std::thread(f<int>, a);
t.join();
t = std::thread(f<decltype(b)>, b); // a bit fancier, more dynamic way
t.join();
return 0;
}
I suspect in your case this should do:
left_t = thread(quickSort<T>, ref(arr), left, j);
And similar for right_t. Also, you have mistake there trying to use operator()() instead of constructing an object. That is why the error is different.
Can't verify though, cause there's no minimal verifiable example =/
I don't know if it's possible to make compiler to use automatic type deduction for f passed as a param, if anyone knows that would probably make it a better answer.
Problem was with thread joins and what #luk32 said
Needed to convert the threads to pointers to threads.
thread* left_t = nullptr; //Left thread
thread* right_t = nullptr; //Right thread
if (left < j)
left_t = new thread(quickSort<T>, ref(arr), left, j);
if (i < right)
right_t = new thread(quickSort<T>, ref(arr), i, right);
if (left_t)
{
left_t->join();
delete left_t;
}
if (right_t)
{
right_t->join();
delete right_t;
}
Seems like if you create a default constructed thread object. But don't use it, it still wants to be joined. and if you do join it, it will complain.

The measured mcycle is less than minstret in Spike

I use spike to run the test program in "riscv-tools/riscv-tests/build/benchmarks":
$ spike multiply.riscv
And the output shows:
mcycle = 24096
minstret = 24103
Why mcycle is less than minstret?
Dose it means that spike can run more than one instructions in one cycle?
(I tried to trace spike code but cannot find how mcycle is counted.)
The printing of mcycle and minstret values are not from Spike in this case, it is from the test (benchmark). There is the code:
https://github.com/ucb-bar/riscv-benchmarks/blob/master/common/syscalls.c
#define NUM_COUNTERS 2
static uintptr_t counters[NUM_COUNTERS];
static char* counter_names[NUM_COUNTERS];
static int handle_stats(int enable)
{
int i = 0;
#define READ_CTR(name) do { \
while (i >= NUM_COUNTERS) ; \
uintptr_t csr = read_csr(name); \
if (!enable) { csr -= counters[i]; counter_names[i] = #name; } \
counters[i++] = csr; \
} while (0)
READ_CTR(mcycle);
READ_CTR(minstret);
#undef READ_CTR
return 0;
}
There is some code between reading mcycle & minstret. And now you know how much code is there between readings.
In Spike mcycle & minstret are always equal by definition (they are handled by the same code): https://github.com/riscv/riscv-isa-sim/blob/9e012462f53113dc9ed00d7fbb89aeafeb9b89e9/riscv/processor.cc#L347
case CSR_MINSTRET:
case CSR_MCYCLE:
if (xlen == 32)
state.minstret = (state.minstret >> 32 << 32) | (val & 0xffffffffU);
else
state.minstret = val;
break;
The syscalls.c was linked into multiply.riscv binary from https://github.com/ucb-bar/riscv-benchmarks/blob/master/multiply/bmark.mk - multiply_riscv_bin = multiply.riscv
$(multiply_riscv_bin): ... $(patsubst %.c, %.o, ... syscalls.c ... )
There is _init in syscalls.c function which calls main of the test and prints values, recorded on SYS_stats "syscall" with handle_stats.
void _init(int cid, int nc)
{
init_tls();
thread_entry(cid, nc);
// only single-threaded programs should ever get here.
int ret = main(0, 0);
char buf[NUM_COUNTERS * 32] __attribute__((aligned(64)));
char* pbuf = buf;
for (int i = 0; i < NUM_COUNTERS; i++)
if (counters[i])
pbuf += sprintf(pbuf, "%s = %d\n", counter_names[i], counters[i]);
if (pbuf != buf)
printstr(buf);
exit(ret);
}

How to solve http://www.spoj.com/problems/MST1/ in n is 10^9

Using Bottom to up DP approach, I am able to solve the problem How to solve http://www.spoj.com/problems/MST1/ upto 10^8.
If input is very large n upto 10^9. I will not be able to create lookup table for upto 10^9. So what will be better approach to solve the problem ?
Is there any heuristic solution ?
#include <iostream>
#include <climits>
#include <algorithm>
using namespace std;
int main()
{
const int N_MAX = 20000001;
int *DP = new int[N_MAX];
DP[1] = 0;
for (int i = 2; i < N_MAX; i++) {
int minimum = DP[i - 1];
if (i % 3 == 0) minimum = min(minimum, DP[i/3]);
if (i % 2 == 0) minimum = min(minimum, DP[i/2]);
DP[i] = minimum + 1;
}
int T, N; cin >> T;
int c = 1;
while (T--) {
cin >> N;
cout << "Case " << c++ << ": " << DP[N] << endl;
}
delete[] DP;
}

Resources