CUDD BDDs: building a boolean as disjunction of conjunctions but get runtime error: segmentation fault - cudd

Does anyone with experience using CUDD (not be confused with CUDA) for manipulating BDDs know why possibly I keep getting the dreaded "segmentation error (dumped core)". I suspect it could be related to referencing de-referencing which I confess I don't fully understand. Any hints pointers appreciated. I (commented out some things I have been trying):
#include <stdio.h>
#include <stdlib.h>
#include "cudd.h"
int main(int argc, char* argv[])
{
/*char filename[30];*/
DdManager * gbm; /* Global BDD manager. */
gbm = Cudd_Init(0, 0, CUDD_UNIQUE_SLOTS, CUDD_CACHE_SLOTS, 0); /* Initialize a new BDD manager with defaults. */
int const n = 3;
int i, j;
DdNode *var, *tmp, *tmp2, *BDD, *BDD_t;
BDD_t = Cudd_ReadLogicZero(gbm);
/*Cudd_Ref(BDD_t);*/
/* Outter loop: disjunction of the n terms*/
for (j = 0; j <= n - 1; j++) {
BDD = Cudd_ReadOne(gbm); /*Returns the logic one constant of the manager*/
/* Cudd_Ref(BDD);*/
/* Inner loop: assemble each of the n conjunctions */
for (i = j * (n - 1); i >= (j - 1) * (n - 1); i--) {
var = Cudd_bddIthVar(gbm, i); /*Create a new BDD variable*/
tmp = Cudd_bddAnd(gbm, var, BDD); /*Perform AND boolean operation*/
BDD = tmp;
}
tmp2 = Cudd_bddOr(gbm, BDD, BDD_t); /*Perform OR boolean operation*/
/*Cudd_RecursiveDeref(gbm, tmp);*/
BDD_t = tmp2;
}
Cudd_PrintSummary(gbm, BDD_t, 4, 0);
/* Cudd_bddPrintCover(mgr, BDD_t, BDD);*/
/* BDD = Cudd_BddToAdd(gbm, BDD_t);*/
/* printf(gbm,BDD_t, 2, 4);*/
Cudd_Quit(gbm);
return 0;
}

While you are correc that Cudd_Ref'find and Cudd_RecursiveDeref'ing is not correct in your code (yet), the current and first problem is actually a different one.
You never check the return values of the CUDD function. Some of them return NULL (0) on error, and your code does not detect such cases. In fact, the call to "Cudd_bddIthVar" returns NULL (0) at least once, and then the subsequent call to the BDD AND function makes the CUDD library access the memory at memory address 0+4, causing the segmentation fault.
There are multiple ways to fix this:
The best way is to always check for NULL return values and then notify the user of the program of the problem. Since this is your main() function, this could be printing an error message and the returning 1
At the very bare minimum, you can add assert(...) statements, so that at least in debug mode, the problem will become obvious. This is not recommended in general, as when compiling not in debug mode, such problems may go unnoticed.
In C++, there is also the possibility to work with exception - but you don't seem to be using C++.
Now why does "Cudd_bddIthVar(gbm, i)" return NULL? Because in the second iteration, variable "i" of the loop has value -1.
Now as far as Ref'fing and Deref'fing is concerned:
You need to call Cudd_Ref(...) to every BDD variable that you want to use until after calling the next Cudd function. Exceptions are constants are variables.
You need to call Cudd_RecursiveDeref(...) on every BDD node that you initially Ref'd that is no longer needed.
This is because every BDD node has a counter telling you how often it is currently in use. Once the counter hits 0, the BDD node may be recycled. Ref'fing while the node is in use makes sure that it does not happen while the node is in use.
You program could be fixed as follows:
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include "cudd.h"
int main(int argc, char* argv[])
{
/*char filename[30];*/
DdManager * gbm; /* Global BDD manager. */
gbm = Cudd_Init(0, 0, CUDD_UNIQUE_SLOTS, CUDD_CACHE_SLOTS, 0); /* Initialize a new BDD manager with defaults. */
assert(gbm!=0);
int const n = 3;
int i, j;
DdNode *var, *tmp, *tmp2, *BDD, *BDD_t;
BDD_t = Cudd_ReadLogicZero(gbm);
assert(BDD_t!=0);
Cudd_Ref(BDD_t);
/* Outter loop: disjunction of the n terms*/
for (j = 0; j <= n - 1; j++) {
BDD = Cudd_ReadOne(gbm); /*Returns the logic one constant of the manager*/
assert(BDD!=0);
Cudd_Ref(BDD);
/* Inner loop: assemble each of the n conjunctions */
for (i = j * (n - 1); i >= (j) * (n - 1); i--) {
var = Cudd_bddIthVar(gbm, i); /*Create a new BDD variable*/
assert(var!=0);
tmp = Cudd_bddAnd(gbm, var, BDD); /*Perform AND boolean operation*/
assert(tmp!=0);
Cudd_Ref(tmp);
Cudd_RecursiveDeref(gbm,BDD);
BDD = tmp;
}
tmp2 = Cudd_bddOr(gbm, BDD, BDD_t); /*Perform OR boolean operation*/
assert(tmp2!=0);
Cudd_Ref(tmp2);
Cudd_RecursiveDeref(gbm, BDD_t);
Cudd_RecursiveDeref(gbm, BDD);
BDD_t = tmp2;
}
Cudd_PrintSummary(gbm, BDD_t, 4, 0);
/* Cudd_bddPrintCover(mgr, BDD_t, BDD);*/
/* BDD = Cudd_BddToAdd(gbm, BDD_t);*/
/* printf(gbm,BDD_t, 2, 4);*/
Cudd_RecursiveDeref(gbm,BDD_t);
assert(Cudd_CheckZeroRef(gbm)==0);
Cudd_Quit(gbm);
return 0;
}
For brevity, I used assert(...) statements to check the conditions. Don't use this in production code - this is only to keep the code shorter during learning. Also look up in the CUDD documentation which calls can actually return NULL. Those that cannot do not need such a check. But most calls can return 0.
Note that:
The return value of Cudd_bddIthVar is not Cudd_Ref's - it doesn't need to.
The return value of Cudd_ReadLogicZero(gbm) is Cudd_Ref'd - this is because the variable is overwritten with nodes that have to be Ref'd later, and hence the code needs to have a call to RecursiveDeref(...) in that case. To make Ref's and Deref's symmetric, the node is needlessly Ref'd (which is allowed).
The last assert statement checks if there are any nodes still in use -- if that is the case before calling Cudd_Quit, this tells you that your code doesn't Deref correctly, which should be fixed. If you comment out any RecursiveDeref line and run the code, the assert statement should halt execution then.
I've rewritten your for-loop condition to ensure that no negative variable numbers occur. But your code may not do what it is supposed to now.

Thanks, #DCTLib. I've found that the C++ interface is much more convenient for formulating Boolean expressions. The only problem is how to go back and forth between the C and C++ interfaces, since ultimately I still need C for printing out the minterms (called cutsets in the world I inhabit, Reliability Eng. Let me pose the question in separate entry. It seems you know CUDD quite well. You should be maintaining that repo! It's a great product but sparsely documented or interacted with.

Related

What is wrong with the following code in vivado hls?

The following code should read a value from DDR, decrement it, write the result back to the same address, and read the next value, repeating 256 times.
Instead on the first run it decrements the first 2 values (axi_ddr[0] and [1]), and on consecutive runs it only decrements the first value (axi_ddr[0]).
#include "ap_cint.h"
#include <stdio.h>
#include "string.h"
void hls_test(volatile int256 axi_ddr[256], uint32 *axi_lite_status_control){
#pragma HLS INTERFACE s_axilite port=axi_lite_status_control register bundle=BUS_A
#pragma HLS INTERFACE s_axilite port=return bundle=BUS_A
#pragma HLS INTERFACE m_axi depth=256 port=axi_ddr bundle=DDR
int256 axi_ddr_reg;
int256 diff = 1;
uint9 i = 0;
if (*axi_lite_status_control == 1){
for(i = 0; i < 256; i++){
axi_ddr_reg = axi_ddr[i];
axi_ddr[i] = axi_ddr_reg -diff;
}
*axi_lite_status_control = 2;
}
}
Both simulation and cosimulation passes as intended, and cannot figure out what is causing the issue.
Also tried C++, but it ended in the same behavior. The only time it was different, was when I forgot to give initial value to variable diff, and then the value in all 256 DDR locations became 0x0.
Could somebody please point out what am I missing?
The code looks fine to me and it should work flowlessly. However, if you're saying that both simulation and cosimulation pass, then something might be wrong with either you test code or with your hardware implementation.
Also, for the C++ version of the code, you shall be using the ap_uint<N> types defined in ap_int.h, instead of ap_cint.h.

How to count branch mispredictions?

I`ve got a task to count branch misprediction penalty (in ticks), so I wrote this code:
int main (int argc, char ** argv) {
unsigned long long start, end;
FILE *f;
f = fopen("output", "w");
long long int k = 0;
unsigned long long min;
int n = atoi(argv[1]);// n1 = atoi(argv[2]);
for (int i = 1; i <= n + 40; i++) {
min = 9999999999999;
for(int r = 0; r < 1000; r++) {
start = rdtsc();
for (long long int j = 0; j < 100000; j++) {
if (j % i == 0) {
k++;
}
}
end = rdtsc();
if (min > end - start) min = end - start;
}
fprintf (f, "%d %lld \n", i, min);
}
fclose (f);
return 0;
}
(rdtsc is a function that measures time in ticks)
The idea of this code is that it periodically (with period equal to i) goes into branch (if (j % i == 0)), so at some point it starts doing mispredictions. Other parts of the code are mostly multiple measurements, that I need to get more precise results.
Tests show that branch mispredictions start to happen around i = 47, but I do not know how to count exact number of mispredictions to count exact number of ticks. Can anyone explain to me, how to do this without using any side programs like Vtune?
It depends on the processor your using, in general cpuid can be used to obtain a lot of information about the processor and what cpuid does not provide is typically accessible via smbios or other regions of memory.
Doing this in code on a general level without the processor support functions and manual will not tell you as much as you want to a great degree of certainty but may be useful as an estimate depending on what your looking for and how you have your code compiled e.g. the flags you use during compilation etc.
In general, what is referred to as specular or speculative execution and is typically not observed by programs as their logic which transitions through the pipeline is determined to be not used is then discarded.
Depending on how you use specific instructions in your program you may be able to use such stale cache information for better or worse but the logic therein would vary greatly depending on the CPU in use.
See also Spectre and RowHammer for interesting examples of using such techniques for privileged execution.
See the comments below for links which have code related to the use of cpuid as well as rdrand, rdseed and a few others. (rdtsc)
It's not completely clear what your looking for perhaps but will surely get you started and provide some useful examples.
See also Branch mispredictions

using malloc in dgels function of lapacke

i am trying to use dgels function of lapacke:
when i use it with malloc fucntion. it doesnot give correct value.
can anybody tell me please what is the mistake when i use malloc and create a matrix?
thankyou
/* Calling DGELS using row-major order */
#include <stdio.h>
#include <lapacke.h>
#include <conio.h>
#include <malloc.h>
int main ()
{
double a[3][2] = {{1,0},{1,1},{1,2}};
double **outputArray;
int designs=3;
int i,j,d,i_mal;
lapack_int info,m,n,lda,ldb,nrhs;
double outputArray[3][1] = {{6},{0},{0}};*/
outputArray = (double**) malloc(3* sizeof(double*));
for(i_mal=0;i_mal<3;i_mal++)
{
outputArray[i_mal] = (double*) malloc(1* sizeof(double));
}
for (i=0;i<designs;i++)
{
printf("put first value");
scanf("%lf",&outputArray[i][0]);
}
m = 3;
n = 2;
nrhs = 1;
lda = 2;
ldb = 1;
info = LAPACKE_dgels(LAPACK_ROW_MAJOR,'N',m,n,nrhs,*a,lda,*outputArray,ldb);
for(i=0;i<m;i++)
{
for(j=0;j<nrhs;j++)
{
printf("%lf ",outputArray[i][j]);
}
printf("\n");
}
getch();
return (info);
}
The problem may come from outputArray not being contiguous in memory. You may use something like this instead :
outputArray = (double**) malloc(3* sizeof(double*));
outputArray[0]=(double*) malloc(3* sizeof(double));
for (i=0;i<designs;i++){
outputArray[i]=&outputArray[0][i];
}
Don't forget to free the memory !
free(outputArray[0]);
free(outputArray);
Edit : Contiguous means that you have to allocate the memory for all values at once. See http://www.fftw.org/doc/Dynamic-Arrays-in-C_002dThe-Wrong-Way.html#Dynamic-Arrays-in-C_002dThe-Wrong-Way : some packages, like fftw or lapack require this feature for optimization. As you were calling malloc three times, you created three parts and things went wrong.
If you have a single right hand side, there is no need for a 2D array (double**). outputArray[i] is a double*, that is, the start of the i-th row ( row major). The right line may be outputArray[i]=&outputArray[0][i*nrhs]; if you have many RHS.
By doing this in your code, you are building a 3 rows, one column, that is one RHS. The solution, is of size n=2. It should be outputArray[0][0] , outputArray[1][0]. I hope i am not too wrong, check this on simple cases !
Bye,

OpenCL float sum reduction

I would like to apply a reduce on this piece of my kernel code (1 dimensional data):
__local float sum = 0;
int i;
for(i = 0; i < length; i++)
sum += //some operation depending on i here;
Instead of having just 1 thread that performs this operation, I would like to have n threads (with n = length) and at the end having 1 thread to make the total sum.
In pseudo code, I would like to able to write something like this:
int i = get_global_id(0);
__local float sum = 0;
sum += //some operation depending on i here;
barrier(CLK_LOCAL_MEM_FENCE);
if(i == 0)
res = sum;
Is there a way?
I have a race condition on sum.
To get you started you could do something like the example below (see Scarpino). Here we also take advantage of vector processing by using the OpenCL float4 data type.
Keep in mind that the kernel below returns a number of partial sums: one for each local work group, back to the host. This means that you will have to carry out the final sum by adding up all the partial sums, back on the host. This is because (at least with OpenCL 1.2) there is no barrier function that synchronizes work-items in different work-groups.
If summing the partial sums on the host is undesirable, you can get around this by launching multiple kernels. This introduces some kernel-call overhead, but in some applications the extra penalty is acceptable or insignificant. To do this with the example below you will need to modify your host code to call the kernel repeatedly and then include logic to stop executing the kernel after the number of output vectors falls below the local size (details left to you or check the Scarpino reference).
EDIT: Added extra kernel argument for the output. Added dot product to sum over the float 4 vectors.
__kernel void reduction_vector(__global float4* data,__local float4* partial_sums, __global float* output)
{
int lid = get_local_id(0);
int group_size = get_local_size(0);
partial_sums[lid] = data[get_global_id(0)];
barrier(CLK_LOCAL_MEM_FENCE);
for(int i = group_size/2; i>0; i >>= 1) {
if(lid < i) {
partial_sums[lid] += partial_sums[lid + i];
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if(lid == 0) {
output[get_group_id(0)] = dot(partial_sums[0], (float4)(1.0f));
}
}
I know this is a very old post, but from everything I've tried, the answer from Bruce doesn't work, and the one from Adam is inefficient due to both global memory use and kernel execution overhead.
The comment by Jordan on the answer from Bruce is correct that this algorithm breaks down in each iteration where the number of elements is not even. Yet it is essentially the same code as can be found in several search results.
I scratched my head on this for several days, partially hindered by the fact that my language of choice is not C/C++ based, and also it's tricky if not impossible to debug on the GPU. Eventually though, I found an answer which worked.
This is a combination of the answer by Bruce, and that from Adam. It copies the source from global memory into local, but then reduces by folding the top half onto the bottom repeatedly, until there is no data left.
The result is a buffer containing the same number of items as there are work-groups used (so that very large reductions can be broken down), which must be summed by the CPU, or else call from another kernel and do this last step on the GPU.
This part is a little over my head, but I believe, this code also avoids bank switching issues by reading from local memory essentially sequentially. ** Would love confirmation on that from anyone that knows.
Note: The global 'AOffset' parameter can be omitted from the source if your data begins at offset zero. Simply remove it from the kernel prototype and the fourth line of code where it's used as part of an array index...
__kernel void Sum(__global float * A, __global float *output, ulong AOffset, __local float * target ) {
const size_t globalId = get_global_id(0);
const size_t localId = get_local_id(0);
target[localId] = A[globalId+AOffset];
barrier(CLK_LOCAL_MEM_FENCE);
size_t blockSize = get_local_size(0);
size_t halfBlockSize = blockSize / 2;
while (halfBlockSize>0) {
if (localId<halfBlockSize) {
target[localId] += target[localId + halfBlockSize];
if ((halfBlockSize*2)<blockSize) { // uneven block division
if (localId==0) { // when localID==0
target[localId] += target[localId + (blockSize-1)];
}
}
}
barrier(CLK_LOCAL_MEM_FENCE);
blockSize = halfBlockSize;
halfBlockSize = blockSize / 2;
}
if (localId==0) {
output[get_group_id(0)] = target[0];
}
}
https://pastebin.com/xN4yQ28N
You can use new work_group_reduce_add() function for sum reduction inside single work group if you have support for OpenCL C 2.0 features
A simple and fast way to reduce data is by repeatedly folding the top half of the data into the bottom half.
For example, please use the following ridiculously simple CL code:
__kernel void foldKernel(__global float *arVal, int offset) {
int gid = get_global_id(0);
arVal[gid] = arVal[gid]+arVal[gid+offset];
}
With the following Java/JOCL host code (or port it to C++ etc):
int t = totalDataSize;
while (t > 1) {
int m = t / 2;
int n = (t + 1) / 2;
clSetKernelArg(kernelFold, 0, Sizeof.cl_mem, Pointer.to(arVal));
clSetKernelArg(kernelFold, 1, Sizeof.cl_int, Pointer.to(new int[]{n}));
cl_event evFold = new cl_event();
clEnqueueNDRangeKernel(commandQueue, kernelFold, 1, null, new long[]{m}, null, 0, null, evFold);
clWaitForEvents(1, new cl_event[]{evFold});
t = n;
}
The host code loops log2(n) times, so it finishes quickly even with huge arrays. The fiddle with "m" and "n" is to handle non-power-of-two arrays.
Easy for OpenCL to parallelize well for any GPU platform (i.e. fast).
Low memory, because it works in place
Works efficiently with non-power-of-two data sizes
Flexible, e.g. you can change kernel to do "min" instead of "+"

Generating a comprehensive callgraph using GCC & Egypt

I am trying to generate a comprehensive callgraph (complete with low level calls to Linux, runtime, the lot).
I have statically compiled my source files with "-fdump-rtl-expand" and created RTL files, which I passed to a PERL script called Egypt (which I believe is Graphviz/Dot) and generated a PDF file of the callgraph. This works perfectly, no problems at all.
Except, there are calls being made into some libraries that are getting shown as built-in. I was looking to see if there is a way for the callgraph not to be printed as and instead the real calls made into the libraries ?
Please let me know if the question is unclear.
http://i.imgur.com/sp58v.jpg
Basically, I am trying to avoid the callgraph from generating < built-in >
Is there a way to do that ?
-------- CODE ---------
#include <cilk/cilk.h>
#include <stdio.h>
#include <stdlib.h>
unsigned long int t0, t5;
unsigned int NOSPAWN_THRESHOLD = 32;
int fib_nospawn(int n)
{
if (n < 2)
return n;
else
{
int x = fib_nospawn(n-1);
int y = fib_nospawn(n-2);
return x + y;
}
}
// spawning fibonacci function
int fib(long int n)
{
long int x, y;
if (n < 2)
return n;
else if (n <= NOSPAWN_THRESHOLD)
{
x = fib_nospawn(n-1);
y = fib_nospawn(n-2);
return x + y;
}
else
{
x = cilk_spawn fib(n-1);
y = cilk_spawn fib(n-2);
cilk_sync;
return x + y;
}
}
int main(int argc, char *argv[])
{
int n;
long int result;
long int exec_time;
n = atoi(argv[1]);
NOSPAWN_THRESHOLD = atoi(argv[2]);
result = fib(n);
printf("%ld\n", result);
return 0;
}
I compiled the Cilk Library from source.
I might have found the partial solution to the problem:
You need to pass the following option to egypt
--include-external
This produced a slightly more comprehensive callgraph, although there still is the " visible
http://i.imgur.com/GWPJO.jpg?1
Can anyone suggest if I get more depth in the callgraph ?
You can use the GCC VCG Plugin: A gcc plugin, which can be loaded when debugging gcc, to show internal structures graphically.
gcc -fplugin=/path/to/vcg_plugin.so -fplugin-arg-vcg_plugin-cgraph foo.c
Call-graph is place to store data needed
for inter-procedural optimization. All datastructures
are divided into three components:
local_info that is produced while analyzing
the function, global_info that is result
of global walking of the call-graph on the end
of compilation and rtl_info used by RTL
back-end to propagate data from already compiled
functions to their callers.

Resources