Generating a comprehensive callgraph using GCC & Egypt - linux

I am trying to generate a comprehensive callgraph (complete with low level calls to Linux, runtime, the lot).
I have statically compiled my source files with "-fdump-rtl-expand" and created RTL files, which I passed to a PERL script called Egypt (which I believe is Graphviz/Dot) and generated a PDF file of the callgraph. This works perfectly, no problems at all.
Except, there are calls being made into some libraries that are getting shown as built-in. I was looking to see if there is a way for the callgraph not to be printed as and instead the real calls made into the libraries ?
Please let me know if the question is unclear.
http://i.imgur.com/sp58v.jpg
Basically, I am trying to avoid the callgraph from generating < built-in >
Is there a way to do that ?
-------- CODE ---------
#include <cilk/cilk.h>
#include <stdio.h>
#include <stdlib.h>
unsigned long int t0, t5;
unsigned int NOSPAWN_THRESHOLD = 32;
int fib_nospawn(int n)
{
if (n < 2)
return n;
else
{
int x = fib_nospawn(n-1);
int y = fib_nospawn(n-2);
return x + y;
}
}
// spawning fibonacci function
int fib(long int n)
{
long int x, y;
if (n < 2)
return n;
else if (n <= NOSPAWN_THRESHOLD)
{
x = fib_nospawn(n-1);
y = fib_nospawn(n-2);
return x + y;
}
else
{
x = cilk_spawn fib(n-1);
y = cilk_spawn fib(n-2);
cilk_sync;
return x + y;
}
}
int main(int argc, char *argv[])
{
int n;
long int result;
long int exec_time;
n = atoi(argv[1]);
NOSPAWN_THRESHOLD = atoi(argv[2]);
result = fib(n);
printf("%ld\n", result);
return 0;
}
I compiled the Cilk Library from source.

I might have found the partial solution to the problem:
You need to pass the following option to egypt
--include-external
This produced a slightly more comprehensive callgraph, although there still is the " visible
http://i.imgur.com/GWPJO.jpg?1
Can anyone suggest if I get more depth in the callgraph ?

You can use the GCC VCG Plugin: A gcc plugin, which can be loaded when debugging gcc, to show internal structures graphically.
gcc -fplugin=/path/to/vcg_plugin.so -fplugin-arg-vcg_plugin-cgraph foo.c
Call-graph is place to store data needed
for inter-procedural optimization. All datastructures
are divided into three components:
local_info that is produced while analyzing
the function, global_info that is result
of global walking of the call-graph on the end
of compilation and rtl_info used by RTL
back-end to propagate data from already compiled
functions to their callers.

Related

How to run complex exponential "cexpf" or "cexp" in the RawKernel of cupy?

As title, I was calculating the exponential of an array of complex numbers in the RawKernel provided by cupy. But I don't know how to include or invoke the function "cexpf" or "cexp" correctly. The error message always shows me that "cexpf" is undefined. Does anybody know how to invoke the function in the correct way? Thank you a lot for the answer.
import cupy as cp
import time
add_kernel = cp.RawKernel(r'''
#include <cupy/complex.cuh>
#include <cupy/complex/cexpf.h>
extern "C" __global__
void test(double* x, double* y, complex<float>* z){
int tId_x = blockDim.x*blockIdx.x + threadIdx.x;
int tId_y = blockDim.y*blockIdx.y + threadIdx.y;
complex<float> value = complex<float>(x[tId_x],y[tId_y]);
z[tId_x*blockDim.y*gridDim.y+tId_y] = cexpf(value);
}''',"test")
x = cp.random.rand(1,8,4096,dtype = cp.float32)
#x = cp.arange(0,4096,dtype = cp.uint32)
y = cp.random.rand(1,8,4096,dtype = cp.float32)
#y = cp.arange(4096,8192,dtype = cp.uint32)
z = cp.zeros((4096,4096), dtype = cp.complex64)
t1 = time.time()
add_kernel((128,128),(32,32),(x,y,z))
print(time.time()-t1)
print(z)
Looking at the headers it seems like you are supposed to just call exp and you don't need to include cupy/complex/cexpf.h yourself, as it is already included implicitly via cupy/complex.cuh.
add_kernel = cp.RawKernel(r'''
#include <cupy/complex.cuh>
extern "C" __global__
void test(double* x, double* y, complex<float>* z){
int tId_x = blockDim.x*blockIdx.x + threadIdx.x;
int tId_y = blockDim.y*blockIdx.y + threadIdx.y;
complex<float> value = complex<float>(x[tId_x],y[tId_y]);
z[tId_x*blockDim.y*gridDim.y+tId_y] = exp(value);
}''',"test")
Generally Cupy's custome kernel C++ complex number API is taken from Thrust, so you can consult the Thrust documentation. Just skip using the thrust:: namespace.
Thrusts API in turn tries to implement the C++ std::complex API for the most part, so looking at the C++ standard library documentation might also be helpful when the Thrust documentation does not go deep enough. Just be careful because Thrust might no give all the same guarantees to avoid performance problems on the GPU.

std::max giving error C2064: term does not evaluate to a function taking 2 arguments

I am fairly new to C++. I was practicing some ds,algo.This code looks fine to me, but I am getting some error about function not taking 2 arguments. Though I get some error asked in stackoverflow none of the cases match my problem.
#include <iostream>
#include <algorithm>
int ropecutting(int n, int *cuts){
if (n == 0)
return 0;
if (n < 0)
return -1;
int res = std::max(ropecutting(n-cuts[0], cuts), ropecutting(n-cuts[1], cuts), ropecutting(n-cuts[2], cuts));
if(res == -1) return -1;
return res+1;
}
int main(){
int n, cuts[3];
std::cin >> n;
for(int i = 0; i < 3; i ++)
std::cin >> cuts[i];
std::cout << ropecutting(n, cuts);
}
The error I get is,
main.cpp
G:\software_installation\Visual Studio Community 2017\VC\Tools\MSVC\14.16.27023\include\xlocale(319): warning C4530: C++ exception handler used, but unwind semantics are not enabled. Specify /EHsc
G:\software_installation\Visual Studio Community 2017\VC\Tools\MSVC\14.16.27023\include\algorithm(5368): error C2064: term does not evaluate to a function taking 2 arguments
G:\software_installation\Visual Studio Community 2017\VC\Tools\MSVC\14.16.27023\include\algorithm(5367): note: see reference to function template instantiation 'const _Ty &std::max<int,int>(const _Ty &,const _Ty &,_Pr) noexcept(<expr>)' being compiled
with
[
_Ty=int,
_Pr=int
]
G:\software_installation\Visual Studio Community 2017\VC\Tools\MSVC\14.16.27023\include\algorithm(5368): error C2056: illegal expression
Wishing someone would point me out in the right direction. Thank you.
Of the overloads of std::max, the only one which can be called with three arguments is
template < class T, class Compare >
constexpr const T& max( const T& a, const T& b, Compare comp );
So since it receives three int values, that function is attempting to use the third value as a functor to compare the other two, which of course doesn't work.
Probably the simplest way to get the maximum of three numbers is using the overload taking a std::initializer_list<T>. And a std::initializer_list can be automatically created from a braced list:
int res = std::max({ropecutting(n-cuts[0], cuts),
ropecutting(n-cuts[1], cuts),
ropecutting(n-cuts[2], cuts)});

CUDD BDDs: building a boolean as disjunction of conjunctions but get runtime error: segmentation fault

Does anyone with experience using CUDD (not be confused with CUDA) for manipulating BDDs know why possibly I keep getting the dreaded "segmentation error (dumped core)". I suspect it could be related to referencing de-referencing which I confess I don't fully understand. Any hints pointers appreciated. I (commented out some things I have been trying):
#include <stdio.h>
#include <stdlib.h>
#include "cudd.h"
int main(int argc, char* argv[])
{
/*char filename[30];*/
DdManager * gbm; /* Global BDD manager. */
gbm = Cudd_Init(0, 0, CUDD_UNIQUE_SLOTS, CUDD_CACHE_SLOTS, 0); /* Initialize a new BDD manager with defaults. */
int const n = 3;
int i, j;
DdNode *var, *tmp, *tmp2, *BDD, *BDD_t;
BDD_t = Cudd_ReadLogicZero(gbm);
/*Cudd_Ref(BDD_t);*/
/* Outter loop: disjunction of the n terms*/
for (j = 0; j <= n - 1; j++) {
BDD = Cudd_ReadOne(gbm); /*Returns the logic one constant of the manager*/
/* Cudd_Ref(BDD);*/
/* Inner loop: assemble each of the n conjunctions */
for (i = j * (n - 1); i >= (j - 1) * (n - 1); i--) {
var = Cudd_bddIthVar(gbm, i); /*Create a new BDD variable*/
tmp = Cudd_bddAnd(gbm, var, BDD); /*Perform AND boolean operation*/
BDD = tmp;
}
tmp2 = Cudd_bddOr(gbm, BDD, BDD_t); /*Perform OR boolean operation*/
/*Cudd_RecursiveDeref(gbm, tmp);*/
BDD_t = tmp2;
}
Cudd_PrintSummary(gbm, BDD_t, 4, 0);
/* Cudd_bddPrintCover(mgr, BDD_t, BDD);*/
/* BDD = Cudd_BddToAdd(gbm, BDD_t);*/
/* printf(gbm,BDD_t, 2, 4);*/
Cudd_Quit(gbm);
return 0;
}
While you are correc that Cudd_Ref'find and Cudd_RecursiveDeref'ing is not correct in your code (yet), the current and first problem is actually a different one.
You never check the return values of the CUDD function. Some of them return NULL (0) on error, and your code does not detect such cases. In fact, the call to "Cudd_bddIthVar" returns NULL (0) at least once, and then the subsequent call to the BDD AND function makes the CUDD library access the memory at memory address 0+4, causing the segmentation fault.
There are multiple ways to fix this:
The best way is to always check for NULL return values and then notify the user of the program of the problem. Since this is your main() function, this could be printing an error message and the returning 1
At the very bare minimum, you can add assert(...) statements, so that at least in debug mode, the problem will become obvious. This is not recommended in general, as when compiling not in debug mode, such problems may go unnoticed.
In C++, there is also the possibility to work with exception - but you don't seem to be using C++.
Now why does "Cudd_bddIthVar(gbm, i)" return NULL? Because in the second iteration, variable "i" of the loop has value -1.
Now as far as Ref'fing and Deref'fing is concerned:
You need to call Cudd_Ref(...) to every BDD variable that you want to use until after calling the next Cudd function. Exceptions are constants are variables.
You need to call Cudd_RecursiveDeref(...) on every BDD node that you initially Ref'd that is no longer needed.
This is because every BDD node has a counter telling you how often it is currently in use. Once the counter hits 0, the BDD node may be recycled. Ref'fing while the node is in use makes sure that it does not happen while the node is in use.
You program could be fixed as follows:
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include "cudd.h"
int main(int argc, char* argv[])
{
/*char filename[30];*/
DdManager * gbm; /* Global BDD manager. */
gbm = Cudd_Init(0, 0, CUDD_UNIQUE_SLOTS, CUDD_CACHE_SLOTS, 0); /* Initialize a new BDD manager with defaults. */
assert(gbm!=0);
int const n = 3;
int i, j;
DdNode *var, *tmp, *tmp2, *BDD, *BDD_t;
BDD_t = Cudd_ReadLogicZero(gbm);
assert(BDD_t!=0);
Cudd_Ref(BDD_t);
/* Outter loop: disjunction of the n terms*/
for (j = 0; j <= n - 1; j++) {
BDD = Cudd_ReadOne(gbm); /*Returns the logic one constant of the manager*/
assert(BDD!=0);
Cudd_Ref(BDD);
/* Inner loop: assemble each of the n conjunctions */
for (i = j * (n - 1); i >= (j) * (n - 1); i--) {
var = Cudd_bddIthVar(gbm, i); /*Create a new BDD variable*/
assert(var!=0);
tmp = Cudd_bddAnd(gbm, var, BDD); /*Perform AND boolean operation*/
assert(tmp!=0);
Cudd_Ref(tmp);
Cudd_RecursiveDeref(gbm,BDD);
BDD = tmp;
}
tmp2 = Cudd_bddOr(gbm, BDD, BDD_t); /*Perform OR boolean operation*/
assert(tmp2!=0);
Cudd_Ref(tmp2);
Cudd_RecursiveDeref(gbm, BDD_t);
Cudd_RecursiveDeref(gbm, BDD);
BDD_t = tmp2;
}
Cudd_PrintSummary(gbm, BDD_t, 4, 0);
/* Cudd_bddPrintCover(mgr, BDD_t, BDD);*/
/* BDD = Cudd_BddToAdd(gbm, BDD_t);*/
/* printf(gbm,BDD_t, 2, 4);*/
Cudd_RecursiveDeref(gbm,BDD_t);
assert(Cudd_CheckZeroRef(gbm)==0);
Cudd_Quit(gbm);
return 0;
}
For brevity, I used assert(...) statements to check the conditions. Don't use this in production code - this is only to keep the code shorter during learning. Also look up in the CUDD documentation which calls can actually return NULL. Those that cannot do not need such a check. But most calls can return 0.
Note that:
The return value of Cudd_bddIthVar is not Cudd_Ref's - it doesn't need to.
The return value of Cudd_ReadLogicZero(gbm) is Cudd_Ref'd - this is because the variable is overwritten with nodes that have to be Ref'd later, and hence the code needs to have a call to RecursiveDeref(...) in that case. To make Ref's and Deref's symmetric, the node is needlessly Ref'd (which is allowed).
The last assert statement checks if there are any nodes still in use -- if that is the case before calling Cudd_Quit, this tells you that your code doesn't Deref correctly, which should be fixed. If you comment out any RecursiveDeref line and run the code, the assert statement should halt execution then.
I've rewritten your for-loop condition to ensure that no negative variable numbers occur. But your code may not do what it is supposed to now.
Thanks, #DCTLib. I've found that the C++ interface is much more convenient for formulating Boolean expressions. The only problem is how to go back and forth between the C and C++ interfaces, since ultimately I still need C for printing out the minterms (called cutsets in the world I inhabit, Reliability Eng. Let me pose the question in separate entry. It seems you know CUDD quite well. You should be maintaining that repo! It's a great product but sparsely documented or interacted with.

DLL Floating point results differ according to caller

This is a follow up question to my earlier one asked yesterday
The problems were occurring in a MSVS 2008 C++ DLL that has over 4000 lines of code, but I have managed to produce a simple case that demonstrates the problem as it occurs on my CPU (an AMD Phenom II X6 1050T).
Will it show the problem occurring on another system? I'd really like to know!
Here is a simple class (Point.cpp), it needs to be compiled as a DLL:
#include <math.h>
#define EXPORT extern "C" __declspec(dllexport)
namespace Test {
struct Point {
double x;
double y;
/* Constructor for a Point object */
Point(double xx, double yy) : x(xx), y(yy) {}
/* Copy constructor */
Point(const Point &rhs) : x(rhs.x), y(rhs.y) {}
double mag() const;
Point norm() const;
};
double Point::mag() const {return sqrt(x*x + y*y);}
Point Point::norm() const {
double m = mag();
return Point(x/m, y/m);
}
EXPORT void __stdcall GetNorm(double x, double y, double *nx, double *ny)
Point P = Point(x, y);
Point N = P.norm();
*nx = N.x;
*ny = N.y;
}
}
Here is the test program (TestPoint.c), which needs to be linked to the lib created for the DLL:
#include <stdio.h>
#define IMPORT extern __declspec(dllimport)
IMPORT void __stdcall GetNorm(double x, double y, double *nx, double *ny);
void dhex(double x) { // double to hex
union {
unsigned long n[2];
double d;
} value;
value.d = x;
printf("(0x%0x%0x)\n", value.n[1], value.n[0]);
}
double i64tod(unsigned long long n) { // hex to double
double *DP = (double *) &n;
return *DP;
}
int main(int argc, char **argv) {
double vx, vy;
double ux, uy;
vx = i64tod(0xbfc7a30f3a53d351);
vy = i64tod(0xc01b578b34e3ce1d);
GetNorm(vx, vy, &ux, &uy);
printf(" vx = %20.18f ", vx); dhex(vx);
printf(" vy = %20.18f ", vy); dhex(vy);
printf("\n");
printf(" ux = %20.18f ", ux); dhex(ux);
printf(" uy = %20.18f ", uy); dhex(uy);
return 0;
}
On my system, with TestPoint compiled with VC++, the output is:
vx = -0.18466368053455054 (0xbfc7a30f3a53d351)
vy = -6.8354919685403077 (0xc01b578b34e3ce1d)
ux = -0.027005566159023012 (0xbf9ba758ddda1454,
uy = -0.99963528318903927 (0xbfeffd032227301b)
However, if the same code is compiled with gcc, or indeed, it seems, ANY equivalent program (eg VB6, PowerBasic), the results (ux and uy) are subtly but definitely different (the last hex digit):
vx = -0.184663680534550540 (0xbfc7a30f3a53d351)
vy = -6.835491968540307700 (0xc01b578b34e3ce1d)
ux = -0.027005566159023008 (0xbf9ba758ddda1453)
uy = -0.999635283189039160 (0xbfeffd032227301a)
This might seem an insignificant difference, but when it occurs in a physics engine, these differences accumulate in an alarming fashion. .
If the engine is going to get different results depending on who calls it I might have to abandon the use of VC++ altogether and try g++ instead.
Ok, I think I know how this happens. Looking at a disassembler listing of Point.dll, I noticed that the GetNorm function was pretty much what you'd expect, a couple of FMUL's and FDIV's. What was not present was an FLDCW instruction.
There weren't any FLDCW's in the MSVC calling program either, but I found FLDCW's in both the gcc and a PowerBasic versions of the calling program.
So I tweaked one of the executables (the PowerBasic EXE was the easiest to find the right place to tweak), and hey presto, I then got answers that matched MSVC. Presumably the FLDCW had changed the FPU rounding mode, hence the difference in the least significant bits.

Optimize the Buddhabrot

I am currently working on my own implementation of the Buddhabrot. So far I am using the std::thread-Class from C++11 to concurrently work through the following iteration:
void iterate(float *res){
//generate starting point
std::default_random_engine generator;
std::uniform_real_distribution<double> distribution(-1.5,1.5);
double ReC,ImC;
double ReS,ImS,ReS_;
unsigned int steps;
unsigned int visitedPos[maxCalcIter];
unsigned int succSamples(0);
//iterate over it
while(succSamples < samplesPerThread){
steps = 0;
ReC = distribution(generator)-0.4;
ImC = distribution(generator);
double p(sqrt((ReC-0.25)*(ReC-0.25) + ImC*ImC));
while (( ((ReC+1)*(ReC+1) + ImC*ImC) < 0.0625) || (ReC < p - 2*p*p + 0.25)){
ReC = distribution(generator)-0.4;
ImC = distribution(generator);
p = sqrt((ReC-0.25)*(ReC-0.25) + ImC*ImC);
}
ReS = ReC;
ImS = ImC;
for (unsigned int j = maxCalcIter; (ReS*ReS + ImS*ImS < 4)&&(j--); ){
ReS_ = ReS;
ReS *= ReS;
ReS += ReC - ImS*ImS;
ImS *= 2*ReS_;
ImS += ImC;
if ((ReS+0.5)*(ReS+0.5) + ImS*ImS < 4){
visitedPos[steps] = int((ReS+2.5)*0.25*outputSize)*outputSize + int((ImS+2)*0.25*outputSize);
}
steps++;
}
if ((steps > minCalcIter)&&(ReS*ReS + ImS*ImS > 4)){
succSamples++;
for (int j = steps; j--;){
//std::cout << visitedPos[j] << std::endl;
res[visitedPos[j]]++;
}
}
}
}
So basically I am working in every thread so long that I generated enough trajectories of sufficient length which in expectation takes the same time in every thread.
But I really have the feeling that this function might me very unoptimized since its code is so very readable. Can anybody come up with some fancy optimizations? When it comes to compiling I just use:
g++ -O4 -std=c++11 -I/usr/include/OpenEXR/ -L/usr/lib64/ -lHalf -lIlmImf -lm buddha_cpu.cpp -o buddha_cpu
So any hints on crunching some more numbers/sec would be really appreciated. Also any links to further literature are totally welcome.
Did you check that -O4 is faster than -O2? Above O2, it's not sure.
Also, if this compilation is only for you, try -march=native. This will take advantage of your specific CPU architecture, but the resulting binary might crash with SIGSEV on older/different machines.
You did not show any threads, if I see correctly. Make sure your threads do not write memory locations of the same cache line. Writing memory locations in the same cache line from different threads force the CPU cores to synchronize their cache -- it's a huge performance degradation.

Resources