Interpreting gfortran error traceback - linux

I'm running a Fortran 77 program written by someone else. I'm using the gfortran compiler (v5.4.0) on Linux (Ubuntu v.16.04). I'm not an experienced user of Fortran, gcc or bash scripting, so I'm struggling here.
When my program finishes running, I get the following message:
Note: The following floating-point exceptions are signalling: IEEE_DENORMAL
I had to look this up - I understand that some of my floating-point numbers need to be stored "denormal", a low-precision form for very small numbers (rather than flushing them to zero). These come from the unstable aerodynamic calculations in the program - I've seen this when doing the calculations longhand. It's unlikely that these denormal quantities are significantly affecting my results, but to try and find out where/why this was happening, I tried compiling with the following error options:
gfortran –g –fbacktrace –ffpe-trap=invalid,zero,overflow,underflow,denormal –O3 –mcmodel=medium –o ../program.exe
The program compiled, but at runtime it crashed and returned:
Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
Backtrace for this error:
#0 0x7F442F143E08
#1 0x7F442F142F90
#2 0x7F442EA8A4AF
#3 0x4428CF in subroutine2_ at code.f:3601 (discriminator 3)
#4 0x442C3F in subroutine1_ at code.f:3569
#5 0x4489DA in code_ at code.f:428
#6 0x42BdD1 in MAIN__ at main.f:235
Floating point exception (core dumped)
I can interpret these as a hierarchy of calls, working backwards from 6 to 3:
*6. At line 235 of "main.f", there was a problem. [this is a call to "code.f"]
*5. At line 428 of "code.f", there was a problem. [this is a call to "subroutine1" in "code.f"]
*4. At line 3569 of "code.f", in "subroutine1", there was a problem. [this is a call to "subroutine2" in "code.f"]
*3. At line 3601 of "code.f", in "subroutine2", there was a problem. [this is a conditional statement]
if (windspd_2m.ge.5.0) then...
So the DENORMAL error must be occurring in the "then" operations (I haven't included that code because (a) it involves a long, complicated series of dependencies, and (b) I can unravel the math errors, it's the debug errors I'm struggling with).
But for the above errors 2,1,0... I don't know how to interpret these strings of numbers/letters. I also don't know what "discriminator 3" means. I've googled these, but the only resources I've found explain them assuming a higher level of knowledge than I have. Can anyone help me to interpret these error codes, assuming very little pre-existing knowledge of Fortran, gcc, or bash scripting?

The first three stack frames are due to the implementation of backtracing in the GFortran runtime library (libgfortran). The backtrace can't resolve addresses in dynamic libraries symbolically, hence you get just the addresses. If you want to see symbolic output, you can add "-static" to your compile options.
Thus my first guess would be that the error is at code.f:3601, and since 5.0 is a constant it follows that windspd_2m ought to be denormal.

Related

What error codes should kernel modules use on Linux?

I am writing a kernel module that will return error codes should something go wrong. My problem is that these are the same codes will be returned by init_module. I currently have only one situation in which my kernel module will fail and the error code would have been -1 which is interpreted as there being a problem with permissions. This means it would be indistinguishable from a case of the process actually not having the permission to load modules. So, which codes should I use? A number lower than the lowest error defined in the kernel errno headers?
As Tsyvarev said, if the error number describes the problem with loading your module well, you may (and indeed should) return this error code (negated, from init_module()) usually. But you should make exceptions from this rule for errors used in insmod or documented for the system call init_module, because e. g. return -ENOENT from the init_module() function makes insmod misleadingly output Unknown symbol in module. Instead of those errors better use less mistakable error codes as EIDRM, EUCLEAN or ECANCELED.

Can't manage to get the Star-Schema DBMS benchmark data generator to run properly

One of the commonly (?) used DBMS benchmarks is called SSB, the Star-Schema Benchmark. To run it, you need to generate your schema, i.e. your tables with the data in them. Well, there's a generator program you can find in all sorts of places (on github):
https://github.com/rxin/ssb-dbgen
https://code.google.com/p/gpudb/source/checkout (then under tests/ssb/dbgen or something)
https://github.com/electrum/ssb-dbgen/
and possibly elsewhere. I'm not sure those all have exactly the same code, but I seem to be experiencing the same problem with them. I'm using a Linux 64-bit system (Kubuntu 14.04 if that helps); and am trying to build and run the `dbgen' program from that package.
When building, I get type/size-related warnings:
me#myhost:~/src/ssb-dbgen$ make
... etc. etc. ...
gcc -O -DDBNAME=\"dss\" -DLINUX -DDB2 -DSSBM -c -o varsub.o varsub.c
rnd.c: In function גrow_stopג:
rnd.c:60:6: warning: format ג%dג expects argument of type גintג, but argument 4 has type גlong intג [-Wformat=]
i, Seed[i].usage);
^
driver.c: In function גpartialג:
driver.c:606:4: warning: format ג%dג expects argument of type גintג, but argument 4 has type גlong intג [-Wformat=]
... etc. etc. ...
Then, I make sure all the right files are in place, try to generate my tables, and only get two of them! I try to explicitly generate the LINEORDER table, and get a strange failure:
eyal#vivaldi:~/src/ssb-dbgen$ ls
bcd2.c build.c driver.c HISTORY makefile_win print.c rnd.c speed_seed.o varsub.c
bcd2.h build.o driver.o history.html mkf.macos print.o rnd.h ssb-dbgen-master varsub.o
bcd2.o CHANGES dss.ddl load_stub.c permute.c qgen rnd.o text.c
bm_utils.c config.h dss.h load_stub.o permute.h qgen.c rxin-ssb-dbgen-master.zip text.o
bm_utils.o dbgen dss.ri Makefile permute.o qgen.o shared.h tpcd.h
BUGS dists.dss dsstypes.h makefile.suite PORTING.NOTES README speed_seed.c TPCH_README
me#myhost:~/src/ssb-dbgen$ ./dbgen -vfF -s 1
SSBM (Star Schema Benchmark) Population Generator (Version 1.0.0)
Copyright Transaction Processing Performance Council 1994 - 2000
Generating data for suppliers table [pid: 32303]done.
Generating data for customers table [pid: 32303]done.
Generating data for (null) [pid: 32303]done.
Generating data for (null) [pid: 32303]done.
Generating data for (null) [pid: 32303]done.
Generating data for (null) [pid: 32303]done.
me#myhost:~/src/ssb-dbgen$ ls *.tbl
customer.tbl supplier.tbl
me#myhost:~/src/ssb-dbgen$ ./dbgen -vfF -s 1 -T l
SSBM (Star Schema Benchmark) Population Generator (Version 1.0.0)
Copyright Transaction Processing Performance Council 1994 - 2000
Generating data for lineorder table [pid: 32305]*** buffer overflow detected ***: ./dbgen terminated
======= Backtrace: =========
... etc. etc. ...
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7fcea1b79ec5]
./dbgen[0x401219]
======= Memory map: ========
... etc. etc. ...
Now, if I switch to a 32-bit Linux system, I don't get any of these warnings (although there two warnings about pointer-to-non-pointer conversion); but running the generation again produces only two tables. Now, other individual tables can be produced - but they don't correspond to one another at all, I would think...
Has anyone encountered a similar problem? Am I doing something wrong? Am I using the wrong sources somehow?
(This is almost a dupe of
SSB dbgen Linux - Segmentation Fault
... but I can't "take over" somebody else's question when they may have encountered other problems than mine. Also, that one has no answers...)
If anyone still encourage this problem, I found a solution here: https://github.com/electrum/ssb-dbgen/pull/1
specifically, you have to modify the two files shared.h and config.h
Regards.
Edit: change:
#ifdef SSBM
#define MAXAGG_LEN 10 /* max component length for a agg str */
to:
#ifdef SSBM
#define MAXAGG_LEN 20 /* max component length for a agg str */
I found workaround but You need Windows system.
Download and unzip this package:
https://github.com/LucidDB/thirdparty/blob/master/ssb.tar.bz2
In bin directory is dbgen.exe. Run it from windows console like f.g.:
...\bin\dbgen.exe -s 1 -T a
After that just copy created files to your Linux system. Not the best way, but effective :)
So, eventually, I ended up surveying all versions of ssb-dbgen on GitHub, and creating a unified repository:
https://github.com/eyalroz/ssb-dbgen/
this repository:
incorporates fixes for all bugs fixed in any of those versions, and a few others. In particular, the format mismatch due to different int sizes on Linux and Windows for 64-bit machines is resolved.
Switches the build to using CMake, rather than needing to manually edit Makefiles. Specifically, building on Windows and MacOS is supported. Building on more exotic systems is theoretically supported.
has CI build testing of commits to make sure that at least the building doesn't break.

debugging a strange error when new is called

I have a program that when run compiled with Microsoft Visual C++ 2008 Express crashes on the line
comparison_vectors = new vec_element[(rbfnetparams->comparison_vector_length)+1];
with the error Unhandled exception at 0x7c93426d in myprog.exe: 0xC0000005: Access violation reading location 0x00000000
rbfnetparams->comparison_vector_length evaluates to 4 (should do and checked in the debugger), and the thing still crashes here when I change the line as a test to:
comparison_vectors = new vec_element[5];
vec_element is a structure with several ints, doubles and a few bools, but no methods or constructor. The thing runs if I replace new with malloc, but then crashes on another new elsewhere. It does not crash every time this line is run, only sometimes, but seems to do so after the same number of iterations of this line each time. Memory usage is only 10MB at this point in the program.
This gets stranger as the same program DOES compile and run under gcc on Solaris, which usually shows up far more errors than Windows does.
Any help would be appreciated, as I am at a loss as to how to debug this one.
Access violation reading location 0x00000000 means "you dereferenced a NULL pointer." It looks like once in a while rbfnetparams is NULL when you reach this line, and thus you get the error.
I can't explain why comparison_vectors = new vec_element[5]; crashes. Is it the same error message?
Check if rbfnetparams is NULL before the line, and see if it gets hit (or add a conditional break point). Then decide if the fact that rbfnetparams is NULL is a symptom of a bigger bug somewhere else.
Dereferencing a NULL pointer is undefined. It's possible that the Solaris compiler does an optimization that masks the bug. That's allowed by the Standard (read the whole series referenced from that post).

gdb break when program opens specific file

Back story: While running a program under strace I notice that '/dev/urandom' is being open'ed. I would like to know where this call is coming from (it is not part of the program itself, it is part of the system).
So, using gdb, I am trying to break (using catch syscall open) program execution when the open call is issued, so I can see a backtrace. The problem is that open is being called alot, like several hundred times so I can't narrow down the specific call that is opening /dev/urandom. How should I go about narrowing down the specific call? Is there a way to filter by arguments, and if so how do I do it for a syscall?
Any advice would be helpful -- maybe I am going about this all wrong.
GDB is a pretty powerful tool, but has a bit of a learning curve.
Basically, you want to set up a conditional breakpoint.
First use the -i flag to strace or objdump -d to find the address of the open function or more realistically something in the chain of getting there, such as in the plt.
set a breakpoint at that address (if you have debug symbols, you can use those instead, omitting the *, but I'm assuming you don't - though you may well have them for library functions if nothing else.
break * 0x080482c8
Next you need to make it conditional
(Ideally you could compare a string argument to a desired string. I wasn't getting this to work within the first few minutes of trying)
Let's hope we can assume the string is a constant somewhere in the program or one of the libraries it loads. You could look in /proc/pid/maps to get an idea of what is loaded and where, then use grep to verify the string is actually in a file, objdump -s to find it's address, and gdb to verify that you've actually found it in memory by combining the high part of the address from maps with the low part from the file. (EDIT: it's probably easier to use ldd on the executable than look in /proc/pid/maps)
Next you will need to know something about the abi of the platform you are working on, specifically how arguments are passed. I've been working on arm's lately, and that's very nice as the first few arguments just go in registers r0, r1, r2... etc. x86 is a bit less convenient - it seems they go on the stack, ie, *($esp+4), *($esp+8), *($esp+12).
So let's assume we are on an x86, and we want to check that the first argument in esp+4 equals the address we found for the constant we are trying to catch it passing. Only, esp+4 is a pointer to a char pointer. So we need to dereference it for comparison.
cond 1 *(char **)($esp+4)==0x8048514
Then you can type run and hope for the best
If you catch your breakpoint condition, and looking around with info registers and the x command to examine memory seems right, then you can use the return command to percolate back up the call stack until you find something you recognize.
(Adapted from a question edit)
Following Chris's answer, here is the process that eventually got me what I was looking for:
(I am trying to find what functions are calling the open syscall on "/dev/urandom")
use ldd on executable to find loaded libraries
grep through each lib (shell command) looking for 'urandom'
open library file in hex editor and find address of string
find out how parameters are passed in syscalls (for open, file is first parameter. on x86_64 it is passed in rdi -- your mileage may vary
now we can set the conditional breakpoint: break open if $rdi == _addr_
run program and wait for break to hit
run bt to see backtrace
After all this I find that glib's g_random_int() and g_rand_new() use urandom. Gtk+ and ORBit were calling these functions -- if anybody was curious.
Like Andre Puel said:
break open if strcmp($rdi,"/dev/urandom") == 0
Might do the job.

Visual C++ App crashes before main in Release, but runs fine in Debug

When in release it crashes with an unhandled exception: std::length error.
The call stack looks like this:
msvcr90.dll!__set_flsgetvalue() Line 256 + 0xc bytes C
msvcr90.dll!__set_flsgetvalue() Line 256 + 0xc bytes C
msvcr90.dll!_getptd_noexit() Line 616 + 0x7 bytes C
msvcr90.dll!_getptd() Line 641 + 0x5 bytes C
msvcr90.dll!rand() Line 68 C
NEM.exe!CGAL::Random::Random() + 0x34 bytes C++
msvcr90.dll!_initterm(void (void)* * pfbegin=0x00000003, void (void)* * pfend=0x00345560) Line 903 C
NEM.exe!__tmainCRTStartup() Line 582 + 0x17 bytes C
kernel32.dll!7c817067()
Has anyone got any clues?
Examining the stack dump:
InitTerm is simply a function that walks a list of other functions and executes each in step - this is used for, amongst other things, global constructors (on startup), global destructors (on shutdown) and atexit lists (also on shutdown).
You are linking with CGAL, since that CGAL::Random::Random in your stack dump is due to the fact that CGAL defines a global variable called default_random of the CGAL::Random::Random type. That's why your error is happening before main, the default_random is being constructed.
From the CGAL source, all it does it call the standard C srand(time(NULL)) followed by the local get_int which, in turn, calls the standard C rand() to get a random number.
However, you're not getting to the second stage since your stack dump is still within srand().
It looks like it's converting your thread into a fiber lazily, i.e., this is the first time you've tried to do something in the thread and it has to set up fiber-local storage before continuing.
So, a couple of things to try and investigate.
1/ Are you running this code on pre-XP? I believe fiber-local storage (__set_flsgetvalue) was introduced in XP. This is a long shot but we need to clear it up anyway.
2/ Do you need to link with CGAL? I'm assuming your application needs something in the CGAL libraries, otherwise don't link with it. It may be a hangover from another project file.
3/ If you do use CGAL, make sure you're using the latest version. As of 3.3, it supports a dynamic linking which should prevent the possibility of mixing different library versions (both static/dynamic and debug/nondebug).
4/ Can you try to compile with VC8? The CGAL supported platforms do NOT yet include VC9 (VS2008). You may need to follow this up with the CGAL team itself to see if they're working on that support.
5/ And, finally, do you have Boost installed? That's another long shot but worth a look anyway.
If none of those suggestions help, you'll have to wait for someone more knowledgable than I to come along, I'm afraid.
Best of luck.
Crashes before main() are usually caused by a bad constructor in a global or static variable.
Looks like the constructor for class Random.
Do you have a global or static variable of type Random? Is it possible that you're trying to construct it before the library it's in has been properly initialized?
Note that the order of construction of global and static variables is not fixed and might change going from debug to release.
Could you be more specific about the error you're receiving? (unhandled exception std::length sounds weird - i've never heard of it)
To my knowledge, FlsGetValue automatically falls back to TLS counterpart if FLS API is not available.
If you're still stuck, take .dmp of your process at the time of crash and post it (use any of the numerous free upload services - and give us a link) (Sounds like a missing feature in SO - source/data file exchange?)

Resources