gfortran openmp no threading - multithreading

I have a simple Fortran code, and despite the use of the omp_set_num_threads() subroutine, I cannot set number of threads, i.e. the output says that I use only 1 thread. I tried also with the export OMP_NUM_THREADS=4 - no result.
I have no idea what is wrong it that pice of code:
program test
use omp_lib
implicit none
integer :: i, tnr,t
call omp_set_num_threads( 4 )
t = omp_get_num_threads()
write(*,*)'t:',t
!$omp parallel
!$omp do
do i = 1, 20
tnr = omp_get_thread_num()
write( *, * ) 'Thread', tnr, ':', i
end do
!$omp end do
!$omp end parallel
end program test
The output of that code is:
t: 1
Thread 0 : 1
Thread 0 : 2
Thread 0 : 3
Thread 0 : 4
Thread 0 : 5
Thread 0 : 6
Thread 0 : 7
Thread 0 : 8
Thread 0 : 9
Thread 0 : 10
Thread 0 : 11
Thread 0 : 12
Thread 0 : 13
Thread 0 : 14
Thread 0 : 15
Thread 0 : 16
Thread 0 : 17
Thread 0 : 18
Thread 0 : 19
Thread 0 : 20
Thanks for any kind of tip!
I use gentoo linux, the gcc-4.5.4 compiler has the openmp flag activated. The cpu is mobile core i7 2nd generation.
ldd test:
linux-vdso.so.1 (0x00007fff85fce000)
libgfortran.so.3 => /usr/lib/gcc/x86_64-pc-linux-gnu/4.5.4/libgfortran.so.3 (0x00007fe310460000)
libm.so.6 => /lib64/libm.so.6 (0x00007fe310169000)
libgomp.so.1 => /usr/lib/gcc/x86_64-pc-linux-gnu/4.5.4/libgomp.so.1 (0x00007fe30ff5b000)
libgcc_s.so.1 => /usr/lib/gcc/x86_64-pc-linux-gnu/4.5.4/libgcc_s.so.1 (0x00007fe30fd45000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fe30fb28000)
libc.so.6 => /lib64/libc.so.6 (0x00007fe30f77d000)
librt.so.1 => /lib64/librt.so.1 (0x00007fe30f574000)
/lib64/ld-linux-x86-64.so.2 (0x00007fe310749000)
gfortran -v
Using built-in specs.
COLLECT_GCC=/usr/x86_64-pc-linux-gnu/gcc-bin/4.5.4/gfortran
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-pc-linux-gnu/4.5.4/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: /var/tmp/portage/sys-devel/gcc-4.5.4/work/gcc-4.5.4/configure --prefix=/usr --bindir=/usr/x86_64-pc-linux-gnu/gcc-bin/4.5.4 --includedir=/usr/lib/gcc/x86_64-pc-linux-gnu/4.5.4/include --datadir=/usr/share/gcc-data/x86_64-pc-linux-gnu/4.5.4 --mandir=/usr/share/gcc-data/x86_64-pc-linux-gnu/4.5.4/man --infodir=/usr/share/gcc-data/x86_64-pc-linux-gnu/4.5.4/info --with-gxx-include-dir=/usr/lib/gcc/x86_64-pc-linux-gnu/4.5.4/include/g++-v4 --host=x86_64-pc-linux-gnu --build=x86_64-pc-linux-gnu --disable-altivec --disable-fixed-point --without-ppl --without-cloog --disable-lto --enable-nls --without-included-gettext --with-system-zlib --enable-obsolete --disable-werror --enable-secureplt --enable-multilib --enable-libmudflap --disable-libssp --enable-libgomp --with-python-dir=/share/gcc-data/x86_64-pc-linux-gnu/4.5.4/python --enable-checking=release --disable-libgcj --enable-languages=c,c++,fortran --enable-shared --enable-threads=posix --enable-__cxa_atexit --enable-clocale=gnu --enable-targets=all --with- bugurl=http://bugs.gentoo.org/ --with-pkgversion='Gentoo 4.5.4 p1.0, pie-0.4.7'
Thread model: posix
gcc version 4.5.4 (Gentoo 4.5.4 p1.0, pie-0.4.7)
output of a testmp.f140t.optimized (the one before the *.statistics):
;; Function test (MAIN__)
test ()
{
struct __st_parameter_dt dt_parm.1;
logical(kind=4) D.1545;
struct __st_parameter_dt dt_parm.0;
integer(kind=4) tnr;
integer(kind=4) t;
integer(kind=4) i;
integer(kind=4) i.8;
integer(kind=4) i.7;
integer(kind=4) i.6;
integer(kind=4) tnr.5;
integer(kind=4) i.4;
integer(kind=4) t.3;
<bb 2>:
omp_set_num_threads (&C.1537);
t.3_1 = omp_get_max_threads ();
t = t.3_1;
dt_parm.0.common.filename = &"testmp.f"[1]{lb: 1 sz: 1};
dt_parm.0.common.line = 11;
dt_parm.0.common.flags = 128;
dt_parm.0.common.unit = 6;
_gfortran_st_write (&dt_parm.0);
_gfortran_transfer_character (&dt_parm.0, &"t:"[1]{lb: 1 sz: 1}, 2);
_gfortran_transfer_integer (&dt_parm.0, &t, 4);
_gfortran_st_write_done (&dt_parm.0);
i = 1;
i.4_2 = i;
if (i.4_2 <= 20)
goto <bb 3>;
else
goto <bb 5>;
<bb 3>:
tnr.5_3 = omp_get_thread_num ();
tnr = tnr.5_3;
dt_parm.1.common.filename = &"testmp.f"[1]{lb: 1 sz: 1};
dt_parm.1.common.line = 16;
dt_parm.1.common.flags = 128;
dt_parm.1.common.unit = 6;
_gfortran_st_write (&dt_parm.1);
_gfortran_transfer_character (&dt_parm.1, &"Thread"[1]{lb: 1 sz: 1}, 6);
_gfortran_transfer_integer (&dt_parm.1, &tnr, 4);
_gfortran_transfer_character (&dt_parm.1, &":"[1]{lb: 1 sz: 1}, 1);
_gfortran_transfer_integer (&dt_parm.1, &i, 4);
_gfortran_st_write_done (&dt_parm.1);
i.6_4 = i;
D.1545_5 = i.6_4 == 20;
i.7_6 = i;
i.8_7 = i.7_6 + 1;
i = i.8_7;
if (D.1545_5 != 0)
goto <bb 5>;
else
goto <bb 4>;
<bb 4>:
goto <bb 3>;
<bb 5>:
return;
}
;; Function main (main)
main (integer(kind=4) argc, character(kind=1) * * argv)
{
static integer(kind=4) options.2[8] = {68, 255, 0, 0, 0, 1, 0, 1};
integer(kind=4) D.1552;
<bb 2>:
_gfortran_set_args (argc_1(D), argv_2(D));
_gfortran_set_options (8, &options.2[0]);
test ();
D.1552_3 = 0;
return D.1552_3;
}

Setting OMP_NUM_THREADS or calling omp_set_num_threads() sets the nthreads-var ICV (Internal Control Variable). To retrieve back its value, one should call omp_get_max_threads() and not omp_get_num_threads().
Second, there is a data race in your code. By default OpenMP will treat both tnr and t shared variables. In that case the value of tnr displayed by the write statement will be the value obtained in the last thread to execute the assignment (note that GCC suppresses register optimisation when it comes to shared variables).
The correct code would be as follows:
program test
use omp_lib
implicit none
integer :: i, tnr,t
call omp_set_num_threads( 4 )
t = omp_get_max_threads()
write(*,*)'t:',t
!$omp parallel do private(tnr)
do i = 1, 20
tnr = omp_get_thread_num()
write( *, * ) 'Thread', tnr, ':', i
end do
!$omp end parallel do
end program test
Note that when a do construct is immediately and the only thing nested inside a parallel region, one could use the combined parallel do construct and save two lines of code.
You have stored Fortran 90 code in a .f file, which is therefore recognised as fixed source form. In this case the OpenMP directives must obey the following rules:
The following sentinels are recognized in fixed form source files:
!$omp | c$omp | *$omp
Sentinels must start in column 1 and appear as a single word with no intervening characters. Fortran fixed form line length, white space, continuation, and column rules apply to the directive line. Initial directive lines must have a space or zero in column 6, and continuation directive lines must have a character other than a space or a zero in column 6. (emphasis mine)
I guess your directives start in the same column as the rest of the program code and therefore are treated simply as comments and not as OpenMP directives, evident by the content of the testmp.f.140t.optimized file.

At first, you have to compile this code with
gfortran -fopenmp FILE
The second thing is, that omp_get_num_threads() shows you the number of threads, at the point, where you are. As you called this function in a serial region, the answer is always 1.

Related

How to setup OpenMP to use whole hyperthreads for parallel processing?

Please help me, I want to use OpenMP for parallel-processing in my program with all threads. I set up it the same follow:
#pragma omp parallel
{
omp_set_num_threads(272);
region my_routine processing;
}
When I execute it, I use compiler "top" to check the performance of CPU use, and just sometimes it archives 6800% (almost it less than 5500%) - it is not stable. I want it stable (always archives 6800%) during the time my program executing.
Where is being wrong for using OpenMP or we have any other method for use whole threads?
Thanks a lot.
This is my platform:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 272
On-line CPU(s) list: 0-271
Thread(s) per core: 4
Core(s) per socket: 68
Socket(s): 1
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 87
Model name: Intel(R) Xeon Phi(TM) CPU 7250 # 1.40GHz
Stepping: 1
CPU MHz: 1392.507
BogoMIPS: 2799.81
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
NUMA node0 CPU(s): 0-271
NUMA node1 CPU(s):
Step 0: safety first:check with your Cluster-provider HPC-support team, if they do not consider harmful to set such high-level of workloads on their owned/operated Cluster device(s).
Step 1: set for a "smoke-on" flight-test
prepare lstopo command ( apt-get or require your admin to fix this, if necessary ) and create the system's NUMA-topology map into a pdf-file and post it here
prepare htop command ( apt-get or require your admin to fix this, if necessary ) ,configure F2
Setup METERS to show CPUs (1&2/4): first half in 2 shorter columns to the left MONITOR panel,
setup METERS to show CPUs (3&4/4): second half in 2 shorter columns to the left MONITOR panel,setup COLUMNS to show at least the { PPID, PID, TGID, CPU, CPU%, STATUS, command } column fields
Step 2: with htop-monitor running, run the compiled OpenMP code
Expect something above this on the terminal CLI, yet the htop-monitor will show the NUMA-CPU-workloads' live-scene landscape better, than any single number:
Real time: 23.027 s
User time: 45.337 s
Sys. time: 0.047 s
Exit code: 0
stdout will read about this:
WARMUP: OpenMP thread[ 0] instantiated as thread[ 0]
WARMUP: OpenMP thread[ 3] instantiated as thread[ 3]
...
WARMUP: OpenMP thread[272] instantiated as thread[272]
my_routine(): thread[ 0] START_TIME( 2078891848 )
my_routine(): thread[ 2] START_TIME( -528891186 )
...
my_routine(): thread[ 2] ENDED_TIME( 635748478 ) sum = 1370488.801186
HOT RUN: in thread[ 2] my_routine() returned 10.915321 ....
my_routine(): thread[ 4] ENDED_TIME( -1543969584 ) sum = 1370489.030301
HOT RUN: in thread[ 4] my_routine() returned 11.133672 ....
my_routine(): thread[ 1] ENDED_TIME( -213996360 ) sum = 1370489.060176
HOT RUN: in thread[ 1] my_routine() returned 11.158897 ....
...
my_routine(): thread[ 0] ENDED_TIME( -389214506 ) sum = 1370489.079366
HOT RUN: in thread[270] my_routine() returned 11.149798 ....
my_routine(): thread[ 3] ENDED_TIME( -586400566 ) sum = 1370489.125829
HOT RUN: in thread[269] my_routine() returned 11.091430 ....
OpenMP ver(201511)...finito
mock-up source ( on TiO.run ):
#include <omp.h> // ------------------------------------ compile flags: -fopenmp -O3
#include <stdio.h>
#define MAX_COUNT 999999999
#define MAX_THREADS 272
double my_routine()
{
printf( "my_routine(): thread[%3d] START_TIME( %20d )\n", omp_get_thread_num(), omp_get_wtime() );
double temp = omp_get_wtime(),
sum = 0;
for ( int count = 0; count < MAX_COUNT; count++ )
{
sum += ( omp_get_wtime() - temp ); temp = omp_get_wtime();
}
printf( "my_routine(): thread[%3d] ENDED_TIME( %20d ) sum = %15.6f\n", omp_get_thread_num(), omp_get_wtime(), sum );
return( sum );
}
void warmUp() // -------------------------------- prevents performance skewing in-situ
{ // NOP-alike payload, yet enforces all thread-instantiations to happen
#pragma omp parallel for num_threads( MAX_THREADS )
for ( int i = 0; i < MAX_THREADS; i++ )
printf( "WARMUP: OpenMP thread[%3d] instantiated as thread[%3d]\n", i, omp_get_thread_num() );
}
int main( int argc, char **argv )
{
omp_set_num_threads( MAX_THREADS ); warmUp(); // ---------- pre-arrange all threads
#pragma omp parallel for
for ( int i = 0; i < MAX_THREADS; i++ )
printf( "HOT RUN: in thread[%3d] my_routine() returned %34.6f ....\n", omp_get_thread_num(), my_routine() );
printf( "\nOpenMP ver(%d)...finito", _OPENMP );
}
I execute on CentOS7 and a little confused with your guide. This is my NUMA-topology and htop monitor when I run application. You can see it only use 1 thread/core and with one thread also cannot archive 100%. How I can use 4 threads/core or 100% 1 thread/core?

openmp core assignation fails

My Centos 6 VM shows four cores when displaying the content of /proc/cpuinfo, and /sys/devices/system/cpu/online shows 0-3.
I am trying to run the following code on the core 2 and 3 using KMP_AFFINITY="explicit,proclist=[2-3]"
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#include <sched.h>
int main (int argc, char *argv[]) {
int nthreads, tid, cid;
#pragma omp parallel private(nthreads, tid)
{
tid = omp_get_thread_num();
cid = sched_getcpu();
printf("Hello from thread %d on core %d\n", tid, cid);
if (tid == 0) {
nthreads = omp_get_num_threads();
printf("Number of threads = %d\n", nthreads);
}
}
}
When compiled with icc (ICC) 16.0.1 20151021, it it fails to detect the available cores and executes everything on the core 0.
$ OMP_NUM_THREADS=4 ./a.out
OMP: Warning #123: Ignoring invalid OS proc ID 2.
OMP: Warning #123: Ignoring invalid OS proc ID 3.
OMP: Warning #124: No valid OS proc IDs specified - not using affinity.
Hello from thread 0 on core 0
Number of threads = 1
Hello from thread 0 on core 0
Number of threads = 1
Hello from thread 0 on core 0
Number of threads = 1
Hello from thread 0 on core 0
Number of threads = 1
Where as gcc (GCC) 4.4.7 20120313, with GOMP_CPU_AFFINITY="2-3", executes properly on core 2 and 3, like set.
I used strace to check what's going on under the hood, and I noticed something strange :
[...]
open("/sys/devices/system/cpu/online", O_RDONLY|O_CLOEXEC) = 3
read(3, "0-3\n", 8192) = 4
[...]
sched_getaffinity(0, 1048576, { 1 }) = 8
sched_setaffinity(0, 8, { 4521c26fbb1c38c1 }) = -1 EFAULT (Bad addres
[...]
Could this be an error from the implementation of OpenMP made by intel?
I cannot upgrade my compiler to fix it in this case. Is it possible to use the GCC OpenMP library instead of the Intel one when compiling with icc ?
Update:
I managed to compile the code with gcc and linking it with iomp using the following command
gcc omp.c -L/opt/intel/compilers_and_libraries_2016/linux/lib/intel64_lin/ -liomp5
The execution outputs no warning, and is still not correct:
$ OMP_NUM_THREADS=4 ./a.out
Hello from thread 0 on core 0
Number of threads = 1
Same sched_setaffinity error than previously shown.

OpenMP: splitting loop based on NUMA

I am running the following loop using, say, 8 OpenMP threads:
float* data;
int n;
#pragma omp parallel for schedule(dynamic, 1) default(none) shared(data, n)
for ( int i = 0; i < n; ++i )
{
DO SOMETHING WITH data[i]
}
Due to NUMA, I'd like to run first half of the loop (i = 0, ..., n/2-1) with threads 0,1,2,3
and second half (i = n/2, ..., n-1) with threads 4,5,6,7.
Essentially, I want to run two loops in parallel, each loop using a separate group of OpenMP threads.
How do I achieve this with OpenMP?
Thank you
PS: Ideally, if threads from one group are done with their half of the loop, and the other half of the loop is still not done, I'd like threads from finished group join unsfinished group processing the other half of the loop.
I am thinking about something like below, but I wonder if I can do this with OpenMP and no extra book-keeping:
int n;
int i0 = 0;
int i1 = n / 2;
#pragma omp parallel for schedule(dynamic, 1) default(none) shared(data,n,i0,i1)
for ( int i = 0; i < n; ++i )
{
int nt = omp_get_thread_num();
int j;
#pragma omp critical
{
if ( nt < 4 ) {
if ( i0 < n / 2 ) j = i0++; // First 4 threads process first half
else j = i1++; // of loop unless first half is finished
}
else {
if ( i1 < n ) j = i1++; // Second 4 threads process second half
else j = i0++; // of loop unless second half is finished
}
}
DO SOMETHING WITH data[j]
}
Probably best is to use nested parallelization, first over NUMA nodes, then within each node; then you can use the infrastructure for dynamic while still breaking the data up amongst thread groups:
#include <omp.h>
#include <stdio.h>
int main(int argc, char **argv) {
const int ngroups=2;
const int npergroup=4;
const int ndata = 16;
omp_set_nested(1);
#pragma omp parallel for num_threads(ngroups)
for (int i=0; i<ngroups; i++) {
int start = (ndata*i+(ngroups-1))/ngroups;
int end = (ndata*(i+1)+(ngroups-1))/ngroups;
#pragma omp parallel for num_threads(npergroup) shared(i, start, end) schedule(dynamic,1)
for (int j=start; j<end; j++) {
printf("Thread %d from group %d working on data %d\n", omp_get_thread_num(), i, j);
}
}
return 0;
}
Running this gives
$ gcc -fopenmp -o nested nested.c -Wall -O -std=c99
$ ./nested | sort -n -k 9
Thread 0 from group 0 working on data 0
Thread 3 from group 0 working on data 1
Thread 1 from group 0 working on data 2
Thread 2 from group 0 working on data 3
Thread 1 from group 0 working on data 4
Thread 3 from group 0 working on data 5
Thread 3 from group 0 working on data 6
Thread 0 from group 0 working on data 7
Thread 0 from group 1 working on data 8
Thread 3 from group 1 working on data 9
Thread 2 from group 1 working on data 10
Thread 1 from group 1 working on data 11
Thread 0 from group 1 working on data 12
Thread 0 from group 1 working on data 13
Thread 2 from group 1 working on data 14
Thread 0 from group 1 working on data 15
But note that the nested approach may well change the thread assignments over what the one-level threading would be, so you will probably have to play with KMP_AFFINITY or other mechanisms a bit more to get the bindings right again.

LTL model checking using Spin and Promela syntax

I'm trying to reproduce ALGOL 60 code written by Dijkstra in the paper titled "Cooperating sequential processes", the code is the first attempt to solve the mutex problem, here is the syntax:
begin integer turn; turn:= 1;
parbegin
process 1: begin Ll: if turn = 2 then goto Ll;
critical section 1;
turn:= 2;
remainder of cycle 1; goto L1
end;
process 2: begin L2: if turn = 1 then goto L2;
critical section 2;
turn:= 1;
remainder of cycle 2; goto L2
end
parend
end
So I tried to reproduce the above code in Promela and here is my code:
#define true 1
#define Aturn true
#define Bturn false
bool turn, status;
active proctype A()
{
L1: (turn == 1);
status = Aturn;
goto L1;
/* critical section */
turn = 1;
}
active proctype B()
{
L2: (turn == 2);
status = Bturn;
goto L2;
/* critical section */
turn = 2;
}
never{ /* ![]p */
if
:: (!status) -> skip
fi;
}
init
{ turn = 1;
run A(); run B();
}
What I'm trying to do is, verify that the fairness property will never hold because the label L1 is running infinitely.
The issue here is that my never claim block is not producing any error, the output I get simply says that my statement was never reached..
here is the actual output from iSpin
spin -a dekker.pml
gcc -DMEMLIM=1024 -O2 -DXUSAFE -DSAFETY -DNOCLAIM -w -o pan pan.c
./pan -m10000
Pid: 46025
(Spin Version 6.2.3 -- 24 October 2012)
+ Partial Order Reduction
Full statespace search for:
never claim - (not selected)
assertion violations +
cycle checks - (disabled by -DSAFETY)
invalid end states +
State-vector 44 byte, depth reached 8, errors: 0
11 states, stored
9 states, matched
20 transitions (= stored+matched)
0 atomic steps
hash conflicts: 0 (resolved)
Stats on memory usage (in Megabytes):
0.001 equivalent memory usage for states (stored*(State-vector + overhead))
0.291 actual memory usage for states
128.000 memory used for hash table (-w24)
0.534 memory used for DFS stack (-m10000)
128.730 total actual memory usage
unreached in proctype A
dekker.pml:13, state 4, "turn = 1"
dekker.pml:15, state 5, "-end-"
(2 of 5 states)
unreached in proctype B
dekker.pml:20, state 2, "status = 0"
dekker.pml:23, state 4, "turn = 2"
dekker.pml:24, state 5, "-end-"
(3 of 5 states)
unreached in claim never_0
dekker.pml:30, state 5, "-end-"
(1 of 5 states)
unreached in init
(0 of 4 states)
pan: elapsed time 0 seconds
No errors found -- did you verify all claims?
I've read all the documentation of spin on the never{..} block but couldn't find my answer (here is the link), also I've tried using ltl{..} blocks as well (link) but that just gave me syntax error, even though its explicitly mentioned in the documentation that it can be outside the init and proctypes, can someone help me correct this code please?
Thank you
You've redefined 'true' which can't possibly be good. I axed that redefinition and the never claim fails. But, the failure is immaterial to your goal - that initial state of 'status' is 'false' and thus the never claim exits, which is a failure.
Also, it is slightly bad form to assign 1 or 0 to a bool; assign true or false instead - or use bit. Why not follow the Dijkstra code more closely - use an 'int' or 'byte'. It is not as if performance will be an issue in this problem.
You don't need 'active' if you are going to call 'run' - just one or the other.
My translation of 'process 1' would be:
proctype A ()
{
L1: turn !=2 ->
/* critical section */
status = Aturn;
turn = 2
/* remainder of cycle 1 */
goto L1;
}
but I could be wrong on that.

Weird SIGSEGV segmentation fault in std::string::assign() method from libstdc++.so.6

My program recently encountered a weird segfault when running. I want to know if somebody had met this error before and how it could be fixed. Here is more info:
Basic info:
CentOS 5.2, kernal version is 2.6.18
g++ (GCC) 4.1.2 20080704 (Red Hat 4.1.2-50)
CPU: Intel x86 family
libstdc++.so.6.0.8
My program will start multiple threads to process data. The segfault occurred in one of the threads.
Though it's a multi-thread program, the segfault seemed to occur on a local std::string object. I'll show this in the code snippet later.
The program is compiled with -g, -Wall and -fPIC, and without -O2 or other optimization options.
The core dump info:
Core was generated by `./myprog'.
Program terminated with signal 11, Segmentation fault.
#0 0x06f6d919 in __gnu_cxx::__exchange_and_add(int volatile*, int) () from /usr/lib/libstdc++.so.6
(gdb) bt
#0 0x06f6d919 in __gnu_cxx::__exchange_and_add(int volatile*, int) () from /usr/lib/libstdc++.so.6
#1 0x06f507c3 in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::assign(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /usr/lib/libstdc++.so.6
#2 0x06f50834 in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::operator=(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /usr/lib/libstdc++.so.6
#3 0x081402fc in Q_gdw::ProcessData (this=0xb2f79f60) at ../../../myprog/src/Q_gdw/Q_gdw.cpp:798
#4 0x08117d3a in DataParser::Parse (this=0x8222720) at ../../../myprog/src/DataParser.cpp:367
#5 0x08119160 in DataParser::run (this=0x8222720) at ../../../myprog/src/DataParser.cpp:338
#6 0x080852ed in Utility::__dispatch (arg=0x8222720) at ../../../common/thread/Thread.cpp:603
#7 0x0052c832 in start_thread () from /lib/libpthread.so.0
#8 0x00ca845e in clone () from /lib/libc.so.6
Please note that the segfault begins within the basic_string::operator=().
The related code:
(I've shown more code than that might be needed, and please ignore the coding style things for now.)
int Q_gdw::ProcessData()
{
char tmpTime[10+1] = {0};
char A01Time[12+1] = {0};
std::string tmpTimeStamp;
// Get the timestamp from TP
if((m_BackFrameBuff[11] & 0x80) >> 7)
{
for (i = 0; i < 12; i++)
{
A01Time[i] = (char)A15Result[i];
}
tmpTimeStamp = FormatTimeStamp(A01Time, 12); // Segfault occurs on this line
And here is the prototype of this FormatTimeStamp method:
std::string FormatTimeStamp(const char *time, int len)
I think such string assignment operations should be a kind of commonly used one, but I just don't understand why a segfault could occurr here.
What I have investigated:
I've searched on the web for answers. I looked at here. The reply says try to recompile the program with _GLIBCXX_FULLY_DYNAMIC_STRING macro defined. I tried but the crash still happens.
I also looked at here. It also says to recompile the program with _GLIBCXX_FULLY_DYNAMIC_STRING, but the author seems to be dealing with a different problem with mine, thus I don't think his solution works for me.
Updated on 08/15/2011
Here is the original code of this FormatTimeStamp. I understand the coding doesn't look very nice(too many magic numbers, for instance..), but let's focus on the crash issue first.
string Q_gdw::FormatTimeStamp(const char *time, int len)
{
string timeStamp;
string tmpstring;
if (time) // It is guaranteed that "time" is correctly zero-terminated, so don't worry about any overflow here.
tmpstring = time;
// Get the current time point.
int year, month, day, hour, minute, second;
#ifndef _WIN32
struct timeval timeVal;
struct tm *p;
gettimeofday(&timeVal, NULL);
p = localtime(&(timeVal.tv_sec));
year = p->tm_year + 1900;
month = p->tm_mon + 1;
day = p->tm_mday;
hour = p->tm_hour;
minute = p->tm_min;
second = p->tm_sec;
#else
SYSTEMTIME sys;
GetLocalTime(&sys);
year = sys.wYear;
month = sys.wMonth;
day = sys.wDay;
hour = sys.wHour;
minute = sys.wMinute;
second = sys.wSecond;
#endif
if (0 == len)
{
// The "time" doesn't specify any time so we just use the current time
char tmpTime[30];
memset(tmpTime, 0, 30);
sprintf(tmpTime, "%d-%d-%d %d:%d:%d.000", year, month, day, hour, minute, second);
timeStamp = tmpTime;
}
else if (6 == len)
{
// The "time" specifies "day-month-year" with each being 2-digit.
// For example: "150811" means "August 15th, 2011".
timeStamp = "20";
timeStamp = timeStamp + tmpstring.substr(4, 2) + "-" + tmpstring.substr(2, 2) + "-" +
tmpstring.substr(0, 2);
}
else if (8 == len)
{
// The "time" specifies "minute-hour-day-month" with each being 2-digit.
// For example: "51151508" means "August 15th, 15:51".
// As the year is not specified, the current year will be used.
string strYear;
stringstream sstream;
sstream << year;
sstream >> strYear;
sstream.clear();
timeStamp = strYear + "-" + tmpstring.substr(6, 2) + "-" + tmpstring.substr(4, 2) + " " +
tmpstring.substr(2, 2) + ":" + tmpstring.substr(0, 2) + ":00.000";
}
else if (10 == len)
{
// The "time" specifies "minute-hour-day-month-year" with each being 2-digit.
// For example: "5115150811" means "August 15th, 2011, 15:51".
timeStamp = "20";
timeStamp = timeStamp + tmpstring.substr(8, 2) + "-" + tmpstring.substr(6, 2) + "-" + tmpstring.substr(4, 2) + " " +
tmpstring.substr(2, 2) + ":" + tmpstring.substr(0, 2) + ":00.000";
}
else if (12 == len)
{
// The "time" specifies "second-minute-hour-day-month-year" with each being 2-digit.
// For example: "305115150811" means "August 15th, 2011, 15:51:30".
timeStamp = "20";
timeStamp = timeStamp + tmpstring.substr(10, 2) + "-" + tmpstring.substr(8, 2) + "-" + tmpstring.substr(6, 2) + " " +
tmpstring.substr(4, 2) + ":" + tmpstring.substr(2, 2) + ":" + tmpstring.substr(0, 2) + ".000";
}
return timeStamp;
}
Updated on 08/19/2011
This problem has finally been addressed and fixed. The FormatTimeStamp() function has nothing to do with the root cause, in fact. The segfault is caused by a writing overflow of a local char buffer.
This problem can be reproduced with the following simpler program(please ignore the bad namings of some variables for now):
(Compiled with "g++ -Wall -g main.cpp")
#include <string>
#include <iostream>
void overflow_it(char * A15, char * A15Result)
{
int m;
int t = 0,i = 0;
char temp[3];
for (m = 0; m < 6; m++)
{
t = ((*A15 & 0xf0) >> 4) *10 ;
t += *A15 & 0x0f;
A15 ++;
std::cout << "m = " << m << "; t = " << t << "; i = " << i << std::endl;
memset(temp, 0, sizeof(temp));
sprintf((char *)temp, "%02d", t); // The buggy code: temp is not big enough when t is a 3-digit integer.
A15Result[i++] = temp[0];
A15Result[i++] = temp[1];
}
}
int main(int argc, char * argv[])
{
std::string str;
{
char tpTime[6] = {0};
char A15Result[12] = {0};
// Initialize tpTime
for(int i = 0; i < 6; i++)
tpTime[i] = char(154); // 154 would result in a 3-digit t in overflow_it().
overflow_it(tpTime, A15Result);
str.assign(A15Result);
}
std::cout << "str says: " << str << std::endl;
return 0;
}
Here are two facts we should remember before going on:
1). My machine is an Intel x86 machine so it's using the Little Endian rule. Therefore for a variable "m" of int type, whose value is, say, 10, it's memory layout might be like this:
Starting addr:0xbf89bebc: m(byte#1): 10
0xbf89bebd: m(byte#2): 0
0xbf89bebe: m(byte#3): 0
0xbf89bebf: m(byte#4): 0
2). The program above runs within the main thread. When it comes to the overflow_it() function, the variables layout in the thread stack looks like this(which only shows the important variables):
0xbfc609e9 : temp[0]
0xbfc609ea : temp[1]
0xbfc609eb : temp[2]
0xbfc609ec : m(byte#1) <-- Note that m follows temp immediately. m(byte#1) happens to be the byte temp[3].
0xbfc609ed : m(byte#2)
0xbfc609ee : m(byte#3)
0xbfc609ef : m(byte#4)
0xbfc609f0 : t
...(3 bytes)
0xbfc609f4 : i
...(3 bytes)
...(etc. etc. etc...)
0xbfc60a26 : A15Result <-- Data would be written to this buffer in overflow_it()
...(11 bytes)
0xbfc60a32 : tpTime
...(5 bytes)
0xbfc60a38 : str <-- Note the str takes up 4 bytes. Its starting address is **16 bytes** behind A15Result.
My analysis:
1). m is a counter in overflow_it() whose value is incremented by 1 at each for loop and whose max value is supposed not greater than 6. Thus it's value could be stored completely in m(byte#1)(remember it's Little Endian) which happens to be temp3.
2). In the buggy line: When t is a 3-digit integer, such as 109, then the sprintf() call would result in a buffer overflow, because serializing the number 109 to the string "109" actually requires 4 bytes: '1', '0', '9' and a terminating '\0'. Because temp[] is allocated with 3 bytes only, the final '\0' would definitely be written to temp3, which is just the m(byte#1), which unfortunately stores m's value. As a result, m's value is reset to 0 every time.
3). The programmer's expectation, however, is that the for loop in the overflow_it() would execute 6 times only, with each time m being incremented by 1. Because m is always reset to 0, the actual loop time is far more than 6 times.
4). Let's look at the variable i in overflow_it(): Every time the for loop is executed, i's value is incremented by 2, and A15Result[i] will be accessed. However, if you compile and run this program, you'll see the i value finally adds up to 24, which means the overflow_it() writes data to the bytes ranging from A15Result[0] to A15Result[23]. Note that the object str is only 16 bytes behind A15Result[0], thus the overflow_it() has "sweeped through" str and destroy it's correct memory layout.
5). I think the correct use of std::string, as it is a non-POD data structure, depends on that that instantiated std::string object must have a correct internal state. But in this program, str's internal layout has been changed by force externally. This should be why the assign() method call would finally cause a segfault.
Update on 08/26/2011
In my previous update on 08/19/2011, I said that the segfault was caused by a method call on a local std::string object whose memory layout had been broken and thus became a "destroyed" object. This is not an "always" true story. Consider the C++ program below:
//C++
class A {
public:
void Hello(const std::string& name) {
std::cout << "hello " << name;
}
};
int main(int argc, char** argv)
{
A* pa = NULL; //!!
pa->Hello("world");
return 0;
}
The Hello() call would succeed. It would succeed even if you assign an obviously bad pointer to pa. The reason is: the non-virtual methods of a class don't reside within the memory layout of the object, according to the C++ object model. The C++ compiler turns the A::Hello() method to something like, say, A_Hello_xxx(A * const this, ...) which could be a global function. Thus, as long as you don't operate on the "this" pointer, things could go pretty well.
This fact shows that a "bad" object is NOT the root cause that results in the SIGSEGV segfault. The assign() method is not virtual in std::string, thus the "bad" std::string object wouldn't cause the segfault. There must be some other reason that finally caused the segfault.
I noticed that the segfault comes from the __gnu_cxx::__exchange_and_add() function, so I then looked into its source code in this web page:
00046 static inline _Atomic_word
00047 __exchange_and_add(volatile _Atomic_word* __mem, int __val)
00048 { return __sync_fetch_and_add(__mem, __val); }
The __exchange_and_add() finally calls the __sync_fetch_and_add(). According to this web page, the __sync_fetch_and_add() is a GCC builtin function whose behavior is like this:
type __sync_fetch_and_add (type *ptr, type value, ...)
{
tmp = *ptr;
*ptr op= value; // Here the "op=" means "+=" as this function is "_and_add".
return tmp;
}
There it is! The passed-in ptr pointer is dereferenced here. In the 08/19/2011 program, the ptr is actually the "this" pointer of the "bad" std::string object within the assign() method. It is the derefenence at this point that actually caused the SIGSEGV segmentation fault.
We could test this with the following program:
#include <bits/atomicity.h>
int main(int argc, char * argv[])
{
__sync_fetch_and_add((_Atomic_word *)0, 10); // Would result in a segfault.
return 0;
}
There are two likely possibilities:
some code before line 798 has corrupted the local tmpTimeStamp
object
the return value from FormatTimeStamp() was somehow bad.
The _GLIBCXX_FULLY_DYNAMIC_STRING is most likely a red herring and has nothing to do with the problem.
If you install debuginfo package for libstdc++ (I don't know what it's called on CentOS), you'll be able to "see into" that code, and might be able to tell whether the left-hand-side (LHS) or the RHS of the assignment operator caused the problem.
If that's not possible, you'll have to debug this at the assembly level. Going into frame #2 and doing x/4x $ebp should give you previous ebp, caller address (0x081402fc), LHS (should match &tmpTimeStamp in frame #3), and RHS. Go from there, and good luck!
I guess there could be some problem inside FormatTimeStamp function, but without source code it's hard to say anything. Try to check your program under Valgrind. Usually this helps to fix such sort of bugs.

Resources