Running SystemTap user-space probes inside a container - linux

I am learning SystemTap and I have created a simple C program to grasp the basics.
When I run the program and a probe in the hosting system, the probe works flawlessly, but when I copy the exact same process in a container I run into some problems (container is running in privileged mode).
When I leave both the program and probe running for a few seconds a WARNING message appears and this is what the stap output looks like after I stop the program:
[root#client ~]# stap -v tmp-probe.stp /root/tmp
Pass 1: parsed user script and 482 library scripts using 115544virt/94804res/16320shr/78424data kb, in 140usr/20sys/168real ms.
Pass 2: analyzed script: 2 probes, 1 function, 0 embeds, 1 global using 116996virt/97852res/17612shr/79876data kb, in 10usr/0sys/6real ms.
Pass 3: using cached /root/.systemtap/cache/73/stap_7357b5a96975af17d2210a04bada9b6a_1260.c
Pass 4: using cached /root/.systemtap/cache/73/stap_7357b5a96975af17d2210a04bada9b6a_1260.ko
Pass 5: starting run.
WARNING: probe process("/root/tmp").statement(0x40113a) at inode-offset 12761464:0000000060a7c1a2 registration error [man warning::pass5] (rc -5)
0
Pass 5: run completed in 0usr/70sys/3579real ms.
The binary has been compiled with: gcc -ggdb3 -O0 tmp.c -o tmp and contains these probe points:
process("/root/tmp").begin $syscall:long $arg1:long $arg2:long $arg3:long $arg4:long $arg5:long $arg6:long
process("/root/tmp").end $syscall:long $arg1:long $arg2:long $arg3:long $arg4:long $arg5:long $arg6:long
process("/root/tmp").plt("puts")
process("/root/tmp").plt("sleep")
process("/root/tmp").syscall $syscall:long $arg1:long $arg2:long $arg3:long $arg4:long $arg5:long $arg6:long
process("/root/tmp").mark("in_test")
process("/root/tmp").function("main#/root/tmp.c:11")
process("/root/tmp").function("test#/root/tmp.c:5")
tmp.c:
#include <stdio.h>
#include <sys/sdt.h>
#include <unistd.h>
void test() {
STAP_PROBE(test, in_test);
printf("here\n");
sleep(1);
}
int main() {
while (1) {
test();
}
}
tmp-probe.stp:
global cnt
probe process(#1).mark("in_test") {
cnt++
}
probe process(#1).end {
printf("%ld\n", cnt)
exit()
}
A more verbose stap output:
[root#client ~]# stap -vv tmp-probe.stp /root/tmp
Systemtap translator/driver (version 4.6/0.186, rpm 4.6-4.fc35)
Copyright (C) 2005-2021 Red Hat, Inc. and others
This is free software; see the source for copying conditions.
tested kernel versions: 2.6.32 ... 5.15.0-rc7
enabled features: AVAHI BOOST_STRING_REF DYNINST BPF JAVA PYTHON3 LIBRPM LIBSQLITE3 LIBVIRT LIBXML2 NLS NSS READLINE MONITOR_LIBS
Created temporary directory "/tmp/stapRsSKyO"
Session arch: x86_64 release: 5.16.20-200.fc35.x86_64
Build tree: "/lib/modules/5.16.20-200.fc35.x86_64/build"
Searched for library macro files: "/usr/share/systemtap/tapset/linux", found: 7, processed: 7
Searched for library macro files: "/usr/share/systemtap/tapset", found: 11, processed: 11
Searched: "/usr/share/systemtap/tapset/linux/x86_64", found: 20, processed: 20
Searched: "/usr/share/systemtap/tapset/linux", found: 407, processed: 407
Searched: "/usr/share/systemtap/tapset/x86_64", found: 1, processed: 1
Searched: "/usr/share/systemtap/tapset", found: 36, processed: 36
Pass 1: parsed user script and 482 library scripts using 115580virt/95008res/16520shr/78460data kb, in 130usr/30sys/163real ms.
derive-probes (location #0): process("/root/tmp").mark("in_test") of keyword at tmp-probe.stp:3:1
derive-probes (location #0): process("/root/tmp").end of keyword at tmp-probe.stp:7:1
Pass 2: analyzed script: 2 probes, 1 function, 0 embeds, 1 global using 117032virt/97860res/17624shr/79912data kb, in 10usr/0sys/6real ms.
Pass 3: using cached /root/.systemtap/cache/73/stap_7357b5a96975af17d2210a04bada9b6a_1260.c
Pass 4: using cached /root/.systemtap/cache/73/stap_7357b5a96975af17d2210a04bada9b6a_1260.ko
Pass 5: starting run.
Running /usr/bin/staprun -v -R /tmp/stapRsSKyO/stap_7357b5a96975af17d2210a04bada9b6a_1260.ko
staprun:insert_module:191 Module stap_7357b5a96975af17d2210a04bada9b_698092 inserted from file /tmp/stapRsSKyO/stap_7357b5a96975af17d2210a04bada9b6a_1260.ko
WARNING: probe process("/root/tmp").statement(0x40113a) at inode-offset 12761464:0000000060a7c1a2 registration error [man warning::pass5] (rc -5)
0
stapio:cleanup_and_exit:352 detach=0
stapio:cleanup_and_exit:369 closing control channel
staprun:remove_module:292 Module stap_7357b5a96975af17d2210a04bada9b_698092 removed.
Spawn waitpid result (0x0): 0
Pass 5: run completed in 10usr/70sys/2428real ms.
Running rm -rf /tmp/stapRsSKyO
Spawn waitpid result (0x0): 0
Removed temporary directory "/tmp/stapRsSKyO"

Related

Yocto bitbake core-image-sato with preempt-rt failed

I want to set up a linux kernel with preempt-RT with yocto.
According to meta/recipes-rt/README, I add the following code in build/conf/local.conf, and do bitbake core-image-sato, but bitbake fail.
MACHINE ?= "genericx86-64"
PREFERRED_PROVIDER_virtual/kernel = "linux-yocto-rt"
COMPATIBLE_MACHINE_genericx86-64 = "genericx86-64"
COMPATIBLE_MACHINE_quilt-native = "genericx86-64"
Yocto output the following error:
NOTE: Bitbake server didn't start within 5 seconds, waiting for 90
Loading cache: 100% |#######################################################################################################################################################################| Time: 0:00:13
Loaded 1330 entries from dependency cache.
NOTE: Resolving any missing task queue dependencies
Build Configuration:
BB_VERSION = "1.46.0"
BUILD_SYS = "x86_64-linux"
NATIVELSBSTRING = "universal"
TARGET_SYS = "x86_64-poky-linux"
MACHINE = "genericx86-64"
DISTRO = "poky"
DISTRO_VERSION = "3.1.20"
TUNE_FEATURES = "m64 core2"
TARGET_FPU = ""
meta
meta-poky
meta-yocto-bsp = "dunfell:90a6f6a110ab14890e2f6a1616e74ee259fc0f8f"
Initialising tasks: 100% |##################################################################################################################################################################| Time: 0:00:48
Sstate summary: Wanted 14 Found 0 Missed 14 Current 1203 (0% match, 98% complete)
NOTE: Executing Tasks
ERROR: linux-yocto-rt-5.4.213+gitAUTOINC+2f18e629f7_03cd66d981-r0 do_kernel_metadata: Could not locate BSP definition for genericx86-64/preempt-rt and no defconfig was provided
ERROR: linux-yocto-rt-5.4.213+gitAUTOINC+2f18e629f7_03cd66d981-r0 do_kernel_metadata: Execution of '/media/fff/disk1T/yocto/demo3/poky/build/tmp/work/genericx86_64-poky-linux/linux-yocto-rt/5.4.213+gitAUTOINC+2f18e629f7_03cd66d981-r0/temp/run.do_kernel_metadata.1138429' failed with exit code 1
ERROR: Logfile of failure stored in: /media/fff/disk1T/yocto/demo3/poky/build/tmp/work/genericx86_64-poky-linux/linux-yocto-rt/5.4.213+gitAUTOINC+2f18e629f7_03cd66d981-r0/temp/log.do_kernel_metadata.1138429
Log data follows:
| DEBUG: Executing python function extend_recipe_sysroot
| NOTE: Direct dependencies are ['/media/fff/disk1T/yocto/demo3/poky/meta/recipes-kernel/kern-tools/kern-tools-native_git.bb:do_populate_sysroot']
| NOTE: Installed into sysroot: []
| NOTE: Skipping as already exists in sysroot: ['kern-tools-native', 'quilt-native']
| DEBUG: Python function extend_recipe_sysroot finished
| DEBUG: Executing shell function do_kernel_metadata
| NOTE: do_kernel_metadata: for summary/debug, set KCONF_AUDIT_LEVEL > 0
| ERROR: Could not locate BSP definition for genericx86-64/preempt-rt and no defconfig was provided
| WARNING: exit code 1 from a shell command.
| ERROR: Execution of '/media/fff/disk1T/yocto/demo3/poky/build/tmp/work/genericx86_64-poky-linux/linux-yocto-rt/5.4.213+gitAUTOINC+2f18e629f7_03cd66d981-r0/temp/run.do_kernel_metadata.1138429' failed with exit code 1
ERROR: Task (/media/fff/disk1T/yocto/demo3/poky/meta/recipes-kernel/linux/linux-yocto-rt_5.4.bb:do_kernel_metadata) failed with exit code '1'
NOTE: Tasks Summary: Attempted 3173 tasks of which 3172 didn't need to be rerun and 1 failed.
Summary: 1 task failed:
/media/fff/disk1T/yocto/demo3/poky/meta/recipes-kernel/linux/linux-yocto-rt_5.4.bb:do_kernel_metadata
Summary: There were 2 ERROR messages shown, returning a non-zero exit code.
My hardware cpu is x86-64 core. It hint genericx86-64/preempt-rt dont exist, what action should I adapt to generate core-image-sato with preempt-RT? Please leave a comment and help me if you are similar with this problem.
I try to checkout various branch of yocto but doesn't matter.These dunfell、langdale、kirkstone.I expect to use dunfell.
Following is my build/conf/bblayers.conf:
POKY_BBLAYERS_CONF_VERSION = "2"
BBPATH = "${TOPDIR}"
BBFILES ?= ""
BBLAYERS ?= " \
/media/fff/disk1T/yocto/demo3/poky/meta \
/media/fff/disk1T/yocto/demo3/poky/meta-poky \
/media/fff/disk1T/yocto/demo3/poky/meta-yocto-bsp \
"

Failed build Yocto Gatesgarth "extensible SDK" (eSDK) - populate_sdk_ext fail

I'm working with Yocto "Gatesgarth" on a custom board based on i.MX6ULL.
I'm facing some problems in generating the extensible SDK (eSDK).
The generation of normal SDK it's accomplished correctly.
Below some details.
Details of system:
Board based on NXP i.MX6ULL
Yocto version "Gatesgarth 3.2.4 (May 2021)"
BB_VERSION = "1.48.0",
NATIVELSBSTRING = "ubuntu-18.04"
DISTRO_VERSION = "5.10-gatesgarth"
meta-qt5 is present
Build environment based on Docker Container
Environment Variable:
File: conf/local.conf
SDKMACHINE ?= 'x86_64'
File: test-image-mx6ull.bb
inherit core-image
inherit populate_sdk_qt5
inherit populate_sdk_ext
SDK_EXT_TYPE = "minimal"
SDK_INCLUDE_TOOLCHAIN = "1"
SDK_INCLUDE_PKGDATA = "0"
SDK_INCLUDE_NATIVESDK = "1"
The command executed is :
bitbake test-image-mx6ull -c populate_sdk_ext
Output:
ERROR: test-image-mx6ull-1.0-r0 do_populate_sdk_ext: Error executing a python function in exec_python_func() autogenerated:
The stack trace of python calls that resulted in this exception/failure was:
File: 'exec_python_func() autogenerated', lineno: 2, function: <module>
0001:
*** 0002:do_populate_sdk_ext(d)
0003:
File: '/yocto/sources/poky/meta/classes/populate_sdk_ext.bbclass', lineno: 720, function: do_populate_sdk_ext
0716: bb.fatal('The extensible SDK can currently only be built for the same architecture as the machine being built on - SDK_ARCH is set to %s (likely via setting
SDKMACHINE) which is different from the architecture of the build machine (%s). Unable to continue.' % (d.getVar('SDK_ARCH'), d.getVar('BUILD_ARCH')))
0717:
0718: d.setVar('SDK_INSTALL_TARGETS', get_sdk_install_targets(d))
0719: if d.getVar('SDK_INCLUDE_BUILDTOOLS') == '1':
*** 0720: buildtools_fn = get_current_buildtools(d)
0721: else:
0722: buildtools_fn = None
0723: d.setVar('SDK_REQUIRED_UTILITIES', get_sdk_required_utilities(buildtools_fn, d))
0724: d.setVar('SDK_BUILDTOOLS_INSTALLER', buildtools_fn)
File: '/yocto/sources/poky/meta/classes/populate_sdk_ext.bbclass', lineno: 556, function: get_current_buildtools
0552: import glob
0553: btfiles = glob.glob(os.path.join(d.getVar('SDK_DEPLOY'), '*-buildtools-nativesdk-standalone-*.sh'))
0554: btfiles.sort(key=os.path.getctime)
0555: print("MY-DEBUG - btfiles = {} - SDK_DEPLOY = {}".format(btfiles, d.getVar('SDK_DEPLOY')))
*** 0556: return os.path.basename(btfiles[-1])
0557:
0558:def get_sdk_required_utilities(buildtools_fn, d):
0559: """Find required utilities that aren't provided by the buildtools"""
0560: sanity_required_utilities = (d.getVar('SANITY_REQUIRED_UTILITIES') or '').split()
Exception: IndexError: list index out of range
DEBUG: Python function do_populate_sdk_ext finished
MY-DEBUG - btfiles = [] - SDK_DEPLOY = /yocto/build-mX6ull/tmp/deploy/sdk
Question:
In line 553 the array btfiles should be filled,
but the array is empty and the line 556 generate the exception.
I have no idea of whats is wrong, what I have forget and what Yocto environment variables are needed to setup to do a correctly work.
hope you are doing good
i had similar issue where i couldnt populate esdk,
its all in GLIBC version..
kindly update your GLIBC version
In my case i had to update GLIBC version to 2.33 in "yocto-uninative.inc" file. It worked for me!!!

How to enable CUDA Aware OpenMPI?

I'm using OpenMPI and I need to enable CUDA aware MPI. Together with MPI I'm using OpenACC with the hpc_sdk software.
Following https://www.open-mpi.org/faq/?category=buildcuda I downloaded and installed UCX (not gdrcopy, I haven't managed to install it) with
./contrib/configure-release --with-cuda=/opt/nvidia/hpc_sdk/Linux_x86_64/20.7/cuda/11.0 CC=pgcc CXX=pgc++ --disable-fortran
and it prints:
checking cuda.h usability... yes
checking cuda.h presence... yes
checking for cuda.h... yes
checking cuda_runtime.h usability... yes
checking cuda_runtime.h presence... yes
checking for cuda_runtime.h... yes
So UCX seems to be ok.
After this I re-configured OpenMPI with:
./configure --with-ucx=/home/marco/Downloads/ucx-1.9.0/install --with-cuda=/opt/nvidia/hpc_sdk/Linux_x86_64/20.7/cuda/11.0 CC=pgcc CXX=pgc++ --disable-mpi-fortran
and it prints:
CUDA support: yes
Open UCX: yes
If I try to run the application with: mpirun -np 2 -mca pml ucx -x ./a.out (as suggested on openucx.org) I get the errors:
match_arg (utils/args/args.c:163): unrecognized argument mca
HYDU_parse_array (utils/args/args.c:178): argument matching returned error
parse_args (ui/mpich/utils.c:1642): error parsing input array
HYD_uii_mpx_get_parameters (ui/mpich/utils.c:1694): unable to parse user arguments
main (ui/mpich/mpiexec.c:148): error parsing parameters
I see that the directories the compilers is looking for are not the OpenMPI ones but the ones of MPICH, I don't know why. If i type which mpicc ,which mpiexec and which mpirun I get the ones of OpenMPI.
If i run with: mpiexec -n 2 ./a.out I get:
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
EDITED:
Doing the same but using the OpenMPI-4.0.5 that comes with NVIDIA HPC SDK it compiles and runs fine.
I get:
[marco-Inspiron-7501:1356251:0:1356251] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7f05cfafa000)
==== backtrace (tid:1356251) ====
0 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucs.so.0(ucs_handle_error+0x67) [0x7f060ae06dc7]
1 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucs.so.0(+0x2ab87) [0x7f060ae06b87]
2 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucs.so.0(+0x2ace4) [0x7f060ae06ce4]
3 /lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0) [0x7f060c7433c0]
4 /lib/x86_64-linux-gnu/libc.so.6(+0x18e885) [0x7f060befb885]
5 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x379e6) [0x7f060b2bd9e6]
6 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(ucp_dt_pack+0xa5) [0x7f060b2bd775]
7 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x4d5b5) [0x7f060b2d35b5]
8 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x4b11d) [0x7f060b2d111d]
9 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libuct.so.0(+0x1b577) [0x7f060b055577]
10 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libuct.so.0(uct_mm_ep_am_bcopy+0x75) [0x7f060b054725]
11 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x4d614) [0x7f060b2d3614]
12 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x4c2c7) [0x7f060b2d22c7]
13 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x4b5b1) [0x7f060b2d15b1]
14 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x625bd) [0x7f060b2e85bd]
15 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x61d15) [0x7f060b2e7d15]
16 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x6121a) [0x7f060b2e721a]
17 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(ucp_tag_send_nbx+0x5ec) [0x7f060b2e65ac]
18 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libmpi.so.40(mca_pml_ucx_send+0x1a3) [0x7f060dfc3b33]
=================================
[marco-Inspiron-7501:1356252:0:1356252] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7fd7f7afa000)
==== backtrace (tid:1356252) ====
0 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucs.so.0(ucs_handle_error+0x67) [0x7fd82a711dc7]
1 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucs.so.0(+0x2ab87) [0x7fd82a711b87]
2 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucs.so.0(+0x2ace4) [0x7fd82a711ce4]
3 /lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0) [0x7fd82c04e3c0]
4 /lib/x86_64-linux-gnu/libc.so.6(+0x18e885) [0x7fd82b806885]
5 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x379e6) [0x7fd82abc89e6]
6 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(ucp_dt_pack+0xa5) [0x7fd82abc8775]
7 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x4d5b5) [0x7fd82abde5b5]
8 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x4b11d) [0x7fd82abdc11d]
9 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libuct.so.0(+0x1b577) [0x7fd82a960577]
10 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libuct.so.0(uct_mm_ep_am_bcopy+0x75) [0x7fd82a95f725]
11 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x4d614) [0x7fd82abde614]
12 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x4c2c7) [0x7fd82abdd2c7]
13 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x4b5b1) [0x7fd82abdc5b1]
14 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x625bd) [0x7fd82abf35bd]
15 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x61d15) [0x7fd82abf2d15]
16 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x6121a) [0x7fd82abf221a]
17 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(ucp_tag_send_nbx+0x5ec) [0x7fd82abf15ac]
18 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libmpi.so.40(mca_pml_ucx_send+0x1a3) [0x7fd82d8ceb33]
=================================
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node marco-Inspiron-7501 exited on signal 11 (Segmentation fault).
The error it's caused by pragma acc host_data use_device(send_buf, recv_buf)
double send_buf[NX_GLOB + 2*NGHOST];
double recv_buf[NX_GLOB + 2*NGHOST];
#pragma acc enter data create(send_buf[:NX_GLOB+2*NGHOST], recv_buf[NX_GLOB+2*NGHOST])
// Top buffer
j = jend;
#pragma acc parallel loop present(phi[:ny_tot][:nx_tot], send_buf[:NX_GLOB+2*NGHOST])
for (i = ibeg; i <= iend; i++) send_buf[i] = phi[j][i];
#pragma acc host_data use_device(send_buf, recv_buf)
{
MPI_Sendrecv (send_buf, iend+1, MPI_DOUBLE, procR[1], 0,
recv_buf, iend+1, MPI_DOUBLE, procR[1], 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
}
This was an issue in the 20.7 release when adding UCX support. You can lower the optimization level to -O1 work around the problem, or update your NV HPC compiler version to 20.9 where we've resolved the issue.
https://developer.nvidia.com/nvidia-hpc-sdk-version-209-downloads

mpi4py irecv causes segmentation fault

I'm running following code which sends an array from rank 0 to 1 using command mpirun -n 2 python -u test_irecv.py > output 2>&1.
from mpi4py import MPI
import numpy as np
comm = MPI.COMM_WORLD
asyncr = 1
size_arr = 10000
if comm.Get_rank()==0:
arrs = np.zeros(size_arr)
if asyncr: comm.isend(arrs, dest=1).wait()
else: comm.send(arrs, dest=1)
else:
if asyncr: arrv = comm.irecv(source=0).wait()
else: arrv = comm.recv(source=0)
print('Done!', comm.Get_rank())
Running in synchronous mode with asyncr = 0 gives the expected output
Done! 0
Done! 1
However running in asynchronous mode with asyncr = 1 gives errors as follows.
I need to know why it runs okay in synchronous mode and not so in asynchronous mode.
Output with asyncr = 1:
Done! 0
[nia1477:420871:0:420871] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x138)
==== backtrace ====
0 0x0000000000010e90 __funlockfile() ???:0
1 0x00000000000643d1 ompi_errhandler_request_invoke() ???:0
2 0x000000000008a8b5 __pyx_f_6mpi4py_3MPI_PyMPI_wait() /tmp/eb-A2FAdY/pip-req-build-dvnprmat/src/mpi4py.MPI.c:49819
3 0x000000000008a8b5 __pyx_f_6mpi4py_3MPI_PyMPI_wait() /tmp/eb-A2FAdY/pip-req-build-dvnprmat/src/mpi4py.MPI.c:49819
4 0x000000000008a8b5 __pyx_pf_6mpi4py_3MPI_7Request_34wait() /tmp/eb-A2FAdY/pip-req-build-dvnprmat/src/mpi4py.MPI.c:83838
5 0x000000000008a8b5 __pyx_pw_6mpi4py_3MPI_7Request_35wait() /tmp/eb-A2FAdY/pip-req-build-dvnprmat/src/mpi4py.MPI.c:83813
6 0x00000000000966a3 _PyMethodDef_RawFastCallKeywords() /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Objects/call.c:690
7 0x000000000009eeb9 _PyMethodDescr_FastCallKeywords() /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Objects/descrobject.c:288
8 0x000000000006e611 call_function() /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Python/ceval.c:4563
9 0x000000000006e611 _PyEval_EvalFrameDefault() /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Python/ceval.c:3103
10 0x0000000000177644 _PyEval_EvalCodeWithName() /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Python/ceval.c:3923
11 0x000000000017774e PyEval_EvalCodeEx() /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Python/ceval.c:3952
12 0x000000000017777b PyEval_EvalCode() /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Python/ceval.c:524
13 0x00000000001aab72 run_mod() /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Python/pythonrun.c:1035
14 0x00000000001aab72 PyRun_FileExFlags() /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Python/pythonrun.c:988
15 0x00000000001aace6 PyRun_SimpleFileExFlags() /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Python/pythonrun.c:430
16 0x00000000001cad47 pymain_run_file() /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Modules/main.c:425
17 0x00000000001cad47 pymain_run_filename() /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Modules/main.c:1520
18 0x00000000001cad47 pymain_run_python() /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Modules/main.c:2520
19 0x00000000001cad47 pymain_main() /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Modules/main.c:2662
20 0x00000000001cb1ca _Py_UnixMain() /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Modules/main.c:2697
21 0x00000000000202e0 __libc_start_main() ???:0
22 0x00000000004006ba _start() /tmp/nix-build-glibc-2.24.drv-0/glibc-2.24/csu/../sysdeps/x86_64/start.S:120
===================
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 420871 on node nia1477 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
The versions are as follows:
Python: 3.7.0
mpi4py: 3.0.0
mpiexec --version gives mpiexec (OpenRTE) 3.1.2
mpicc -v gives icc version 18.0.3 (gcc version 7.3.0 compatibility)
Running with asyncr = 1 in another system with MPICH gave the following output.
Done! 0
Traceback (most recent call last):
File "test_irecv.py", line 14, in <module>
if asyncr: arrv = comm.irecv(source=0).wait()
File "mpi4py/MPI/Request.pyx", line 235, in mpi4py.MPI.Request.wait
File "mpi4py/MPI/msgpickle.pxi", line 411, in mpi4py.MPI.PyMPI_wait
mpi4py.MPI.Exception: MPI_ERR_TRUNCATE: message truncated
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[23830,1],1]
Exit code: 1
--------------------------------------------------------------------------
[master:01977] 1 more process has sent help message help-mpi-btl-base.txt / btl:no-nics
[master:01977] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
Apparently this is a known problem in mpi4py as described in https://bitbucket.org/mpi4py/mpi4py/issues/65/mpi_err_truncate-message-truncated-when. Lisandro Dalcin says
The implementation of irecv() for large messages requires users to pass a buffer-like object large enough to receive the pickled stream. This is not documented (as most of mpi4py), and even non-obvious and unpythonic...
The fix is to pass a large enough pre-allocated bytearray to irecv. A working example is as follows.
from mpi4py import MPI
import numpy as np
comm = MPI.COMM_WORLD
size_arr = 10000
if comm.Get_rank()==0:
arrs = np.zeros(size_arr)
comm.isend(arrs, dest=1).wait()
else:
arrv = comm.irecv(bytearray(1<<20), source=0).wait()
print('Done!', comm.Get_rank())

Oprofile warning "could not check that the binary file"

We do profiling kernel modules with Oprofile and there is a warning in the opreport as follow
warning: could not check that the binary file /lib/modules/2.6.32-191.el6.x86_64/kernel/fs/ext4/ext4.ko has not been modified since the profile was taken. Results may be inaccurate.
1591 samples % symbol name
1592 1622 9.8381 ext4_iget
1593 1591 9.6500 ext4_find_entry
1594 1231 7.4665 __ext4_get_inode_loc
1595 783 4.7492 ext4_ext_get_blocks
1596 752 4.5612 ext4_check_dir_entry
1597 644 3.9061 ext4_mark_iloc_dirty
1598 583 3.5361 ext4_get_blocks
1599 583 3.5361 ext4_xattr_get
May anyone please explain what is the warning, does it impact the accuracy of the oprofile output and is there anyway to avoid this warning?
Any suggestions are appreciated. Thank a lot!
Add more information:
In daemon/opd_mangling.c
if (!sf->kernel)
binary = find_cookie(sf->cookie);
else
binary = sf->kernel->name;
...
fill_header(odb_get_data(file), counter,
sf->anon ? sf->anon->start : 0, last_start,
!!sf->kernel, last ? !!last->kernel : 0,
spu_profile, sf->embedded_offset,
binary ? op_get_mtime(binary) : 0);
For kernel module file, the sf->kernel->name is the kernel module name, so the fill header will always fill mtime with 0 and generate the unwanted warning
This failure indicates that a stat of the file in question failed. Do an strace -e stat to see the specific failure mode.
time_t op_get_mtime(char const * file)
{
struct stat st;
if (stat(file, &st))
return 0;
return st.st_mtime;
}
...
if (!header.mtime) {
// FIXME: header.mtime for JIT sample files is 0. The problem could be that
// in opd_mangling.c:opd_open_sample_file() the call of fill_header()
// think that the JIT sample file is not a binary file.
if (is_jit_sample(file)) {
cverb << vbfd << "warning: could not check that the binary file "
<< file << " has not been modified since "
"the profile was taken. Results may be inaccurate.\n";
does it impact the accuracy of the oprofile output and is there anyway to avoid this warning?
Yes, it impacts the output in that it has no opportunity to warn you whether "the last modified time of the binary file does not match that of the sample file...". As long as you're certain what you measured then matches which binary is installed now, the warning you're seeing is harmless.

Resources