torch.nn.CrossEntropyLoss().ignore_index is crashing when importing transfomers library

torch.nn.CrossEntropyLoss().ignore_index is crashing when importing transfomers library - python-3.x

I am using layoutlm github which require python 3.6, transformer 2.9.0. I created an conda env:
name: env_test
channels:
- defaults
- conda-forge
dependencies:
- python=3.6
- pip=20.3.3
- pytorch=1.4.0
- cudatoolkit=10.1
- pip:
- transformers==2.9.0
I have the following test.py code to reproduce the issue:
import sys
import torch
from torch.nn import CrossEntropyLoss
from transformers import (
BertConfig,
__version__
)
print (sys.version)
print(torch.__version__)
print(__version__)
CrossEntropyLoss().ignore_index
print("success!")
Importing transformers library results in segmentation fault (core dumped) a when calling CrossEntropyLoss().ignore_index:
$python test.py
3.6.12 |Anaconda, Inc.| (default, Sep 8 2020, 23:10:56)
[GCC 7.3.0]
1.4.0
2.9.0
Segmentation fault (core dumped)
I tried to investigate a bit but I don't really see from where the problem is coming from:
gdb python
GNU gdb (Ubuntu 8.1.1-0ubuntu1) 8.1.1
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...done.
(gdb) r test.py
Starting program: /home/jupyter/.conda-env/env_test/bin/python test.py
warning: Error disabling address space randomization: Operation not permitted
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
3.6.12 |Anaconda, Inc.| (default, Sep 8 2020, 23:10:56)
[GCC 7.3.0]
1.4.0
2.9.0
Program received signal SIGSEGV, Segmentation fault.
0x00007f97000055fb in ?? ()
(gdb) where
#0 0x00007f97000055fb in ?? ()
#1 0x00007f97f4755729 in void pybind11::cpp_function::initialize<void (*&)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&), void, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, pybind11::name, pybind11::scope, pybind11::sibling>(void (*&)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&), void (*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) ()
from /home/jupyter/.conda-env/env_test/lib/python3.6/site-packages/torch/lib/libtorch_python.so
#2 0x00007f97f436bca6 in pybind11::cpp_function::dispatcher(_object*, _object*, _object*) () from /home/jupyter/.conda-env/env_test/lib/python3.6/site-packages/torch/lib/libtorch_python.so
#3 0x000055fbadd73a14 in _PyCFunction_FastCallDict () at /tmp/build/80754af9/python_1599604603603/work/Objects/methodobject.c:231
#4 0x000055fbaddfba5c in call_function () at /tmp/build/80754af9/python_1599604603603/work/Python/ceval.c:4851
#5 0x000055fbade1e25a in _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1599604603603/work/Python/ceval.c:3335
#6 0x000055fbaddf5c1b in _PyFunction_FastCall (globals=<optimized out>, nargs=1, args=<optimized out>, co=<optimized out>) at /tmp/build/80754af9/python_1599604603603/work/Python/ceval.c:4933
#7 fast_function () at /tmp/build/80754af9/python_1599604603603/work/Python/ceval.c:4968
#8 0x000055fbaddfbb35 in call_function () at /tmp/build/80754af9/python_1599604603603/work/Python/ceval.c:4872
#9 0x000055fbade1e25a in _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1599604603603/work/Python/ceval.c:3335
#10 0x000055fbaddf5166 in _PyEval_EvalCodeWithName () at /tmp/build/80754af9/python_1599604603603/work/Python/ceval.c:4166
#11 0x000055fbaddf5e51 in fast_function () at /tmp/build/80754af9/python_1599604603603/work/Python/ceval.c:4992
#12 0x000055fbaddfbb35 in call_function () at /tmp/build/80754af9/python_1599604603603/work/Python/ceval.c:4872
#13 0x000055fbade1e25a in _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1599604603603/work/Python/ceval.c:3335
#14 0x000055fbaddf5166 in _PyEval_EvalCodeWithName () at /tmp/build/80754af9/python_1599604603603/work/Python/ceval.c:4166
#15 0x000055fbaddf5e51 in fast_function () at /tmp/build/80754af9/python_1599604603603/work/Python/ceval.c:4992
#16 0x000055fbaddfbb35 in call_function () at /tmp/build/80754af9/python_1599604603603/work/Python/ceval.c:4872
#17 0x000055fbade1e25a in _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1599604603603/work/Python/ceval.c:3335
#18 0x000055fbaddf5166 in _PyEval_EvalCodeWithName () at /tmp/build/80754af9/python_1599604603603/work/Python/ceval.c:4166
#19 0x000055fbaddf632c in _PyFunction_FastCallDict () at /tmp/build/80754af9/python_1599604603603/work/Python/ceval.c:5084
#20 0x000055fbadd73ddf in _PyObject_FastCallDict () at /tmp/build/80754af9/python_1599604603603/work/Objects/abstract.c:2310
#21 0x000055fbadd78873 in _PyObject_Call_Prepend () at /tmp/build/80754af9/python_1599604603603/work/Objects/abstract.c:2373
#22 0x000055fbadd7381e in PyObject_Call () at /tmp/build/80754af9/python_1599604603603/work/Objects/abstract.c:2261
#23 0x000055fbaddcc88b in slot_tp_init () at /tmp/build/80754af9/python_1599604603603/work/Objects/typeobject.c:6420
#24 0x000055fbaddfbd97 in type_call () at /tmp/build/80754af9/python_1599604603603/work/Objects/typeobject.c:915
#25 0x000055fbadd73bfb in _PyObject_FastCallDict () at /tmp/build/80754af9/python_1599604603603/work/Objects/abstract.c:2331
#26 0x000055fbaddfbbae in call_function () at /tmp/build/80754af9/python_1599604603603/work/Python/ceval.c:4875
#27 0x000055fbade1e25a in _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1599604603603/work/Python/ceval.c:3335
#28 0x000055fbaddf6969 in _PyEval_EvalCodeWithName (qualname=0x0, name=<optimized out>, closure=0x0, kwdefs=0x0, defcount=0, defs=0x0, kwstep=2, kwcount=<optimized out>, kwargs=0x0, kwnames=0x0, argcount=0, args=0x0,
locals=0x7f98035bf1f8, globals=0x7f98035bf1f8, _co=0x7f980357aae0) at /tmp/build/80754af9/python_1599604603603/work/Python/ceval.c:4166
#29 PyEval_EvalCodeEx () at /tmp/build/80754af9/python_1599604603603/work/Python/ceval.c:4187
#30 0x000055fbaddf770c in PyEval_EvalCode (co=co#entry=0x7f980357aae0, globals=globals#entry=0x7f98035bf1f8, locals=locals#entry=0x7f98035bf1f8) at /tmp/build/80754af9/python_1599604603603/work/Python/ceval.c:731
#31 0x000055fbade77574 in run_mod () at /tmp/build/80754af9/python_1599604603603/work/Python/pythonrun.c:1025
#32 0x000055fbade77971 in PyRun_FileExFlags () at /tmp/build/80754af9/python_1599604603603/work/Python/pythonrun.c:978
#33 0x000055fbade77b73 in PyRun_SimpleFileExFlags () at /tmp/build/80754af9/python_1599604603603/work/Python/pythonrun.c:419
#34 0x000055fbade77c7d in PyRun_AnyFileExFlags () at /tmp/build/80754af9/python_1599604603603/work/Python/pythonrun.c:81
#35 0x000055fbade7b663 in run_file (p_cf=0x7fff210dc16c, filename=0x55fbaefa6dc0 L"test.py", fp=0x55fbaefda800) at /tmp/build/80754af9/python_1599604603603/work/Modules/main.c:340
#36 Py_Main () at /tmp/build/80754af9/python_1599604603603/work/Modules/main.c:811
#37 0x000055fbadd4543e in main () at /tmp/build/80754af9/python_1599604603603/work/Programs/python.c:69
#38 0x00007f9803fd6bf7 in __libc_start_main (main=0x55fbadd45350 <main>, argc=2, argv=0x7fff210dc378, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fff210dc368) at ../csu/libc-start.c:310
#39 0x000055fbade24d0b in _start () at ../sysdeps/x86_64/elf/start.S:103
(gdb
I am the following list of packages:
_libgcc_mutex 0.1 main defaults
_pytorch_select 0.2 gpu_0 defaults
blas 1.0 mkl defaults
ca-certificates 2020.12.8 h06a4308_0 defaults
certifi 2020.12.5 py36h06a4308_0 defaults
cffi 1.14.4 py36h261ae71_0 defaults
chardet 4.0.0 pypi_0 pypi
click 7.1.2 pypi_0 pypi
cudatoolkit 10.1.243 h6bb024c_0 defaults
cudnn 7.6.5 cuda10.1_0 defaults
dataclasses 0.8 pypi_0 pypi
filelock 3.0.12 pypi_0 pypi
idna 2.10 pypi_0 pypi
intel-openmp 2020.2 254 defaults
joblib 1.0.0 pypi_0 pypi
ld_impl_linux-64 2.33.1 h53a641e_7 defaults
libedit 3.1.20191231 h14c3975_1 defaults
libffi 3.3 he6710b0_2 defaults
libgcc-ng 9.1.0 hdf63c60_0 defaults
libstdcxx-ng 9.1.0 hdf63c60_0 defaults
mkl 2020.2 256 defaults
mkl-service 2.3.0 py36he8ac12f_0 defaults
mkl_fft 1.2.0 py36h23d657b_0 defaults
mkl_random 1.1.1 py36h0573a6f_0 defaults
ncurses 6.2 he6710b0_1 defaults
ninja 1.10.2 py36hff7bd54_0 defaults
numpy 1.19.2 py36h54aff64_0 defaults
numpy-base 1.19.2 py36hfa32c7d_0 defaults
openssl 1.1.1i h27cfd23_0 defaults
pip 20.3.3 py36h06a4308_0 defaults
pycparser 2.20 py_2 defaults
python 3.6.12 hcff3b4d_2 defaults
pytorch 1.4.0 cuda101py36h02f0884_0 defaults
readline 8.0 h7b6447c_0 defaults
regex 2020.11.13 pypi_0 pypi
requests 2.25.1 pypi_0 pypi
sacremoses 0.0.43 pypi_0 pypi
sentencepiece 0.1.94 pypi_0 pypi
setuptools 51.0.0 py36h06a4308_2 defaults
six 1.15.0 py36h06a4308_0 defaults
sqlite 3.33.0 h62c20be_0 defaults
tk 8.6.10 hbc83047_0 defaults
tokenizers 0.7.0 pypi_0 pypi
tqdm 4.55.1 pypi_0 pypi
transformers 2.9.0 pypi_0 pypi
urllib3 1.26.2 pypi_0 pypi
wheel 0.36.2 pyhd3eb1b0_0 defaults
xz 5.2.5 h7b6447c_0 defaults
zlib 1.2.11 h7b6447c_3 defaults
What is responsible fo this core dump (I have a VM with 30 GB of memory) ? Seems to be related to transformers. Some dependency issues not catched by conda ? This piece of code seems to work with the latest version of transformers 4.1.1 but this is not compatible with layoutlm. Any suggestions?

It seems something was broken on layoutlm with pytorch 1.4 related issue. Switching to pytorch 1.6 fix the issue with the core dump, and the layoutlm code run without any modification.

Related

SMP threads not showing in GDB

Trying to debug multi CPU SoC (Amlogic A113X) and faced a problem.
So I have this debug configuration: A113X(JTAG) -> Segger J-Link V11 -> OpenOCD -> gdb-multiarch
Everything is connected and seems okay, but GDB shows just 1 thread (should be 4 - one for each CPU):
(gdb) info threads
Id Target Id Frame
* 1 Remote target 0xffffff8009853364 in arch_spin_lock (lock=<optimized out>) at ./arch/arm64/include/asm/spinlock.h:89
Meanwhile there are 4 debug targets according to telenet 'targets' command:
> targets
TargetName Type Endian TapName State
-- ------------------ ---------- ------ ------------------ ------------
0* A113X.a53.0 aarch64 little A113X.cpu halted
1 A113X.a53.1 aarch64 little A113X.cpu halted
2 A113X.a53.2 aarch64 little A113X.cpu unknown
3 A113X.a53.3 aarch64 little A113X.cpu halted
Core 2 is shut down in this particular case.
When I halt CPUs then get this output in GDB:
(gdb) continue
Continuing.
^CA113X.a53.1 halted in AArch64 state due to debug-request, current mode: EL1H
cpsr: 0x800001c5 pc: 0xffffff8009853364
MMU: enabled, D-Cache: enabled, I-Cache: enabled
A113X.a53.3 halted in AArch64 state due to debug-request, current mode: EL1H
cpsr: 0x800000c5 pc: 0xffffff80098532b4
MMU: enabled, D-Cache: enabled, I-Cache: enabled
Program received signal SIGINT, Interrupt.
0xffffff8009853364 in arch_spin_lock (lock=<optimized out>) at ./arch/arm64/include/asm/spinlock.h:89
89 asm volatile(
(gdb) where
#0 0xffffff8009853364 in arch_spin_lock (lock=<optimized out>) at ./arch/arm64/include/asm/spinlock.h:89
#1 do_raw_spin_lock (lock=<optimized out>) at ./include/linux/spinlock.h:148
#2 __raw_spin_lock (lock=<optimized out>) at ./include/linux/spinlock_api_smp.h:145
#3 _raw_spin_lock (lock=0xffffffc01fb86a00) at kernel/locking/spinlock.c:151
#4 0xffffff80090c2114 in try_to_wake_up (p=0xffffffc01963a880, state=<optimized out>, wake_flags=0) at kernel/sched/core.c:2110
#5 0xffffff80090c239c in wake_up_process (p=<optimized out>) at kernel/sched/core.c:2203
#6 0xffffff80090b1b7c in wake_up_worker (pool=<optimized out>) at kernel/workqueue.c:837
#7 insert_work (pwq=<optimized out>, work=<optimized out>, head=<optimized out>, extra_flags=<optimized out>) at kernel/workqueue.c:1310
#8 0xffffff80090b1d10 in __queue_work (cpu=0, wq=0xdf2, work=0x8df2) at kernel/workqueue.c:1460
#9 0xffffff80090b1fc8 in queue_work_on (cpu=8, wq=0xffffffc01dbb5c00, work=0xffffffc01ccb82a0) at kernel/workqueue.c:1485
#10 0xffffff800191d068 in ?? ()
#11 0xffffffc0138664e8 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
Is something wrong with my OpenOCD configuration? SMP config looks fine because it halts all 4 cores. What can be wrong here? Thanks beforehand.
Here is openocd config:
telnet_port 4444
gdb_port 3333
source [find interface/jlink.cfg]
transport select jtag
adapter speed 1000
scan_chain
set _CHIPNAME A113X
set _DAPNAME $_CHIPNAME.dap
jtag newtap $_CHIPNAME cpu -irlen 4 -expected-id 0x5ba00477
dap create $_DAPNAME -chain-position $_CHIPNAME.cpu
echo "$_CHIPNAME.cpu"
set CA53_DBGBASE {0x80410000 0x80510000 0x80610000 0x80710000}
set CA53_CTIBASE {0x80420000 0x80520000 0x80620000 0x80720000}
set _num_ca53 4
set _ap_num 0
set smp_targets ""
proc setup_a5x {core_name dbgbase ctibase num boot} {
for { set _core 0 } { $_core < $num } { incr _core } {
set _TARGETNAME $::_CHIPNAME.$core_name.$_core
set _CTINAME $_TARGETNAME.cti
cti create $_CTINAME -dap $::_DAPNAME -ap-num $::_ap_num \
-baseaddr [lindex $ctibase $_core]
target create $_TARGETNAME aarch64 -dap $::_DAPNAME -cti $_CTINAME -coreid $_core
set ::smp_targets "$::smp_targets $_TARGETNAME"
}
}
setup_a5x a53 $CA53_DBGBASE $CA53_CTIBASE $_num_ca53 1
echo "SMP targets:$smp_targets"
eval "target smp $smp_targets"
targets $_CHIPNAME.a53.0
And output of OpenOCD:
Open On-Chip Debugger 0.11.0+dev-00640-ge83eeb44a (2022-04-21-10:10)
Licensed under GNU GPL v2
For bug reports, read
http://openocd.org/doc/doxygen/bugs.html
A113X.cpu
SMP targets: A113X.a53.0 A113X.a53.1 A113X.a53.2 A113X.a53.3
Info : Listening on port 6666 for tcl connections
Info : Listening on port 4444 for telnet connections
Info : J-Link V11 compiled Mar 3 2022 10:16:14
Info : Hardware version: 11.00
Info : VTarget = 3.309 V
Info : clock speed 1000 kHz
Info : JTAG tap: A113X.cpu tap/device found: 0x5ba00477 (mfg: 0x23b (ARM Ltd), part: 0xba00, ver: 0x5)
Info : A113X.a53.0: hardware has 6 breakpoints, 4 watchpoints
Info : A113X.a53.1: hardware has 6 breakpoints, 4 watchpoints
Error: JTAG-DP STICKY ERROR
Warn : target A113X.a53.2 examination failed
Info : A113X.a53.3: hardware has 6 breakpoints, 4 watchpoints
Info : A113X.a53.0 cluster 0 core 0 multi core
Info : A113X.a53.1 cluster 0 core 1 multi core
Info : A113X.a53.3 cluster 0 core 3 multi core
Info : starting gdb server for A113X.a53.0 on 3333
Info : Listening on port 3333 for gdb connections
Info : accepting 'gdb' connection on tcp/3333
Info : New GDB Connection: 1, Target A113X.a53.0, state: halted
Warn : Prefer GDB command "target extended-remote :3333" instead of "target remote :3333"
A113X.a53.1 halted in AArch64 state due to debug-request, current mode: EL1H
cpsr: 0x800001c5 pc: 0xffffff8009853364
MMU: enabled, D-Cache: enabled, I-Cache: enabled
A113X.a53.3 halted in AArch64 state due to debug-request, current mode: EL1H
cpsr: 0x800000c5 pc: 0xffffff80098532b4
MMU: enabled, D-Cache: enabled, I-Cache: enabled

I tried to combine hwthread and targete smp to achieve multi-cores.
Add "-rtos hwthread" for each core, like
target create ...... -rtos hwthread
In my opinion, a core is regarded as a hwthread by OpenOCD, so set hwthread for each core. Following is the response of info threads.
(gdb) i threads
Id Target Id Frame
* 1 Thread 1 (Name: riscv.cpu0.0, state: debug-request) 0x8000000a in _mstart ()
2 Thread 2 (Name: riscv.cpu0.1, state: debug-request) 0x82000000 in ?? ()
3 Thread 3 (Name: riscv.cpu0.2, state: debug-request) 0x82000000 in ?? ()
4 Thread 4 (Name: riscv.cpu0.3, state: debug-request) 0x82000000 in ?? ()

Weird Backtrace After a Call Instruction Targeting Signal Functions

I tried to trace evince-3.28.4 execution using GDB. There is a callq instruction at some point in libdl, which is shown below (i.e., at _dl_lookup_symbol_x+840):
│0x7ffff7de03f5 <_dl_lookup_symbol_x+837> mov %rbx,%rsi │
>│0x7ffff7de03f8 <_dl_lookup_symbol_x+840> callq 0x7ffff7df0b00 <_dl_signal_cexception> │
│0x7ffff7de03fd <_dl_lookup_symbol_x+845> mov %rbx,%rdi │
When the execution reaches here, the backtrace is as follows:
#0 0x00007ffff7de03f8 in _dl_lookup_symbol_x (undef_name=0x7ffff744fa23 "gtk_progress_get_type", undef_map=0x7ffff7ffe170, ref=0x7fffffffd8d8, symbol_scope=0x7ffff7ffe4f8, version=0x0, type_class=0, flags=2, skip_map=<optimized out>) at dl-lookup.c:857
#1 0x00007ffff4bd6da6 in do_sym (flags=2, vers=0x0, who=0x7ffff486d48e <g_module_symbol+126>, name=0x7ffff744fa23 "gtk_progress_get_type", handle=0x7ffff7ffe170)
at dl-sym.c:151
#2 0x00007ffff4bd6da6 in _dl_sym (handle=0x7ffff7ffe170, name=0x7ffff744fa23 "gtk_progress_get_type", who=0x7ffff486d48e <g_module_symbol+126>) at dl-sym.c:254
#3 0x00007fffefcdf0e4 in dlsym_doit (a=a#entry=0x7fffffffdb20) at dlsym.c:50
#4 0x00007ffff4bd72df in __GI__dl_catch_exception (exception=exception#entry=0x7fffffffdab0, operate=0x7fffefcdf0d0 <dlsym_doit>, args=0x7fffffffdb20)
at dl-error-skeleton.c:196
#5 0x00007ffff4bd736f in __GI__dl_catch_error (objname=0x5555557d44a0, errstring=0x5555557d44a8, mallocedp=0x5555557d4498, operate=<optimized out>, args=<optimized out>) at dl-error-skeleton.c:215
#6 0x00007fffefcdf735 in _dlerror_run (operate=operate#entry=0x7fffefcdf0d0 <dlsym_doit>, args=args#entry=0x7fffffffdb20) at dlerror.c:162
#7 0x00007fffefcdf166 in __dlsym (handle=handle#entry=0x7ffff7ffe170, name=name#entry=0x7ffff744fa23 "gtk_progress_get_type") at dlsym.c:70
#8 0x00007ffff486d48e in _g_module_symbol (symbol_name=0x7ffff744fa23 "gtk_progress_get_type", handle=0x7ffff7ffe170) at ../../../../gmodule/gmodule-dl.c:163
#9 0x00007ffff486d48e in g_module_symbol (module=module#entry=0x5555557d44c0, symbol_name=symbol_name#entry=0x7ffff744fa23 "gtk_progress_get_type", symbol=symbol#entry=0x7fffffffdba0) at ../../../../gmodule/gmodule.c:800
#10 0x00007ffff728f55e in _gtk_module_has_mixed_deps (module_to_check=module_to_check#entry=0x0) at ../../../../gtk/gtkmodules.c:594
#11 0x00007ffff728f703 in find_module (name=0x5555557db040 "gail") at ../../../../gtk/gtkmodules.c:227
#12 0x00007ffff728f703 in load_module (name=0x5555557db040 "gail", module_list=0x0) at ../../../../gtk/gtkmodules.c:292
#13 0x00007ffff728f703 in load_modules (module_str=module_str#entry=0x5555557d44f0 "gail:atk-bridge") at ../../../../gtk/gtkmodules.c:423
#14 0x00007ffff728fb64 in _gtk_modules_init (argc=0x0, argv=<optimized out>, gtk_modules_args=0x5555557d44f0 "gail:atk-bridge") at ../../../../gtk/gtkmodules.c:544
#15 0x00007ffff726786b in do_post_parse_initialization (argc=0x0, argv=0x0) at ../../../../gtk/gtkmain.c:755
#16 0x00007ffff726786b in post_parse_hook (context=<optimized out>, group=<optimized out>, data=0x5555557d0cd0, error=0x7fffffffdd98) at ../../../../gtk/gtkmain.c:798
#17 0x00007ffff54768a8 in g_option_context_parse (context=<optimized out>, argc=<optimized out>, argv=<optimized out>, error=<optimized out>)
at ../../../../glib/goption.c:2165
#18 0x0000555555573386 in main (argc=<optimized out>, argv=<optimized out>) at main.c:275
But when I enter ni (to jump to the next assembly instruction), it turns into this:
#0 0x00007ffff4bd72cd in __GI__dl_catch_exception (exception=exception#entry=0x7fffffffdab0, operate=0x7fffefcdf0d0 <dlsym_doit>, args=0x7fffffffdb20)
at dl-error-skeleton.c:194
#1 0x00007ffff4bd736f in __GI__dl_catch_error (objname=0x5555557d44a0, errstring=0x5555557d44a8, mallocedp=0x5555557d4498, operate=<optimized out>, args=<optimized out>) at dl-error-skeleton.c:215
#2 0x00007fffefcdf735 in _dlerror_run (operate=operate#entry=0x7fffefcdf0d0 <dlsym_doit>, args=args#entry=0x7fffffffdb20) at dlerror.c:162
#3 0x00007fffefcdf166 in __dlsym (handle=handle#entry=0x7ffff7ffe170, name=name#entry=0x7ffff744fa23 "gtk_progress_get_type") at dlsym.c:70
#4 0x00007ffff486d48e in _g_module_symbol (symbol_name=0x7ffff744fa23 "gtk_progress_get_type", handle=0x7ffff7ffe170) at ../../../../gmodule/gmodule-dl.c:163
#5 0x00007ffff486d48e in g_module_symbol (module=module#entry=0x5555557d44c0, symbol_name=symbol_name#entry=0x7ffff744fa23 "gtk_progress_get_type", symbol=symbol#entry=0x7fffffffdba0) at ../../../../gmodule/gmodule.c:800
#6 0x00007ffff728f55e in _gtk_module_has_mixed_deps (module_to_check=module_to_check#entry=0x0) at ../../../../gtk/gtkmodules.c:594
#7 0x00007ffff728f703 in find_module (name=0x5555557db040 "gail") at ../../../../gtk/gtkmodules.c:227
#8 0x00007ffff728f703 in load_module (name=0x5555557db040 "gail", module_list=0x0) at ../../../../gtk/gtkmodules.c:292
#9 0x00007ffff728f703 in load_modules (module_str=module_str#entry=0x5555557d44f0 "gail:atk-bridge") at ../../../../gtk/gtkmodules.c:423
#10 0x00007ffff728fb64 in _gtk_modules_init (argc=0x0, argv=<optimized out>, gtk_modules_args=0x5555557d44f0 "gail:atk-bridge") at ../../../../gtk/gtkmodules.c:544
#11 0x00007ffff726786b in do_post_parse_initialization (argc=0x0, argv=0x0) at ../../../../gtk/gtkmain.c:755
#12 0x00007ffff726786b in post_parse_hook (context=<optimized out>, group=<optimized out>, data=0x5555557d0cd0, error=0x7fffffffdd98) at ../../../../gtk/gtkmain.c:798
#13 0x00007ffff54768a8 in g_option_context_parse (context=<optimized out>, argc=<optimized out>, argv=<optimized out>, error=<optimized out>)
at ../../../../glib/goption.c:2165
#14 0x0000555555573386 in main (argc=<optimized out>, argv=<optimized out>) at main.c:275
As can be seen, after a simple call and return, 4 elements are popped off the stack. Perhaps there is something special about the <_dl_signal_cexception(), __GI__dl_catch_exception()> pair. The stack is changed by some means other than call or return. It seems that _dl_signal_cexception() finally leads to a __longjmp() function at ../sysdeps/x86_64/__longjmp.S which modifies the backtrace. Can someone describe the process?

As can be seen, after a simple call and return, 4 elements are popped off the stack. Perhaps there is something special about the _dl_signal_cexception() __GI__dl_catch_exception()pair. The stack is changed by some means other than call or return.
Correct: the _dl_signal_exception doesn't return, it uses longjmp to transfer control not to its caller, but to its callers callers ... caller.
It seems that _dl_signal_cexception() finally leads to a __longjmp()
Correct.
Can someone describe the process?
You appear to not understand what longjmp does. Reading its man page and/or this example should help.
Update:
this approach for transition in control flow is somehow insane even compared to simple gotos ... any other cases that I should consider?
Other "interesting" control transfers are via makecontext, setcontext and swapcontext family of functions, and (in C++) throw and catch are pretty much equivalent to setjmp and longjmp.

How does Mesa recycle graphic resources?

I have a system running on an Intel debug board with DRM and Mesa.
This graphic system use Wayland/Weston and Mesa.
And applications are developed with OpenGL ES 2.0.
Now, I find, sometimes, if the application crashed, the Weston will crashed too.
By checking the coredump of Weston, I can find some invalid memory address was used.
But when I run Weston with Valgrind, there is not any report for invalid memory access.
So, I am thinking about if there were some shared-memory leak by mesa when the client crash.
Means, for example, an application draw a buffer, and commit it to Weston, after that, the application crashed, and mesa recycled all the buffers alloced by this application. But, Weston do not know this, it used the committed buffer and crashed.
Will those things happen? And what could I do to survive from this?
Core was generated by `weston --config=/usr/lib/weston/weston.ini --backend=/usr/lib/weston/ias-backen'.
Program terminated with signal SIGTRAP, Trace/breakpoint trap.
#0 0x00007fec68911b09 in raise (sig=5) at ../sysdeps/unix/sysv/linux/pt-raise.c:36
36 ../sysdeps/unix/sysv/linux/pt-raise.c: No such file or directory.
(gdb) bt
#0 0x00007fec68911b09 in raise (sig=5) at ../sysdeps/unix/sysv/linux/pt-raise.c:36
#1 <signal handler called>
#2 0x00007fec685904b8 in __GI_raise (sig=sig#entry=6) at ../sysdeps/unix/sysv/linux/raise.c:55
#3 0x00007fec6859358a in __GI_abort () at abort.c:89
#4 0x00007fec685ca90b in __libc_message (do_abort=do_abort#entry=2, fmt=fmt#entry=0x7fec686c68a0 "*** Error in `%s': %s: 0x%s ***\n") at ../sysdeps/posix/libc_fatal.c:175
#5 0x00007fec685d4896 in malloc_printerr (action=3, str=0x7fec686c2d31 "free(): invalid pointer", ptr=<optimized out>, ar_ptr=<optimized out>) at malloc.c:5000
#6 0x00007fec685d507e in _int_free (av=0x7fec688fbb40 <main_arena>, p=<optimized out>, have_lock=0) at malloc.c:3861
#7 0x00007fec655d320d in intel_miptree_release (mt=mt#entry=0x18884a8)
at src/mesa/drivers/dri/i965/intel_mipmap_tree.c:1036
#8 0x00007fec655d32a7 in intel_miptree_reference (dst=0x18884a8, src=0x1870170)
at src/mesa/drivers/dri/i965/intel_mipmap_tree.c:989
#9 0x00007fec655dc640 in intel_set_texture_image_mt (brw=brw#entry=0x7fec69a44040, image=image#entry=0x185c050, internal_format=<optimized out>, mt=<optimized out>)
at src/mesa/drivers/dri/i965/intel_tex_image.c:180
#10 0x00007fec655dc952 in intel_image_target_texture_2d (ctx=<optimized out>, target=3553, texObj=0x1888080, texImage=0x185c050, image_handle=<optimized out>)
at src/mesa/drivers/dri/i965/intel_tex_image.c:426
#11 0x00007fec653544ff in _mesa_EGLImageTargetTexture2DOES (target=3553, image=0x185fb60)
at src/mesa/main/teximage.c:3194
#12 0x00007fec660a2d27 in gl_renderer_attach_egl (format=<optimized out>, buffer=0x1889130, es=<optimized out>) at ../src/gl-renderer.c:1450
#13 gl_renderer_attach (es=<optimized out>, buffer=0x1889130) at ../src/gl-renderer.c:1919
#14 0x000000000040fdb5 in weston_surface_attach (buffer=0x1889130, surface=0x13ad650) at ../src/compositor.c:2266
#15 weston_surface_commit_state (surface=surface#entry=0x13ad650, state=state#entry=0x13ad778) at ../src/compositor.c:3190
#16 0x000000000041036f in weston_surface_commit (surface=surface#entry=0x13ad650) at ../src/compositor.c:3262
#17 0x00000000004104c7 in surface_commit (client=<optimized out>, resource=<optimized out>) at ../src/compositor.c:3289
#18 0x00007fec6835ad04 in ffi_call_unix64 () from /usr/lib/libffi.so.6
#19 0x00007fec6835a7fa in ffi_call () from /usr/lib/libffi.so.6
#20 0x00007fec696ff7ba in wl_closure_invoke (closure=<optimized out>, flags=<optimized out>, target=0x13ad940, opcode=6, data=0x13acd70) at ../src/connection.c:935
#21 0x00007fec696fc517 in wl_client_connection_data (fd=<optimized out>, mask=<optimized out>, data=0x13acd70) at ../src/wayland-server.c:407
#22 0x00007fec696fdd32 in wl_event_loop_dispatch (loop=0x11a7460, timeout=timeout#entry=-1) at ../src/event-loop.c:423
#23 0x00007fec696fc6b5 in wl_display_run (display=display#entry=0x11a7380) at ../src/wayland-server.c:1281
#24 0x00000000004092d1 in main (argc=1, argv=<optimized out>) at ../src/main.c:1049
(gdb)

nodejs + aerospike crashes

I am getting the following backtrace from an error that only happens after some subsequent requests to the server:
node: ../deps/uv/src/unix/core.c:171: uv__finish_close: Assertion `handle->flags & UV_CLOSING' failed.
Program received signal SIGABRT, Aborted.
0x00000030eee32925 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
64 return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
(gdb) backtrace
#0 0x00000030eee32925 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1 0x00000030eee34105 in abort () at abort.c:92
#2 0x00000030eee2ba4e in __assert_fail_base (fmt=<value optimized out>, assertion=0xb37538 "handle->flags & UV_CLOSING", file=0xb374a8 "../deps/uv/src/unix/core.c", line=<value optimized out>, function=<value optimized out>)
at assert.c:96
#3 0x00000030eee2bb10 in __assert_fail (assertion=0xb37538 "handle->flags & UV_CLOSING", file=0xb374a8 "../deps/uv/src/unix/core.c", line=171, function=0xb37690 "uv__finish_close") at assert.c:105
#4 0x0000000000994bb4 in uv__finish_close (loop=0xe6d840, mode=<value optimized out>) at ../deps/uv/src/unix/core.c:171
#5 uv__run_closing_handles (loop=0xe6d840, mode=<value optimized out>) at ../deps/uv/src/unix/core.c:221
#6 uv_run (loop=0xe6d840, mode=<value optimized out>) at ../deps/uv/src/unix/core.c:319
#7 0x0000000000942132 in node::Start(int, char**) ()
#8 0x00000030eee1ed1d in __libc_start_main (main=0x599710 <main>, argc=2, ubp_av=0x7fffffffdec8, init=<value optimized out>, fini=<value optimized out>, rtld_fini=<value optimized out>, stack_end=0x7fffffffdeb8)
at libc-start.c:226
#9 0x00000000005999f1 in _start ()
Any idea why is this happening? I am using aerospike but I am not sure if it is related to the issue.
To reproduce it:
gdb --args node /bin/www
> run // until error occurs
> backtrace full

This question was asked on the Aerospike Community Edition user forum here.
Aerospike made an official release to NPM (Node.js 1.0.38) in April 2015; it fixes this UV assertion segfault when running query in the node.js API.
The user #Daniel reported that his problem is now fixed.

Which signal was delivered to process deadlocked in signal handler

I have a core dump from a process that deadlocked after invoking a signal handler. How do I determine which signal was delivered and who sent it?
The GDB-generated backtrace for the the thread that received the signal follows. The signal handler was called in frame 15.
(gdb) bt
#0 0x00007fa9c204654b in sys_futex (w=0x7fa9c2263d80, value=2, loop=<value optimized out>) at ./src/base/linux_syscall_support.h:1789
#1 base::internal::SpinLockDelay (w=0x7fa9c2263d80, value=2, loop=<value optimized out>) at ./src/base/spinlock_linux-inl.h:87
#2 0x00007fa9c204774c in SpinLock::SlowLock (this=0x7fa9c2263d80) at src/base/spinlock.cc:132
#3 0x00007fa9c2037ee3 in Lock (this=0x7fa9c2263d80, start=0x7fa9bb3c04c8, end=0x7fa9bb3c04c0, N=3) at src/base/spinlock.h:75
#4 tcmalloc::CentralFreeList::RemoveRange (this=0x7fa9c2263d80, start=0x7fa9bb3c04c8, end=0x7fa9bb3c04c0, N=3) at src/central_freelist.cc:247
#5 0x00007fa9c203bae4 in tcmalloc::ThreadCache::FetchFromCentralCache (this=0x17efb40, cl=<value optimized out>, byte_size=32) at src/thread_cache.cc:162
#6 0x00007fa9c202b9cb in Allocate (size=<value optimized out>) at src/thread_cache.h:341
#7 do_malloc (size=<value optimized out>) at src/tcmalloc.cc:1068
#8 (anonymous namespace)::do_malloc_or_cpp_alloc (size=<value optimized out>) at src/tcmalloc.cc:1005
#9 0x00007fa9c204bfa8 in tc_realloc (old_ptr=0x0, new_size=32) at src/tcmalloc.cc:1517
#10 0x0000003a358c0f3b in ?? () from /usr/lib64/libstdc++.so.6
#11 0x0000003a358c2adf in ?? () from /usr/lib64/libstdc++.so.6
#12 0x0000003a358c2cae in __cxa_demangle () from /usr/lib64/libstdc++.so.6
#13 0x000000000085f6c7 in my_print_stacktrace ()
#14 0x00000000006a773a in handle_fatal_signal ()
#15 <signal handler called>
#16 tcmalloc::CentralFreeList::FetchFromSpans (this=0x7fa9c2263d80) at src/central_freelist.cc:298
#17 0x00007fa9c2037f88 in tcmalloc::CentralFreeList::RemoveRange (this=0x7fa9c2263d80, start=0x7fa9bb3c1468, end=0x7fa9bb3c1460, N=3) at src/central_freelist.cc:269
#18 0x00007fa9c203bae4 in tcmalloc::ThreadCache::FetchFromCentralCache (this=0x17efb40, cl=<value optimized out>, byte_size=32) at src/thread_cache.cc:162
...
For what it's worth, handle_fatal_signal() and my_print_stacktrace() are MySQL functions. The rest are from Google's tcmalloc.

I would try "frame 15" to move to the signal delivery frame, followed by "print $_siginfo.si_signo". See https://sourceware.org/gdb/onlinedocs/gdb/Signals.html
This works on Linux at least, which I presume from your backtrace that you are using. I'm not sure about other platforms.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

torch.nn.CrossEntropyLoss().ignore_index is crashing when importing transfomers library - python-3.x

It seems something was broken on layoutlm with pytorch 1.4 related issue. Switching to pytorch 1.6 fix the issue with the core dump, and the layoutlm code run without any modification.

Related

SMP threads not showing in GDB

Weird Backtrace After a Call Instruction Targeting Signal Functions

How does Mesa recycle graphic resources?

nodejs + aerospike crashes

Which signal was delivered to process deadlocked in signal handler

Categories

Resources