Theano segmentation fault with cuda-9.0 - theano

I tried installing and using Theano with Cuda-9.0 on a P100 node. The installation itself went smooth, but I get Segmentation fault (see below).
I tried with both Theano-0.9.0 and Theano-0.10.0beta1 in combination with libgpuarray/pygpu - 0.6.8 and 0.6.9. All of the cases result in segfault.
Here is my setup:
* RHEL 7
* GCC: 4.8.5
* CUDA: 9.0
* cuDNN: 5.1.5
* Python: 2.7.13
* cmake: 3.7.2
[bsankara#c460 ~]$ python
Python 2.7.13 (default, Aug 10 2017, 07:33:11)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-11)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import theano
--------------------------------------------------------------------------
A process has executed an operation involving a call to the
"fork()" system call to create a child process. Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your job may hang, crash, or produce silent
data corruption. The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.
The process that invoked fork was:
Local host: [[52508,1],0] (PID 3946)
If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
--------------------------------------------------------------------------
[c460:03946] *** Process received signal ***
[c460:03946] Signal: Segmentation fault (11)
[c460:03946] Signal code: Invalid permissions (2)
[c460:03946] Failing at address: 0x3fff8d48f5b0
[c460:03946] [ 0] [0x3fff9cdf0478]
[c460:03946] [ 1] /home/bsankara/software/ppc64le-08102017/lib/libgpuarray.so.2(load_libcuda+0x60)[0x3fff8631b5e0]
[c460:03946] [ 2] /home/bsankara/software/ppc64le-08102017/lib/libgpuarray.so.2(+0x3f384)[0x3fff862df384]
[c460:03946] [ 3] /home/bsankara/software/ppc64le-08102017/lib/libgpuarray.so.2(+0x41118)[0x3fff862e1118]
[c460:03946] [ 4] /home/bsankara/software/ppc64le-08102017/lib/libgpuarray.so.2(gpucontext_init+0x90)[0x3fff862c7930]
[c460:03946] [ 5] /home/bsankara/software/ppc64le-08102017/lib/python2.7/site-packages/pygpu-0.6.8-py2.7-linux-ppc64le.egg/pygpu/gpuarray.so(+0x2c974)[0x3fff8638c974]
[c460:03946] [ 6] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(+0x101050)[0x3fff9cc61050]
[c460:03946] [ 7] /home/bsankara/software/ppc64le-08102017/lib/python2.7/site-packages/pygpu-0.6.8-py2.7-linux-ppc64le.egg/pygpu/gpuarray.so(+0x54318)[0x3fff863b4318]
[c460:03946] [ 8] /home/bsankara/software/ppc64le-08102017/lib/python2.7/site-packages/pygpu-0.6.8-py2.7-linux-ppc64le.egg/pygpu/gpuarray.so(+0x56530)[0x3fff863b6530]
[c460:03946] [ 9] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyCFunction_Call+0x164)[0x3fff9cc31554]
[c460:03946] [10] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x8e64)[0x3fff9ccc9484]
[c460:03946] [11] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0xb40)[0x3fff9cccb360]
[c460:03946] [12] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x8f04)[0x3fff9ccc9524]
[c460:03946] [13] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0xb40)[0x3fff9cccb360]
[c460:03946] [14] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x8f04)[0x3fff9ccc9524]
[c460:03946] [15] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0xb40)[0x3fff9cccb360]
[c460:03946] [16] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyEval_EvalCode+0x34)[0x3fff9cccb484]
[c460:03946] [17] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyImport_ExecCodeModuleEx+0xe0)[0x3fff9cce8960]
[c460:03946] [18] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(+0x188e50)[0x3fff9cce8e50]
[c460:03946] [19] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(+0x18ad54)[0x3fff9ccead54]
[c460:03946] [20] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(+0x18a540)[0x3fff9ccea540]
[c460:03946] [21] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyImport_ImportModuleLevel+0x2f4)[0x3fff9cceb7b4]
[c460:03946] [22] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(+0x15d038)[0x3fff9ccbd038]
[c460:03946] [23] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyCFunction_Call+0x164)[0x3fff9cc31554]
[c460:03946] [24] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyObject_Call+0x74)[0x3fff9cbc1ab4]
[c460:03946] [25] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyEval_CallObjectWithKeywords+0x68)[0x3fff9ccbfc68]
[c460:03946] [26] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x3214)[0x3fff9ccc3834]
[c460:03946] [27] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0xb40)[0x3fff9cccb360]
[c460:03946] [28] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyEval_EvalCode+0x34)[0x3fff9cccb484]
[c460:03946] [29] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyImport_ExecCodeModuleEx+0xe0)[0x3fff9cce8960]
[c460:03946] *** End of error message ***
Segmentation fault
Any help would be appreciated. Thanks.

Grab a demo mpi c++ or c code from the web, and compile it with mpicc / mpic++. Check that the compiler works and the executable you made can run and can manage point to point communication between different nodes in the cluster.
You probably used a wrong mpicc to compile theano and that compiler doesn't have binary compatibility with the library for inifiniband (or any hardware that connects the computers in a cluster).
For example, if the InfiniBand library is compiled by gcc and theano is compiled by a mpicc that is based on the intel compiler then it won't work.
You can set an environmental variable to ask the mpicc of openmpi to use another compiler.
If you have multiple mpi implementations compiled by different compilers on that computer... Try to use ldd to find out which shared library object (those .so files) depends on which one.
The best case is of course use the same compiler and same mpi wrapper for the compiler to compile everything, and wrap the files into several modules.

The answer turns to be in the gcc version and libgpuarray. For some reason, gcc-4.8.5 has issues with the libgpuarray and that's what was causing segmentation fault.
I installed gcc-5.4.0 in my user space and recompiled cmake and libgpuarray as well as others including theano and numpy (just to be sure) and then it doesn't have the Segmentation fault any more.
The other change was that the cluster admins updated CUDA to 9.0.151 with new driver 384.66

Related

How can I debug a "Failed to install" message?

I am currently attempting to develop a cross-platform mobile app using Xamarin.forms. As a part of this application I need to include a 3rd party .framework in my Xamarin.iOS project. I have successfully created a Xamarin.iOS Bindings Library .dll and included it in my project. I am able to reference the library and compile without errors, however when I attempt to deploy the app to the iPhone simulator the app will start and then crash with a “Failed to install” message.
Error Message
If I remove any lines of code which reference this .dll the app will run fine.
Does anyone have any insight on how to solve this?
Potentially useful information:
I am developing in Visual Studio for Windows
Xamarin version: 16.7.000.440
Xamarin.iOS version: 13.20.2.2
XCode version: 12.0.1
iOS version: 14.0
Device Crash Log:
Incident Identifier: 882D82AB-5511-48C1-AFCD-4B86933B2A5C
CrashReporter Key: 1cc59f0bc819c0d806e2c1ccdf7b24a413699a4f
Hardware Model: iPad7,11
Process: MyApp.iOS [452]
Path: /private/var/containers/Bundle/Application/A31E102C-4BB8-431A-ABDF-E17A503E1778/MyApp.iOS.app/MyApp.iOS
Identifier: com.Crossroads.MyApp
Version: 1.0 (1.0)
Code Type: ARM-64 (Native)
Role: Foreground
Parent Process: launchd [1]
Coalition: com.Crossroads.MyApp [614]
Date/Time: 2020-10-16 09:46:53.9542 -0500
Launch Time: 2020-10-16 09:46:53.9106 -0500
OS Version: iPhone OS 13.5.1 (17F80)
Release Type: User
Baseband Version: n/a
Report Version: 104
Exception Type: EXC_CRASH (SIGABRT)
Exception Codes: 0x0000000000000000, 0x0000000000000000
Exception Note: EXC_CORPSE_NOTIFY
Termination Description: DYLD, dependent dylib '#rpath/MyFramework.framework/MyFramework' not found for '/private/var/containers/Bundle/Application/A31E102C-4BB8-431A-ABDF-E17A503E1778/MyApp.iOS.app/MyApp.iOS', tried but didn't find: '#rpath/MyFramework.framework/MyFramework' '/System/Library/Frameworks/MyFramework.framework/MyFramework'
Highlighted by Thread: 0
Backtrace not available
Unknown thread crashed with ARM Thread State (64-bit):
x0: 0x0000000000000006 x1: 0x0000000000000001 x2: 0x000000016b701390 x3: 0x00000000000000c7
x4: 0x000000016b700f90 x5: 0x0000000000000000 x6: 0x0000000000000000 x7: 0x0000000000000000
x8: 0x0000000000000020 x9: 0x0000000000000009 x10: 0x6f4d706163617461 x11: 0x656b6f54656c6962
x12: 0x6f77656d6172662e x13: 0x63617461442f6b72 x14: 0x656c69626f4d7061 x15: 0x0020276e656b6f54
x16: 0x0000000000000209 x17: 0x0000000000000000 x18: 0x0000000000000000 x19: 0x0000000000000000
x20: 0x000000016b700f90 x21: 0x00000000000000c7 x22: 0x000000016b701390 x23: 0x0000000000000001
x24: 0x0000000000000006 x25: 0x0000000106cd4000 x26: 0x0000000000000001 x27: 0x0000000106cd4000
x28: 0x0000000000000000 fp: 0x000000016b700f60 lr: 0x0000000106cbbee8
sp: 0x000000016b700f20 pc: 0x0000000106cb4f68 cpsr: 0x00000000
esr: 0x00000000 Address size fault
Binary images description not available
Error Formulating Crash Report:
Failed to create CSSymbolicatorRef - corpse still valid ¯\_(ツ)_/¯
EOF
From the apple document, the errors means you have linked the frame while does not ember it.
The app crashes at launch, because the dynamic linker can’t locate the
missing framework.
So what you need to do is ember the framework and here is the document you can refer:
Linking the dependencies
binding-objective-c

How to solve "Error: tool" with RSAGA?

I am using R 3.1.1 on Ubuntu 14 64 bit version. I installed SAGA GIS 2.1.2 and RSAGA 0.93-6.
So far all seems to work fine.
rsaga.env() works, I use:
work_env <- rsaga.env(modules='/usr/lib/x86_64-linux-gnu/saga/')
because on the 64 bit the modules are located somewhere else.
Getting the libraries works as well:
> rsaga.get.libraries(path=work_env$modules)
[1] "climate_tools" "contrib_perego" "db_odbc" "db_pgsql"
[5] "docs_html" "docs_pdf" "garden_3d_viewer" "garden_fractals"
[9] "garden_games" "garden_learn_to_program" "garden_webservices" "grid_analysis"
[13] "grid_calculus_bsl" "grid_calculus" "grid_filter" "grid_gridding"
[17] "grid_spline" "grid_tools" "grid_visualisation" "imagery_classification"
[21] "imagery_rga" "imagery_segmentation" "imagery_svm" "imagery_tools"
[25] "io_esri_e00" "io_gdal" "io_gps" "io_grid_grib2"
[29] "io_grid_image" "io_grid" "io_shapes_dxf" "io_shapes"
[33] "io_table" "io_virtual" "pj_georeference" "pj_proj4"
[37] "pointcloud_tools" "pointcloud_viewer" "shapes_grid" "shapes_lines"
[41] "shapes_points" "shapes_polygons" "shapes_tools" "shapes_transect"
[45] "sim_cellular_automata" "sim_ecosystems_hugget" "sim_erosion" "sim_hydrology"
[49] "sim_ihacres" "statistics_grid" "statistics_kriging" "statistics_points"
[53] "statistics_regression" "table_calculus" "table_tools" "ta_channels"
[57] "ta_compound" "ta_hydrology" "ta_lighting" "ta_morphometry"
[61] "ta_preprocessor" "ta_profiles" "ta_slope_stability" "tin_tools"
[65] "tin_viewer"
But when I try to get the modules or anything else it gives a weird error:
> rsaga.get.modules("ta_preprocessor", env=work_env)
Error: tool
$ta_preprocessor
NULL
I found out that rsaga officially doesn't support higer versions of SAGA GIS 2.1.0 but when I try 2.1.0 I get the error described in another question: https://gis.stackexchange.com/questions/109497/rsaga-saga-cmd-2-1-0-error-inconsistency
How should I solve this error?
I kind of fixed it by compiling SAGA GIS 2.1.1 from source (http://sourceforge.net/p/saga-gis/wiki/Compiling%20a%20Linux%20Unicode%20version/). When I execute a tool with RSAGA I do get another error; "Error: module", but the execution seems to do fine.
Also, SAGA GIS sometimes quits with a segmentation fault... but not to often.

dlopen failed: cannot locate symbol "signal"

I am developing an Android app using NDK.
I have built OpenSSL as static libraries, libcrypto.a and libssl.a, which I linked with my custom C code.
When I try to load the library at runtime I get:
dlopen failed: cannot locate symbol "signal"...
Any idea how to fix this?
Thanks!
Update:
This comes from libcrypto:
libcrypto.a:
00000000 *UND* 00000000 signal
In my .so I see:
libtest.so:
NEEDED libc.so
...
00040240 <signal#plt>:
40240: e28fc601 add ip, pc, #1048576 ; 0x100000
40244: e28cca80 add ip, ip, #128, 20 ; 0x80000
40248: e5bcfd64 ldr pc, [ip, #3428]! ; 0xd64
So why is it complaining about "signal"?

IIS problem, web application

When I use the web application, the application logs me out. I think it might be an IIS recycle.
EventViewer Message:
.NET Runtime version 2.0.50727.4927 - Fatal Execution Engine Error (000007FEF582FA42) (80131506)
----------
Faulting application name: w3wp.exe, version: 7.5.7600.16385, time stamp: 0x4a5bd0eb
Faulting module name: mscorwks.dll, version: 2.0.50727.4927, time stamp: 0x4a27466f
Exception code: 0xc0000005
Fault offset: 0x00000000006be81f
Faulting process id: 0x%9
Faulting application start time: 0x%10
Faulting application path: %11
Faulting module path: %12
Report Id: %13
-------------
Fault bucket , type 0
Event Name: APPCRASH
Response: Not available
Cab Id: 0
Problem signature:
P1: w3wp.exe
P2: 7.5.7600.16385
P3: 4a5bd0eb
P4: mscorwks.dll
P5: 2.0.50727.4927
P6: 4a27466f
P7: c0000005
P8: 00000000006be81f
P9:
P10:
Attached files:
These files may be available here:
C:\ProgramData\Microsoft\Windows\WER\ReportQueue\AppCrash_w3wp.exe_6a41af6fc5f73afd65a4b62225f4f0ff51ba820_60e9d666
Analysis symbol:
Rechecking for solution: 0
Report Id: d745615a-e67c-11df-83c0-d8d385b73c58
Report Status: 4
I analyzed the crash dump with windbg but I dont know how can I solve and what is problem:
0:056> !analyze -v
*******************************************************************************
* *
* Exception Analysis *
* *
*******************************************************************************
Unable to load image C:\Windows\assembly\NativeImages_v2.0.50727_64\mscorlib\9a017aa8d51322f18a40f414fa35872d\mscorlib.ni.dll, Win32 error 0n2
*** WARNING: Unable to verify checksum for mscorlib.ni.dll
Unable to load image C:\Windows\assembly\NativeImages_v2.0.50727_64\System.Web.RegularE#\bf11731ff6e75c72e9939a05151e7484\System.Web.RegularExpressions.ni.dll, Win32 error 0n2
*** WARNING: Unable to verify checksum for System.Web.RegularExpressions.ni.dll
Unable to load image C:\Windows\assembly\NativeImages_v2.0.50727_64\System.Web\d753bba0990df9a19883f05d5b681d3b\System.Web.ni.dll, Win32 error 0n2
*** WARNING: Unable to verify checksum for System.Web.ni.dll
Unable to load image C:\Windows\assembly\NativeImages_v2.0.50727_64\System.Data\46a0336046744a9f29986b208b8d38d4\System.Data.ni.dll, Win32 error 0n2
*** WARNING: Unable to verify checksum for System.Data.ni.dll
Unable to load image C:\Windows\winsxs\amd64_microsoft.windows.gdiplus_6595b64144ccf1df_1.1.7600.16385_none_2b4f45e87195fcc4\GdiPlus.dll, Win32 error 0n2
*** WARNING: Unable to verify timestamp for GdiPlus.dll
Unable to load image C:\Windows\assembly\NativeImages_v2.0.50727_64\System\247913fa7ae6fcf04ea33d28d24ab611\System.ni.dll, Win32 error 0n2
*** WARNING: Unable to verify checksum for System.ni.dll
GetPageUrlData failed, server returned HTTP status 500
URL requested: http://watson.microsoft.com/StageOne/w3wp_exe/7_5_7600_16385/4a5bd0eb/mscorwks_dll/2_0_50727_4927/4a27466f/c0000005/006be81f.htm?Retriage=1
FAULTING_IP:
mscorwks!COMCryptography::_GetKeyParameter+24f
000007fe`f5dde81f 418b4514 mov eax,dword ptr [r13+14h]
EXCEPTION_RECORD: ffffffffffffffff -- (.exr 0xffffffffffffffff)
ExceptionAddress: 000007fef5dde81f (mscorwks!COMCryptography::_GetKeyParameter+0x000000000000024f)
ExceptionCode: c0000005 (Access violation)
ExceptionFlags: 00000000
NumberParameters: 2
Parameter[0]: 0000000000000000
Parameter[1]: 0000000000000014
Attempt to read from address 0000000000000014
PROCESS_NAME: w3wp.exe
ERROR_CODE: (NTSTATUS) 0xc0000005 - The instruction at 0x%08lx referenced memory at 0x%08lx. The memory could not be %s.
EXCEPTION_CODE: (NTSTATUS) 0xc0000005 - The instruction at 0x%08lx referenced memory at 0x%08lx. The memory could not be %s.
EXCEPTION_PARAMETER1: 0000000000000000
EXCEPTION_PARAMETER2: 0000000000000014
READ_ADDRESS: 0000000000000014
FOLLOWUP_IP:
mscorwks!COMCryptography::_GetKeyParameter+24f
000007fe`f5dde81f 418b4514 mov eax,dword ptr [r13+14h]
MOD_LIST: <ANALYSIS/>
NTGLOBALFLAG: 0
APPLICATION_VERIFIER_FLAGS: 0
MANAGED_STACK: !dumpstack -EE
No export dumpstack found
MANAGED_BITNESS_MISMATCH:
Managed code needs matching platform of sos.dll for proper analysis. Use 'x64' debugger.
ADDITIONAL_DEBUG_TEXT: Followup set based on attribute [Is_ChosenCrashFollowupThread] from Frame:[0] on thread:[PSEUDO_THREAD]
LAST_CONTROL_TRANSFER: from 000007fef3a0bf50 to 000007fef5dde81f
FAULTING_THREAD: ffffffffffffffff
DEFAULT_BUCKET_ID: NOSOS
PRIMARY_PROBLEM_CLASS: NOSOS
BUGCHECK_STR: APPLICATION_FAULT_NOSOS_NULL_CLASS_PTR_DEREFERENCE_INVALID_POINTER_READ_WRONG_SYMBOLS_CALL_STACKIMMUNE
STACK_TEXT:
00000000`00000000 00000000`00000000 w3wp.exe+0x0
SYMBOL_NAME: w3wp.exe
FOLLOWUP_NAME: MachineOwner
MODULE_NAME: w3wp
IMAGE_NAME: w3wp.exe
DEBUG_FLR_IMAGE_TIMESTAMP: 4a5bd0eb
STACK_COMMAND: ** Pseudo Context ** ; kb
FAILURE_BUCKET_ID: NOSOS_c0000005_w3wp.exe!Unknown
BUCKET_ID: X64_APPLICATION_FAULT_NOSOS_NULL_CLASS_PTR_DEREFERENCE_INVALID_POINTER_READ_WRONG_SYMBOLS_CALL_STACKIMMUNE_w3wp.exe
Followup: MachineOwner
I solved this problem.
Solution Steps:
First I open ControlPanel> ActionCenter> Problem Reports
I saw list of problems. and my IIS Crash problem.
I entered item detail and save it is dumps.
I downloaded Windbg then open this dump with it.
and enter command !analyze -v
Windbg analized and show a text like this:
GetPageUrlData failed, server returned HTTP status 404
URL requested: http://watson.microsoft.com/StageOne/w3wp_exe/7_5_7600_16385/4a5bd0eb/mscorwks_dll/2_0_50727_4927/4a27466f/c0000005/006be81f.htm?Retriage=1
FAULTING_IP:
mscorwks!COMCryptography::_GetKeyParameter+24f
000007fe`f5dde81f 418b4514 mov eax,dword ptr [r13+14h]
EXCEPTION_RECORD: ffffffffffffffff -- (.exr 0xffffffffffffffff)
ExceptionAddress: 000007fef5dde81f (mscorwks!COMCryptography::_GetKeyParameter+0x000000000000024f)
ExceptionCode: c0000005 (Access violation)
ExceptionFlags: 00000000
NumberParameters: 2
Parameter[0]: 0000000000000000
Parameter[1]: 0000000000000014
Attempt to read from address 0000000000000014
PROCESS_NAME: w3wp.exe
ERROR_CODE: (NTSTATUS) 0xc0000005 - The instruction at 0x%08lx referenced memory at 0x%08lx. The memory could not be %s.
EXCEPTION_CODE: (NTSTATUS) 0xc0000005 - The instruction at 0x%08lx referenced memory at 0x%08lx. The memory could not be %s.
EXCEPTION_PARAMETER1: 0000000000000000
EXCEPTION_PARAMETER2: 0000000000000014
READ_ADDRESS: 0000000000000014
FOLLOWUP_IP:
mscorwks!COMCryptography::_GetKeyParameter+24f
000007fe`f5dde81f 418b4514 mov eax,dword ptr [r13+14h]
MOD_LIST: <ANALYSIS/>
NTGLOBALFLAG: 0
APPLICATION_VERIFIER_FLAGS: 0
MANAGED_STACK: !dumpstack -EE
No export dumpstack found
MANAGED_BITNESS_MISMATCH:
Managed code needs matching platform of sos.dll for proper analysis. Use 'x64' debugger.
ADDITIONAL_DEBUG_TEXT: Followup set based on attribute [Is_ChosenCrashFollowupThread] from Frame:[0] on thread:[PSEUDO_THREAD]
LAST_CONTROL_TRANSFER: from 000007fef3a0bf50 to 000007fef5dde81f
FAULTING_THREAD: ffffffffffffffff
DEFAULT_BUCKET_ID: NOSOS
PRIMARY_PROBLEM_CLASS: NOSOS
BUGCHECK_STR: APPLICATION_FAULT_NOSOS_NULL_CLASS_PTR_DEREFERENCE_INVALID_POINTER_READ_WRONG_SYMBOLS_CALL_STACKIMMUNE
STACK_TEXT:
00000000`00000000 00000000`00000000 w3wp.exe+0x0
SYMBOL_NAME: w3wp.exe
FOLLOWUP_NAME: MachineOwner
MODULE_NAME: w3wp
IMAGE_NAME: w3wp.exe
DEBUG_FLR_IMAGE_TIMESTAMP: 4a5bd0eb
STACK_COMMAND: ** Pseudo Context ** ; kb
FAILURE_BUCKET_ID: NOSOS_c0000005_w3wp.exe!Unknown
BUCKET_ID: X64_APPLICATION_FAULT_NOSOS_NULL_CLASS_PTR_DEREFERENCE_INVALID_POINTER_READ_WRONG_SYMBOLS_CALL_STACKIMMUNE_w3wp.exe
WATSON_STAGEONE_URL:
Followup: MachineOwner
0:056> .exr 0xffffffffffffffff
ExceptionAddress: 000007fef5dde81f (mscorwks!COMCryptography::_GetKeyParameter+0x000000000000024f)
ExceptionCode: c0000005 (Access violation)
ExceptionFlags: 00000000
NumberParameters: 2
Parameter[0]: 0000000000000000
Parameter[1]: 0000000000000014
Attempt to read from address 0000000000000014
So I added this code to Decrypt Method: if (String.IsNullOrEmpty(value)) return String.Empty;
public static string Decrypt(string value)
{
SymmetricAlgorithm algorithm = SymmetricAlgorithm.Create();
ICryptoTransform decryptor = algorithm.CreateDecryptor(EncryptionKey, EncryptionVector);
// I control value
**if (String.IsNullOrEmpty(value))
return String.Empty;**
byte[] encryptedBytes = Convert.FromBase64String(value);
MemoryStream memoryStream = new MemoryStream(encryptedBytes);
CryptoStream cryptoStream = new CryptoStream(memoryStream, decryptor, CryptoStreamMode.Read);
...
}
problem was solved.
I know I'm late, but I just debug a similar problem with WinDbg. I finally managed to find the cause of the problem.
It's a reported bug at microsoft:
http://connect.microsoft.com/VisualStudio/feedback/details/330926/cryptostream-flushfinalblock-fatal-on-64-bit-os-if-bytearray-is-null
I just add this to the discussion as a lead for others who search the web.
Tess Ferrandez has some great tutorials and information on how to use DebugDiag and WinDbg to nail down why this is happening:
If it is broken, fix it you should
There's also a lab to walk you through analysing worker process crashes:
.NET Debugging Demos Lab 5: Crash
.NET Debugging Demos Lab 2: Crash - Review
I ran into exactly the same symptoms, and the real reason was that I accidentally created an infinite recursion, which in turn caused a stack overflow. Please note that you need to restart the app pool after correcting the error.
The ASP.NET worker process is crashing with Access Violation. This is usually a result of dereferencing a NULL or an invalid pointer. Attempting to access a null reference in C# normally generates a managed exception which ASP.NET is capable of catching, I would assume that your web app is using COM interop or is invoking unmanaged (C++) code that crashes.
Unfortunately, that's about as much as we can tell you from the info above. You will need to debug your process to understand the exact cause of the crash.

how do you diagnose a kernel oops?

Given a linux kernel oops, how do you go about diagnosing the problem? In the output I can see a stack trace which seems to give some clues. Are there any tools that would help find the problem? What basic procedures do you follow to track it down?
Unable to handle kernel paging request for data at address 0x33343a31
Faulting instruction address: 0xc50659ec
Oops: Kernel access of bad area, sig: 11 [#1]
tpsslr3
Modules linked in: datalog(P) manet(P) vnet wlan_wep wlan_scan_sta ath_rate_sample ath_pci wlan ath_hal(P)
NIP: c50659ec LR: c5065f04 CTR: c00192e8
REGS: c2aff920 TRAP: 0300 Tainted: P (2.6.25.16-dirty)
MSR: 00009032 CR: 22082444 XER: 20000000
DAR: 33343a31, DSISR: 20000000
TASK = c2e6e3f0[1486] 'datalogd' THREAD: c2afe000
GPR00: c5065f04 c2aff9d0 c2e6e3f0 00000000 00000001 00000001 00000000 0000b3f9
GPR08: 3a33340a c5069624 c5068d14 33343a31 82082482 1001f2b4 c1228000 c1230000
GPR16: c60f0000 000004a8 c59abbe6 0000002f c1228360 c340d6b0 c5070000 00000001
GPR24: c2aff9e0 c5070000 00000000 00000000 00000003 c2cc2780 c2affae8 0000000f
NIP [c50659ec] mesh_packet_in+0x3d8/0xdac [manet]
LR [c5065f04] mesh_packet_in+0x8f0/0xdac [manet]
Call Trace:
[c2aff9d0] [c5065f04] mesh_packet_in+0x8f0/0xdac [manet] (unreliable)
[c2affad0] [c5061ff8] IF_netif_rx+0xa0/0xb0 [manet]
[c2affae0] [c01925e4] netif_receive_skb+0x34/0x3c4
[c2affb10] [c60b5f74] netif_receive_skb_debug+0x2c/0x3c [wlan]
[c2affb20] [c60bc7a4] ieee80211_deliver_data+0x1b4/0x380 [wlan]
[c2affb60] [c60bd420] ieee80211_input+0xab0/0x1bec [wlan]
[c2affbf0] [c6105b04] ath_rx_poll+0x884/0xab8 [ath_pci]
[c2affc90] [c018ec20] net_rx_action+0xd8/0x1ac
[c2affcb0] [c00260b4] __do_softirq+0x7c/0xf4
[c2affce0] [c0005754] do_softirq+0x58/0x5c
[c2affcf0] [c0025eb4] irq_exit+0x48/0x58
[c2affd00] [c000627c] do_IRQ+0xa4/0xc4
[c2affd10] [c00106f8] ret_from_except+0x0/0x14
--- Exception: 501 at __delay+0x78/0x98
LR = cfi_amdstd_write_buffers+0x618/0x7ac
[c2affdd0] [c0163670] cfi_amdstd_write_buffers+0x504/0x7ac (unreliable)
[c2affe50] [c015a2d0] concat_write+0xe4/0x140
[c2affe80] [c0158ff4] part_write+0xd0/0xf0
[c2affe90] [c015bdf0] mtd_write+0x170/0x2a8
[c2affef0] [c0073898] vfs_write+0xcc/0x16c
[c2afff10] [c0073f2c] sys_write+0x4c/0x90
[c2afff40] [c0010060] ret_from_syscall+0x0/0x38
--- Exception: c01 at 0xfd98a50
LR = 0x10003840
Instruction dump:
419d02a0 98010009 800100a4 2f800003 419e0508 2f170000 419a0098 3d20c507
a0e1002e 81699624 39299624 7f8b4800 419e007c a0610016 7d264b78
Kernel panic - not syncing: Fatal exception in interrupt
Rebooting in 1 seconds..
An Oops gives a bunch of information useful in diagnosing a crash. It starts with the address of the crash, the reason ("access of bad area") and the contents of the registers. The call trace answers the question "how did we get here". The first item in the list happened most recently. Working backwards, an interrupt happened (do_IRQ) because the Atheros WiFi adapter received a packet (ath_rx_poll). The routine passed it to the generic WiFi code (ieee80211_input) which in turn passed it up to the network stack (netif_receive_skb).
To figure out the exact code causing the problem, you can run
gdb /usr/src/linux/vmlinux
and then disassemble the function in question, which might be mesh_packet_in(). Might, because the faulting instruction (0xc50659ec) looks to be outside of mesh_packet_in() (0xc5065f04). You might also try the gdb command
(gdb) info line 0xc50659ec
to figure out which function contains this address.
You should first try to find the source of the code that has crashed. In the specific case, the analysis claims that the crash happened in mesh_packet_in of the manet driver, at offset 0x8f0. It also reports that the instructions at this point are 419d02a0 98010009 ... So inspect the module with "objdump -d", to confirm whether the function/offset reported is correct. Then check the source for what it is doing; you can use the registers list to confirm again that you are looking at the right instruction.
When you know what C statement is faulting, you need to read the source to find out where the bogus data were coming from.
http://oss.sgi.com/projects/kdb/
Install this into your kernel, then when it Oops's, you'll be thrown into a gdb-like interface that you can poke around with. However, it looks like the manet module is deref'ing a bad pointer.

Resources