I have developed a library which I have testing on an x86-64 bit machine and it works and passes tests successfully. When I put it in my android application, the code stops in a constructor that just initializes all its variables to their default values (pointers get assigned to null, booleans to false...). I have set the target for x86-64 bit so I am sure it's not a problem of deploying a different architecture. How can I find out the root of the problem because if I do comment out the initialization in the constructor, it will execute a good amount of code before giving a SEGILL error again? I am using android 8 x64 bit intel image in the emulator. Also, the log cat doesn't show anything, the only error is the SEGILL.
It seems that most of the time, doing some pointer manipulation causes the problem. Simply initializing pointers with null or new causes the app to crash.
Instead of enabling SSE, I enabled avx which is not supported by android and therefore clang optimized some parts by using avx which resulted in SIGILL.
Related
First and foremost: I am completely unable to create a MCVE, as I can only reproduce this when running a full code, any attempt to measure or replicate the error in a simpler environment makes it disappear. TDLR I suspect its not a code problem, but a configuration problem.
I have a piece of code for some mathematics on kernels in CUDA. I have a windows machine Win10 x64, GTX 1050, CUDA 9.2 and a Ubuntu 17.04, 2xGTX 1080 Ti, CUDA 9.1.
My code runs good on the windows machine. It is long (~700ms per kernel call for big samples) so I needed to increase the TDR value in windows. The code also (for now) forces it to run in 1 GPU, the first one that is selected with cudaSetDevice(0).
When I copy the same input data and code to the linux machine (I am using git, it is the same code), I get either
an illegal memory access was encountered
or
unspecified launch failure
in my error checking after the GPU call.
If I change the kernel to instead do the math, to just write a number in the output, the kernel executes properly. Other CUDA code (different functions that I have) works fine too. All this leads me to think that there is a problem outside the code, not with the code itself, nor with the general configuration of the drivers/environment variables.
I read that the xorg.conf can have an effect on the timeout of the kernels. I generated a xorg.conf (I had none) and remove the devices from there, as suggested here. I am connecting to the server remotely, and have no monitor plugged in. This changes nothing in the behavior, my kernels still error.
My question is: what else should I look? What linux specific configuration should I have a look at to pinpoint the cause of the kernel halts?
The error ended up being indeed illegal memory access.
These were caused by the fact that sizeof(unsigned long) is machine specific, and my linux machine returns 8 while my windows machine returns 4. As this code is called from MATLAB, and MATLAB (like some other high level languages such as python) defines the sizes of variables in bits (such as uint32(1)) there was a mismatch in the linux machine when doing memcpys. Turns out that this happened in a variable that is a index, so the kernels were reading garbage (due to the bad memcpy), but then triying to access another array at that location, creating an illegal memory error.
Too specific? yeah.
I'm compiling a rather large project for ARM. I'm using an AT91SAM9G25-EK as a devboard running a Debian ARM image. All libraries and executables in the image seem to be compiled for the armv4t instruction set.
My CPU is an ARM926EJ-S, which should run armv5tej code.
I'm using GCC to cross compile for my board. My CXX flags look like the following:
set(CMAKE_CXX_FLAGS "--signed-char --sysroot=${SYSROOT} -mcpu=arm926ej-s -mtune=arm926ej-s -mfloat-abi=softfp" CACHE STRING "" FORCE)
If I try to run this on my board, I get an Illegal Instruction signal (SIGILL) during initialization of one of my dependencies (using armv4t).
If I enable thumb mode (-mthumb -mthumb-interwork) it works, but uses Thumb for all the code, which in my case runs slower (I'm doing some serious number crunching).
In this case, if I specify one function to be compiled for ARM mode (using __attribute__((target("arm")))) it will run fine until that function is called, then exit with SIGILL.
I'm lost. Is it that bad I'm linking against libraries using armv4t? Am I misunderstanding the way ARM modes work? Is it something in the linux kernel?
What softfp means is to use using the soft-float calling convention between functions, but still use the hardware FPU within them. Assuming your cross-compiler is configured with a default -mfpu option other than "none" (run arm-whatever-gcc -v and look for --with-fpu= to check), then you've got a problem, because as far as I can see from the Atmel datasheet, SAM9G25 doesn't have an FPU.
My first instinct would be to slap GDB on there, catch the signal and disassemble the offending instruction to make sure, but the fact that Thumb code works OK is already a giveaway (Thumb before ARMv6T2 doesn't include any coprocessor instructions, and thus can't make use of an FPU).
In short, use -mfloat-abi=soft to ensure the ARM code actually uses software floating-point and avoids poking a non-existent FPU. And if the "serious number crunching" involves a lot of floating-point, perhaps consider getting a different MCU...
Some days ago I set up my computer and installed a new copy of Windows 8 because of some hardware changes. Among others I changed the video card from Radeon HD 7870 to Nvidia GTX 660.
After setting up Visual Studio 11 again, I downloaded my last OpenGL project from Github and rebuilt the whole project. I ran the application out of Visual Studio and it crashed because of nvoglv32.dll.
Unhandled exception at 0x5D9F74E3 (nvoglv32.dll) in Application.exe: 0xC0000005: Access violation reading location 0x00000000.
In the old environment the application worked as expected. I didn't changed anything of the project or source code. The only difference was the language of the Visual Studio installtion which is English now and was German before. Therefore I created a new project and adopted all settings, but the error remains.
In order to locate the crash, I noticed that all initialization (window, shaders, ...) succeeded and the error is at the draw call glDrawElements() which referrs to the gemoetry pass of my deferred renderer.
After some reseach I found out that nvoglv32.dll is from Nvidia and is about a services called Compatible OpenGL ICD. Does that somehow mean that my application runs in a compatible mode? That sounds like a mode to support older applications and I want mine to run in a regular mode! By the way I installed the lastest stable drivers for my video card.
To be honest, I have no clue how to approach fixing this crash. What could cause it and how to fix it?
Update: I found a post on Geforce Forums about my issue. Although there was no reply, the autor could fix the problem by changing the order of two OpenGL calls.
Hi all,
After poking around with my application source code for a few hours, I found that calling the functions...
glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, #)
glBindVertexArray(#)
...in that order causes the crash in nvoglv64.dll.
Reversing the order of these calls to...
glBindVertexArray(#)
glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, #)
...prevents the crash and appears to be well-behaved.
Cheers,
Robert Graf
Since I do not use vertex arrays I cannot simple do this fix but there might be a similar issue. I will report my progress.
Update: I have absolutely no clue how to solve my problem. I tried different video driver versions but it makes no difference. I completely rewrote the renderer using minimal shaders and simple forward rendering. But the crash sill occurs at the first draw call.
In order to locate the crash, I noticed that all initialization (window, shaders, ...) succeeded and the error is at the draw call glDrawElements().
Most likely you had a out-of-bounds access in your code all the time, but the AMD Radeon Catalyst drivers did reserve more address space, or caught them beforehand. And now your NVidia GeForce driver's don't.
Either you're passing glDrawElements a too large number for count elements to draw, or your index buffer contains values that index beyond the range of your vertex arrays. If it's the later, then you're probably using client side vertex arrays, as VBOs usually catch out-of-bounds accesses; also those wouldn't crash your client side program, but just render garbage.
Finally I came up with a solution to fix the crash.
The SFML framework I use to create the window and more provides a function to reset the OpenGL state of the context. I called it right after window creation.
Even though I can't explain why, removing that function call solved the crash. Maybe it is because GLEW or something else isn't yet initialized that moment.
sf::RenderWindow window;
window.create(VideoMode(1024, 768), "Window Title");
window.resetGLStates(); // removing this line fixed the crash
window.setVerticalSyncEnabled(true);
The title is pretty straightforward - I can't get anything at all to run when building in x64 and I get a message box with this error code. Do you know what may be the problem here?
This is STATUS_INVALID_IMAGE_FORMAT, you can find these error codes listed in the ntstatus.h SDK header file.
It is certainly strongly correlated with building x64 code. You'll get this status code whenever your program has a dependency on 32-bit code, particularly in a DLL. Your program will fail to start when it tries to load the DLL at startup, a 64-bit process cannot contain any 32-bit code. Or the other way around, a 32-bit process trying to load a 64-bit DLL.
Review all the dependencies for your program, particularly the import libraries you link. Everything must be built to target x64. You can use SysInternals' ProcMon utility to find the DLL that fails to load, useful in case this is a DLL Hell problem.
Just an addition to the correct answer above: also check you .manifest-files (resp. #pragma comment(linker,"/manifestdependency...) and make sure that you have processorArchitecture='x86' for 32-bit and processorArchitecture='amd64' for x64 code.
I'm running a small CUDA application: the QuickSort benchmark algorithm (see here). I have a dual system with a NVIDIA 660GTX (device 0) and 8600GTS (device 1).
Under Windows 8 and Visual Studio, the application compiles and runs flawlessly on device 0. Under Linux (Ubuntu 12.04 LTS), the app compiles with nvcc and gcc but suddenly stops in its tracks, returning a (unspecified launch failure).
I have two issues:
After this error, my GPU cannot perform some other operations, e.g., running the SDK example bandwidhtTest blocks when it performs the first data transfer, but running deviceQuery continues to perform well. How can I reset my GPU? I've already tried the cudaDeviceReset() method but it doesn't help
How can I find what is going wrong under linux? Has someone a clue or seen this before?
Thanks in advance for your help!
Using the nvidia-smi utility you can reset the GPU if it is compatible
To my knowledge and experience, (unspecified launch failure) usually referees to segmentation fault. Have you specified the right GPU to use? Try to use cuda-memcheck to see if there is any memory out of bound scenario.
From my experience XID 31 was always caused by accessing bad pointer (aka Memory access violation).
I'd first pursue this trail. Run your application with cuda memcheck. Like that cuda-memcheck you_app args to your app and see if it finds any wrong memory accesses.
Also try stepping though the code with cuda-gdb or Nsight Eclipse Edition.
I've found that using
cuda-memcheck -b ...
prevents the device from locking up.