Code working on windows but launch failures on Linux - linux

First and foremost: I am completely unable to create a MCVE, as I can only reproduce this when running a full code, any attempt to measure or replicate the error in a simpler environment makes it disappear. TDLR I suspect its not a code problem, but a configuration problem.
I have a piece of code for some mathematics on kernels in CUDA. I have a windows machine Win10 x64, GTX 1050, CUDA 9.2 and a Ubuntu 17.04, 2xGTX 1080 Ti, CUDA 9.1.
My code runs good on the windows machine. It is long (~700ms per kernel call for big samples) so I needed to increase the TDR value in windows. The code also (for now) forces it to run in 1 GPU, the first one that is selected with cudaSetDevice(0).
When I copy the same input data and code to the linux machine (I am using git, it is the same code), I get either
an illegal memory access was encountered
or
unspecified launch failure
in my error checking after the GPU call.
If I change the kernel to instead do the math, to just write a number in the output, the kernel executes properly. Other CUDA code (different functions that I have) works fine too. All this leads me to think that there is a problem outside the code, not with the code itself, nor with the general configuration of the drivers/environment variables.
I read that the xorg.conf can have an effect on the timeout of the kernels. I generated a xorg.conf (I had none) and remove the devices from there, as suggested here. I am connecting to the server remotely, and have no monitor plugged in. This changes nothing in the behavior, my kernels still error.
My question is: what else should I look? What linux specific configuration should I have a look at to pinpoint the cause of the kernel halts?

The error ended up being indeed illegal memory access.
These were caused by the fact that sizeof(unsigned long) is machine specific, and my linux machine returns 8 while my windows machine returns 4. As this code is called from MATLAB, and MATLAB (like some other high level languages such as python) defines the sizes of variables in bits (such as uint32(1)) there was a mismatch in the linux machine when doing memcpys. Turns out that this happened in a variable that is a index, so the kernels were reading garbage (due to the bad memcpy), but then triying to access another array at that location, creating an illegal memory error.
Too specific? yeah.

Related

CPU/Threads usage on M1 Pro (Apple Silicon) using openMP

hope someone knows the answer to this...
I have a code that compiles perfectly well with openMP (it uses libsharp). However, I am finding it impossible to make the M1 Pro chip use all the 8 or 10 cores I have.
I am setting the threads variable correctly as export OMP_NUM_THREADS=10 such that the code correctly identifies it's supposed to be running with 10 threads (see image below showing a print-screen from my activity monitor):
Activity Monitor Print Screen
Print screen is showing that the code is compiled for Apple Silicon, uses 10 threads but not much of the CPU available.
Does anyone know how to properly compile/set the number of threads such that all the cores will be used?
This is trivial in x86 architectures.
Not really an answer, but long for a comment...
If both LLVM and GCC behave the same then it's not an OpenMP runtime issue. (And your monitor output shows that the correct number of threads have been created). I'm also not certain that it's really an Arm issue.
Are you comparing with an Apple x86 machine (so running the same operating system), or with a Linux x86 system?
The scheduling decisions of the two OSes are likely different, and (for instance) MacOS has no interface to bind threads to logicalCPUs.
As well as that, there's the issue of having some fast and some slow cores. That could mean that statically scheduled loops are inefficient.
I'm also confused by the fact that you arm to show multiple instances of your code running at the same time, so you are explicitly causing over-subscription of the logicalCPUs...

CUDA device seems to be blocked

I'm running a small CUDA application: the QuickSort benchmark algorithm (see here). I have a dual system with a NVIDIA 660GTX (device 0) and 8600GTS (device 1).
Under Windows 8 and Visual Studio, the application compiles and runs flawlessly on device 0. Under Linux (Ubuntu 12.04 LTS), the app compiles with nvcc and gcc but suddenly stops in its tracks, returning a (unspecified launch failure).
I have two issues:
After this error, my GPU cannot perform some other operations, e.g., running the SDK example bandwidhtTest blocks when it performs the first data transfer, but running deviceQuery continues to perform well. How can I reset my GPU? I've already tried the cudaDeviceReset() method but it doesn't help
How can I find what is going wrong under linux? Has someone a clue or seen this before?
Thanks in advance for your help!
Using the nvidia-smi utility you can reset the GPU if it is compatible
To my knowledge and experience, (unspecified launch failure) usually referees to segmentation fault. Have you specified the right GPU to use? Try to use cuda-memcheck to see if there is any memory out of bound scenario.
From my experience XID 31 was always caused by accessing bad pointer (aka Memory access violation).
I'd first pursue this trail. Run your application with cuda memcheck. Like that cuda-memcheck you_app args to your app and see if it finds any wrong memory accesses.
Also try stepping though the code with cuda-gdb or Nsight Eclipse Edition.
I've found that using
cuda-memcheck -b ...
prevents the device from locking up.

Monitoring the instructions of a running program in ubuntu?

I'm a little stuck here.
The idea is that I'd like to get a file of every instruction run by a program during it's execution. I'd like to do it with just the executable in hand (no source) and be able to determine what operation is occuring on what address when.
For example, I'd like to be able to run it on Google Chrome, Firefox, etc.
I want to use this for a performance prediction system I'm working on. I figure if I'm able to obtain each instruction that is executed in order it is executed on system 1, I can attempt to simulate/model the run time of an identical program being run on system 2, because I'll be able to predict(although I know not with 100% accuracy) L1/L2 cache-misses, L1/L2 cache-hits, TLB hits/misses, page faults, time taken on floating point multiplication operations, etc.
I'd like to try to do this on two different systems:
System 1: Ubuntu 10.10 on Intel Core 2 Duo CPU
System 2: Ubuntu 12.04 on system with 2x AMD Sixteen Core Opteron model 6274
(I can definitely change the OS's as neccessary, but would prefer to stay with Ubuntu, if possible)
Is this possible / how could I go about doing it? I know with debuggers, you can use them to step through everything, but I don't have the source available.
I think, you can use qemu (or even bochs) or valgrind to monitor every executed instruction. They are x86 binary translation tools (excluding bochs - which is an interpreter of x86 code). There is a valgrind tool called cachegrind (+ kcachegrind gui), which is ready to emulate cache by instrumenting every memory access and simulating some L1/L2 cache model (sizes may be configured via command line options).
To get deeper (into pipeline) you may want to look on free ptlsim (http://www.ptlsim.org/)

Address space identifiers using qemu for i386 linux kernel

Friends, I am working on an in-house architectural simulator which is used to simulate the timing-effect of a code running on different architectural parameters like core, memory hierarchy and interconnects.
I am working on a module takes the actual trace of a running program from an emulator like "PinTool" and "qemu-linux-user" and feed this trace to the simulator.
Till now my approach was like this :
1) take objdump of a binary executable and parse this information.
2) Now the emulator has to just feed me an instruction-pointer and other info like load-address/store-address.
Such approaches work only if the program content is known.
But now I have been trying to take traces of an executable running on top of a standard linux-kernel. The problem now is that the base kernel image does not contain the code for LKM(Loadable Kernel Modules). Also the daemons are not known when starting a kernel.
So, my approach to this solution is :
1) use qemu to emulate a machine.
2) When an instruction is encountered for the first time, I will parse it and save this info. for later.
3) create a helper function which sends the ip, load/store address when an instruction is executed.
i am stuck in step2. how do i differentiate between different processes from qemu which is just an emulator and does not know anything about the guest OS ??
I can modify the scheduler of the guest OS but I am really not able to figure out the way forward.
Sorry if the question is very lengthy. I know I could have abstracted some part but felt that some part of it gives an explanation of the context of the problem.
In the first case, using qemu-linux-user to perform user mode emulation of a single program, the task is quite easy because the memory is linear and there is no virtual memory involved in the emulator. The second case of whole system emulation is a lot more complex, because you basically have to parse the addresses out of the kernel structures.
If you can get the virtual addresses directly out of QEmu, your job is a bit easier; then you just need to identify the process and everything else functions just like in the single-process case. You might be able to get the PID by faking a system call to get_pid().
Otherwise, this all seems quite a bit similar to debugging a system from a physical memory dump. There are some tools for this task. They are probably too slow to run for every instruction, though, but you can look for hints there.

How to test the kernel for kernel panics?

I am testing the Linux Kernel on an embedded device and would like to find situations / scenarios in which Linux Kernel would issue panics.
Can you suggest some test steps (manual or code automated) to create Kernel panics?
There's a variety of tools that you can use to try to crash your machine:
crashme tries to execute random code; this is good for testing process lifecycle code.
fsx is a tool to try to exercise the filesystem code extensively; it's good for testing drivers, block io and filesystem code.
The Linux Test Project aims to create a large repository of kernel test cases; it might not be designed with crashing systems in particular, but it may go a long way towards helping you and your team keep everything working as planned. (Note that the LTP isn't proscriptive -- the kernel community doesn't treat their tests as anything important -- but the LTP team tries very hard to be descriptive about what the kernel does and doesn't do.)
If your device is network-connected, you can run nmap against it, using a variety of scanning options: -sV --version-all will try to find versions of all services running (this can be stressful), -O --osscan-guess will try to determine the operating system by throwing strange network packets at the machine and guessing by responses what the output is.
The nessus scanning tool also does version identification of running services; it may or may not offer any improvements over nmap, though.
You can also hand your device to users; they figure out the craziest things to do with software, they'll spot bugs you'd never even think to look for. :)
You can try following key combination
SysRq + c
or
echo c >/proc/sysrq-trigger
Crashme has been known to find unknown kernel panic situations, but it must be run in a potent way that creates a variety of signal exceptions handled within the process and a variety of process exit conditions.
The main purpose of the messages generated by Crashme is to determine if sufficiently interesting things are happening to indicate possible potency. For example, if the mprotect call is needed to allow memory allocated with malloc to be executed as instructions, and if you don't have the mprotect enabled in the source code crashme.c for your platform, then Crashme is impotent.
It seems that operating systems on x64 architectures tend to have execution turned off for data segments. Recently I have updated the crashme.c on http://crashme.codeplex.com/ to use mprotect in case of __APPLE__ and tested it on a MacBook Pro running MAC OS X Lion. This is the first serious update to Crashme since 1994. Expect to see updated Centos and Freebsd support soon.

Resources