select() inside infinite loop uses significantly more CPU on RHEL 4.8 virtual machine than on a Solaris 10 machine - linux

I have a daemon app written in C and is currently running with no known issues on a Solaris 10 machine. I am in the process of porting it over to Linux. I have had to make minimal changes. During testing it passes all test cases. There are no issues with its functionality. However, when I view its CPU usage when 'idle' on my Solaris machine it is using around .03% CPU. On the Virtual Machine running Red Hat Enterprise Linux 4.8 that same process uses all available CPU (usually somewhere in the 90%+ range).
My first thought was that something must be wrong with the event loop. The event loop is an infinite loop (while(1)) with a call to select(). The timeval is setup so that timeval.tv_sec = 0 and timeval.tv_usec = 1000. This seems reasonable enough for what the process is doing. As a test I bumped the timeval.tv_sec to 1. Even after doing that I saw the same issue.
Is there something I am missing about how select works on Linux vs. Unix? Or does it work differently with and OS running on a Virtual Machine? Or maybe there is something else I am missing entirely?
One more thing I am not sure which version of vmware server is being used. It was just updated about a month ago though.

I believe that Linux returns the remaining time by writing it into the time parameter of the select() call and Solaris does not. That means that a programmer who isn't aware of the POSIX spec might not reset the time parameter between calls to select.
This would result in the first call having 1000 usec timeout and all other calls using 0 usec timeout.

As Zan Lynx said, the timeval is modified by select on linux, so you should reassign the correct value before each select call. Also I suggest to check if some of the file descriptor is in a particular state (e.g. end of file, peer connection closed...). Maybe the porting is showing some latent bug in the analisys of the returned values (FD_ISSET and so on). It happened to me too some years ago in a port of a select-driven cycle: I was using the returned value in the wrong way, and a closed fd was added to the rd_set, causing select to fail. On the old platform the wrong fd was used to have a value higher than maxfd, so it was ignored. Because of the same bug, the program didn't recognize the select failure (select() == -1) and looped forever.
Bye!

Related

CPU/Threads usage on M1 Pro (Apple Silicon) using openMP

hope someone knows the answer to this...
I have a code that compiles perfectly well with openMP (it uses libsharp). However, I am finding it impossible to make the M1 Pro chip use all the 8 or 10 cores I have.
I am setting the threads variable correctly as export OMP_NUM_THREADS=10 such that the code correctly identifies it's supposed to be running with 10 threads (see image below showing a print-screen from my activity monitor):
Activity Monitor Print Screen
Print screen is showing that the code is compiled for Apple Silicon, uses 10 threads but not much of the CPU available.
Does anyone know how to properly compile/set the number of threads such that all the cores will be used?
This is trivial in x86 architectures.
Not really an answer, but long for a comment...
If both LLVM and GCC behave the same then it's not an OpenMP runtime issue. (And your monitor output shows that the correct number of threads have been created). I'm also not certain that it's really an Arm issue.
Are you comparing with an Apple x86 machine (so running the same operating system), or with a Linux x86 system?
The scheduling decisions of the two OSes are likely different, and (for instance) MacOS has no interface to bind threads to logicalCPUs.
As well as that, there's the issue of having some fast and some slow cores. That could mean that statically scheduled loops are inefficient.
I'm also confused by the fact that you arm to show multiple instances of your code running at the same time, so you are explicitly causing over-subscription of the logicalCPUs...

Code working on windows but launch failures on Linux

First and foremost: I am completely unable to create a MCVE, as I can only reproduce this when running a full code, any attempt to measure or replicate the error in a simpler environment makes it disappear. TDLR I suspect its not a code problem, but a configuration problem.
I have a piece of code for some mathematics on kernels in CUDA. I have a windows machine Win10 x64, GTX 1050, CUDA 9.2 and a Ubuntu 17.04, 2xGTX 1080 Ti, CUDA 9.1.
My code runs good on the windows machine. It is long (~700ms per kernel call for big samples) so I needed to increase the TDR value in windows. The code also (for now) forces it to run in 1 GPU, the first one that is selected with cudaSetDevice(0).
When I copy the same input data and code to the linux machine (I am using git, it is the same code), I get either
an illegal memory access was encountered
or
unspecified launch failure
in my error checking after the GPU call.
If I change the kernel to instead do the math, to just write a number in the output, the kernel executes properly. Other CUDA code (different functions that I have) works fine too. All this leads me to think that there is a problem outside the code, not with the code itself, nor with the general configuration of the drivers/environment variables.
I read that the xorg.conf can have an effect on the timeout of the kernels. I generated a xorg.conf (I had none) and remove the devices from there, as suggested here. I am connecting to the server remotely, and have no monitor plugged in. This changes nothing in the behavior, my kernels still error.
My question is: what else should I look? What linux specific configuration should I have a look at to pinpoint the cause of the kernel halts?
The error ended up being indeed illegal memory access.
These were caused by the fact that sizeof(unsigned long) is machine specific, and my linux machine returns 8 while my windows machine returns 4. As this code is called from MATLAB, and MATLAB (like some other high level languages such as python) defines the sizes of variables in bits (such as uint32(1)) there was a mismatch in the linux machine when doing memcpys. Turns out that this happened in a variable that is a index, so the kernels were reading garbage (due to the bad memcpy), but then triying to access another array at that location, creating an illegal memory error.
Too specific? yeah.

Linux Suspend To RAM from idle loop

I have a question regarding STR (Suspend To RAM) in the Linux kernel.
I am working on a small embedded Linux (Kernel 3.4.22) and I want to implement a mechanism that will put the system into sleep (suspend to ram) while it has nothing to do.
This is done in order to save power.
The HW support RAM self-refresh meaning its content will stay persistence.
And I'll take care of all the rest things which should be done (e.g keeping CPU context etc…)
I want to trigger the Kernel PM (power management) subsystem from within the idle loop.
When the system has nothing to do, it should go into sleep.
The HW also supports a way to wake up the system.
Doing some research, I have found out that Linux gives an option for the user space to switch to STR by writing "echo "mem" > /sys/power/state".
This will trigger the PM subsystem and will perform the relevant callbacks.
My questions are:
Is there any other standard alternative to go into STR besides writing to the above proc?
Did anyone tried to put the system into STR from the idle loop code ?
Thanks,
Why would you need another method? Linux treats everything as a file. Is it any surprise that the contents of a psudo-file dictate the state of the system? Check for yourself. pm-utils is a popular tool set for managing the state of the system. All the commands are just calls to /sys files.
This policy is actually platform dependent. You would have to look at the cpuidle driver for your platform to understand what it is doing. For example, on atmel platforms, it is using both RAM self refresh and WFI.

How can a large number of assignments to the same array cause a pyopencl.LogicError when run on GPU?

I'm using pyOpenCL to do some complex calculations.
It runs fine on CPU, but I get an error when trying to run it on an NVIDIA GeForce 9400M (256 MB).
I'm working on Mac OS X Lion (10.7.5)
The strange thing is that this error does not always show up. It seems to occur when my calculations use larger numbers (resulting in larger iterations) but only when run on GPU.
I'm not writing to memory locations I'm not supposed to write to. I ruled out possible problems with concurrent modification by running the code as a single work item.
I simplified my OpenCL code as much as possible, and from what was left created some very simple code with extremely weird behavior that causes the pyopencl.LogicError. It consists of 2 nested loops in which a couple of assignments are made to the result array. This assignment need not even depend on the state of the loop.
This is run on a single thread (or work item, shape = (1,)) on the GPU.
__kernel void weirdError(__global unsigned int* result){
unsigned int outer = (1<<30)-1;
for(int i=20; i--; ){
unsigned int inner = 0;
while(inner != outer){
result[0] = 1248;
result[1] = 1337;
inner++;
}
outer++;
}
}
The strange part is that removing either one of the assignments to the result array removes the error. Also, decreasing the initial value for outer (down to (1<<20)-1 for example) also removes the error. In these cases, the code returns normally, with the correct result available in the corresponding buffer.
On CPU, it never raises an error.
The OpenCL code is run from Python using PyOpenCL.
Nothing fancy in the setup:
platform = cl.get_platforms()[0]
device = platform.get_devices(cl.device_type.GPU)[0]
context = cl.Context([device])
program = cl.Program(context, getProgramCode()).build()
queue = cl.CommandQueue(context)
In this Python code I set the result_buf to 0, then I run the calculation in OpenCL that will set its values in a large iteration. Afterwards I try to collect this value from the device memory, but that's where it goes wrong:
result = numpy.zeros(2, numpy.uint32)
result_buf = cl.Buffer(context, mem_flags.READ_WRITE | mem_flags.COPY_HOST_PTR, hostbuf=result)
shape = (1,)
program.weirdError(queue, shape, None, result_buf)
cl.enqueue_copy(queue, result, result_buf)
The last line gives me:
pyopencl.LogicError: clEnqueueReadBuffer failed: invalid command queue
How can this repeated assignment cause an error?
And more importantly: how can it be avoided?
I understand that this problem is probably platform dependent, and thus perhaps hard to reproduce. But this is the only machine I have access to, so the code should work on this machine.
DISCLAIMER: I have never worked with OpenCL (or CUDA) before. I wrote the code on a machine where the GPU did not support OpenCL. I always tested it on CPU. Now that I switched to GPU, I find it frustrating that errors do not occur consistently and I have no idea why.
My advice is to avoid such a long loops inside a kernel. Work Item is making over 1 billion of iterations, and that's a long shot. Probably, driver kills your kernel as it takes too much time to execute. Reduce the number of iterations to the maximal amount, which doesn't lead to error and look at the execution time. If it takes something like seconds - that's too much.
As you said, reducing iterations numbers solves the problem and that's the evidence in my opinion. Reducing the number of assignment operations also makes kernel runs faster as IO operations are usually the slowest.
CPU doesn't face such difficulties for obvious reasons.
This timeout problem can be fixed in Windows and Linux, but apparently not in Mac.
Windows
This answer to a similar question (explaining the symptoms in Windows) tells both what is going on and how to fix it:
This is a known "feature" under Windows (not sure about Linux) - if the video driver stops responding, the OS will reset it. Except that, since OpenCL (and CUDA) is implemented by the driver, a kernel that takes too long will look like a frozen driver. There is a watchdog timer that keeps track of this (5 seconds, I believe).
Your options are:
You need to make sure that your kernels are not too time-consuming
(best).
You can turn-off the watchdog timer: Timeout Detection and Recovery of GPUs.
You can run the kernel on a GPU that is not hooked up to a display.
I suggest you go with 1.
This answer explains how to actually do (2) in Windows 7. But the MSDN-page for these registry keys mentions they should not be manipulated by any applications outside targeted testing or debugging. So it might not be the best option, but it is an option.
Linux
(From Cuda Release Notes, but also applicable to OpenCL)
GPUs without a display attached are not subject to the 5 second run time restriction. For this reason it is recommeded that CUDA is run on a GPU that is NOT attached to an X display.
While X does not need to be running in order to use CUDA, X must have been initialized at least once after booting in order to properly load the NVIDIA kernel module. The NVIDIA kernel module remains loaded even after X shuts down, allowing CUDA to continue to function.
Mac
Apple apparently does not allow fiddling with this watchdog and thus the only option seems to be using a second GPU (without a screen attached to it)

How to debug ARM Linux kernel (msleep()) lock up?

I am first of all looking for debugging tips. If some one can point out the one line of code to change or the one peripheral config bit to set to fix the problem, that would be terrific. But that's not what I'm hoping for; I'm looking more for how do I go about debugging it.
Googling "msleep hang linux kernel site:stackoverflow.com" yields 13 answers and none is on the point, so I think I'm safe to ask.
I rebuild an ARM Linux kernel for an embedded TI AM1808 ARM processor (Sitara/DaVinci?). I see the all the boot log up to the login: prompt coming out of the serial port, but trying to login gets no response, doesn't even echo what I typed.
After lots of debugging I arrived at the kernel and added debugging code between line 828 and 830 (yes, kernel version is 2.6.37). This is at this point in the kernel mode before 'sbin/init' is called:
http://lxr.linux.no/linux+v2.6.37/init/main.c#L815
Right before line 830 I added a forever loop printk and I see the results. I have let it run for about a couple of hour and it counts to about 2 million. Sample line:
dbg:init/main.c:1202: 2088430
So it has spit out 60 million bytes without problem.
However, if I add msleep(1000) in the loop, it prints only once, i.e. msleep () does not return.
Details:
Adding a conditional printk at line 4073 in the scheduler that condition on a flag that get set at the start of the forever test loop described above shows that the schedule() is no longer called when it hangs:
http://lxr.linux.no/linux+v2.6.37/kernel/sched.c#L4064
The only selections under .config/'Device Drivers' are:
Block devices
I2C support
SPI support
The kernel and its ramdisk are loaded using uboot/TFTP.
I don't believe it tries to use the Ethernet.
Since all these happened before '/sbin/init', very little should be happenning.
More details:
I have a very similar board with the same CPU. I can run the same uImage and the same ramdisk and it works fine there. I can login and do the usual things.
I have run memory test (64 MB total, limit kernel to 32M and test the other 32M; it's a single chip DDR2) and found no problem.
One board uses UART0, and the other UART2, but boot log comes out of both so it should not be the problem.
Any debugging tips is greatly appreciated.
I don't have an appropriate JTAG so I can't use that.
If msleep doesn't return or doesn't make it to schedule, then in order to debug we can follow the call stack.
msleep calls schedule_timeout_uninterruptible(timeout) which calls schedule_timeout(timeout) which in the default case exits without calling schedule if the timeout in jiffies passed to it is < 0, so that is one thing to check.
If timeout is positive , then setup_timer_on_stack(&timer, process_timeout, (unsigned long)current); is called, followed by __mod_timer(&timer, expire, false, TIMER_NOT_PINNED); before calling schedule.
If we aren't getting to schedule then something must be happening in either setup_timer_on_stack or __mod_timer.
The calltrace for setup_timer_on_stack is setup_timer_on_stack calls setup_timer_on_stack_key which calls init_timer_on_stack_key is either external if CONFIG_DEBUG_OBJECTS_TIMERS is enabled or calls init_timer_key(timer, name, key);which calls
debug_init followed by __init_timer(timer, name, key).
__mod_timer first calls timer_stats_timer_set_start_info(timer); then a whole lot of other function calls.
I would advise starting by putting a printk or two in schedule_timeout probably either side of the setup_timer_on_stack call or either side of the __mod_timer call.
This problem has been solved.
With liberal use of prink it was determined that schedule() indeed switches to another task, the idle task. In this instance, being an embedded Linux, the original code base I copied from installed an idle task. That idle task seems not appropriate for my board and has locked up the CPU and thus causing the crash. Commenting out the call to the idle task
http://lxr.linux.no/linux+v2.6.37/arch/arm/mach-davinci/cpuidle.c#L93
works around the problem.

Resources