Running perf record with Intel-PT event on compiled binaries from SPECCpu2006 crashes the server machine - linux

I am having a recurring problem when using perf with Intel-PT event. I am currently performing profiling on a Intel(R) Xeon(R) CPU E5-2620 v4 # 2.10GHz machine, with x86_64 architecture and 32 hardware threads with virtualization enabled. I specifically use programs/source codes from SpecCPU2006 for profiling.
I am specifically observing that the first time I perform profiling on one of the compiled binaries from SpecCPU2006, everything works fine and the perf.data file gets generated, which is as expected with Intel-PT. As SpecCPU2006 programs are computationally-intensive(use 100% of CPU at any time), clearly perf.data files would be large for most of the programs. I obtain roughly 7-10 GB perf.data files for most of the profiled programs.
However, when I try to perform profiling the second time on the same compiled binary, after the first one is successfully done -- my server machine freezes up. Sometimes, this happens when I try profiling the third time/the fourth time (after the second or third profiling completed successfully). This behavior is highly unpredictable. Now I cannot profile any more binaries unless I have restarted the machine again.
I have also posted the server error logs which I get once I see that the computer has stopped responding.
Server error logs
Clearly there is an error message saying Fixing recursive fault but reboot is needed!.
This happens for particularly large enough SpecCPU2006 binaries which take more than 1 minute to run without perf.
Is there any particular reason why this might happen ? This should not occur due to high CPU usage, as running the programs without perf or with perf but any other hardware event(that can be seen by perf list) completed successfully. This only seems to happen with Intel-PT.
Please guide me in using the steps to solve this problem. Thanks.

Seems I resolved this issue now. So will post an answer.
The server crashed because of a null pointer dereference/access happening with a specific member of the structure perf_event. Basically the member perf_event->handle was the culprit. This information, as suggested by #osgx, was obtained from var/log/syslog file. A portion of the error message was :-
Apr 19 04:49:15 ###### kernel: [582411.404677] BUG: unable to handle kernel NULL pointer dereference at 00000000000000ea
Apr 19 04:49:15 ###### kernel: [582411.404747] IP: [] perf_event_aux_event+0x2e/0xf0
One possible scenario where this structure member turns out to be NULL is if I start capturing packets even before an earlier run of perf record finished releasing all of its resources. This has been properly handled in kernel version 4.10. I was using kernel version 4.4.
I upgraded my kernel to the newer version and it works fine now!

Related

Sysmgr volatile memory bug

As you may know that there is a bug with Nexus switches called SYSMGR-2-VOLATILE_DB_FULL with versions below System version: 5.0(2)N2(1) that causes a switch to crash reboot once dir /dev/shm gets to 100% unless updated to a later version.
in order to fill the dir you can run long commands such as "show run" (needs to be over 190 lines) and then check how it increases by running
show system internal flash
show system internal dir /dev/shm | i csm_acfg | count
I was wondering if there is a similar bug with 4500 switches ?
Catalyst 4500 L3 Switch Software (cat4500es8-UNIVERSALK9-M), Version 03.11.00.E RELEASE SOFTWARE (fc3)
So what exactly happened...
I have a script that runs from time to time that gets over 190 lines from all of our switches and performs some action remotely, so recently when the script ran a few minutes later we had a massive outage since our core switch had a power outage( at least what I was able to see from the logs) The thing is there are 2 4500 chassis configured with sso redundancy so the failover should have been instantaneous, however everything was down for about 8 mins before the standby switch became active.
Can anyone please advise if there is a similar bug with 4500 switches ?
Thank you.
After analysing crash info I was able to find some things that caused the crash, however wont be able to tell with 100 % certainty what exactly happened to crash it
So there are a few errors that are called VFETQINTERRUPT and VFETQTOOMANYPARITYERRORS basically VFETQINTERRUPT counts fast accruing errors and VFETQTOOMANYPARITYERRORS will crash reboot switch if exceeds 100 errors in a short period of time, could indicate that there is a hardware error
and this is pretty much what happened in out environment, something has caused 100+ errors and it crashed rebooted.
There is a command to stop it from crash rebooting, however not sure if it should be used as if there is a hardware issue it better to failover onto the other supervisor.
platform fw-asic dbl hash memory parity-error reload never

"clocksource tsc unstable" shown when the linux kernel boots up

I am booting up a linux kernel using a full system simulator, and I'd like to run my benchmark on the booted system. However, when it boots up, it shows me this message: "clocksource tsc unstable" and occasionally it hangs in the beggining. However, sometimes it lets me run my benchmark and then probably it hangs in the middle since the application never finishes and seems it's stuck there. Any idea how to fix this issue?
Thanks.
It suggests that, kernel didn't manage to calculate tsc (Time Stamp Counter) value properly i.e value is stale. This usually happens with VM. The way to avoid this is to - use predefined lpj (loops per jiffy) as kernel parameter (lpj=). Try it, hope issue will be fixed!

CUDA device seems to be blocked

I'm running a small CUDA application: the QuickSort benchmark algorithm (see here). I have a dual system with a NVIDIA 660GTX (device 0) and 8600GTS (device 1).
Under Windows 8 and Visual Studio, the application compiles and runs flawlessly on device 0. Under Linux (Ubuntu 12.04 LTS), the app compiles with nvcc and gcc but suddenly stops in its tracks, returning a (unspecified launch failure).
I have two issues:
After this error, my GPU cannot perform some other operations, e.g., running the SDK example bandwidhtTest blocks when it performs the first data transfer, but running deviceQuery continues to perform well. How can I reset my GPU? I've already tried the cudaDeviceReset() method but it doesn't help
How can I find what is going wrong under linux? Has someone a clue or seen this before?
Thanks in advance for your help!
Using the nvidia-smi utility you can reset the GPU if it is compatible
To my knowledge and experience, (unspecified launch failure) usually referees to segmentation fault. Have you specified the right GPU to use? Try to use cuda-memcheck to see if there is any memory out of bound scenario.
From my experience XID 31 was always caused by accessing bad pointer (aka Memory access violation).
I'd first pursue this trail. Run your application with cuda memcheck. Like that cuda-memcheck you_app args to your app and see if it finds any wrong memory accesses.
Also try stepping though the code with cuda-gdb or Nsight Eclipse Edition.
I've found that using
cuda-memcheck -b ...
prevents the device from locking up.

Performance counters collection - Linux

model name : Intel(R) Core(TM) i7 CPU Q 720 # 1.60GHz
perf version 3.2.33
python 2.7
OS: Linux. Distro: Ubuntu 12.10
I am trying to collect, PerThread performance counters using perfmon.
these are the events that I am trying to collect: perf_events=['INSTRUCTIONS_RETIRED','L2_LINES_IN','LLC_MISSES']
I am able to collect the data at times - unpredictable.
Sometimes, I am able to collect the data.
I am trying to run the PARSEC - CANNEAL Benchmark thread.
File "/usr/lib/python2.7/dist-packages/apport_python_hook.py", line 44, in apport_excepthook
This is the error, I keep getting sometimes. I tried googling about scenario's, if people had this kind of issue
Google keywords: "perfmon perthread session traces unpredictable" | perfmon events collection intel unpredictable
I could also post snippets of the code, if necessary.
One more interesting thing I notice is, if I run the same thread twice, without killing the previous thread. I am able to collect it everytime.
Does that mean anychance mean? that I don't have enough IPS to report in the first instance? that can't be the case, because - the "top"shows that the core is at 100%
Thanks

How to test the kernel for kernel panics?

I am testing the Linux Kernel on an embedded device and would like to find situations / scenarios in which Linux Kernel would issue panics.
Can you suggest some test steps (manual or code automated) to create Kernel panics?
There's a variety of tools that you can use to try to crash your machine:
crashme tries to execute random code; this is good for testing process lifecycle code.
fsx is a tool to try to exercise the filesystem code extensively; it's good for testing drivers, block io and filesystem code.
The Linux Test Project aims to create a large repository of kernel test cases; it might not be designed with crashing systems in particular, but it may go a long way towards helping you and your team keep everything working as planned. (Note that the LTP isn't proscriptive -- the kernel community doesn't treat their tests as anything important -- but the LTP team tries very hard to be descriptive about what the kernel does and doesn't do.)
If your device is network-connected, you can run nmap against it, using a variety of scanning options: -sV --version-all will try to find versions of all services running (this can be stressful), -O --osscan-guess will try to determine the operating system by throwing strange network packets at the machine and guessing by responses what the output is.
The nessus scanning tool also does version identification of running services; it may or may not offer any improvements over nmap, though.
You can also hand your device to users; they figure out the craziest things to do with software, they'll spot bugs you'd never even think to look for. :)
You can try following key combination
SysRq + c
or
echo c >/proc/sysrq-trigger
Crashme has been known to find unknown kernel panic situations, but it must be run in a potent way that creates a variety of signal exceptions handled within the process and a variety of process exit conditions.
The main purpose of the messages generated by Crashme is to determine if sufficiently interesting things are happening to indicate possible potency. For example, if the mprotect call is needed to allow memory allocated with malloc to be executed as instructions, and if you don't have the mprotect enabled in the source code crashme.c for your platform, then Crashme is impotent.
It seems that operating systems on x64 architectures tend to have execution turned off for data segments. Recently I have updated the crashme.c on http://crashme.codeplex.com/ to use mprotect in case of __APPLE__ and tested it on a MacBook Pro running MAC OS X Lion. This is the first serious update to Crashme since 1994. Expect to see updated Centos and Freebsd support soon.

Resources