Unable to inject errors with Einj (mce-test, ras-tools) - linux

I want to inject memory errors on my system to check whether RAS/EDAC system really works and logs errors on my memory (during boot or any runtime). I came across with many tools but I don't know which one to actually trust. The machine I want to test is a Sandy Bridge machine running Linux kernel 5.15.0-58-generic version. Specificially, I want to test my system with Einj tool (https://docs.kernel.org/firmware-guide/acpi/apei/einj.html). Although I followed the earlier steps in the link (BIOS supports Einj, CONFIG_DEBUG_FS, CONFIG_ACPI_APEI, CONFIG_ACPI_APEI_EINJ config parameters are set on my kernel), the files mentioned in the document: /sys/kernel/debug/apei/einj etc. are not present. How can I proceed with this tool? Or is there a better way/tool to inject memory errors to check the EDAC subsystem?

Related

How they do debugging Linux Kernel Core

Now a days debugging become so advanced that even 'core kernel source code' can be debugged using Virtual environment.
But after reading couple of blog related to Kernel Core development it was not clear whether they are debugging using Virtual environment.
They have mentioned that they rely on 'Printing message' rather than using debugging tool, at-least for core component.
So, I Request from 'Linux Kernel Experts' to let me know what is good practice followed while debugging Kernel?
I've tried multiple approaches when trying to debug the kernel.
Sometimes, the easiest way is to just add a few printk statements based on my own conditional values, monitor the serial log and see what's going on. Its especially useful when the function in question is invoked quite often, but you are interested only in a subset of those.
QEMU GDB debugging. I have a buildroot filesystem setup. This means the kernel is lean and it boots up real fast. I start qemu with the -s -S flags, and attach gdb as target remote :1234. Additionally, there aren't very many userspace processes in this setup so its easier to debug the kernel.
VMWare stub. Assuming you are running an Ubuntu VM, it is possible that you can attach gdb to a VMware stub and debug the kernel. Personally, I never have had to pursue this route, but I look forward to trying it out someday.
If you have a kernel for a device that gets stuck in a bootloop and it does not print out any debug information out onto serial, it still might be helpful to try and boot it up using QEMU. Sure, the booting up will probably fail as the kernel tries to load up drivers, but you should be able to attach gdb, get a stack trace and see what the root cause is(perhaps a recursive call).

How to not stop RedHawk processing even if there is no request from RedHawk-IDE

I use Red Hawk v2.1.0 to realize the AM demodulation part with three components.
Platform --> Xilinx Zynq 7035 (ARM Coretex A9*2)
Oparating System(OS)--> embedded Linux.
When connecting the RedHawk-IDE on the external PC with the Ether and displaying the waveform between the components, an abnormal sound is occured.
At this time, when I disconnect the LAN cable, the AM demodulation processing of Red Hawk inside the ARM will cease.
RedHawk inside the ARM appears to be waiting for requests from RedHawk-IDE on the external PC.
From this, it seems that abnormal noise will occur when requests from RedHawk-IDE on the external PC are delayed.
How can I keep RedHawk's AM demodulation processing inside the ARM running without stopping while connecting the RedHawk-IDE of the external PC and monitoring the waveform?
Environment is below.
CPU:Xilinx Zynq ARM CoretexA9 2cores 600MHz
OS:Embedded Linux Kernel 3.14 RealTimePatch
FrameLength:5.333ms(48kHz sampling, 256 data)
I have seen similar, if not identical issues, when running on an ARM board. Tracking down the exact issue may be difficult and in my experience hasn't been redhawk specific and has really been an issue with omniORB or its configuration. I believe one of the fixes for me was recompiling omniORB rather than using the omniORB package provided by my OS. (Which didn't make any sense to me at the time as I used the same flags & build process as the package maintainer)
First I would confirm this issue is specific to ARM. If it's easy enough to setup the same components, waveforms etc. on a 2nd x86_64 host and validate the problem does not occur.
Second I would try a "quick fix" of setting the omniORB timeouts on the arm host using the /etc/omniORB.cfg file and setting:
clientCallTimeOutPeriod = 2000
clientConnectTimeOutPeriod = 2000
This will set a 2 second timeout on CORBA interactions for both the connect portion and the call completion portion. In the past this has served as a quick fix for me but does not address the underlying issue. If this "fixes" it for you then you've at least narrowed part of the issue down and you could enable omniORB debugging using the traceLevel configuration option to find what call is timing out. See this sample configuration file for all options
If you want to dive into the underlying issues you'd need to see what the IDE and framework are doing when things lock up. With the IDE this is easy; simply find the PID of the java process and run kill -3 <pid> and a full stack trace will be printed in the terminal that is running the IDE. This can give you hints as to what calls are locked up. For the framework you'll need to use GDB and connect to the process in question and tell GDB to print the stack trace. You'd have to do some investigation ahead of time to determine which process is locking up.
If it ends up being an issue with the Java CORBA implementation on x86_64 talking with the C++ CORBA implementation on ARM you could also try launching / configuring / interacting with the ARM board via the REDHAWK python API from your x86_64 host. This may have better compatibility since they both use the same omniORB CORBA implementation.

How to validate/test/benchmark for the set of features on EXT4 filesystem

I wanted to validate/test/benchmark set of features I have added to the ext4 kernel_tree/fs.
I came across Spruce Linux file system driver verification. Especially for filesystem.
The project is hosted #https://code.google.com/p/spruce/wiki/GettingStarted.
and this is for x86.
I work on arm target, and I have few questions before starting off.
Has anybody worked on Spruce earlier.
how to use Spruce project for ARM, Do we need to port for ARM?
Is cross compilation straight forward or any changes need to be done.
I have gone through this paper: http://syrcose.ispras.ru/2012/files/submissions/25_syrcose2012_submission_21.pdf
there is no information on ARM and its support.
Please someone explain/help who has any work experience/knowledge on Spruce project.
Spruce was intended to work as follows. It provides a set of tests that make the kernel module for a given file system execute as many paths in the code as possible. It allows to use some external analyzers (such as the tools from KEDR framework) to detect different kinds of errors: memory leaks, etc.
All that was primarily intended for x86.
While it might be possible to port the tests themselves to ARM, one will need to choose the analyzers that work on that platform too. KEDR tools are currently for x86 only but one may try Kmemleak, Fault injection facilities and other tools on ARM instead.
Spruce seems to be a work in progress still. I see, you opened a ticket concerning ARM support in their issue tracker, I think, it is the right thing to do.
I would also suggest to take a look at Phoronix Test Suite. It is currently widely used for testing and benchmarking, including the analysis of file system kernel modules. See this article for example. It seems to work on ARM although I haven't tried it there myself.
The best tool for testing/validating a file system is xfstests. I have written tools to make it easy to validate xfstests for ext4. See: http://thunk.org/gce-xfstests for more details.
There is also an alpha-test level support for using this on ARM directly: http://thread.gmane.org/gmane.comp.file-systems.ext4/53649/focus=53659
This has been used successfully to test ext4 on an Android device, although to be honest, most of the time what I do is to bludgeon an Android kernel until it will build on x86, and then use kvm-xfstests gce-xfstests, since it's much more convenient. In particular with gce-xfstests, I can just do a "fire and forget", and then when the test completes I get a test report in my e-mail. Where as with the Android arm xfstests tarball, the automation isn't done yet, so you have to manually set up an external USB-attached USB device, hook it up via some kind of USB C hub, or if you are going to use an OTG usb adapter, you need to make sure the Android device can receive power while it is also driving the OTG usb port --- and you have to manually set up the chroot. Unless the BSP kernel has been badly abused so you can't figure out how to make it build on x86 (getting the MSM kernel to work on x86 wasn't easy) testing on gce-xfstests may be much simpler at the end of the day.

How is the Linux kernel tested?

How do the Linux kernel developers test their code locally and after they have it committed? Do they use some kind of unit testing and build automation? Test plans?
The Linux kernel has a heavy emphasis on community testing.
Typically, any developer will test their own code before submitting, and quite often they will be using a development version of the kernel from Linus, or one of the other unstable/development trees for a project relevant to their work. This means they are often testing both their changes and other people's changes.
There tends not to be much in the way of formal test plans, but extra testing may be asked for before features are merged into upstream trees.
As Dean pointed out, there's also some automated testing: The Linux Test Project and the kernel Autotest (good overview).
Developers will often also write automated tests targeted to test their change, but I'm not sure there's a (often used) mechanism to centrally collect these ad hoc tests.
It depends a lot on which area of the kernel is being changed of course - the testing you'd do for a new network driver is quite different to the testing you'd do when replacing the core scheduling algorithm.
Naturally, the kernel itself and its parts are tested prior to the release, but these tests cover only the basic functionality. There are some testing systems which perform testing of Linux Kernel:
The Linux Test Project (LTP) delivers test suites to the open source community that validate the reliability and stability of Linux. The LTP test suite contains a collection of tools for testing the Linux kernel and related features.
Autotest—a framework for fully automated testing. It is designed primarily to test the Linux kernel, though it is useful for many other purposes such as qualifying new hardware, virtualization testing, and other general user space program testing under Linux platforms. It's an open-source project under the GPL and is used and developed by a number of organizations, including Google, IBM, Red Hat, and many others.
Also there are certification systems developed by some major GNU/Linux distribution companies. These systems usually check complete GNU/Linux distributions for compatibility with hardware. There are certification systems developed by Novell, Red Hat, Oracle, Canonical, and Google.
There are also systems for dynamic analysis of the Linux kernel:
Kmemleak is a memory leak detector included in the Linux kernel. It provides a way of detecting possible kernel memory leaks in a way similar to a tracing garbage collector with the difference that the orphan objects are not freed, but only reported via /sys/kernel/debug/kmemleak.
Kmemcheck traps every read and write to memory that was allocated dynamically (i.e., with kmalloc()). If a memory address is read that has not previously been written to, a message is printed to the kernel log. It is also is a part of the Linux kernel.
Fault Injection Framework (included in the Linux kernel) allows for infusing errors and exceptions into an application's logic to achieve a higher coverage and fault tolerance of the system.
How do the Linux kernel developers test their code locally and after they have it committed?
Do they use some kind of unit testing and build automation?
In the classic sense of words, no.
For example, Ingo Molnar is running the following workload:
build a new kernel with a random set of configuration options
boot into it
go to 1
Every build fail, boot fail, bug or runtime warning is dealt with. 24/7. Multiply by several boxes, and one can uncover quite a lot of problems.
Test plans?
No.
There may be a misunderstanding that there is a central testing facility, but there is none. Everyone does what he/she wants.
In-tree tools
A good way to find test tools in the kernel is to:
make help and read all targets
look under tools/testing
In v4.0, this leads me to:
kselftest under tools/testing/selftests. Run with make kselftest. Must be running built kernel already. See also: Documentation/kselftest.txt , https://kselftest.wiki.kernel.org/
ktest under tools/testing/ktest. See also: http://elinux.org/Ktest , http://www.slideshare.net/satorutakeuchi18/kernel-auto-testbyktest
Static analysers section of make help, which contains targets like:
checkstack: Perl: what does checkstack.pl in linux source do?
coccicheck for Coccinelle (mentioned by askb)
Kernel CI
https://kernelci.org/ is a project that aims to make kernel testing more automated and visible.
It appears to do only build and boot tests (TODO how to test automatically that boot worked Source should be at https://github.com/kernelci/).
Linaro seems to be the main maintainer of the project, with contributions from many big companies: https://kernelci.org/sponsors/
Linaro Lava
http://www.linaro.org/initiatives/lava/ looks like a CI system with focus on development board bringup and the Linux kernel.
ARM LISA
https://github.com/ARM-software/lisa
Not sure what it does in detail, but it is by ARM and Apache Licensed, so likely worth a look.
Demo: https://www.youtube.com/watch?v=yXZzzUEngiU
Step debuggers
Not really unit testing, but may help once your tests start failing:
QEMU + GDB: https://stackoverflow.com/a/42316607/895245
KGDB: https://stackoverflow.com/a/44226360/895245
My own QEMU + Buildroot + Python setup
I also started a setup focused on ease of development, but I ended up adding some simple testing capabilities to it as well: https://github.com/cirosantilli/linux-kernel-module-cheat/tree/8217e5508782827320209644dcbaf9a6b3141724#test-this-repo
I haven't analyzed all the other setups in great detail, and they likely do much more than mine, however I believe that my setup is very easy to get started with quickly because it has a lot of documentation and automation.
It’s not very easy to automate kernel testing. Most Linux developers do the testing on their own, much like adobriyan mentioned.
However, there are a few things that help with debugging the Linux Kernel:
kexec: A system call that allows you to put another kernel into memory and reboot without going back to the BIOS, and if it fails, reboot back.
dmesg: Definitely the place to look for information about what happened during the kernel boot and whether it works/doesn't work.
Kernel Instrumentation: In addition to printk's (and an option called 'CONFIG_PRINTK_TIME' which allows you to see (to microsecond accuracy) when the kernel output what), the kernel configuration allows you to turn on a lot of tracers that enable them to debug what is happening.
Then, developers usually have others review their patches. Once the patches are reviewed locally and seen not to interfere with anything else, and the patches are tested to work with the latest kernel from Linus without breaking anything, the patches are pushed upstream.
Here's a nice video detailing the process a patch goes through before it is integrated into the kernel.
In addition to the other answers, this emphasise more on the functionality testing, hardware certification testing and performance testing the Linux kernel.
A lot of testing actually happen through scripts, static code analysis tools, code reviews, etc. which is very efficient in catching bugs, which would otherwise break something in the application.
Sparse – An open-source tool designed to find faults in the Linux kernel.
Coccinelle is another program does matching and transformation engine which provides the language SmPL (Semantic Patch Language) for specifying desired matches and transformations in C code.
checkpatch.pl and other scripts - coding style issues can be found in the file Documentation/CodingStyle in the kernel source tree. The important thing to remember when reading it is not that this style is somehow better than any other style, just that it is consistent. This helps developers easily find and fix coding style issues. The script scripts/checkpatch.pl in the kernel source tree has been developed for it. This script can point out problems easily, and should always be run by a developer on their changes, instead of having a reviewer waste their time by pointing out problems later on.
There are also:
MMTests which is collection of benchmarks and scripts to analyze the results.
Trinity which is Linux system call fuzz tester.
Also the LTP pages at SourceForge are quite outdated and the project has moved to GitHub.
I would imagine they use virtualization to do quick tests. It could be something like QEMU, VirtualBox or Xen, and some scripts to perform configurations and automated tests.
Automated testing is probably done by trying either many random configurations or a few specific ones (if they are working with a specific issue). Linux has a lot of low-level tools (such as dmesg) to monitor and log debug data from the kernel, so I imagine that is used as well.
As far as I know, there is an automatically performance regression check tool (named lkp/0 day) running/funding by the Intel. It will test each valid patch sent to the mailing list and check the scores changed from different microbenchmarks such as hackbench, fio, unixbench, netperf, etc.
Once there is a performance regression/improvement, a corresponding report will be sent directly to the patch author and a Cc related maintainers.
LTP and Memtests are generally preferred tools.
adobriyan mentioned Ingo's loop of random configuration build testing. That is pretty much now covered by the 0-day test bot (aka kbuild test bot). A nice article about the infrastructure is presented here: Kernel Build/boot testing
The idea behind this set-up is to notify the developers ASAP so that they can rectify the errors soon enough (before the patches make it into Linus' tree in some cases as the kbuild infrastructure also tests against maintainer's subsystem trees).
Once after contributors submit their patch files and after making a merge request, Linux gatekeepers are checking the patch by integrating and reviewing it. Once it succeeds, they will merge the patch into the relevant branch and a make new version release.
The Linux Test Project is the main source which provides test scenarios (test cases) to run against the kernel after applying patches. This may take around 2 ~ 4 hours, and it depends.
Please note regarding the file system of the selected kernel is going to test against.
Example: ext4 generates different results against ext3 and so on.
Kernel Testing procedure.
Get latest kernel source from the repository (The Linux Kernel Archives or GitHub)
Apply the patch file (using a diff tool)
Build the new kernel.
Test against test procedures in LTP (Linux Test Project)
I had done Linux kernel compilation and done some modifications for Android (Android 6.0 (Marshmallow) and Android 7.0 (Nougat)) in which I use Linux version 3. I cross-compiled it on a Linux system, debugged the errors manually and then ran its boot image file in Android and checked if it was going in a loop-hole or not. If it runs perfect then it means it is compiled perfectly according to system requirements.
For MotoG kernel Compilation
Note: The Linux kernel will change according to requirements which depend on system hardware

Current Linux Kernel debugging techniques

A linux machine freezes few hours after booting and running software (including custom drivers). I'm looking a method to debug such problem. Recently, there has been significant progress in Linux Kernel debugging techniques, hasn't it?
I kindly ask to share some experience on the topic.
If you can reproduce the problem inside a VM, there is indeed a fairly new (AFAIK) technique which might be useful: debugging the virtual machine from the host machine it runs on.
See for example this:
Debugging Linux Kernel in VMWare with Windows host
VMware Workstation 7 also enables a powerful technique that lets you record system execution deterministically and then replay it as desired, even backwards. So as soon as the system crashes you can go backwards and see what was happening then (and even try changing something and see if it still crashes). IIRC I read somewhere you can't do this and debug the kernel using VMware/gdb at the same time.
Obviously, you need a VMM for this. I don't know what VMM's other than VMware's VMM family support this, and I don't know if any free VMware versions support this. Likely not; one can't really expect a commercial company to give away everything for free. The trial version is 30 days.
If your custom drivers are for hardware inside the machine, then I suppose this probably won't work.
SystemTap seems to be to Linux what Dtrace is to Solaris .. however I find it rather hostile to use. Still, you may want to give it a try. NB: compile the kernel with debug info and spend some time with the kernel instrumentation hooks.
This is why so many are still using printk() after empirically narrowing a bug down to a specific module.
I'm not recommending it, just pointing out that it exists. I may not be smart enough to appreciate some underlying beauty .. I just write drivers for odd devices.
There are many and varied techniques depending on the sort of problems you want to debug. In your case the first question is "is the system really frozen?". You can enable the magic sysrq key and examine the system state at freeze and go from there.
Probably the most directly powerful method is to enable the kernel debugger and connect to it via a serial cable.
One option is to use Kprobes. A quick search on google will show you all the information you need. It isn't particularly hard to use. Kprobes was created by IBM I believe as a solution for kernel debugging. It is essentially a elaborate form of printk() however it allows you to handle any "breakpoints" you insert using handlers. It may be what you are looking for. All you need to do is write and 'insmod' a module into the kernel which will handle any "breakpoints" hit that you specify in the module.
Hope that can be a useful option...
How I debug this kind of bug, was to run my OS inside the VirtualBox, and compile the kernel with kgdb builtin. Then I setup a serial console on the VirtualBox so that I can gdb to the kernel inside the VirtualBox's OS via the serial console. Anytime the OS hang, just like magic sysrq key, I can enter ctrl-c on the gdb to stop and understand the kernel at that point in time.
Normally kernel stack tracing is just too difficult to pinpoint the culprit process, so the best way I think is still generic "top" command, just looking at the application logs to see what are the cause of hanging - this will need a reboot to see the log of course.

Resources