Let us consider that I have an application that is to be executed on 1st node. This application however, cannot execute some function on this 1st node as the node lacks such capabilities. Hence, in order to make this application execution flawless, I am planning to steal the process's stack, heap & its registers using ptrace & send them over to other fully capable 2nd node. Here in this 2nd node, I would like to execute the same process(i.e same executable on the same architecture like x86) until the exact same point 1st process has exeuted, apply the previously stolen stack, heap & register's value onto this process and execute it here and transfer the results back to the 1st node and start executing the application from there.
I have also disabled the ASLR (Address space layout randomization) so that it will be one to one mapping between the process executed on remote node.
On applying such logic, the program ends up with "Stack smashing detected"
Is there anything that I am missing here, or is the idea itself not so feasible???
NOTE: I am also skipping the part of copying kernel stack, as the process on both sides are executed exactly until the same instruction. Please also note that this was a very simple program that I tried as I don't want the complexity of heaps to be involved.
#include <unistd.h>
#include <stdio.h>
#include <signal.h>
void add_one(int *p){
*p += 2;
}
int main(int argc, char **argv)
{
int i = 0;
add_one(&i);
return 0;
}
Above picture holds that program that I experimented with, here I disassembled and found out the address of the function add_one, the point at which I would steal stack & process registers and send them over to apply onto the other identical process in node 2.
Any help on how to do such migrations and the things that I am missing would really help me in moving forward.
if you want to do this you need to at least disable stack canaries, because those will 100% mismatch when carrying over the execution to another machine even if you copied the entire address space.
-fno-stack-protector will do
Context
I'm using an i.MX6 (IMXULL) application processor, and want to know in software when the power-off button has been pressed:
Luckily, the IMX6ULL reference manual explains that this should be possible:
Section 10.5: ONOFF Button
The chip supports the use of a button input signal to request main SoC power state changes (i.e. On or Off) from the PMU. The ONOFF logic inside of SNVS_LP allows for connecting directly to a PMIC or other voltage regulator device. The logic takes a button input signal and then outputs a pmic_en_b and set_pwr_off_irq signal. [...] The logic has two different modes of operation (Dumb and Smart mode).
The Dumb PMIC Mode uses pmic_en_b to issue a level signal for on and off. Dumb pmic mode has many different configuration options which include (debounce, off to on time, and max time out).
(Also available in condensed form here on page 18)
Attempt
Therefore, I have built a trivially simple kernel module to try and capture this interrupt:
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/init.h>
#include <linux/sched.h>
#include <linux/syscalls.h>
#include <linux/interrupt.h>
MODULE_LICENSE("GPL");
MODULE_AUTHOR("John Doe <j.doe#acme.inc>");
// Forward declaration
irqreturn_t irq_handler (int, void *);
// Number of interrupt to capture
#define INTERRUPT_NO 36
static int __init pwr_ctl_init (void)
{
pr_err("init()\n");
return request_irq(INTERRUPT_NO, irq_handler, IRQF_SHARED, "onoff-button",
(void *)irq_handler);
}
static void __exit pwr_ctl_exit (void)
{
pr_err("exit()\n");
free_irq(INTERRUPT_NO, NULL);
}
irqreturn_t irq_handler (int irq, void *dev_irq)
{
pr_err("interrupt!\n");
return IRQ_HANDLED;
}
module_init(pwr_ctl_init);
module_exit(pwr_ctl_exit);
Problem
However, I cannot find any information about what the number of the interrupt is. When searching on the internet, all I get is this one NXP forum post:
ONOFF button doesn't interrupt
Which hints it should be 36. However, I have found that this isn't the case on my platform. When I check /proc/interrupts 36 is already occupied by 20b4000.ethernet. Because the application manual also mentions that it is generated by the SNVS low power system, I checked the device-tree and found the following information:
snvs_poweroff: snvs-poweroff {
compatible = "syscon-poweroff";
regmap = <&snvs>;
offset = <0x38>;
value = <0x60>;
mask = <0x60>;
status = "disabled";
};
snvs_pwrkey: snvs-powerkey {
compatible = "fsl,sec-v4.0-pwrkey";
regmap = <&snvs>;
interrupts = <GIC_SPI 4 IRQ_TYPE_LEVEL_HIGH>;
linux,keycode = <KEY_POWER>;
wakeup-source;
status = "disabled";
};
This information seems useful for knowing that SNVS is the interrupt controller, but not how to capture this set_pwr_off_irq signal.
Conclusion
How do I capture the ON/OFF interrupt supposedly generated by SNVS?
How do I determine the number of an interrupt from the device-tree (if applicable at all)
Am I misunderstanding something about how the ONOFF feature works? Is it possible to capture this from a kernel module at all?
Edit
This edit answers some user questions, and then goes into new information about the problem I have since discovered:
User Questions
Processor: The processor is an NXP i.MX 6UltraLite / 6ULL / 6ULZ ARM Cortex A7.
New Information
SNVS Driver: Using my build system kernel configuration, I have modified and verified that the snvs_pwrkey driver (see here) is enabled. My modification consists of adding a single kprint statement to the interrupt routine to see if the button trips it. This did not work
I have tried updating the driver to a newer version, which claims to support newer i.MX6 processors. This also did not work
I have tried to load the driver as a kernel module for easier debugging. This is not possible, as the kernel configuration requires this be enabled and I cannot remove it from being statically built into the kernel.
The answer is rather anticlimactic. In short, there was a device-tree overlay that was disabling my changes to snvs_pwrkey, even when I had enabled it. Once I located and removed the overlay, the driver (snvs_pwrkey.c) was working as expected.
As for the IRQ number, it turns out that the IRQ for the power button is 45 as interpreted through Linux. The interrupt is not configured for sharing, so my kernel module could not be loaded.
If you want to capture power button toggle events, I suggest modifying the driver to add some output, and then perhaps adding a udev rule to capture button presses. I will update my answer with an example ASAP.
$ time foo
real 0m0.003s
user 0m0.000s
sys 0m0.004s
$
What do real, user and sys mean in the output of time? Which one is meaningful when benchmarking my app?
Real, User and Sys process time statistics
One of these things is not like the other. Real refers to actual elapsed time; User and Sys refer to CPU time used only by the process.
Real is wall clock time - time from start to finish of the call. This is all elapsed time including time slices used by other processes and time the process spends blocked (for example if it is waiting for I/O to complete).
User is the amount of CPU time spent in user-mode code (outside the kernel) within the process. This is only actual CPU time used in executing the process. Other processes and time the process spends blocked do not count towards this figure.
Sys is the amount of CPU time spent in the kernel within the process. This means executing CPU time spent in system calls within the kernel, as opposed to library code, which is still running in user-space. Like 'user', this is only CPU time used by the process. See below for a brief description of kernel mode (also known as 'supervisor' mode) and the system call mechanism.
User+Sys will tell you how much actual CPU time your process used. Note that this is across all CPUs, so if the process has multiple threads (and this process is running on a computer with more than one processor) it could potentially exceed the wall clock time reported by Real (which usually occurs). Note that in the output these figures include the User and Sys time of all child processes (and their descendants) as well when they could have been collected, e.g. by wait(2) or waitpid(2), although the underlying system calls return the statistics for the process and its children separately.
Origins of the statistics reported by time (1)
The statistics reported by time are gathered from various system calls. 'User' and 'Sys' come from wait (2) (POSIX) or times (2) (POSIX), depending on the particular system. 'Real' is calculated from a start and end time gathered from the gettimeofday (2) call. Depending on the version of the system, various other statistics such as the number of context switches may also be gathered by time.
On a multi-processor machine, a multi-threaded process or a process forking children could have an elapsed time smaller than the total CPU time - as different threads or processes may run in parallel. Also, the time statistics reported come from different origins, so times recorded for very short running tasks may be subject to rounding errors, as the example given by the original poster shows.
A brief primer on Kernel vs. User mode
On Unix, or any protected-memory operating system, 'Kernel' or 'Supervisor' mode refers to a privileged mode that the CPU can operate in. Certain privileged actions that could affect security or stability can only be done when the CPU is operating in this mode; these actions are not available to application code. An example of such an action might be manipulation of the MMU to gain access to the address space of another process. Normally, user-mode code cannot do this (with good reason), although it can request shared memory from the kernel, which could be read or written by more than one process. In this case, the shared memory is explicitly requested from the kernel through a secure mechanism and both processes have to explicitly attach to it in order to use it.
The privileged mode is usually referred to as 'kernel' mode because the kernel is executed by the CPU running in this mode. In order to switch to kernel mode you have to issue a specific instruction (often called a trap) that switches the CPU to running in kernel mode and runs code from a specific location held in a jump table. For security reasons, you cannot switch to kernel mode and execute arbitrary code - the traps are managed through a table of addresses that cannot be written to unless the CPU is running in supervisor mode. You trap with an explicit trap number and the address is looked up in the jump table; the kernel has a finite number of controlled entry points.
The 'system' calls in the C library (particularly those described in Section 2 of the man pages) have a user-mode component, which is what you actually call from your C program. Behind the scenes, they may issue one or more system calls to the kernel to do specific services such as I/O, but they still also have code running in user-mode. It is also quite possible to directly issue a trap to kernel mode from any user space code if desired, although you may need to write a snippet of assembly language to set up the registers correctly for the call.
More about 'sys'
There are things that your code cannot do from user mode - things like allocating memory or accessing hardware (HDD, network, etc.). These are under the supervision of the kernel, and it alone can do them. Some operations like malloc orfread/fwrite will invoke these kernel functions and that then will count as 'sys' time. Unfortunately it's not as simple as "every call to malloc will be counted in 'sys' time". The call to malloc will do some processing of its own (still counted in 'user' time) and then somewhere along the way it may call the function in kernel (counted in 'sys' time). After returning from the kernel call, there will be some more time in 'user' and then malloc will return to your code. As for when the switch happens, and how much of it is spent in kernel mode... you cannot say. It depends on the implementation of the library. Also, other seemingly innocent functions might also use malloc and the like in the background, which will again have some time in 'sys' then.
To expand on the accepted answer, I just wanted to provide another reason why real ≠ user + sys.
Keep in mind that real represents actual elapsed time, while user and sys values represent CPU execution time. As a result, on a multicore system, the user and/or sys time (as well as their sum) can actually exceed the real time. For example, on a Java app I'm running for class I get this set of values:
real 1m47.363s
user 2m41.318s
sys 0m4.013s
• real: The actual time spent in running the process from start to finish, as if it was measured by a human with a stopwatch
• user: The cumulative time spent by all the CPUs during the computation
• sys: The cumulative time spent by all the CPUs during system-related tasks such as memory allocation.
Notice that sometimes user + sys might be greater than real, as
multiple processors may work in parallel.
Minimal runnable POSIX C examples
To make things more concrete, I want to exemplify a few extreme cases of time with some minimal C test programs.
All programs can be compiled and run with:
gcc -ggdb3 -o main.out -pthread -std=c99 -pedantic-errors -Wall -Wextra main.c
time ./main.out
and have been tested in Ubuntu 18.10, GCC 8.2.0, glibc 2.28, Linux kernel 4.18, ThinkPad P51 laptop, Intel Core i7-7820HQ CPU (4 cores / 8 threads), 2x Samsung M471A2K43BB1-CRC RAM (2x 16GiB).
sleep syscall
Non-busy sleep as done by the sleep syscall only counts in real, but not for user or sys.
For example, a program that sleeps for a second:
#define _XOPEN_SOURCE 700
#include <stdlib.h>
#include <unistd.h>
int main(void) {
sleep(1);
return EXIT_SUCCESS;
}
GitHub upstream.
outputs something like:
real 0m1.003s
user 0m0.001s
sys 0m0.003s
The same holds for programs blocked on IO becoming available.
For example, the following program waits for the user to enter a character and press enter:
#include <stdio.h>
#include <stdlib.h>
int main(void) {
printf("%c\n", getchar());
return EXIT_SUCCESS;
}
GitHub upstream.
And if you wait for about one second, it outputs just like the sleep example something like:
real 0m1.003s
user 0m0.001s
sys 0m0.003s
For this reason time can help you distinguish between CPU and IO bound programs: What do the terms "CPU bound" and "I/O bound" mean?
Multiple threads
The following example does niters iterations of useless purely CPU-bound work on nthreads threads:
#define _XOPEN_SOURCE 700
#include <assert.h>
#include <inttypes.h>
#include <pthread.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
uint64_t niters;
void* my_thread(void *arg) {
uint64_t *argument, i, result;
argument = (uint64_t *)arg;
result = *argument;
for (i = 0; i < niters; ++i) {
result = (result * result) - (3 * result) + 1;
}
*argument = result;
return NULL;
}
int main(int argc, char **argv) {
size_t nthreads;
pthread_t *threads;
uint64_t rc, i, *thread_args;
/* CLI args. */
if (argc > 1) {
niters = strtoll(argv[1], NULL, 0);
} else {
niters = 1000000000;
}
if (argc > 2) {
nthreads = strtoll(argv[2], NULL, 0);
} else {
nthreads = 1;
}
threads = malloc(nthreads * sizeof(*threads));
thread_args = malloc(nthreads * sizeof(*thread_args));
/* Create all threads */
for (i = 0; i < nthreads; ++i) {
thread_args[i] = i;
rc = pthread_create(
&threads[i],
NULL,
my_thread,
(void*)&thread_args[i]
);
assert(rc == 0);
}
/* Wait for all threads to complete */
for (i = 0; i < nthreads; ++i) {
rc = pthread_join(threads[i], NULL);
assert(rc == 0);
printf("%" PRIu64 " %" PRIu64 "\n", i, thread_args[i]);
}
free(threads);
free(thread_args);
return EXIT_SUCCESS;
}
GitHub upstream + plot code.
Then we plot wall, user and sys as a function of the number of threads for a fixed 10^10 iterations on my 8 hyperthread CPU:
Plot data.
From the graph, we see that:
for a CPU intensive single core application, wall and user are about the same
for 2 cores, user is about 2x wall, which means that the user time is counted across all threads.
user basically doubled, and while wall stayed the same.
this continues up to 8 threads, which matches my number of hyperthreads in my computer.
After 8, wall starts to increase as well, because we don't have any extra CPUs to put more work in a given amount of time!
The ratio plateaus at this point.
Note that this graph is only so clear and simple because the work is purely CPU-bound: if it were memory bound, then we would get a fall in performance much earlier with less cores because the memory accesses would be a bottleneck as shown at What do the terms "CPU bound" and "I/O bound" mean?
Quickly checking that wall < user is a simple way to determine that a program is multithreaded, and the closer that ratio is to the number of cores, the more effective the parallelization is, e.g.:
multithreaded linkers: Can gcc use multiple cores when linking?
C++ parallel sort: Are C++17 Parallel Algorithms implemented already?
Sys heavy work with sendfile
The heaviest sys workload I could come up with was to use the sendfile, which does a file copy operation on kernel space: Copy a file in a sane, safe and efficient way
So I imagined that this in-kernel memcpy will be a CPU intensive operation.
First I initialize a large 10GiB random file with:
dd if=/dev/urandom of=sendfile.in.tmp bs=1K count=10M
Then run the code:
#define _GNU_SOURCE
#include <assert.h>
#include <fcntl.h>
#include <stdlib.h>
#include <sys/sendfile.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
int main(int argc, char **argv) {
char *source_path, *dest_path;
int source, dest;
struct stat stat_source;
if (argc > 1) {
source_path = argv[1];
} else {
source_path = "sendfile.in.tmp";
}
if (argc > 2) {
dest_path = argv[2];
} else {
dest_path = "sendfile.out.tmp";
}
source = open(source_path, O_RDONLY);
assert(source != -1);
dest = open(dest_path, O_WRONLY | O_CREAT | O_TRUNC, S_IRUSR | S_IWUSR);
assert(dest != -1);
assert(fstat(source, &stat_source) != -1);
assert(sendfile(dest, source, 0, stat_source.st_size) != -1);
assert(close(source) != -1);
assert(close(dest) != -1);
return EXIT_SUCCESS;
}
GitHub upstream.
which gives basically mostly system time as expected:
real 0m2.175s
user 0m0.001s
sys 0m1.476s
I was also curious to see if time would distinguish between syscalls of different processes, so I tried:
time ./sendfile.out sendfile.in1.tmp sendfile.out1.tmp &
time ./sendfile.out sendfile.in2.tmp sendfile.out2.tmp &
And the result was:
real 0m3.651s
user 0m0.000s
sys 0m1.516s
real 0m4.948s
user 0m0.000s
sys 0m1.562s
The sys time is about the same for both as for a single process, but the wall time is larger because the processes are competing for disk read access likely.
So it seems that it does in fact account for which process started a given kernel work.
Bash source code
When you do just time <cmd> on Ubuntu, it use the Bash keyword as can be seen from:
type time
which outputs:
time is a shell keyword
So we grep source in the Bash 4.19 source code for the output string:
git grep '"user\b'
which leads us to execute_cmd.c function time_command, which uses:
gettimeofday() and getrusage() if both are available
times() otherwise
all of which are Linux system calls and POSIX functions.
GNU Coreutils source code
If we call it as:
/usr/bin/time
then it uses the GNU Coreutils implementation.
This one is a bit more complex, but the relevant source seems to be at resuse.c and it does:
a non-POSIX BSD wait3 call if that is available
times and gettimeofday otherwise
1: https://i.stack.imgur.com/qAfEe.png**Minimal runnable POSIX C examples**
To make things more concrete, I want to exemplify a few extreme cases of time with some minimal C test programs.
All programs can be compiled and run with:
gcc -ggdb3 -o main.out -pthread -std=c99 -pedantic-errors -Wall -Wextra main.c
time ./main.out
and have been tested in Ubuntu 18.10, GCC 8.2.0, glibc 2.28, Linux kernel 4.18, ThinkPad P51 laptop, Intel Core i7-7820HQ CPU (4 cores / 8 threads), 2x Samsung M471A2K43BB1-CRC RAM (2x 16GiB).
sleep
Non-busy sleep does not count in either user or sys, only real.
For example, a program that sleeps for a second:
#define _XOPEN_SOURCE 700
#include <stdlib.h>
#include <unistd.h>
int main(void) {
sleep(1);
return EXIT_SUCCESS;
}
GitHub upstream.
outputs something like:
real 0m1.003s
user 0m0.001s
sys 0m0.003s
The same holds for programs blocked on IO becoming available.
For example, the following program waits for the user to enter a character and press enter:
#include <stdio.h>
#include <stdlib.h>
int main(void) {
printf("%c\n", getchar());
return EXIT_SUCCESS;
}
GitHub upstream.
And if you wait for about one second, it outputs just like the sleep example something like:
real 0m1.003s
user 0m0.001s
sys 0m0.003s
For this reason time can help you distinguish between CPU and IO bound programs: What do the terms "CPU bound" and "I/O bound" mean?
multithreaded linkers: Can gcc use multiple cores when linking?
C++ parallel sort: Are C++17 Parallel Algorithms implemented already?
Real shows total turn-around time for a process;
while User shows the execution time for user-defined instructions
and Sys is for time for executing system calls!
Real time includes the waiting time also (the waiting time for I/O etc.)
In very simple terms, I like to think about it like this:
real is the actual amount of time it took to run the command (as if you had timed it with a stopwatch)
user and sys are how much 'work' the CPU had to do to execute the command. This 'work' is expressed in units of time.
Generally speaking:
user is how much work the CPU did to run to run the command's code
sys is how much work the CPU had to do to handle 'system overhead' type tasks (such as allocating memory, file I/O, ect.) in order to support the running command
Since these last two times are counting 'work' done, they don't include time a thread might have spent waiting (such as waiting on another process or for disk I/O to finish).
real, however, is a measure of actual runtime and not 'work', so it does include any time spent waiting (which is why sometimes real > usr+sys).
And finally, sometimes the reverse is true (usr+sys > real) for multi-threaded applications. This also arises because we are comparing 'work-time' to actual time. For example, if 3 processors each run continuously for 10 minutes to execute a command, you will get real = 10m but usr = 30m.
I want to mention some other scenario when the real-time is much much bigger than user + sys. I've created a simple server which respondes after a long time
real 4.784
user 0.01s
sys 0.01s
the issue is that in this scenario the process waits for the response which is not on the user site nor in the system.
Something similar happens when you run the find command. In that case, the time is spent mostly on requesting and getting a response from SSD.
Must mention that at least on my AMD Ryzen CPU, the user is always large than real in multi-threaded program(or single threaded program compiled with -O3).
eg.
real 0m5.815s
user 0m8.213s
sys 0m0.473s
I'm new to this forum and I would like to ask the experts a question.
I wrote the following program ( part of a bigger thing, but this is the code that causes me trouble)
#include <unistd.h>
#include <fcntl.h>
int main()
{
int fd;
fd = open("/dev/watchdog",O_RDONLY);
lseek(fd,0,SEEK_END);
return 0;
}
The thing that bothers me is that after I run this program as root, after 20-30 seconds, the system crashes, and I can't seem to figure out why. This does not happend as a regular user.
Could you please enlighten me regarding this issue?
Thanks!
PS. Yes, I know that /dev/watchdog is a character file and it's not seekable, but this seems really weird.
It looks like /dev/watchdog is doing what its supposed to do. Once you open /dev/watchdog, you have to keep writing to it, otherwise the system reboots. It is probably not the lseek that is crashing it, it is the lack of writing. See the linux manpages for watchdog for more info.
When you ran as a non-root user, your open of /dev/watchdog probably just failed, so the system did not reboot. Your code is not checking for an error from open().
I wrote a very simple program in Linux using c++, which downloads images from some website over http (basically developed a http client request), using cURL libraries. http://curl.haxx.se/libcurl/c/allfuncs.html
#define CURL_STATICLIB
#include <stdio.h>
#include <stdlib.h>
#include </usr/include/curl/curl.h>
#include </usr/include/curl/stdcheaders.h>
#include </usr/include/curl/easy.h>
size_t write_data(void *ptr, size_t size, size_t nmemb, FILE *stream) {
size_t written = fwrite(ptr, size, nmemb, stream);
return written;
}
int main(void) {
CURL *curl;
FILE *fp;
CURLcode res;
char *url = "http://www.example.com/test_img.png";
char outfilename[FILENAME_MAX] = "/home/c++_proj/output/web_req_img.png";
curl = curl_easy_init();
if (curl) {
fp = fopen(outfilename,"wb");
curl_easy_setopt(curl, CURLOPT_URL, url);
curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, write_data);
curl_easy_setopt(curl, CURLOPT_WRITEDATA, fp);
res = curl_easy_perform(curl);
/* always cleanup */
curl_easy_cleanup(curl);
fclose(fp);
}
return 0;
}
I verified the code, and it works fine. I can see the image is downloaded and that I can view the image (with no errors or warnings). Since I plan on expanding my code, I tried to install ddd, and use the debugger, but the debugger doesn't work, and my program exits with some sort of Signal errors, when I try to run my program with ddd.
This is the error:
(Threadd debugging using libthread_db enabled)
Using host libthread_db library "/lib/arm-linux-gnueadihf/libthread_db.so.1"
Program received signal SIGILL, illegal instruction.
0xb6a5c4C0 in ?? () from /usr/lib/arm-linux-gnueadbihf/libcrypto.so.1.0.0
First I thought that I didn't properly install ddd, so I went back to gdb, but I get the exact same errors, when I run the program. (And I believe that I am using the latest version of gdb and ddd)
Then I tried to use ddd on another simple program, that doesn't involve cURL library, and it worked fine !!!
Does anyone know why this is the case, and what is the solution? Do I somehow need to point to cURL libraries while ddd is running? But, in the past, I don't recall doing this with different set of libraries! Maybe it is something abuot the cURL that ddd doesn't like? But the program runs fine itself without the debugger! I would appreciate some help.
I am guessing it may be part of some instruction set detection code. Just let the program continue and see if it handles the signal by itself (since it runs outside of gdb, it probably does). Alternatively, you can tell gdb to not bother you with SIGILL at all before you run the program: handle SIGILL pass nostop noprint.
It's only a problem if the program dies, which was not clear from your question.
Program received signal SIGILL, illegal instruction.
0xb6a5c4C0 in ?? () from /usr/lib/arm-linux-gnueadbihf/libcrypto.so.1.0.0
Does anyone know why this is the case, and what is the solution?
Jester gave you the solution. Here's the reason why it happens.
libcrypto.so is OpenSSL's crypto library. OpenSSL performs cpu feature probes by executing an instruction to see if its available. If a SIGILL is generated, then the feature is not available and an appropriate function is used instead.
The reason you see them on ARM and not IA-32 is, on Intel's IA-32 the cpuid instruction is non-privileged. Any program can execute cpuid to detect cpu features so there's no need for SIGILL-based feature program.
In contrast to IA-32, ARM's equivalent of cpuid is a privileged instruction. Your program needs Exception Level 1 (EL-1), but your program runs at EL-0. To side step the need for privileges on ARM programs setup a jmpbuf and install a SIGILL handler. They then try the instruction in question and the SIGILL handler indicates if the instruction or feature is available or not.
OpenSSL recently changed to SIGILL-free feature detection on some Apple platforms because Apple corrupts things. Also see PR 3108, SIGILL-free processor capabilities detection on MacOS X. Other libraries are doing similar. Also see How to determine ARMv8 features at runtime?
OpenSSL also documents the SIGILL behavior in their FAQ. See item 17 in the OpenSSL FAQ for more details: When debugging I observe SIGILL during OpenSSL initialization: why? Also see SSL_library_init cause SIGILL when running under gdb on Stack Overflow.
For Android developers you can disable SIGILL in Android Studio:
https://developer.oculus.com/documentation/native/android/mobile-studio-debug/#troubleshooting