"Illegal instruction" when running ARM code targeting my CPU - linux

I'm compiling a rather large project for ARM. I'm using an AT91SAM9G25-EK as a devboard running a Debian ARM image. All libraries and executables in the image seem to be compiled for the armv4t instruction set.
My CPU is an ARM926EJ-S, which should run armv5tej code.
I'm using GCC to cross compile for my board. My CXX flags look like the following:
set(CMAKE_CXX_FLAGS "--signed-char --sysroot=${SYSROOT} -mcpu=arm926ej-s -mtune=arm926ej-s -mfloat-abi=softfp" CACHE STRING "" FORCE)
If I try to run this on my board, I get an Illegal Instruction signal (SIGILL) during initialization of one of my dependencies (using armv4t).
If I enable thumb mode (-mthumb -mthumb-interwork) it works, but uses Thumb for all the code, which in my case runs slower (I'm doing some serious number crunching).
In this case, if I specify one function to be compiled for ARM mode (using __attribute__((target("arm")))) it will run fine until that function is called, then exit with SIGILL.
I'm lost. Is it that bad I'm linking against libraries using armv4t? Am I misunderstanding the way ARM modes work? Is it something in the linux kernel?

What softfp means is to use using the soft-float calling convention between functions, but still use the hardware FPU within them. Assuming your cross-compiler is configured with a default -mfpu option other than "none" (run arm-whatever-gcc -v and look for --with-fpu= to check), then you've got a problem, because as far as I can see from the Atmel datasheet, SAM9G25 doesn't have an FPU.
My first instinct would be to slap GDB on there, catch the signal and disassemble the offending instruction to make sure, but the fact that Thumb code works OK is already a giveaway (Thumb before ARMv6T2 doesn't include any coprocessor instructions, and thus can't make use of an FPU).
In short, use -mfloat-abi=soft to ensure the ARM code actually uses software floating-point and avoids poking a non-existent FPU. And if the "serious number crunching" involves a lot of floating-point, perhaps consider getting a different MCU...

Related

Can we convert elf from a cpu architecture to another, in linux? [duplicate]

How I can run x86 binaries (for example .exe file) on arm?As I see on Wikipedia,I need to convert binary data for the emulated platform into binary data suitable for execution on the targeted platform.but question is:How I can do it?I need to open file in hex editor and change?Or something else?
To successfully do this, you'd have to do two things.. one relatively easy, one very hard. Neither of which you want to do by hand in a hex editor.
Convert the machine code from x86 to ARM. This is the easy one, because you should be able to map each x86 opcode to one or more ARM opcodes. There are different ways to do this, some more efficient than others, but it can be done with a pretty straightforward mapping.
Remap function calls (and other jumps). This one is hard, because monkeying with the opcodes is going to change all the offsets for the jump and return points. If you have dynamically linked libraries (.so), and we assume that all the libraries are available at exactly the same version in both places (a sketchy assumption at best), you'd have to remap the loads.
It's essentially a machine->machine compiler and linker.
So, can you do it? Sure.
Is it easy? No.
There may be a commercial tool out there, but I'm not aware of it.
You can not do this with a binary;note1 here binary means an object with no symbol information like an elf file. Even with an elf file, this is difficult to impossible. The issue is determining code from data. If you resolve this issue, then you can make de-compilers and other tools.
Even if you haven an elf file, a compiler will insert constants used in the code in the text segment. You have to look at many op-codes and do a reverse basic block to figure out where a function starts and ends.
A better mechanism is to emulate the x86 on the ARM. Here, you can use JIT technology to do the translation as encountered, but you approximately double code space. Also, the code will execute horribly. The ARM has 16 registers and the x86 is register starved (usually it has hidden registers). A compilers big job is to allocate these registers. QEMU is one technology that does this. I am unsure if it goes in the x86 to ARM direction; and it will have a tough job as noted.
Note1: The x86 has an asymmetric op-code sizing. In order to recognize a function prologue and epilogue, you would have to scan an image multiple times. To do this, I think the problem would be something like O(n!) where n is the bytes of the image, and then you might have trouble with in-line assembler and library routines coded in assembler. It maybe possible, but it is extremely hard.
To run an ARM executable on an X86 machine all you need is qemu-user.
Example:
you have busybox compiled for AARCH64 architecture (ARM64) and you want to run it on an X86_64 linux system:
Assuming a static compile, this runs arm64 code on x86 system:
$ qemu-aarch64-static ./busybox
And this runs X86 code on ARM system:
$ qemu-x86_64-static ./busybox
What I am curioous is if there is a way to embed both in a single program.
read x86 binary file as utf-8,then copy from ELF to last character�.Then go to arm binary and delete as you copy with x86.Then copy x86 in clip-board to the head.i tried and it's working.

Is there a __yield() intrinsic on Arm?

I am trying to compile some code for arm (v7a), that had a
#if defined(__arm__)
__yield();
#endif
added in this pull-request
The other branches have YieldProcessor() for MSC and _mm_pause() or __builtin_ia32_pause() for x86 and x86-64.
The symbol __yield is not found by the compiler, a arm-v7a-linux-gnueabihf-gcc 7.3.1 with -mcpu=cortex-a9 -mtune=cortex-a9 -march=armv7-a options. Is such symbol defined on some other ARM platforms, later Gcc, or Clang?
In the headers that come with the compiler all I could find is __gnu_parallel::__yield being an inline wrapper over sched_yield(), which I suppose is equivalent to the std::this_thread::yield() that the code calls after 100 iterations calling __yield(). So I think it's not that. But I didn't see __yield in gcc documentation either.
The __yield intrinsic is specified as part of the ARM C Language Extensions (see 8.4 "Hints"). It emits the yield instruction, which is the rough equivalent of x86 pause. It is intended precisely for situations like waiting on a spinlock; it keeps the CPU from hammering on the cache line excessively (which hurts performance), possibly saves some power, and, in case of a hyperthreading CPU, makes more computational units available to the other logical processor.
(Note that it is purely a CPU function, and not an OS or library call; it doesn't yield a CPU timeslice to the operating system like the similarly named pause() or sched_yield() or std::this_thread::yield() calls would do.)
Although GCC supports some of the ACLE intrinsics, it seems to be missing this one. You should be able to substitute with asm volatile("yield");. The yield instruction has no architectural effect (it executes like nop) so no register or memory clobbers are needed.
So in C++ you have std::this_thread::yield()
in C11 you have trd_yield,
https://en.cppreference.com/w/c/thread
both are available in ARM.

When compiling programs to run inside a VM, what should march and mtune be set to?

With VMs being slave to whatever the host machine is providing, what compiler flags should be provided to gcc?
I would normally think that -march=native would be what you would use when compiling for a dedicated box, but the fine detail that -march=native is going to as indicated in this article makes me extremely wary of using it.
So... what to set -march and -mtune to inside a VM?
For a specific example...
My specific case right now is compiling python (and more) in a linux guest inside a KVM-based "cloud" host that I have no real control over the host hardware (aside from 'simple' stuff like CPU GHz m CPU count, and available RAM). Currently, cpuinfo tells me I've got an "AMD Opteron(tm) Processor 6176" but I honestly don't know (yet) if that is reliable and whether the guest can get moved around to different architectures on me to meet the host's infrastructure shuffling needs (sounds hairy/unlikely).
All I can really guarantee is my OS, which is a 64-bit linux kernel where uname -m yields x86_64.
Some incomplete and out of order excerpts from section 3.17.14 Intel 386 and AMD x86-64 Options of the GCC 4.6.3 Standard C++ Library Manual (which I hope are pertinent).
-march=cpu-type
Generate instructions for the machine type cpu-type.
The choices for cpu-type are the same as for -mtune.
Moreover, specifying -march=cpu-type implies -mtune=cpu-type.
-mtune=cpu-type
Tune to cpu-type everything applicable about the generated code,
except for the ABI and the set of available instructions.
The choices for cpu-type are:
generic
Produce code optimized for the most common IA32/AMD64/EM64T processors.
native
This selects the CPU to tune for at compilation time by determining
the processor type of the compiling machine.
Using -mtune=native will produce code optimized for the local machine
under the constraints of the selected instruction set.
Using -march=native will enable all instruction subsets supported by
the local machine (hence the result might not run on different machines).
What I found most interesting is that specifying -march=cpu-type implies -mtune=cpu-type. My take on the rest was that if you are specifying both -march & -mtune you're probably getting too close to tweak overkill.
My suggestion would be to just use -m64 and you should be safe enough since you're running inside a x86-64 Linux, correct?
But if you don't need to run in another environment and you're feeling lucky and fault tolerant then -march=native might also work just fine for you.
-m32
The 32-bit environment sets int, long and pointer to 32 bits
and generates code that runs on any i386 system.
-m64
The 64-bit environment sets int to 32 bits and long and pointer
to 64 bits and generates code for AMD's x86-64 architecture.
For what it's worth ...
Out of curiosity I tried using the technique described in the article you referenced. I tested gcc v4.6.3 in 64-bit Ubuntu 12.04 which was running as a VMware Player guest. The VMware VM was running in Windows 7 on a desktop using an Intel Pentium Dual-Core E6500 CPU.
The gcc option -m64 was replaced with just -march=x86-64 -mtune=generic.
However, compiling with -march=native resulted in gcc using all of the much more specific compiler options below.
-march=core2 -mtune=core2 -mcx16
-mno-abm -mno-aes -mno-avx -mno-bmi -mno-fma -mno-fma4 -mno-lwp
-mno-movbe -mno-pclmul -mno-popcnt -mno-sse4.1 -mno-sse4.2
-mno-tbm -mno-xop -msahf --param l1-cache-line-size=64
--param l1-cache-size=32 --param l2-cache-size=2048
So, yes, as the gcc documentation states when "Using -march=native ... the result might not run on different machines". To play it safe you should probably only use -m64 or it's apparent equivalent -march=x86-64 -mtune=generic for your compiles.
I can't see how you would have any problem with this since the intent of those compiler options are that gcc will produce code capable of running correctly on any x86-64/amd64 compliant CPU. (No?)
I am frankly astounded at how specific the gcc -march=native CPU options turned out to be. I have no idea how a CPU's L1 cache size being 32k could be used to fine tune the generated code. But apparently if there is a way to do this, then using -march=native will allow gcc to do it.
I wonder if this might result in any noticeable performance improvements?
One would like the think that the CPU architecture reported by the guest OS is what you should optimize for. Otherwise, I'd call it a bug. There can be decent reasons for bugs sometimes, but...
Note that not all hypervisors will necessarily be the same.
It might be a good idea to check on a mailing list for your specific hypervisor.

Is QEMU good for learning programming in assembler for ARM and PowerPC?

I want to learn programming in assembler for PowerPC and ARM, but I'm unable to buy real hardware for this purpose. I'm thinking about using QEMU for that. However I'm not sure if it emulates both architectures enough well, that I'll compile and run my programs in native assembler on it?
QEMU works well for testing program correction (i.e. whether the code would properly run on an actual ARM or PowerPC) but it is not good for testing program efficiency: the emulation is not cycle accurate, and speed measured with QEMU cannot be reliably (or even unreliably) correlated with speed on true hardware.
Also, QEMU will not trap unaligned memory accesses, which is not a problem for PowerPC emulation (the PowerPC tolerates unaligned accesses) but may be for ARM (an unaligned access, e.g. reading a 32-bit word in RAM from an address which is not a multiple of 4, will work fine with QEMU but would trigger an exception on a true ARM processor).
Apart from these points, QEMU is fine for assembly development on ARM or MIPS (haven't tried PowerPC, because I found an old iBook on eBay for that; but I have done ARM and MIPS assembly with QEMU and then ran the resulting code on true hardware, and this worked). You can either emulate a whole system and run Debian in it (in which case the compiler, linker, text editor... will also run in emulation), or use the "user-mode emulation" where the ARM/MIPS executable is run directly, with a wrapper which converts system calls into those for the host PC (this assumes that the host is a PC running Linux). The latter is more convenient (you have access to your normal home directory, programming tools are native...) but requires installing cross-development tools. See buildroot for that (and link with -static, this will avoid many headaches).
Since I have found signs that Debian for PowerPC and for ARM can run on QEMU, I suppose this won't be a problem.

How to reduce compilation cost in GCC and make?

I am trying to build some big libraries, like Boost and OpenCV, from their source code via make and GCC under Ubuntu 8.10 on my laptop. Unfortunately the compilation of those big libraries seem to be big burden to my laptop (Acer Aspire 5000). Its fan makes higher and higher noises until out of a sudden my laptop shuts itself down without the OS gracefully turning off.
So I wonder how to reduce the compilation cost in case of make and GCC?
I wouldn't mind if the compilation will take much longer time or more space, as long as it can finish without my laptop shutting itself down.
Is building the debug version of libraries always less costly than building release version because there is no optimization?
Generally speaking, is it possible to just specify some part of a library to install instead of the full library? Can the rest be built and added into if later found needed?
Is it correct that if I restart my laptop, I can resume compilation from around where it was when my laptop shut itself down? For example, I noticed that it is true for OpenCV by looking at the progress percentage shown during its compilation does not restart from 0%. But I am not sure about Boost, since there is no obvious information for me to tell and the compilation seems to take much longer.
UPDATE:
Thanks, brianegge and Levy Chen! How to use the wrapper script for GCC and/or g++? Is it like defining some alias to GCC or g++? How to call a script to check sensors and wait until the CPU temperature drops before continuing?
I'd suggest creating a wrapper script for gcc and/or g++
#!/bin/bash
sleep 10
exec gcc "$#"
Save the above as "gccslow" or something, and then:
export CC="gccslow"
Alternatively, you can call the script gcc and put it at the front of your path. If you do that, be sure to include the full path in the script, otherwise, the script will call itself recursively.
A better implementation could call a script to check sensors and wait until the CPU temperature drops before continuing.
For your latter question: A well written Makefile will define dependencies as a directed a-cyclical graph (DAG), and it will try to satisfy those dependencies by compiling them in the order according to the DAG. Thus as a file is compiled, the dependency is satisfied and need not be compiled again.
It can, however, be tricky to write good Makefiles, and thus sometime the author will resort to a brute force approach, and recompile everything from scratch.
For your question, for such well known libraries, I will assume the Makefile is written properly, and that the build should resume from the last operation (with the caveat that it needs to rescan the DAG, and recalculate the compilation order, that should be relatively cheap).
Instead of compiling the whole thing, you can compile each target separately. You have to examine the Makefile for identifying them.
Tongue-in-cheek: What about putting the laptop into the fridge while compiling?

Resources