Does ARM sit idle while NEON is doing its operations?

Does ARM sit idle while NEON is doing its operations? - linux

Might look similar to: ARM and NEON can work in parallel?, but its not, I have some other issue ( may be problem with my understanding):
In the protocol stack, while we compute checksum, that is done on the GPP, I’m handing over that task now to NEON as part of a function:
Here is the checksum function that I have written as a part of NEON, posted in Stack Overflow: Checksum code implementation for Neon in Intrinsics
Now, suppose from linux this function is called,
ip_csum(){
…
…
csum = do_csum(); //function call from arm
…
…
}
do_csum(){
…
…
//NEON optimised code
…
…
returns the final checksum to ip_csum/linux/ARM
}
in this case.. what happens to ARM when NEON is doing the calculations? does ARM sit idle? or it moves on with other operations?
as you can see do_csum is called and we are waiting on that result ( or that is what it looks like)..
NOTE:
Speaking in terms of cortex-a8
do_csum as you can see from the link is coded with intrinsics
compilation using gnu tool-chain
Will be good if you also take Multi-threading or any other concept involved or comes into picture when these inter operations happen.
Questions:
Does ARM sit idle while NEON is doing its operations? ( in this particular case)
Or does it shelve this current ip_csum related code, and take up another process/thread till NEON is done? ( I'm almost dumb as to what happens here)
if its sitting idle, how can we make ARM work on something else till NEON is done?

(Image from TI Wiki Cortex A8)
The ARM (or rather the Integer Pipeline) does not sit idle while NEON instructions are processing. In the Cortex A8, the NEON is at the "end" of the processor pipeline, instructions flow through the pipeline and if they are ARM instructions they are executed in the "beginning" of the pipeline and NEON instructions are executed in the end. Every clock pushes the instruction down the pipeline.
Here are some hints on how to read the diagram above:
Every cycle, if possible, the processor fetches an instruction pair (two instructions).
Fetching is pipelined so it takes 3 cycles for the instructions to propagate into the decode unit.
It takes 5 cycles (D0-D4) for the instruction to be decoded. Again this is all pipelines so it affects the latency but not the throughput. More instructions keep flowing through the pipeline where possible.
Now we reach the execute/load store portion. NEON instructions flow through this stage (but they do that while other instructions are possibly executing).
We get to the NEON portion, if the instruction fetched 13 cycles ago was a NEON instruction it is now decoded and executed in the NEON pipeline.
While this is happening, integer instructions that followed that instruction can execute at the same time in the integer pipeline.
The pipeline is a fairly complex beast, some instructions are multi-cycle, some have dependencies and will stall if those dependencies are not met. Other events such as branches will flush the pipeline.
If you are executing a sequence that is 100% NEON instructions (which is pretty rare, since there are usually some ARM registers involved, control flow etc.) then there is some period where the the integer pipeline isn't doing anything useful. Most code will have the two executing concurrently for at least some of the time while cleverly engineered code can maximize performance with the right instructions mix.
This online tool Cycle Counter for Cortex A8 is great for analyzing the performance of your assembly code and gives information about what is executing in what units and what is stalling.

In Application Level Programmers’ Model, you can't really distinguish between ARM and NEON units.
While NEON being a separate hardware unit (that is available as an option on Cortex-A series
processors), it is the ARM core who drives it in a tight fashion. It is not a separate DSP which you can communicate in an asynchronous fashion.
You can write better code by fully utilizing pipelines on both units, but this is not same as having a separate core.
NEON unit is there because it can do some operations (SIMDs) much faster than ARM unit at a low frequency.
This is like having a friend who is good at math, whenever you have a hard question you can ask him. While waiting for an answer you can do some small things like if answer is this I should do this or if not instead do that but if you depend on that answer to go on, you need to wait for him to answer before going further. You could calculate the answer yourself but it will be much faster even including the communication time between two of you compared to doing all the math yourself. I think you can even extend this analogy like "you also need to buy some lunch to that friend (energy consumption) but in many cases it worths it".
Anyone who is saying ARM core can do other things while NEON core is working on its stuff is talking about instruction-level parallelism not anything like task-level parallelism.

ARM is not "idle" while NEON operations are executed, but controls them.
To fully use the power of both units, one can carefully plan an interleaved sequence of operations:
loop:
SUBS r0,r0,r1 ; // ARM operation
addpq.16 q0,q0,q1 ; NEON operation
LDR r0, [r1, r2 LSL #2]; // ARM operation
vld1.32 d0, [r1]! ; // NEON operation using ARM register
bne loop; // ARM operation controlling the flow of both units...
ARM cortex-A8 can execute in each clock cycle up to 2 instructions. If both of them are independent NEON operations, it's no use to put an ARM instruction in between. OTOH if one knows that the latency of a VLD (load) is large, one can place many ARM instruction in between the load and first use of the loaded value. But in each case the combined usage must be planned in advance and interleaved.

Related

Can a single byte instruction be executed while being only partially overwritten?

I have made an experiment in which a new thread executes a shellcode with this simple infinite loop:
NOP
JMP REL8 0xFE (-0x2)
This generate the following shellcode:
0x90, 0xEB, 0xFE
After this infinite loop there are other instructions ending by the overwriting of the destination byte back to -0x2 to make it an infinite loop again, and an absolute jump to send the thread back to this infinite loop.
Now I was asking myself if it was possible that the instruction of the jump was executed while the single byte of the destination is only partially overwritten by the other thread.
For example, let's say that the other thread overwrites the destination of the jump (0xFE, or 11111110 in binary) to 0x0 (00000000) to release the thread of this infinite loop.
Could it happen that the jump goes to let's say 0x1E (00011110) because the destination byte wasn't completely overwritten at that nanosecond?
Before asking this question here I have done the experiment myself in a C++ program and I have let it run for some hours without it never missing a single jump.
If you want to have a look at the code I made for this experiment I have uploaded it to GitHub
Accordingly to this experiment, it seems to be impossible that an instruction is executed while being only partially overwritten .
However, I have very little knowledge in assembly and in processors, this is for this reason that I ask the question here:
Can anyone confirm my observation please? Is it indeed impossible to have an instruction executed while being partially overwritten by another thread? Does anyone knows why for sure?
Thank you very much for your help and knowledge on that, I did not know where to look for such an information.

No, byte stores are always atomic on x86, even for cross-modifying code.
See Observing stale instruction fetching on x86 with self-modifying code for some links to Intel's manuals for cross modifying code. And maybe Reproducing Unexpected Behavior w/Cross-Modifying Code on x86-64 CPUs
Of course, all the recommendations for writing efficient cross-modifying code (and running code that you just JIT-compiled) involve avoiding stores into pages that other threads are currently executing.
Why are you doing this with "shellcode", anyway? Is this supposed to be part of an exploit? If not, why not just write code in asm like a normal person, with a label on the jmp instruction so you can store to it from C by assigning to extern char jmp_bytes[2]?
And if this is supposed to be an efficient cross-thread notification mechanism... it isn't. Spinning on a data load and a conditional branch with a pause loop would allow a lower latency exit from the loop than a self-modifying code machine nuke that flushes the whole pipeline right when you want it to finally be doing something useful instead of wasting CPU time. At least several times the delay of a simple branch miss.
Even better, use an OS-supported condition variable so the thread can sleep instead of heating up your CPU (reducing the thermal headroom for the CPU to turbo above its rated clock speed up when there is work to do).
The mechanism used by current CPUs is that if a store near the EIP/RIP or any instruction in flight in the pipeline is detected, it does a machine clear. (perf counter machine_clears.smc, aka machine nuke.) It doesn't even try to handle it "efficiently", but if you did a non-atomic store (e.g. actually two separate stores, or a store split across a cache line boundary) the target CPU core could see it in different parts and potentially decode it with some bytes updated and other bytes not. But a single byte is always updated atomically, so tearing within a byte is not possible.
However, x86 on paper doesn't guarantee that, but as Andy Glew (one of the architects of Intel's P6 microarchitecture family) says, implementing stronger behaviour than the paper spec can actually be the most efficient way to meet all the required guarantees and run fast. (And / or avoid breaking existing code in widely-used software!)

Is there a standard constant *nix benchmark, and if not, how to make a `bogobench`?

Ordinary single-threaded *nix programs can be benchmarked with utils like time, i.e.:
# how long does `seq` take to count to 100,000,000
/usr/bin/time seq 100000000 > /dev/null
Outputs:
1.16user 0.06system 0:01.23elapsed 100%CPU (0avgtext+0avgdata 1944maxresident)k
0inputs+0outputs (0major+80minor)pagefaults 0swaps
...but numbers returned are always system dependent, which in a sense also measures the user's hardware.
Is there some non-relative benchmarking method or command-line util which would return approximately the same virtual timing numbers on any system, (or at least a reasonably large subset of systems)? Just like grep -m1 bogo /proc/cpuinfo returns a roughly approximate but stable unit, such a benchmark should also return a somewhat similar unit of duration.
Suppose for benchmarking ordinary commands we have a magic util bogobench (where "bogo" is an adjective signifying "a somewhat bogus status", but not necessarily having algorithms in common with BogoMIPs):
bogobench foo bar.data
And we run this on two physically separate systems:
a 1996 Pentium II
a 2015 Xeon
Desired output would be something like:
21 bogo-seconds
So bogobench should return about the same number in both cases, even though it probably would finish in much less time on the 2nd system.
A hardware emulator like qemu might be one approach, but not necessarily the only approach:
Insert the code to benchmark into a wrapper script bogo.sh
Copy bogo.sh to a bootable Linux disk image bootimage.iso, within a directory where bogo.sh would autorun then promptly shutdown the emulator. During which it outputs some form of timing data to parse into bogo-seconds.
Run bootimage.iso using one of qemu's more minimal -machine options:
qemu-system-i386 -machine type=isapc bootimage.iso
But I'm not sure how to make qemu use a virtual clock, rather than the host CPU's clock, and qemu itself seems like a heavy tool for a seemingly simple task. (Really MAME or MESS would be more versatile emulators than qemu for such a task -- but I'm not adept with MAME, although MAME currently has some capacity for 80486 PC emulation.)
Online we sometimes compare and contrast timing-based benchmarks made on machine X with one made on machine Y. Whereas I'd like both user X and Y to be able to do their benchmark on a virtual machine Z, with bonus points for emulating X or Y (like MAME) if need be, except with no consideration of X or Y's real run-time, (unlike MAME where emulations are often playable). In this way users could report how programs perform in interesting cases without the programmer having to worry that the results were biased by idiosyncrasies of a user's hardware, such as CPU quirks, background processes hogging resources, etc.
Indeed, even on the user's own hardware, a time based benchmark can be unreliable, as often the user can't be sure some background process, (or bug, or hardware error like a bad sector, or virus), might not be degrading some aspect of performance. Whereas a more virtual benchmark ought to be less susceptible to such influences.

The only sane way I see to implement this is with a cycle-accurate simulator for some kind of hardware design.
AFAIK, no publicly-available cycle-accurate simulators for modern x86 hardware exist, because it's extremely complex and despite a lot of stuff being known about x86 microarchitecture internals (Agner Fog's stuff, Intel's and AMD's own optimization guides, and other stuff in the x86 tag wiki), enough of the behaviour is still a black box full of CPU-design trade-secrets that it's at best possible to simulate something similar. (E.g. branch prediction is definitely one of the most secret but highly important parts).
While it should be possible to come close to simulating Intel Sandybridge or Haswell's actual pipeline and out-of-order core / ROB / RS (at far slower than realtime), nobody has done it that I know of.
But cycle-accurate simulators for other hardware designs do exist: Donald Knuth's MMIX architecture is a clean RISC design that could actually be built in silicon, but currently only exists on paper.
From that link:
Of particular interest is the MMMIX meta-simulator, which is able to do dynamic scheduling of a complex pipeline, allowing superscalar execution with any number of functional units and with many varieties of caching and branch prediction, etc., including a detailed implementation of both hard and soft interrupts.
So you could use this as a reference machine for everyone to run their benchmarks on, and everyone could get comparable results that will tell you how fast something runs on MMIX (after compiling for MMIX with gcc). But not how fast it runs on x86 (presumably also compiling with gcc), which may differ by a significant factor even for two programs that do the same job a different way.
For [fastest-code] challenges over on the Programming puzzles and Code Golf site, #orlp created the GOLF architecture with a simulator that prints timing results, designed for exactly this purpose. It's a toy architecture with stuff like print to stdout by storing to 0xffffffffffffffff, so it's not necessarily going to tell you anything about how fast something will run on any real hardware.
There isn't a full C implementation for GOLF, AFAIK, so you can only really use it with hand-written asm. This is a big difference from MMIX, which optimizing compilers do target.

One practical approach that could (maybe?) be extended to be more accurate over time is to use existing tools to measure some hardware invariant performance metric(s) for the code under test, and then apply a formula to come up with your bogoseconds score.
Unfortunately most easily measurable hardware metrics are not invariant - rather, they depend on the hardware. An obvious one that should be invariant, however, would be "instructions retired". If the code is taking the same code paths every time it is run, the instructions retired count should be the same on all hardware1.
Then you apply some kind of nominal clock speed (let's say 1 GHz) and nominal CPI (let's say 1.0) to get your bogoseconds - if you measure 15e9 instructions, you output a result of 15 bogoseconds.
The primary flaw here is that the nominal CPI may be way off from the actual CPI! While most programs hover around 1 CPI, it's easy to find examples where they can approach 0.25 or whatever the inverse of the width is, or alternately be 10 or more if there are many lengthy stalls. Of course such extreme programs may be what you'd want to benchmark - and even if not you have the issue that if you are using your benchmark to evaluate code changes, it will ignore any improvements or regressions in CPI and look only at instruction count.
Still, it satisfies your requirement in as much as it effectively emulates a machine that executes exactly 1 instruction every cycle, and maybe it's a reasonable broad-picture approach. It is pretty easy to implement with tools like perf stat -e instructions (like one-liner easy).
To patch the holes then you could try to make the formula better - let's say you could add in a factor for cache misses to account for that large source of stalls. Unfortunately, how are you going to measure cache-misses in a hardware invariant way? Performance counters won't help - they rely on the behavior and sizes of your local caches. Well, you could use cachegrind to emulate the caches in a machine-independent way. As it turns out, cachegrind even covers branch prediction. So maybe you could plug your instruction count, cache miss and branch miss numbers into a better formula (e.g., use typical L2, L3, RAM latencies, and a typical cost for branch misses).
That's about as far as this simple approach will take you, I think. After that, you might as well just rip apart any of the existing x862 emulators and add your simple machine model right in there. You don't need to cycle accurate, just pick a nominal width and model it. Probably whatever underlying emulation cachegrind is going might be a good match and you get the cache and branch prediction modeling already for free.
1 Of course, this doesn't rule out bugs or inaccuracies in the instruction counting mechanism.
2 You didn't tag your question x86 - but I'm going to assume that's your target since you mentioned only Intel chips.

Question about cycle counting accuracy when emulating a CPU

I am planning on creating a Sega Master System emulator over the next few months, as a hobby project in Java (I know it isn't the best language for this but I find it very comfortable to work in, and as a frequent user of both Windows and Linux I thought a cross-platform application would be great). My question regards cycle counting;
I've looked over the source code for another Z80 emulator, and for other emulators as well, and in particular the execute loop intrigues me - when it is called, an int is passed as an argument (let's say 1000 as an example). Now I get that each opcode takes a different number of cycles to execute, and that as these are executed, the number of cycles is decremented from the overall figure. Once the number of cycles remaining is <= 0, the execute loop finishes.
My question is that many of these emulators don't take account of the fact that the last instruction to be executed can push the number of cycles to a negative value - meaning that between execution loops, one may end up with say, 1002 cycles being executed instead of 1000. Is this significant? Some emulators account for this by compensating on the next execute loop and some don't - which approach is best? Allow me to illustrate my question as I'm not particularly good at putting myself across:
public void execute(int numOfCycles)
{ //this is an execution loop method, called with 1000.
while (numOfCycles > 0)
{
instruction = readInstruction();
switch (instruction)
{
case 0x40: dowhatever, then decrement numOfCycles by 5;
break;
//lets say for arguments sake this case is executed when numOfCycles is 3.
}
}
After the end of this particular looping example, numOfCycles would be at -2. This will only ever be a small inaccuracy but does it matter overall in peoples experience? I'd appreciate anyone's insight on this one. I plan to interrupt the CPU after every frame as this seems appropriate, so 1000 cycles is low I know, this is just an example though.
Many thanks,
Phil

most emulators/simulators dealing just with CPU Clock tics
That is fine for games etc ... So you got some timer or what ever and run the simulation of CPU until CPU simulate the duration of the timer. Then it sleeps until next timer interval occurs. This is very easy to simulate. you can decrease the timing error by the approach you are asking about. But as said here for games is this usually unnecessary.
This approach has one significant drawback and that is your code works just a fraction of a real time. If the timer interval (timing granularity) is big enough this can be noticeable even in games. For example you hit a Keyboard Key in time when emulation Sleeps then it is not detected. (keys sometimes dont work). You can remedy this by using smaller timing granularity but that is on some platforms very hard. In that case the timing error can be more "visible" in software generated Sound (at least for those people that can hear it and are not deaf-ish to such things like me).
if you need something more sophisticated
For example if you want to connect real HW to your emulation/simulation then you need to emulate/simulate BUS'es. Also things like floating bus or contention of system is very hard to add to approach #1 (it is doable but with big pain).
If you port the timings and emulation to Machine cycles things got much much easier and suddenly things like contention or HW interrupts, floating BUS'es are solving themselves almost on their own. I ported my ZXSpectrum Z80 emulator to this kind of timing and see the light. Many things get obvious (like errors in Z80 opcode documentation, timings etc). Also the contention got very simple from there (just few lines of code instead of horrible decoding tables almost per instruction type entry). The HW emulation got also pretty easy I added things like FDC controlers AY chips emulations to the Z80 in this way (no hacks it really runs on their original code ... even Floppy formating :)) so no more TAPE Loading hacks and not working for custom loaders like TURBO
To make this work I created my emulation/simulation of Z80 in a way that it uses something like microcode for each instruction. As I very often corrected errors in Z80 instruction set (as there is no single 100% correct doc out there I know of even if some of them claim that they are bug free and complete) I come with a way how to deal with it without painfully reprogramming the emulator.
Each instruction is represented by an entry in a table, with info about timing, operands, functionality... Whole instruction set is a table of all theses entries for all instructions. Then I form a MySQL database for my instruction set. and form similar tables to each instruction set I found. Then painfully compared all of them selecting/repairing what is wrong and what is correct. The result is exported to single text file which is loaded at emulation startup. It sound horrible but In reality it simplifies things a lot even speedup the emulation as the instruction decoding is now just accessing pointers. The instruction set data file example can be found here What's the proper implementation for hardware emulation
Few years back I also published paper on this (sadly institution that holds that conference does not exist anymore so servers are down for good on those old papers luckily I still got a copy) So here image from it that describes the problematics:
a) Full throtlle has no synchronization just raw speed
b) #1 has big gaps causing HW synchronization problems
c) #2 needs to sleep a lot with very small granularity (can be problematic and slow things down) But the instructions are executed very near their real time ...
Red line is the host CPU processing speed (obviously what is above it take a bit more time so it should be cut and inserted before next instruction but it would be hard to draw properly)
Magenta line is the Emulated/Simulated CPU processing speed
alternating green/blue colors represent next instruction
both axises are time
[edit1] more precise image
The one above was hand painted... This one is generated by VCL/C++ program:
generated by these parameters:
const int iset[]={4,6,7,8,10,15,21,23}; // possible timings [T]
const int n=128,m=sizeof(iset)/sizeof(iset[0]); // number of instructions to emulate, size of iset[]
const int Tps_host=25; // max possible simulation speed [T/s]
const int Tps_want=10; // wanted simulation speed [T/s]
const int T_timer=500; // simulation timer period [T]
so host can simulate at 250% of wanted speed and simulation granularity is 500T. Instructions where generated pseudo-randomly...

Was a quite interesting article on Arstechnica talking about console simulation recently, also links to quite a few simulators that might make for quite good research:
Accuracy takes power: one man's 3GHz quest to build a perfect SNES emulator
The relevant bit is that the author mentions, and I am inclined to agree, that most games will appear to function pretty correctly even with timing deviations of +/-20%. The issue you mention looks likely to never really introduce more than a fraction of a percent timing error, which is probably imperceptible whilst playing the final game. The authors probably didn't consider it worth dealing with.

I guess that depends on how accurate you want your emulator to be. I do not think that it has to be that accurate. Think emulation of x86 platform, there are so many variants of processors and each has different execution latencies and issue rates.

How can I count instructions executed on Red Hat Enterprise Linux (x86-64)?

I want to find out how many x86-64 instructions are executed during a given run of a program running on Red Hat Enterprise Linux. I know I can get this information from valgrind but the slowdown is considerable. I also know that we are using Intel Core 2 Quad CPUs (model Q6700) which have hardware performance counters built in. But I don't know of any way to get access to the total number of instructions executed from within a C program.

Performance Application Programming Interface (PAPI) appears to be along the lines of what you are looking for.
From the website:
PAPI aims to provide the tool designer and application engineer with a consistent interface and methodology for use of the performance counter hardware found in most major microprocessors.
Vince Weaver, a Post Doctoral Research Associate with the Innovative Computing Laboratory at the University of Tennessee, did some PAPI-related work. The research listed on his web page at UTK looks like it may provide some additional information.

libpapi is the library you are looking for.
AMD and Intel chips provide the insn counts.

The program below access to cycles counter register from C (sorry non portable code, but works fine with gcc). This one is for counting cycles, that is not the same thing as instructions. Modern processors can both use several cycles on the same instruction, or execute several instructions at once. Cycles is usually more interresting that number of instructions, but it depends of your actual purpose.
Other performances counter can certainly be accessed the same ways (actually I don't even know if there is others), but I will have to look for the actual instruction code to use.
static __inline__ unsigned long long rdtsc(void)
{
unsigned long long int x;
__asm__ volatile (".byte 0x0f, 0x31" : "=A" (x));
return x;
}

There are a couple of ways you could go about it, depending on exactly what you need. If you just want to find out the total number of potential arguments you could just run objdump on the binary, which will give you the assembly. If you want more detailed information about the actual instructions being hit on a given run-through of the program, you may want to look into DynamoRIO which provides that functionality. It is similar to valgrind, but I believe it has a smaller affect on performance. I was able to throw together a basic instruction counter with it back in September relatively quickly and easily.
If that's no good, you could try checking out PAPI, which is an API that should let you get at the performance counters on your processors. I've never used it, so I can't speak for it, but a friend of mine used it in a project about 6 months ago and said he found it very helpful.

What are coding conventions for using floating-point in Linux device drivers?

This is related to
this question.
I'm not an expert on Linux device drivers or kernel modules, but I've been reading "Linux Device Drivers" [O'Reilly] by Rubini & Corbet and a number of online sources, but I haven't been able to find anything on this specific issue yet.
When is a kernel or driver module allowed to use floating-point registers?
If so, who is responsible for saving and restoring their contents?
(Assume x86-64 architecture)
If I understand correctly, whenever a KM is running, it is using a hardware context (or hardware thread or register set -- whatever you want to call it) that has been preempted from some application thread. If you write your KM in c, the compiler will correctly insure that the general-purpose registers are properly saved and restored (much as in an application), but that doesn't automatically happen with floating-point registers. For that matter, a lot of KMs can't even assume that the processor has any floating-point capability.
Am I correct in guessing that a KM that wants to use floating-point has to carefully save and restore the floating-point state? Are there standard kernel functions for doing this?
Are the coding conventions for this spelled out anywhere? Are they different for SMP-non SMP drivers? Are they different for older non-preemptive kernels and newer preemptive kernels?

Linus's answer provides this pretty clear quote to use as a guideline:
In other words: the rule is that you really shouldn't use FP in the kernel.

Short answer: Kernel code can use floating point if this use is surrounded by kernel_fpu_begin()/kernel_fpu_end(). These function handle saving and restoring the fpu context. Also, they call preempt_disable()/preempt_enable(), which means no sleeping, page faults etc. in the code between those functions. Google the function names for more information.
If I understand correctly, whenever a
KM is running, it is using a hardware
context (or hardware thread or
register set -- whatever you want to
call it) that has been preempted from
some application thread.
No, a kernel module can run in user context as well (eg. when userspace calls syscalls on a device provided by the KM). It has, however, no relation to the float issue.
If you write your KM in c, the
compiler will correctly insure that
the general-purpose registers are
properly saved and restored (much as
in an application), but that doesn't
automatically happen with
floating-point registers.
That is not because of the compiler, but because of the kernel context-switching code.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string