AArch64 - Running ARM and ASIMD instructions in parallel

AArch64 - Running ARM and ASIMD instructions in parallel - multithreading

I want to implement a code in assembly instruction using both ARM assembly instruction and ASIMD instructions in parallel. My first question is, whether this is can be done on ARMv8? Based on this thread, it's possible on ARMv7, however data transfer between NEON and ARM registers takes considerable amount of time.
Second, I am looking for a way that I can implement my assembly code in parallel. Here is what I am trying to do:
.
.
.
<ASIMD instruction>
<ASIMD instruction>
<ASIMD instruction>
<Data MOV between ASIMD vectors and ARM Reg>
<ARM assembly instruction> ------- <ASIMD instruction>
<ARM assembly instruction> ------- <ASIMD instruction>
<ARM assembly instruction> ------- <ASIMD instruction>
<Data MOV between ARM Reg and ASIMD vectors>
<ARM assembly instruction> ------- <ASIMD instruction>
<ARM assembly instruction> ------- <ASIMD instruction>
<ARM assembly instruction> ------- <ASIMD instruction>
.
.
.
I am wondering if I can do this using two threads. I am working on ARM-CortexA53 microprocessor. I also have access to ARM-CortexA57, but I think these platforms are roughly the same and they have equal capabilities.

I think your comments on threading are misplaced here, or you have a background in a hyper-threaded (or other simultaneous multithreading) architecture. Neither Cortex-A57 or Cortex-A53 are SMT microarchitectures, so at any time you will only have one thread executing on one core. This means your idea of having one thread for Advanced SIMD instructions and one thread for integer/A32/T32 (what you call "ARM instructions") instructions is not going to result in good overall utilisation of a multi-core system.
The thread you linked to discusses a model for the Cortex-A8 microarchitecture in which data dependencies carried through Neon instructions back to A32 instructions cause pipeline bubbles (note that the other comment saying this has to do with memories being synced is incorrect). While it is the case that there is some cost to moving data from Advanced SIMD registers to core registers, the cost is much lower than that thread suggests (see, for example, the Cortex-A57 Software Optimisation Guide, which gives latency numbers for each instruction).
The performance benefits you gain from making use of the vectorised Advanced SIMD instructions will depend on the blend of instructions you intend to use in the A32 and Advanced SIMD portions of your algorithm. Moving the data around too often will have the obvious impact on your execution speed - the more time you spend moving data, the less time you are spending doing the work you intend to do!
The instruction interleaving you propose above is a common way to expose instruction level parallelism, and is likely to work well within a single thread.

I am not sure what you mean with "In parallel". None of Cortex-A53 or Cortex-A57 support multithreading (Although it is possible to have several CPUs in the same chip, which is a different matter).
What you can do however on Cortex-A57 (Certainly less on A53) is to use the fact that execution is mostly out-of-order. So it you don't have dependencies between the instructions, the long instruction can execute, and during this time, you could execute the shorter instructions. But really using it is very difficult, and the best may be to trust that the CPU will do as much out-or-order execution as it can.

Related

How do I monitor the amount of SIMD instruction usage

How can I monitor the amount of SIMD (SSE, AVX, AVX2, AVX-512) instruction usage of a process? For example, htop can be used to monitor general CPU usage, but not specifically SIMD instruction usage.

I think the only reliable way to count all SIMD instructions (not just FP math) is dynamic instrumentation (e.g. via something like Intel PIN / SDE).
See How to characterize a workload by obtaining the instruction type breakdown? and How do I determine the number of x86 machine instructions executed in a C program? specifically sde64 -mix -- ./my_program to print the instruction mix for your program for that run, example output in libsvm compiled with AVX vs no AVX
I don't think there's a good way to make this like top / htop, if it's even possible to safely attach to already-running processes, especially multi-threaded once.
It might also be possible to get dynamic instruction counts using last-branch-record stuff to record / reconstruct the path of execution and count everything, but I don't know of tools for that. In theory that could attach to already-running programs without much danger, but it would take a lot of computation (disassembling and counting instructions) to do it on the fly for all running processes. Not like just asking the kernel for CPU usage stats that it tracks anyway on context switches.
You'd need hardware instruction-counting support for this to be really efficient the way top is.
For SIMD floating point math specifically (not FP shuffles, just real FP math like vaddps), there are perf counter events.
e.g. from perf list output:
fp_arith_inst_retired.128b_packed_single
[Number of SSE/AVX computational 128-bit packed single precision
floating-point instructions retired. Each count represents 4
computations. Applies to SSE* and AVX* packed single precision
floating-point instructions: ADD SUB MUL DIV MIN MAX RCP RSQRT SQRT
DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count twice as
they perform multiple calculations per element]
So it's not even counting uops, it's counting FLOPS. There are other events for ...pd packed double, and 256-bit versions of each. (I assume on CPUs with AVX512, there are also 512-bit vector versions of these events.)
You can use perf to count their execution globally across processes and on all cores. Or for a single process
## count math instructions only, not SIMD integer, load/store, or anything else
perf stat -e cycles:u,instructions:u,fp_arith_inst_retired.{128,256}b_packed_{double,single}:u ./my_program
# fixme: that brace-expansion doesn't expand properly; it separates with spaces not commas.
(Intentionally omitting fp_arith_inst_retired.scalar_{double,single} because you only asked about SIMD and scalar instructions on XMM registers don't count, IMO.)
(You can attach perf to a running process by using -p PID instead of a command. Or use perf top as suggested in
See Ubuntu - how to tell if AVX or SSE, is current being used by CPU app?
You can run perf stat -a to monitor globally across all cores, regardless of what process is executing. But again, this only counts FP math, not SIMD in general.
Still, it is hardware-supported and thus could be cheap enough for something like htop to use without wasting a lot of CPU time if you leave it running long-term.

RISC-V ISA most important 40 instructions to be implemented

I have to implement RISC-V architecture (ISA for pipelined processor) in C++. As all ISA can not be implemented, can some one tell me about most important approx. 40 Instructions set which i should include?
please help

The most important subset is RV32I. It is about 40 instruction in size.
https://riscv.org/specifications/
Chapter 2
RV32I was designed to be suficient to form a compiler target and to
support modern operating system environments. The ISA was also
designed to reduce the hardware required in a minimal
implementation. RV32I contains 47 unique instructions, though a
simple implementation might cover the eight SCALL/SBREAK/CSRR*
instructions with a single SYSTEM hardware instruction that always
traps and might be able to implement the FENCE and FENCE.I
in- structions as NOPs, reducing hardware instruction count to 38
total. RV32I can emulate almost any other ISA extension (except the A
extension, which requires additional hardware support for atomicity).

Are RISC-V instruction execution durations standardized for the sake of cryptographic security?

Some cryptographic functions require a consistent execution duration to avoid timing attacks. I read that such functions targeting x86 are hard to write for reasons potentially including the emulated nature of the ISA and out-of-order processing. Therefore preventing timing attacks on the x86 is not easy because it depends on complex, and/or unknown factors in any given moment.
In a standard RISC-V core, are instruction timings predictably consistent relative to each another? What about in the case of a standard core with out-of-order processing or proprietary implementations of the base ISA?

RISC-V could be implemented in a machine with deterministic latencies; this has to do more with the implementation than the ISA.
See this project for a RISC-V implementation that supports predictable-latency execution: https://github.com/pretis/flexpret. It was developed for the embedded space, but would seem to be suitable for your proposed application as well.

It is important differentiate an ISA from an implementation of it. Nothing in the RISC-V spec mandates the instruction execution latencies. Most implementations will do whatever gives them the highest performance. A security paranoid processor could be designed to have consistent latencies for all instructions and yet still conform to the RISC-V spec.
A nice feature of RISC-V is that plenty of opcode space was intentionally left unused to make room for ISA extensions. There appear to be no publicly announced plans for a crypto extension, so this feature could be incorporated into a crypto extension when it is made if needed.

I'm not sure about core, but I've read that in RISC-V Cryptography Extensions Volume I (riscv-crypto-spec-scalar-v1.0.1.pdf), cryptographic instructions are required of this:
This instruction must always be implemented such that its execution latency does not depend on the data being operated on.
So in the context of cryptographic-specific instructions, yes.

"is there a standard for how long each instruction should take to complete relative to other operations?"
No.
Such behavior will be consistent with all other major ISAs as far as I am aware of.
An out-of-order processor will execute instructions as their dependencies resolve. Cache misses and the potentially random nature of issue select will mean that successive loop iterations will behave differently with regards to when instructions execute relative to one another. Any number of other micro-architecture issues get in the way, including instruction fetch misses, dcache misses, resource stalls causing replays, etc. Even a typical in-order core will face such issues.
how does the RISC-V team plan to address potential standard or non-standard complexity that a cryptographic library developer must find some way to address?
I can't speak for the RISC-V team, but if I may hazard a guess, I suspect that this (and similar) areas will involve the wider community to discuss and address such issues.

Determining Opcode Cycle Count for a CPU

I was wondering where would one go about getting CPU opcode cycle counts for various machines. An example of what I'm talking about can be seen at this link:
https://web.archive.org/web/20150217051448/http://www.obelisk.demon.co.uk/6502/reference.html
If you examine the MAME source code, especially under src\emu\cpu, you'll see that most of the CPU models keep a track of the cycle count in a similar way. My question is where does one go about getting this information, or reverse engineering it if its not available? I've never seen any 'official' ASM programmer's guide contain cycle count info. My initial guess is that a small program is thrown into the real hardware's bootrom, and if it contains an opcode equivalent to RDTSC, something like this is done:
RDTSC
//opcode of choosing
RDTSC
But what would you do if such support wasn't available? I know for older hardware the MAME team has no access to anything but the roms, and scattered documentation.

Up through about the Pentium, cycle counts were easy to find for Intel and AMD processors (and most competitors). Starting with the Pentium Pro and AMD K5, however, the CPU went to a dynamic execution model, in which instructions can be executed out of order. In this case, the time taken to execute an instruction depends heavily upon the data it uses, and whether (for example) it depends on data from a previous instruction (in which case, it has to wait for that instruction to complete before it can execute).
There are also constraints on things like how many instructions can be decoded per cycle (e.g. at least one, plus two more as long as they're "simple") and how many can be retired per cycle (usually around three or four).
As a result, on a modern CPU it's almost meaningless to talk about the cycles for a given instruction in isolation. Meaningful results require a stream of instructions, so you look not only at that instruction, but what comes before and after it. An instruction that's a serious bottleneck in one instruction stream might be essentially free in another stream (e.g. if you have one multiplication mixed in with a lot of adds, the multiplication might be almost free -- but if it's surrounded by a lot of other multiplications, it might be relatively expensive).

The accepted RDTSC count should have a serializing instruction to ensure that all previous instructions have retired before getting the count. This adds overhead to the count, but you can simply "count" zero instructions and subtract that value from the measured instructions.
Some pdf manuals that cover this very well.
http://www.agner.org/optimize/#manuals

Does one assembler instruction always execute atomically? [duplicate]

This question already has answers here:
Can num++ be atomic for 'int num'?
(13 answers)
Closed 2 years ago.
Today I came across this question:
you have a code
static int counter = 0;
void worker() {
for (int i = 1; i <= 10; i++)
counter++;
}
If worker would be called from two different threads, what value will counter have after both of them are finished?
I know that actually it could be anything. But my internal guts tells me, that counter++ will most likely be translated into single assembler instruction, and if both threads are execute on the same core, counter will be 20.
But what if those threads are run on different cores or processors, could there be a race condition in their microcode? Is one assembler instruction could always be viewed as atomic operation?

Specifically for x86, and regarding your example: counter++, there are a number of ways it could be compiled. The most trivial example is:
inc counter
This translates into the following micro operations:
load counter to a hidden register on the CPU
increment the register
store the updated register in counter
This is essentially the same as:
mov eax, counter
inc eax
mov counter, eax
Note that if some other agent updates counter between the load and the store, it won't be reflected in counter after the store. This agent could be another thread in the same core, another core in the same CPU, another CPU in the same system, or even some external agent that uses DMA (Direct Memory Access).
If you want to guarantee that this inc is atomic, use the lock prefix:
lock inc counter
lock guarantees that nobody can update counter between the load and the store.
Regarding more complicated instructions, you usually can't assume that they'll execute atomically, unless they support the lock prefix.

The answer is: it depends!
Here is some confusion around, what an assembler instruction is. Normally, one assembler instruction is translated into exactly one machine instruction. The excemption is when you use macros -- but you should be aware of that.
That said, the question boils down is one machine instruction atomic?
In the good old days, it was. But today, with complex CPUs, long running instructions, hyperthreading, ... it is not. Some CPUs guarantee that some increment/decrement instructions are atomic. The reason is, that they are neat for very simple syncronizing.
Also some CPU commands are not so problematic. When you have a simple fetch (of one piece of data that the processor can fetch in one piece) -- the fetch itself is of course atomic, because there is nothing to be divided at all. But when you have unaligned data, it becomes complicated again.
The answer is: It depends. Carefully read the machine instruction manual of the vendor. In doubt, it is not!
Edit:
Oh, I saw it now, you also ask for ++counter. The statement "most likely to be translated" can not be trusted at all. This largely depends also on the compiler of course! It gets more difficult when the compiler is making different optimizations.

Not always - on some architectures one assembly instruction is translated into one machine code instruction, while on others it does not.
In addition - you can never assume that the program language you are using is compiling a seemingly simple line of code into one assembly instruction. Moreover, on some architectures, you cannot assume one machine code will execute atomically.
Use proper synchronization techniques instead, dependent on the language you are coding in.

Increment/decrement operations on 32-bit or less integer variables on a single 32-bit processor with no Hyper-Threading Technology are atomic.
On a processor with Hyper-Threading Technology or on a multi-processor system, the increment/decrement operations are NOT guaranteed to be executed atomicaly.

Invalidated by Nathan's comment:
If I remember my Intel x86 assembler correctly, the INC instruction only works for registers and does not directly work for memory locations.
So a counter++ would not be a single instruction in assembler (just ignoring the post-increment part). It would be at least three instructions: load counter variable to register, increment register, load register back to counter. And that is just for x86 architecture.
In short, don't rely on it being atomic unless it is specified by the language specification and that the compiler that you are using supports the specifications.

Another issue is that if you don't declare the variable as volatile, the code generated would probably not update the memory at every loop iteration, only at the end of the loop the memory would be updated.

No you cannot assume this. Unless it clearly stated in compiler specification. And moreover no one can guarantee that one single assembler instruction indeed atomic. In practice each assembler instruction is translated to number of microcode operation - uops.
Also the issue of race condition is tightly coupled with memory model(coherence, sequential, release coherence and etc.), for each one the answer and result could be different.

In most cases, no. In fact, on x86, you can perform the instruction
push [address]
which, in C, would be something like:
*stack-- = *address;
This performs two memory transfers in one instruction.
That's basically impossible to do in 1 clock cycle, not the least because one memory transfer is also not possible in one cycle!

Might not be an actual answer to your question, but (assuming this is C#, or another .NET language) if you want counter++ to really be multi-threaded atomic, you could use System.Threading.Interlocked.Increment(counter).
See other answers for actual information on the many different ways why/how counter++ could not be atomic. ;-)

On many other processors, the seperation between memory system and processor is bigger. (often these processor can be little or big-endian depending on memory system, like ARM and PowerPC), this also has consequences for atomic behaviour if the memory system can reorder reads and writes.
For this purpose, there are memory barriers (http://en.wikipedia.org/wiki/Memory_barrier)
So in short, while atomic instructions are enough on intel (with the relevant lock prefixes), more must be done on non-intel, since the memory I/O might not be in the same order.
This is a known problem when porting "lock-free" solutions from Intel to other architectures.
(Note that multiprocessor (not multicore) systems on x86 also seem to need memory barriers, at least in 64-bit mode.

I think that you'll get a race condition on access.
If you wanted to ensure an atomic operation in incrementing counter then you'd need to use ++counter.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string