for-loop(downto,Reversed) vs for-loop(increase mode) vs while-loop in Delphi - micro-optimization

I have a micro-optimization issue. I have 3 methods for processing typed-Pointer(array) . Which one is better?
1
for I:=0 to ArrCount-1 do
begin // I:Var is unused in below-block
Inc(P) ; // P is typed-Pointer
// do somethings
end;
2
for I:=ArrCount-1 downto 0 do
begin // I:Var is unused in below-block
Inc(P) ; // P is typed-Pointer
// do somethings
end;
3
While ArrCount>0 do
begin
Inc(P) ; // P is typed-Pointer
// do somethings
Dec(ArrCount);
end;

The answer that I will give to this question is rather more mundane than perhaps you are expecting. The fastest of these variants is the one that, wait for it, is timed to run most quickly.
It's entirely plausible that on different architectures you'll find that different variants win.
It's also conceivable that different variants will win depending on what is in the body of the loop.
It's also quite possible that the body of the loop takes sufficient time that the loop itself is negligible in comparison.
In short, it depends. Since only you know what happens inside the body, only you can answer the specific question.
As an aside, if the loop body does not refer to the loop variable, then the compiler re-writes the ascending loop as if it were a descending loop. So there may in fact be only two variants here. Indeed, that might mean that all three variants lead to identical compiled code!
Some advice:
Never optimise without profiling.
Never optimise code that is not a bottleneck.
Now, if you want me to take a guess, I predict that for any loop body that is more than a trivial nop, you'll find it hard to find any measurable difference between these variants.
I also see that you are using a pointer to walk across an array. You might find that if this code is a bottleneck, and if the loop body just handles this array iteration, that using arr[] indexing is more effective that pointer arithmetic. But again, it depends on many things and you have to profile, and look at the code the compiler produces.

Funny, but looking at disassembly window the speed is depending on weather is the loop variable used inside loop.
1) Not using - code is almost identical:
Project17.dpr.12: for i := 0 to 3 do
0040914D B804000000 mov eax,$00000004
Project17.dpr.13: Inc(j);
00409152 43 inc ebx
Project17.dpr.12: for i := 0 to 3 do
00409153 48 dec eax
00409154 75FC jnz $00409152
Project17.dpr.15: for i := 3 downto 0 do
00409156 B8FCFFFFFF mov eax,$fffffffc
Project17.dpr.16: Inc(j);
0040915B 43 inc ebx
Project17.dpr.15: for i := 3 downto 0 do
0040915C 40 inc eax
0040915D 75FC jnz $0040915b
2) Used - first variant faster a bit because xor faster then mov:
Project17.dpr.12: for i := 0 to 3 do
0040914D 33C0 xor eax,eax
Project17.dpr.13: Inc(j, i);
0040914F 03D8 add ebx,eax
00409151 40 inc eax
Project17.dpr.12: for i := 0 to 3 do
00409152 83F804 cmp eax,$04
00409155 75F8 jnz $0040914f
Project17.dpr.15: for i := 3 downto 0 do
00409157 B803000000 mov eax,$00000003
Project17.dpr.16: Inc(j, i);
0040915C 03D8 add ebx,eax
0040915E 48 dec eax
Project17.dpr.15: for i := 3 downto 0 do
0040915F 83F8FF cmp eax,-$01
00409162 75F8 jnz $0040915c
You can check third variant yourself.
PS: I am using D2007 for this test.

Related

What is the point of atomic.Load and atomic.Store

In the Go's memory model nothing is stated about atomics and their relation to memory fencing.
Although many internal packages seem to rely on the memory ordering that could be provided if atomics created memory fences around them. See this issue for details.
After not understanding how it really works, I went to the sources, in particular src/runtime/internal/atomic/atomic_amd64.go and found following implementations of Load and Store:
//go:nosplit
//go:noinline
func Load(ptr *uint32) uint32 {
return *ptr
}
Store is implemented in asm_amd64.s in the same package.
TEXT runtime∕internal∕atomic·Store(SB), NOSPLIT, $0-12
MOVQ ptr+0(FP), BX
MOVL val+8(FP), AX
XCHGL AX, 0(BX)
RET
Both look as if they had nothing to do with parallelism.
I did look into other architectures but implementation seems to be equivalent.
However, if atomics are indeed weak and provide no memory ordering guarantees, than the code below could fail, but it does not.
As an addition I tried replacing atomic calls with simple assignments but it still produces consistent and "successful" result in both cases.
func try() {
var a, b int32
go func() {
// atomic.StoreInt32(&a, 1)
// atomic.StoreInt32(&b, 1)
a = 1
b = 1
}()
for {
// if n := atomic.LoadInt32(&b); n == 1 {
if n := b; n == 1 {
if a != 1 {
panic("fail")
}
break
}
runtime.Gosched()
}
}
func main() {
n := 1000000000
for i := 0; i < n ; i++ {
try()
}
}
The next thought was that the compiler does some magic to provide ordering guarantees. So below is the listing of the variant with atomic Store and Load not commented. Full listing is available on the pastebin.
// Anonymous function implementation with atomic calls inlined
TEXT %22%22.try.func1(SB) gofile../path/atomic.go
atomic.StoreInt32(&a, 1)
0x816 b801000000 MOVL $0x1, AX
0x81b 488b4c2408 MOVQ 0x8(SP), CX
0x820 8701 XCHGL AX, 0(CX)
atomic.StoreInt32(&b, 1)
0x822 b801000000 MOVL $0x1, AX
0x827 488b4c2410 MOVQ 0x10(SP), CX
0x82c 8701 XCHGL AX, 0(CX)
}()
0x82e c3 RET
// Important "cycle" part of try() function
0x6ca e800000000 CALL 0x6cf [1:5]R_CALL:runtime.newproc
for {
0x6cf eb12 JMP 0x6e3
runtime.Gosched()
0x6d1 90 NOPL
checkTimeouts()
0x6d2 90 NOPL
mcall(gosched_m)
0x6d3 488d0500000000 LEAQ 0(IP), AX [3:7]R_PCREL:runtime.gosched_m·f
0x6da 48890424 MOVQ AX, 0(SP)
0x6de e800000000 CALL 0x6e3 [1:5]R_CALL:runtime.mcall
if n := atomic.LoadInt32(&b); n == 1 {
0x6e3 488b442420 MOVQ 0x20(SP), AX
0x6e8 8b08 MOVL 0(AX), CX
0x6ea 83f901 CMPL $0x1, CX
0x6ed 75e2 JNE 0x6d1
if a != 1 {
0x6ef 488b442428 MOVQ 0x28(SP), AX
0x6f4 833801 CMPL $0x1, 0(AX)
0x6f7 750a JNE 0x703
0x6f9 488b6c2430 MOVQ 0x30(SP), BP
0x6fe 4883c438 ADDQ $0x38, SP
0x702 c3 RET
As you can see, no fences or locks are in place again.
Note: all tests are done on x86_64 and i5-8259U
The question:
So, is there any point of wrapping simple pointer dereference in a function call or is there some hidden meaning to it and why do these atomics still work as memory barriers? (if they do)
I don't know Go at all, but it looks like the x86-64 implementations of .load() and .store() are sequentially-consistent. Presumably on purpose / for a reason!
//go:noinline on the load means the compiler can't reorder around a blackbox non-inline function, I assume. On x86 that's all you need for the load side of sequential-consistency, or acq-rel. A plain x86 mov load is an acquire load.
The compiler-generated code gets to take advantage of x86's strongly-ordered memory model, which is sequential consistency + a store buffer (with store forwarding), i.e. acq/rel. To recover sequential consistency, you only need to drain the store buffer after a release-store.
.store() is written in asm, loading its stack args and using xchg as a seq-cst store.
XCHG with memory has an implicit lock prefix which is a full barrier; it's an efficient alternative to mov+mfence to implement what C++ would call a memory_order_seq_cst store.
It flushes the store buffer before later loads and stores are allowed to touch L1d cache. Why does a std::atomic store with sequential consistency use XCHG?
See
https://bartoszmilewski.com/2008/11/05/who-ordered-memory-fences-on-an-x86/
C/C++11 mappings to processors
describes the sequences of instructions that implement relaxed load/store, acq/rel load/store, seq-cst load/store, and various barriers, on various ISAs. So you can recognize things like xchg with memory.
Does lock xchg have the same behavior as mfence? (TL:DR: yes except for maybe some corner cases with NT loads from WC memory, e.g. from video RAM). You may see a dummy lock add $0, (SP) used as an alternative to mfence in some code.
IIRC, AMD's optimization manual even recommends this. It's good on Intel as well, especially on Skylake where mfence was strengthened by microcode update to fully block out-of-order exec even of ALU instructions (like lfence) as well as memory reordering. (To fix an erratum with NT loads.)
https://preshing.com/20120913/acquire-and-release-semantics/

In MSVC, why do InterlockedOr and InterlockedAnd generate a loop instead of a simple locked instruction?

On MSVC for x64 (19.10.25019),
InterlockedOr(&g, 1)
generates this code sequence:
prefetchw BYTE PTR ?g##3JC
mov eax, DWORD PTR ?g##3JC ; g
npad 3
$LL3#f:
mov ecx, eax
or ecx, 1
lock cmpxchg DWORD PTR ?g##3JC, ecx ; g
jne SHORT $LL3#f
I would have expected the much simpler (and loopless):
mov eax, 1
lock or [?g##3JC], eax
InterlockedAnd generates analogous code to InterlockedOr.
It seems wildly inefficient to have to have a loop for this instruction. Why is this code generated?
(As a side note: the whole reason I was using InterlockedOr was to do an atomic load of the variable - I have since learned that InterlockedCompareExchange is the way to do this. It is odd to me that there is no InterlockedLoad(&x), but I digress...)
The documented contract for InterlockedOr has it returning the original value:
InterlockedOr
Performs an atomic OR operation on the specified LONG values. The function prevents more than one thread from using the same variable simultaneously.
LONG __cdecl InterlockedOr(
_Inout_ LONG volatile *Destination,
_In_ LONG Value
);
Parameters:
Destination [in, out]
A pointer to the first operand. This value will be replaced with the result of the operation.
Value [in]
The second operand.
Return value
The function returns the original value of the Destination parameter.
This is why the unusual code that you've observed is required. The compiler cannot simply emit an OR instruction with a LOCK prefix, because the OR instruction does not return the previous value. Instead, it has to use the odd workaround with LOCK CMPXCHG in a loop. In fact, this apparently unusual sequence is the standard pattern for implementing interlocked operations when they aren't natively supported by the underlying hardware: capture the old value, perform an interlocked compare-and-exchange with the new value, and keep trying in a loop until the old value from this attempt is equal to the captured old value.
As you observed, you see the same thing with InterlockedAnd, for exactly the same reason: the x86 AND instruction doesn't return the original value, so the code-generator has to fallback on the general pattern involving compare-and-exchange, which is directly supported by the hardware.
Note that, at least on x86 where InterlockedOr is implemented as an intrinsic, the optimizer is smart enough to figure out whether you're using the return value or not. If you are, then it uses the workaround code involving CMPXCHG. If you are ignoring the return value, then it goes ahead and emits code using LOCK OR, just like you would expect.
#include <intrin.h>
LONG InterlockedOrWithReturn()
{
LONG val = 42;
return _InterlockedOr(&val, 8);
}
void InterlockedOrWithoutReturn()
{
LONG val = 42;
LONG old = _InterlockedOr(&val, 8);
}
InterlockedOrWithoutReturn, COMDAT PROC
mov DWORD PTR [rsp+8], 42
lock or DWORD PTR [rsp+8], 8
ret 0
InterlockedOrWithoutReturn ENDP
InterlockedOrWithReturn, COMDAT PROC
mov DWORD PTR [rsp+8], 42
prefetchw BYTE PTR [rsp+8]
mov eax, DWORD PTR [rsp+8]
LoopTop:
mov ecx, eax
or ecx, 8
lock cmpxchg DWORD PTR [rsp+8], ecx
jne SHORT LoopTop
ret 0
InterlockedOrWithReturn ENDP
The optimizer is equally as smart for InterlockedAnd, and should be for the other Interlocked* functions, as well.
As intuition would tell you, the LOCK OR implementation is more efficient than the LOCK CMPXCHG in a loop. Not only is there the expanded code size and the overhead of looping, but you risk branch prediction misses, which can cost a large number of cycles. In performance-critical code, if you can avoid relying on the return value for interlocked operations, you can gain a performance boost.
However, what you really should be using in modern C++ is std::atomic, which allows you to specify the desired memory model/semantics, and then let the standard library maintainers deal with the complexity.

VHDL - String indexing - RAM usage and total logic elements increase by over 100% each

I'm hoping someone with more VHDL experience can enlighten me! To summarise, I have an LCD entity and a Main entity which instantiates it. The LCD takes an 84-character wide string ("msg"), which seems to cause me huge problems as soon as I index it using a variable or signal. I have no idea what the reason for this is, however, since the string is displaying HEX values, and each clock cycle, I read a 16-bit value... I need to update 4 characters of the string for each nybble of this 16-bit value. This doesn't need to be done in a single clock cycle, since a new value is read after a large number of cycles... however, experimenting with incrementing a "t" variable, and only changing string values one "t" at a time makes no difference for whatever reason.
The error is: "Error (170048): Selected device has 26 RAM location(s) of type M4K" However, the current design needs more than 26 to successfully fit
Here is the compilation report with the problem:
Flow Status Flow Failed - Tue Aug 08 18:49:21 2017
Quartus II 64-Bit Version 13.0.1 Build 232 06/12/2013 SP 1 SJ Web Edition
Revision Name Revision1
Top-level Entity Name Main
Family Cyclone II
Device EP2C5T144C6
Timing Models Final
Total logic elements 6,626 / 4,608 ( 144 % )
Total combinational functions 6,190 / 4,608 ( 134 % )
Dedicated logic registers 1,632 / 4,608 ( 35 % )
Total registers 1632
Total pins 50 / 89 ( 56 % )
Total virtual pins 0
Total memory bits 124,032 / 119,808 ( 104 % )
Embedded Multiplier 9-bit elements 0 / 26 ( 0 % )
Total PLLs 1 / 2 ( 50 % )
The RAM summary table contains 57 rows, of "LCD:display|altsyncram:Mux####_rtl_0|altsyncram_####:auto_generated|ALTSYNCRAM"
Here is the LCD entity:
entity LCD is
generic(
delay_time : integer := 50000;
half_period : integer := 7
);
port(
clk : in std_logic;
SCE : out std_logic := '1';
DC : out std_logic := '1';
RES : out std_logic := '0';
SCLK : out std_logic := '1';
SDIN : out std_logic := '0';
op : in std_logic_vector(2 downto 0);
msg : in string(1 to 84);
jx : in integer range 0 to 255 := 0;
jy : in integer range 0 to 255 := 0;
cx : in integer range 0 to 255 := 0;
cy : in integer range 0 to 255 := 0
);
end entity;
The following code is what causes the problem, where a, b, c and d are variables which are incremented by 4 after each read:
msg(a) <= getHex(data(3 downto 0));
msg(b) <= getHex(data(7 downto 4));
msg(c) <= getHex(data(11 downto 8));
msg(d) <= getHex(data(15 downto 12));
Removing some of these lines causes the memory and logic element usages to both drop, but they still seem absurdly high, and I don't understand the cause.
Replacing a, b, c and d with integers, like 1, 2, 3 and 4 causes the problem to go away completely, with the logic elements at 22%, and RAM usage at 0%!
If anybody has any ideas at all, I'd be very grateful! I will post the full code below in case anybody needs it... but be warned, it's a bit messy, and I feel like the problem could be simple. Many thanks in advance!
Main.vhd
LCD.vhd
There are a few issues here.
The first is that HDL synthesis tools do an awful lot of optimization. What this basically means is if you don't properly connect up input and output parts to/from something it is likely (but not certain) to get eliminated by the optimizer.
The second is you have to be very careful with loops and functions. Basically loops will be unrolled and functions will be inlined, so a small ammount of code can generate an awful lot of logic.
The third is that under some cicumstances arrays will be translated to memory elements.
As pointed out in a comment this loop is the root cause of the large ammounts of memory usage.
for j in 0 to 83 loop
for i in 0 to 5 loop
pixels((j*6) + i) <= getByte(msg(j+1), i);
end loop;
end loop;
This has the potential to use a hell of a lot of memory resources. Each call to "getByte" requires a read port on (parts of) "ram" but blockrams only have two read ports. So "ram" gets duplicated to satisfy the need for more read ports. The inner loop is reading different parts of the same location so basically each iteration of the outer loop needs an independent read port on the ram. So that is about 40 copies of the ram. Reading the cyclone 2 datasheet each copy will require 2 m4k blocks
So why doesn't this happen when you use numbers instead of the variables a,b,c and d?
If the compiler can figure out something is a constant it can compute it at compile time. This would limit the number of calls to "pixels" that have to actually be translated to memory blocks rather that just having their result hardcoded. Still i'm surprised it's dropping to zero.
I notice your code doesn't actually have any inputs other than the clock and a "rx" input that doesn't actually seem to be being used for anything, so it is quite possible that the synthesizer may be figuring out a hell of a lot of stuff at build time. Often eliminating one bit of code can allow another bit to be eliminated until you have nothing left.

Delphi Copy-on-Write for String

I have a code like this:
function Test: string;
var
r, s: string;
begin
r := 'Hello';
Writeln(NativeInt(PChar(#r[1])));
s := r;
Writeln(NativeInt(PChar(#s[1])));
Result := r;
Writeln(NativeInt(PChar(#Result[1])));
end;
People say that delphi uses copy-on-write for strings. but the above function prints 3 different Addresses for variable, r, s, and Result. So this is confusing.. Is there only a copy of 'Hello' string in memory?
Whenever you take the address of an element of a string, that counts as a write in the eyes of the compiler. As far as it is concerned, you now have a raw pointer to the internals of the string, and who knows what you plan to do with it. So, from its perspective, it plays safe. It decides to make a unique copy of the string, so that you are free to do whatever dastardly deed you plan to do with your raw pointer.
The code you have compiles to this:
Project2.dpr.13: r := 'Hello';
00419EF8 8D45FC lea eax,[ebp-$04]
00419EFB BAA89F4100 mov edx,$00419fa8
00419F00 E827D2FEFF call #UStrLAsg
Project2.dpr.14: Writeln(NativeInt(PChar(#r[1])));
00419F05 8D45FC lea eax,[ebp-$04]
00419F08 E883D3FEFF call #UniqueStringU
00419F0D 8BD0 mov edx,eax
00419F0F A18CE64100 mov eax,[$0041e68c]
00419F14 E853B2FEFF call #Write0Long
00419F19 E82EB5FEFF call #WriteLn
00419F1E E845A1FEFF call #_IOTest
Project2.dpr.15: s := r;
00419F23 8D45F8 lea eax,[ebp-$08]
00419F26 8B55FC mov edx,[ebp-$04]
00419F29 E8FED1FEFF call #UStrLAsg
Project2.dpr.16: Writeln(NativeInt(PChar(#s[1])));
00419F2E 8D45F8 lea eax,[ebp-$08]
00419F31 E85AD3FEFF call #UniqueStringU
00419F36 8BD0 mov edx,eax
00419F38 A18CE64100 mov eax,[$0041e68c]
00419F3D E82AB2FEFF call #Write0Long
00419F42 E805B5FEFF call #WriteLn
00419F47 E81CA1FEFF call #_IOTest
Project2.dpr.17: Result := r;
00419F4C 8BC3 mov eax,ebx
00419F4E 8B55FC mov edx,[ebp-$04]
00419F51 E88ED1FEFF call #UStrAsg
Project2.dpr.18: Writeln(NativeInt(PChar(#Result[1])));
00419F56 8BC3 mov eax,ebx
00419F58 E833D3FEFF call #UniqueStringU
00419F5D 8BD0 mov edx,eax
00419F5F A18CE64100 mov eax,[$0041e68c]
00419F64 E803B2FEFF call #Write0Long
00419F69 E8DEB4FEFF call #WriteLn
00419F6E E8F5A0FEFF call #_IOTest
The calls to UniqueStringU are performing the copy in copy-on-write.

Trouble With Reading In A String With A Subroutine In LC3

So I believe that the way I store the string works. I am just having some issues passing the String out of the subroutine. I heard that in order to pass something out of a subroutine you need to store it in R1 but I can't get it to store into my WORD array
.orig x3000
AND R1,R1,0
LEA R0,PROMPT
PUTS
JSR GETS
ST R1,WORD
LEA R0,WORD
PUTS
halt
; ---------Data Area-------------
WORD .blkw 20
PROMPT .stringz "Enter String: "
; -------------------------------
GETS LEA R1,MEMORYBLOCK ; saves the address of the storage memory block
loop GETC ; input character -> r0
PUTC ; r0 -> console
; always points at the next available block
LD R2,EMPTY ; check for
ADD R2,R2,R0 ; end of line
BRz finish
LD R2,COUNTDOWN
ADD R2,R2,#-1
BRz finish
ST R2,COUNTDOWN
STR R0,R1,#0 ; r0 -> ( memory address stored in r1 + 0 )
ADD R1,R1,#1 ; increments the memory pointer so that it
BR loop
finish LEA R1,MEMORYBLOCK
RET
; ----Subroutine Data Area-------
EMPTY .fill xfff6
COUNTDOWN .fill #10
MEMORYBLOCK .BLKW 20
; -------------------------------
.end
The biggest problem here is the concept of "returning a string". What you're actually doing at the end of GETS is returning the memory address at which the string starts. When you then store this into WORD in the calling function, you are storing the memory address of the first byte of the string that was input (i.e. the memory address of MEMORYBLOCK) into the first byte of WORD. You aren't copying the entire string from MEMORYBLOCK into WORD.
The easiest "fix" for what you're trying to do would be to change
LEA R0,WORD
to
LD R0,WORD
and then for good measure:
WORD .blkw 20
to
WORD .fill 0
as now you're just using it to store a single value (i.e. the memory address of MEMORYBLOCK).
However, at this point you haven't made a copy of the string. If you want to do this, then you will need to make a loop whereby you walk through MEMORYBLOCK and copy each byte to WORD instead.
The final, cheaper, way to do this is to just use MEMORYBLOCK directly from the calling function. It's not really any less valid in a program of this size, unless there's project requirements that ask otherwise.

Resources