SSE SIMD Optimization For Loop

SSE SIMD Optimization For Loop - visual-c++

I have some code in a loop
for(int i = 0; i < n; i++)
{
u[i] = c * u[i] + s * b[i];
}
So, u and b are vectors of the same length, and c and s are scalars. Is this code a good candidate for vectorization for use with SSE in order to get a speedup?
UPDATE
I learnt vectorization (turns out it's not so hard if you use intrinsics) and implemented my loop in SSE. However, when setting the SSE2 flag in the VC++ compiler, I get about the same performance as with my own SSE code. The Intel compiler on the other hand was much faster than my SSE code or the VC++ compiler.
Here is the code I wrote for reference
double *u = (double*) _aligned_malloc(n * sizeof(double), 16);
for(int i = 0; i < n; i++)
{
u[i] = 0;
}
int j = 0;
__m128d *uSSE = (__m128d*) u;
__m128d cStore = _mm_set1_pd(c);
__m128d sStore = _mm_set1_pd(s);
for (j = 0; j <= i - 2; j+=2)
{
__m128d uStore = _mm_set_pd(u[j+1], u[j]);
__m128d cu = _mm_mul_pd(cStore, uStore);
__m128d so = _mm_mul_pd(sStore, omegaStore);
uSSE[j/2] = _mm_add_pd(cu, so);
}
for(; j <= i; ++j)
{
u[j] = c * u[j] + s * omegaCache[j];
}

Yes, this is an excellent candidate for vectorization. But, before you do so, make sure you've profiled your code to be sure that this is actually worth optimizing. That said, the vectorization would go something like this:
int i;
for(i = 0; i < n - 3; i += 4)
{
load elements u[i,i+1,i+2,i+3]
load elements b[i,i+1,i+2,i+3]
vector multiply u * c
vector multiply s * b
add partial results
store back to u[i,i+1,i+2,i+3]
}
// Finish up the uneven edge cases (or skip if you know n is a multiple of 4)
for( ; i < n; i++)
u[i] = c * u[i] + s * b[i];
For even more performance, you can consider prefetching further array elements, and/or unrolling the loop and using software pipelining to interleave the computation in one loop with the memory accesses from a different iteration.

_mm_set_pd is not vectorized. If taken literally, it reads the two doubles using scalar operations, then combines the two scalar doubles and copy them into the SSE register. Use _mm_load_pd instead.

probably yes, but you have to help compiler with some hints.
__restrict__ placed on pointers tells compiler that there is no alias between two pointers.
if you know alignment of your vectors, communicate that to compiler (Visual C++ may have some facility).
I am not familiar with Visual C++ myself, but I have heard it is no good for vectorization.
Consider using Intel compiler instead.
Intel allows pretty fine-grained control over assembly generated: http://www.intel.com/software/products/compilers/docs/clin/main_cls/cref_cls/common/cppref_pragma_vector.htm

Yes, this is a great candidate for vectorizaton, assuming there is no overlap of U and B array. But the code is bound by memory access(load/store). Vectorization helps reduce cycles per loop, but the instructions will stall due to cache-miss on U and B array . The Intel C/C++ Compiler generates the following code with default flags for Xeon x5500 processor. The compiler unrolls the loop by 8 and employs SIMD ADD (addpd) and MULTIPLY (mulpd) instructions using xmm[0-16] SIMD registers. In each cycle, the processor can issue 2 SIMD instructions yielding 4-way scalar ILP, assuming you have the data ready in the registers.
Here U, B, C and S are Double Precision (8 bytes).
..B1.14: # Preds ..B1.12 ..B1.10
movaps %xmm1, %xmm3 #5.1
unpcklpd %xmm3, %xmm3 #5.1
movaps %xmm0, %xmm2 #6.12
unpcklpd %xmm2, %xmm2 #6.12
# LOE rax rcx rbx rbp rsi rdi r8 r12 r13 r14 r15 xmm0 xmm1 xmm2 xmm3
..B1.15: # Preds ..B1.15 ..B1.14
movsd (%rsi,%rcx,8), %xmm4 #6.21
movhpd 8(%rsi,%rcx,8), %xmm4 #6.21
mulpd %xmm2, %xmm4 #6.21
movaps (%rdi,%rcx,8), %xmm5 #6.12
mulpd %xmm3, %xmm5 #6.12
addpd %xmm4, %xmm5 #6.21
movaps 16(%rdi,%rcx,8), %xmm7 #6.12
movaps 32(%rdi,%rcx,8), %xmm9 #6.12
movaps 48(%rdi,%rcx,8), %xmm11 #6.12
movaps %xmm5, (%rdi,%rcx,8) #6.3
mulpd %xmm3, %xmm7 #6.12
mulpd %xmm3, %xmm9 #6.12
mulpd %xmm3, %xmm11 #6.12
movsd 16(%rsi,%rcx,8), %xmm6 #6.21
movhpd 24(%rsi,%rcx,8), %xmm6 #6.21
mulpd %xmm2, %xmm6 #6.21
addpd %xmm6, %xmm7 #6.21
movaps %xmm7, 16(%rdi,%rcx,8) #6.3
movsd 32(%rsi,%rcx,8), %xmm8 #6.21
movhpd 40(%rsi,%rcx,8), %xmm8 #6.21
mulpd %xmm2, %xmm8 #6.21
addpd %xmm8, %xmm9 #6.21
movaps %xmm9, 32(%rdi,%rcx,8) #6.3
movsd 48(%rsi,%rcx,8), %xmm10 #6.21
movhpd 56(%rsi,%rcx,8), %xmm10 #6.21
mulpd %xmm2, %xmm10 #6.21
addpd %xmm10, %xmm11 #6.21
movaps %xmm11, 48(%rdi,%rcx,8) #6.3
addq $8, %rcx #5.1
cmpq %r8, %rcx #5.1
jl ..B1.15 # Prob 99% #5.1

it depends on how you placed u and b in memory.
if both memory block are far from each other, SSE wouldn't boost much in this scenario.
it is suggested that the array u and b are AOE (array of structure) instead of SOA (structure of array), because you can load both of them into register in single instruction.

Related

How to use unsafe get a byte slice from a string without memory copy

I have read about "https://github.com/golang/go/issues/25484" about no-copy conversion from []byte to string.
I am wondering if there is a way to convert a string to a byte slice without memory copy?
I am writing a program which processes terra-bytes data, if every string is copied twice in memory, it will slow down the progress. And I do not care about mutable/unsafe, only internal usage, I just need the speed as fast as possible.
Example:
var s string
// some processing on s, for some reasons, I must use string here
// ...
// then output to a writer
gzipWriter.Write([]byte(s)) // !!! Here I want to avoid the memory copy, no WriteString
So the question is: is there a way to prevent from the memory copying? I know maybe I need the unsafe package, but I do not know how. I have searched a while, no answer till now, neither the SO showed related answers works.

Getting the content of a string as a []byte without copying in general is only possible using unsafe, because strings in Go are immutable, and without a copy it would be possible to modify the contents of the string (by changing the elements of the byte slice).
So using unsafe, this is how it could look like (corrected, working solution):
func unsafeGetBytes(s string) []byte {
return (*[0x7fff0000]byte)(unsafe.Pointer(
(*reflect.StringHeader)(unsafe.Pointer(&s)).Data),
)[:len(s):len(s)]
}
This solution is from Ian Lance Taylor.
One thing to note here: the empty string "" has no bytes as its length is zero. This means there is no guarantee what the Data field may be, it may be zero or an arbitrary address shared among the zero-size variables. If an empty string may be passed, that must be checked explicitly (although there's no need to get the bytes of an empty string without copying...):
func unsafeGetBytes(s string) []byte {
if s == "" {
return nil // or []byte{}
}
return (*[0x7fff0000]byte)(unsafe.Pointer(
(*reflect.StringHeader)(unsafe.Pointer(&s)).Data),
)[:len(s):len(s)]
}
Original, wrong solution was:
func unsafeGetBytesWRONG(s string) []byte {
return *(*[]byte)(unsafe.Pointer(&s)) // WRONG!!!!
}
See Nuno Cruces's answer below for reasoning.
Testing it:
s := "hi"
data := unsafeGetBytes(s)
fmt.Println(data, string(data))
data = unsafeGetBytes("gopher")
fmt.Println(data, string(data))
Output (try it on the Go Playground):
[104 105] hi
[103 111 112 104 101 114] gopher
BUT: You wrote you want this because you need performance. You also mentioned you want to compress the data. Please know that compressing data (using gzip) requires a lot more computation than just copying a few bytes! You will not see any noticeable performance gain by using this!
Instead when you want to write strings to an io.Writer, it's recommended to do it via io.WriteString() function which if possible will do so without making a copy of the string (by checking and calling WriteString() method which if exists is most likely does it better than copying the string). For details, see What's the difference between ResponseWriter.Write and io.WriteString?
There are also ways to access the contents of a string without converting it to []byte, such as indexing, or using a loop where the compiler optimizes away the copy:
s := "something"
for i, v := range []byte(s) { // Copying s is optimized away
// ...
}
Also see related questions:
[]byte(string) vs []byte(*string)
What are the possible consequences of using unsafe conversion from []byte to string in go?
What is the difference between the string and []byte in Go?
Does conversion between alias types in Go create copies?
How does type conversion internally work? What is the memory utilization for the same?

After some extensive investigation, I believe I've discovered the most efficient way of getting a []byte from a string as of Go 1.17 (this is for i386/x86_64 gc; I haven't tested other architectures.) The trade-off of being efficient code here is being inefficient to code, though.
Before I say anything else, it should be made clear that the differences are ultimately very small and probably inconsequential -- the info below is for fun/educational purposes only.
Summary
With some minor alterations, the accepted answer illustrating the technique of slicing a pointer to array is the most efficient way. That being said, I wouldn't be surprised if unsafe.Slice becomes the (decisively) better choice in the future.
unsafe.Slice
unsafe.Slice currently has the advantage of being slightly more readable, but I'm skeptical about it's performance. It looks like it makes a call to runtime.unsafeslice. The following is the gc amd64 1.17 assembly of the function provided in Atamiri's answer (FUNCDATA omitted). Note the stack check (lack of NOSPLIT):
unsafeGetBytes_pc0:
TEXT "".unsafeGetBytes(SB), ABIInternal, $48-16
CMPQ SP, 16(R14)
PCDATA $0, $-2
JLS unsafeGetBytes_pc86
PCDATA $0, $-1
SUBQ $48, SP
MOVQ BP, 40(SP)
LEAQ 40(SP), BP
PCDATA $0, $-2
MOVQ BX, ""..autotmp_4+24(SP)
MOVQ AX, "".s+56(SP)
MOVQ BX, "".s+64(SP)
MOVQ "".s+56(SP), DX
PCDATA $0, $-1
MOVQ DX, ""..autotmp_5+32(SP)
LEAQ type.uint8(SB), AX
MOVQ BX, CX
MOVQ DX, BX
PCDATA $1, $1
CALL runtime.unsafeslice(SB)
MOVQ ""..autotmp_5+32(SP), AX
MOVQ ""..autotmp_4+24(SP), BX
MOVQ BX, CX
MOVQ 40(SP), BP
ADDQ $48, SP
RET
unsafeGetBytes_pc86:
NOP
PCDATA $1, $-1
PCDATA $0, $-2
MOVQ AX, 8(SP)
MOVQ BX, 16(SP)
CALL runtime.morestack_noctxt(SB)
MOVQ 8(SP), AX
MOVQ 16(SP), BX
PCDATA $0, $-1
JMP unsafeGetBytes_pc0
Other unimportant fun facts about the above (easily subject to change): compiled size of 3326B; has an inline cost of 7; correct escape analysis: s leaks to ~r1 with derefs=0.
Carefully Modifying *reflect.SliceHeader
This method has the advantage/disadvantage of letting one modify the internal state of a slice directly. Unfortunately, due it's multiline nature and use of uintptr, the GC can easily mess things up if one is not careful about keeping a reference to the original string. (Here I avoided creating temporary pointers to reduce inline cost and to avoid needing to add runtime.KeepAlive):
func unsafeGetBytes(s string) (b []byte) {
(*reflect.SliceHeader)(unsafe.Pointer(&b)).Data = (*reflect.StringHeader)(unsafe.Pointer(&s)).Data
(*reflect.SliceHeader)(unsafe.Pointer(&b)).Cap = len(s)
(*reflect.SliceHeader)(unsafe.Pointer(&b)).Len = len(s)
return
}
The corresponding assembly on amd64 (FUNCDATA omitted):
TEXT "".unsafeGetBytes(SB), NOSPLIT|ABIInternal, $32-16
SUBQ $32, SP
MOVQ BP, 24(SP)
LEAQ 24(SP), BP
MOVQ AX, "".s+40(SP)
MOVQ BX, "".s+48(SP)
MOVQ $0, "".b(SP)
MOVUPS X15, "".b+8(SP)
MOVQ "".s+40(SP), DX
MOVQ DX, "".b(SP)
MOVQ "".s+48(SP), CX
MOVQ CX, "".b+16(SP)
MOVQ "".s+48(SP), BX
MOVQ BX, "".b+8(SP)
MOVQ "".b(SP), AX
MOVQ 24(SP), BP
ADDQ $32, SP
RET
Other unimportant fun facts about the above (easily subject to change): compiled size of 3700B; has an inline cost of 20; subpar escape analysis: s leaks to {heap} with derefs=0.
Unsafer version of modifying SliceHeader
Adapted from Nuno Cruces' answer. This relies on the inherent structural similarity between StringHeader and SliceHeader, so in a sense it breaks "more easily". Additionally, it temporarily creates an illegal state where cap(b) (being 0) is less than len(b).
func unsafeGetBytes(s string) (b []byte) {
*(*string)(unsafe.Pointer(&b)) = s
(*reflect.SliceHeader)(unsafe.Pointer(&b)).Cap = len(s)
return
}
Corresponding assembly (FUNCDATA omitted):
TEXT "".unsafeGetBytes(SB), NOSPLIT|ABIInternal, $32-16
SUBQ $32, SP
MOVQ BP, 24(SP)
LEAQ 24(SP), BP
MOVQ AX, "".s+40(FP)
MOVQ $0, "".b(SP)
MOVUPS X15, "".b+8(SP)
MOVQ AX, "".b(SP)
MOVQ BX, "".b+8(SP)
MOVQ BX, "".b+16(SP)
MOVQ "".b(SP), AX
MOVQ BX, CX
MOVQ 24(SP), BP
ADDQ $32, SP
NOP
RET
Other unimportant details: compiled size 3636B, inline cost of 11, with subpar escape analysis: s leaks to {heap} with derefs=0.
Slicing a pointer to array
This is the accepted answer (shown here for comparison) -- its primary disadvantage is its ugliness (viz. magic number 0x7fff0000). There's also the tiniest possibility of getting a string bigger than the array, and an unavoidable bounds check.
func unsafeGetBytes(s string) []byte {
return (*[0x7fff0000]byte)(unsafe.Pointer(
(*reflect.StringHeader)(unsafe.Pointer(&s)).Data),
)[:len(s):len(s)]
}
Corresponding assembly (FUNCDATA removed).
TEXT "".unsafeGetBytes(SB), NOSPLIT|ABIInternal, $24-16
SUBQ $24, SP
MOVQ BP, 16(SP)
LEAQ 16(SP), BP
PCDATA $0, $-2
MOVQ AX, "".s+32(SP)
MOVQ BX, "".s+40(SP)
MOVQ "".s+32(SP), AX
PCDATA $0, $-1
TESTB AL, (AX)
NOP
CMPQ BX, $2147418112
JHI unsafeGetBytes_pc54
MOVQ BX, CX
MOVQ 16(SP), BP
ADDQ $24, SP
RET
unsafeGetBytes_pc54:
MOVQ BX, DX
MOVL $2147418112, BX
PCDATA $1, $1
NOP
CALL runtime.panicSlice3Alen(SB)
XCHGL AX, AX
Other unimportant details: compiled size 3142B, inline cost of 9, with correct escape analysis: s leaks to ~r1 with derefs=0
Note the runtime.panicSlice3Alen -- this is bounds check that checks that len(s) is within 0x7fff0000.
Improved slicing pointer to array
This is what I've concluded to be the most efficient method as of Go 1.17. I basically modified the accepted answer to eliminate the bounds check, and found a "more meaningful" constant (math.MaxInt32) to use than 0x7fff0000. Using MaxInt32 preserves 32-bit compatibility.
func unsafeGetBytes(s string) []byte {
const MaxInt32 = 1<<31 - 1
return (*[MaxInt32]byte)(unsafe.Pointer((*reflect.StringHeader)(
unsafe.Pointer(&s)).Data))[:len(s)&MaxInt32:len(s)&MaxInt32]
}
Corresponding assembly (FUNCDATA removed):
TEXT "".unsafeGetBytes(SB), NOSPLIT|ABIInternal, $0-16
PCDATA $0, $-2
MOVQ AX, "".s+8(SP)
MOVQ BX, "".s+16(SP)
MOVQ "".s+8(SP), AX
PCDATA $0, $-1
TESTB AL, (AX)
ANDQ $2147483647, BX
MOVQ BX, CX
RET
Other unimportant details: compiled size 3188B, inline cost of 13, and correct escape analysis: s leaks to ~r1 with derefs=0

In go 1.17, I'd recommend unsafe.Slice as more readable:
unsafe.Slice((*byte)(unsafe.Pointer((*reflect.StringHeader)(unsafe.Pointer(&s)).Data)), len(s))
I think that this also works (doesn't violate any unsafe.Pointer rules), with the benefit that it works for a const s:
*(*[]byte)(unsafe.Pointer(&struct{string; int}{s, len(s)}))
Commentary bellow is regarding the accepted answer as it originally stood. The accepted answer now mentions an (authoritative) solution from Ian Lance Taylor. Keeping it as it points out a common error.
The accepted answer is wrong, and may produce the panic #RFC mentioned in the comments. The explanation by #icza about GC and keep alive is misguided.
The reason capacity is zero (or even an arbitrary value) is more prosaic.
A slice is:
type SliceHeader struct {
Data uintptr
Len int
Cap int
}
A string is:
type StringHeader struct {
Data uintptr
Len int
}
Converting a byte slice to a string can be "safely" done as the strings.Builder does it:
func (b *Builder) String() string {
return *(*string)(unsafe.Pointer(&b.buf))
}
This will copy the Data pointer and Len from the slice to the string.
The opposite conversion is not "safe" because Cap doesn't get set to the correct value.
The following (originally by me) is also wrong because it violates unsafe.Pointer rule #1.
This is the correct code, that fixes the panic:
var buf = *(*[]byte)(unsafe.Pointer(&str))
(*reflect.SliceHeader)(unsafe.Pointer(&buf)).Cap = len(str)
Or perhaps:
var buf []byte
*(*string)(unsafe.Pointer(&buf)) = str
(*reflect.SliceHeader)(unsafe.Pointer(&buf)).Cap = len(str)
I should add that all these conversions are unsafe in the sense that strings are expected to be immutable, and byte arrays/slices mutable.
But if you know for sure that the byte slice won't be mutated, you won't get bounds (or GC) issues with the above conversions.

In Go 1.17, one can now use unsafe.Slice, so the accepted answer can be rewritten as follows:
func unsafeGetBytes(s string) []byte {
return unsafe.Slice((*byte)(unsafe.Pointer((*reflect.StringHeader)(unsafe.Pointer(&s)).Data)), len(s))
}

I managed to get the goal by this:
func TestString(t *testing.T) {
b := []byte{'a', 'b', 'c', '1', '2', '3', '4'}
s := *(*string)(unsafe.Pointer(&b))
sb := *(*[]byte)(unsafe.Pointer(&s))
addr1 := unsafe.Pointer(&b)
addr2 := unsafe.Pointer(&s)
addr3 := unsafe.Pointer(&sb)
fmt.Print("&b=", addr1, "\n&s=", addr2, "\n&sb=", addr3, "\n")
hdr1 := (*reflect.StringHeader)(unsafe.Pointer(&b))
hdr2 := (*reflect.SliceHeader)(unsafe.Pointer(&s))
hdr3 := (*reflect.SliceHeader)(unsafe.Pointer(&sb))
fmt.Print("b.data=", hdr1.Data, "\ns.data=", hdr2.Data, "\nsb.data=", hdr3.Data, "\n")
b[0] = 'X'
sb[1] = 'Y' // if sb is from a string directly, this will cause nil panic
fmt.Print("s=", s, "\nsb=")
for _, c := range sb {
fmt.Printf("%c", c)
}
fmt.Println()
}
Output:
=== RUN TestString
&b=0xc000218000
&s=0xc00021a000
&sb=0xc000218020
b.data=824635867152
s.data=824635867152
sb.data=824635867152
s=XYc1234
sb=XYc1234
These variables all share the same memory.

Go 1.20 (February 2023)
You can use unsafe.StringData to greatly simplify YenForYang's answer:
StringData returns a pointer to the underlying bytes of str. For an empty string the return value is unspecified, and may be nil.
Since Go strings are immutable, the bytes returned by StringData must not be modified.
func main() {
str := "foobar"
d := unsafe.StringData(str)
b := unsafe.Slice(d, len(str))
fmt.Printf("%T, %s\n", b, b) // []uint8, foobar (byte is alias of uint8)
}
Go tip playground: https://go.dev/play/p/FIXe0rb8YHE?v=gotip
Remember that you can't assign to b[n]. The memory is still read-only.

Simple, no reflect, and I think it is portable. s is your string and b is your bytes slice
var b []byte
bb:=(*[3]uintptr)(unsafe.Pointer(&b))[:]
copy(bb, (*[2]uintptr)(unsafe.Pointer(&s))[:])
bb[2] = bb[1]
// use b
Remember, bytes value should not be modified (will panic). re-slicing is ok (for example: bytes.split(b, []byte{','} )

How to simplify mathematical formulas with rust macros?

I must admit I'm a bit lost with macros.
I want to build a macro that does the following task and
I'm not sure how to do it. I want to perform a scalar product
of two arrays, say x and y, which have the same length N.
The result I want to compute is of the form:
z = sum_{i=0}^{N-1} x[i] * y[i].
x is const which elements are 0, 1, or -1 which are known at compile time,
while y's elements are determined at runtime. Because of the
structure of x, many computations are useless (terms multiplied by 0
can be removed from the sum, and multiplications of the form 1 * y[i], -1 * y[i] can be transformed into y[i], -y[i] respectively).
As an example if x = [-1, 1, 0], the scalar product above would be
z=-1 * y[0] + 1 * y[1] + 0 * y[2]
To speed up my computation I can unroll the loop by hand and rewrite
the whole thing without x[i], and I could hard code the above formula as
z = -y[0] + y[1]
But this procedure is not elegant, error prone
and very tedious when N becomes large.
I'm pretty sure I can do that with a macro, but I don't know where to
start (the different books I read are not going too deep into macros and
I'm stuck)...
Would anyone of you have any idea how to (if it is possible) this problem using macros?
Thank you in advance for your help!
Edit: As pointed out in many of the answers, the compiler is smart enough to remove optimize the loop in the case of integers. I am not only using integers but also floats (the x array is i32s, but in general y is f64s), so the compiler is not smart enough (and rightfully so) to optimize the loop. The following piece of code gives the following asm.
const X: [i32; 8] = [0, 1, -1, 0, 0, 1, 0, -1];
pub fn dot_x(y: [f64; 8]) -> f64 {
X.iter().zip(y.iter()).map(|(i, j)| (*i as f64) * j).sum()
}
playground::dot_x:
xorpd %xmm0, %xmm0
movsd (%rdi), %xmm1
mulsd %xmm0, %xmm1
addsd %xmm0, %xmm1
addsd 8(%rdi), %xmm1
subsd 16(%rdi), %xmm1
movupd 24(%rdi), %xmm2
xorpd %xmm3, %xmm3
mulpd %xmm2, %xmm3
addsd %xmm3, %xmm1
unpckhpd %xmm3, %xmm3
addsd %xmm1, %xmm3
addsd 40(%rdi), %xmm3
mulsd 48(%rdi), %xmm0
addsd %xmm3, %xmm0
subsd 56(%rdi), %xmm0
retq

First of all, a (proc) macro can simply not look inside your array x. All it gets are the tokens you pass it, without any context. If you want it to know about the values (0, 1, -1), you need to pass those directly to your macro:
let result = your_macro!(y, -1, 0, 1, -1);
But you don't really need a macro for this. The compiler optimizes a lot, as also shown in the other answers. However, it will not, as you already mention in your edit, optimize away 0.0 * x[i], as the result of that is not always 0.0. (It could be -0.0 or NaN for example.) What we can do here, is simply help the optimizer a bit by using a match or if, to make sure it does nothing for the 0.0 * y case:
const X: [i32; 8] = [0, -1, 0, 0, 0, 0, 1, 0];
fn foobar(y: [f64; 8]) -> f64 {
let mut sum = 0.0;
for (&x, &y) in X.iter().zip(&y) {
if x != 0 {
sum += x as f64 * y;
}
}
sum
}
In release mode, the loop is unrolled and the values of X inlined, resulting in most iterations being thrown away as they don't do anything. The only thing left in the resulting binary (on x86_64), is:
foobar:
xorpd xmm0, xmm0
subsd xmm0, qword, ptr, [rdi, +, 8]
addsd xmm0, qword, ptr, [rdi, +, 48]
ret
(As suggested by #lu-zero, this can also be done using filter_map. That will look like this: X.iter().zip(&y).filter_map(|(&x, &y)| match x { 0 => None, _ => Some(x as f64 * y) }).sum(), and gives the exact same generated assembly. Or even without a match, by using filter and map separately: .filter(|(&x, _)| x != 0).map(|(&x, &y)| x as f64 * y).sum().)
Pretty good! However, this function calculates 0.0 - y[1] + y[6], since sum started at 0.0 and we only subtract and add things to it. The optimizer is again not willing to optimize away a 0.0. We can help it a bit more by not starting at 0.0, but starting with None:
fn foobar(y: [f64; 8]) -> f64 {
let mut sum = None;
for (&x, &y) in X.iter().zip(&y) {
if x != 0 {
let p = x as f64 * y;
sum = Some(sum.map_or(p, |s| s + p));
}
}
sum.unwrap_or(0.0)
}
This results in:
foobar:
movsd xmm0, qword, ptr, [rdi, +, 48]
subsd xmm0, qword, ptr, [rdi, +, 8]
ret
Which simply does y[6] - y[1]. Bingo!

You may be able to achieve your goal with a macro that returns a function.
First, write this function without a macro. This one takes a fixed number of parameters.
fn main() {
println!("Hello, world!");
let func = gen_sum([1,2,3]);
println!("{}", func([4,5,6])) // 1*4 + 2*5 + 3*6 = 4 + 10 + 18 = 32
}
fn gen_sum(xs: [i32; 3]) -> impl Fn([i32;3]) -> i32 {
move |ys| ys[0]*xs[0] + ys[1]*xs[1] + ys[2]*xs[2]
}
Now, completely rewrite it because the prior design doesn't work well as a macro. We had to give up on fixed sized arrays, as macros appear unable to allocate fixed-sized arrays.
Rust Playground
fn main() {
let func = gen_sum!(1,2,3);
println!("{}", func(vec![4,5,6])) // 1*4 + 2*5 + 3*6 = 4 + 10 + 18 = 32
}
#[macro_export]
macro_rules! gen_sum {
( $( $x:expr ),* ) => {
{
let mut xs = Vec::new();
$(
xs.push($x);
)*
move |ys:Vec<i32>| {
if xs.len() != ys.len() {
panic!("lengths don't match")
}
let mut total = 0;
for i in 0 as usize .. xs.len() {
total += xs[i] * ys[i];
}
total
}
}
};
}
What does this do/What should it do
At compile time, it generates a lambda. This lambda accepts a list of numbers and multiplies it by a vec that was generated at compile time. I don't think this was exactly what you were after, as it does not optimize away zeroes at compile time. You could optimize away zeroes at compile time, but you would necessarily incur some cost at run-time by having to check where the zeroes were in x to determine which elements to multiply by in y. You could even make this lookup process in constant time using a hashset. It's still probably not worth it in general (where I presume 0 is not all that common). Computers are better at doing one thing that's "inefficient" than they are at detecting that the thing they're about to do is "inefficient" then skipping that thing. This abstraction breaks down when a significant portion of the operations they do are "inefficient"
Follow-up
Was that worth it? Does it improve run times? I didn't measure, but it seems like understanding and maintaining the macro I wrote isn't worth it compared to just using a function. Writing a macro that does the zero optimization you talked about would probably be even less pleasant.

In many cases, the optimisation stage of the compiler will take care of this for you. To give an example, this function definition
const X: [i32; 8] = [0, 1, -1, 0, 0, 1, 0, -1];
pub fn dot_x(y: [i32; 8]) -> i32 {
X.iter().zip(y.iter()).map(|(i, j)| i * j).sum()
}
results in this assembly output on x86_64:
playground::dot_x:
mov eax, dword ptr [rdi + 4]
sub eax, dword ptr [rdi + 8]
add eax, dword ptr [rdi + 20]
sub eax, dword ptr [rdi + 28]
ret
You won't be able to get any more optimised version than this, so simply writing the code in a naïve way is the best solution. Whether the compiler will unroll the loop for longer vectors is unclear, and it may change with compiler versions.
For floating-point numbers, the compiler is not normally able to perform all the optimisations above, since the numbers in y are not guaranteed to be finite – they could also be NaN, inf or -inf. For this reason, multiplying with 0.0 is not guaranteed to result in 0.0 again, so the compiler needs to keep the multiplication instructions in the code. You can explicitly allow it to assume all numbers are finite, though, by using the fmul_fast() instrinsic function:
#![feature(core_intrinsics)]
use std::intrinsics::fmul_fast;
const X: [i32; 8] = [0, 1, -1, 0, 0, 1, 0, -1];
pub fn dot_x(y: [f64; 8]) -> f64 {
X.iter().zip(y.iter()).map(|(i, j)| unsafe { fmul_fast(*i as f64, *j) }).sum()
}
This results in the following assembly code:
playground::dot_x: # #playground::dot_x
# %bb.0:
xorpd xmm1, xmm1
movsd xmm0, qword ptr [rdi + 8] # xmm0 = mem[0],zero
addsd xmm0, xmm1
subsd xmm0, qword ptr [rdi + 16]
addsd xmm0, xmm1
addsd xmm0, qword ptr [rdi + 40]
addsd xmm0, xmm1
subsd xmm0, qword ptr [rdi + 56]
ret
This still redundantly adds zeros between the steps, but I would not expect this to result in any measurable overhead for realistic CFD simulations, since such simulations tend to be limited by memory bandwidth rather than CPU. If you want to avoid these additions as well, you need to use fadd_fast() for the additions to allow the compiler to optimise further:
#![feature(core_intrinsics)]
use std::intrinsics::{fadd_fast, fmul_fast};
const X: [i32; 8] = [0, 1, -1, 0, 0, 1, 0, -1];
pub fn dot_x(y: [f64; 8]) -> f64 {
let mut result = 0.0;
for (&i, &j) in X.iter().zip(y.iter()) {
unsafe { result = fadd_fast(result, fmul_fast(i as f64, j)); }
}
result
}
This results in the following assembly code:
playground::dot_x: # #playground::dot_x
# %bb.0:
movsd xmm0, qword ptr [rdi + 8] # xmm0 = mem[0],zero
subsd xmm0, qword ptr [rdi + 16]
addsd xmm0, qword ptr [rdi + 40]
subsd xmm0, qword ptr [rdi + 56]
ret
As with all optmisations, you should start with the most readable and maintainable version of the code. If performance becomes an issue, you should profile your code and find the bottlenecks. As the next step, try to improve the fundamental approach, e.g. by using an algorithm with a better asymptotical complexity. Only then should you turn to micro-optimisations like the one you suggested in the question.

If you can spare an #[inline(always)] probably using an explicit filter_map() should be enough to have the compiler do what you want.

register_kprobe returns EINVAL (-22) error for instructions involving rip

I am trying to insert probes at different instructions with kprobes in function of kernel module.
But register_kprobe is returning EINVAL(-22) error for 0xffffffffa33c1085 instruction addresses and 0xffffffffa33c109b from below assembly code (it passes for all other instruction addresses).
Instructions giving errors:
0xffffffffa33c1085 <test_increment+5>: mov 0x21bd(%rip),%eax # 0xffffffffa33c3248
0xffffffffa33c109b <test_increment+27>: mov %esi,0x21a7(%rip) # 0xffffffffa33c3248
Observed that both these instructions use rip register. Tried with functions of other modules, observed same error with instructions which use rip register.
Why is register_kprobe failing ? does it have any constraints involving rip ? Any help is appreciated.
System has kernel 3.10.0-514 on x86_64 installed.
kprobe function:
kp = kzalloc(sizeof(struct kprobe), GFP_KERNEL);
kp->post_handler = exit_func;
kp->pre_handler = entry_func;
kp->addr = sym_addr;
atomic_set(&pcount, 0);
ret = register_kprobe(kp);
if ( ret != 0 ) {
printk(KERN_INFO "register_kprobe returned %d for %s\n", ret, str);
kfree(kp);
kp=NULL;
return ret;
}
probed function:
int race=0;
void test_increment()
{
race++;
printk(KERN_INFO "VALUE=%d\n",race);
return;
}
assembly code:
crash> dis -l test_increment
0xffffffffa33c1080 <test_increment>: nopl 0x0(%rax,%rax,1) [FTRACE NOP]
0xffffffffa33c1085 <test_increment+5>: mov 0x21bd(%rip),%eax # 0xffffffffa33c3248
0xffffffffa33c108b <test_increment+11>: push %rbp
0xffffffffa33c108c <test_increment+12>: mov $0xffffffffa33c2024,%rdi
0xffffffffa33c1093 <test_increment+19>: mov %rsp,%rbp
0xffffffffa33c1096 <test_increment+22>: lea 0x1(%rax),%esi
0xffffffffa33c1099 <test_increment+25>: xor %eax,%eax
0xffffffffa33c109b <test_increment+27>: mov %esi,0x21a7(%rip) # 0xffffffffa33c3248
0xffffffffa33c10a1 <test_increment+33>: callq 0xffffffff81659552 <printk>
0xffffffffa33c10a6 <test_increment+38>: pop %rbp
0xffffffffa33c10a7 <test_increment+39>: retq
Thanks

Turns out, register_kprobe does have limitations with instructions invoving rip relative addressing for x86_64.
Here is snippet of __copy_instruction function code causing error (register_kprobe -> prepare_kprobe -> arch_prepare_kprobe -> arch_copy_kprobe -> __copy_instruction )
#ifdef CONFIG_X86_64
if (insn_rip_relative(&insn)) {
s64 newdisp;
u8 *disp;
kernel_insn_init(&insn, dest);
insn_get_displacement(&insn);
/*
* The copied instruction uses the %rip-relative addressing
* mode. Adjust the displacement for the difference between
* the original location of this instruction and the location
* of the copy that will actually be run. The tricky bit here
* is making sure that the sign extension happens correctly in
* this calculation, since we need a signed 32-bit result to
* be sign-extended to 64 bits when it's added to the %rip
* value and yield the same 64-bit result that the sign-
* extension of the original signed 32-bit displacement would
* have given.
*/
newdisp = (u8 *) src + (s64) insn.displacement.value - (u8 *) dest;
if ((s64) (s32) newdisp != newdisp) {
pr_err("Kprobes error: new displacement does not fit into s32 (%llx)\n", newdisp);
pr_err("\tSrc: %p, Dest: %p, old disp: %x\n", src, dest, insn.displacement.value);
return 0;
}
disp = (u8 *) dest + insn_offset_displacement(&insn);
*(s32 *) disp = (s32) newdisp;
}
#endif
http://elixir.free-electrons.com/linux/v3.10/ident/__copy_instruction
A new displacement value is calculated based new instruction address (where orig insn is copied). If that value doesn't fit in 32 bit, it returns 0 which results in EINVAL error. Hence the failure.
As a workaround, we can set kprobe handler post previous instruction or pre next instruction based on need (works for me).

Xbox 360, PPC function Hook crashes when i call a function in the hook. PowerPC

i am facing a problem that I could not resolve so i have to turn to the community to help me out. The problem is related to PPC function hooking.
The area where i am hooking is this.
.text:8220D810 mflr r12
.text:8220D814 bl __savegprlr_20
.text:8220D818 stfd f31, var_70(r1)
.text:8220D81C stwu r1, -0x100(r1)
.text:8220D820 lis r11, off_82A9CCC0#ha // => This is where i am hooking the function
.text:8220D824 lis r22, dword_82BBAE68#ha // These 4 instructions are overwritt
.text:8220D828 lis r10, 8 # 0x87700 //Patched
.text:8220D82C mr r26, r3 //Patched
.text:8220D830 li r20, 0
.text:8220D834 lwz r9, off_82A9CCC0#l(r11)
.text:8220D838 ori r23, r10, 0x7700 # 0x87700
.text:8220D83C lwz r11, dword_82BBAE68#l(r22)
.text:8220D840 cmplwi cr6, r11, 0
.text:8220D844 stw r9, 0x100+var_7C(r1)
.text:8220D848 bne cr6, loc_8220D854
.text:8220D84C mr r30, r20
.text:8220D850 b loc_8220D85C
Here it jumps to my code cave that is mentioned below. The patched instructions are correctly written in the PredictPlayerHook function and is not the problem.
The problem here is if i call a function in the hook e.g here i call "GetCurrentCmdNumber(0);" it causes the game to crash. Now without calling any functions the game doesn't crash and the code cave works without any issues. but if I try to call any function within the code cave(PredictPlayerHook) it just crashes. I cant debug it so i dont know where it crashes.
void __declspec(naked) PredictPlayerHook(){
DWORD R11,Return,cmdNumber;
__asm lis r11, 0x82AA //patched instructions
__asm lis r22, 0x82BC //patched instructions
__asm lis r10, 0x8 //patched instructions
__asm mr r26, r3 //patched instructions
__asm mflr r0 ; //mflr grabs the link register, and stores it into the first operand. r0 is now the link register
__asm stw r0, -0x14(r1) ; //Save the link register inside the stack frame
__asm stwu r1, -0x90(r1) ;// This is pushing the stack (hence push)
// cmdNumber = GetCurrentCmdNumber(0);
__asm addi r1, r1, 0x90 ;//popping the stack frame
__asm lwz r0,-0x14(r1) ; //Reading the link register from the sack
__asm mtlr r0
__asm stw r11,R11
//Return = 0x82200230;
__asm lis r11,0x8220 //Return Address is correct. The difference is in IDA segment, it is +0xD600 ahead of the original address.
__asm ori r11,r11,0x0230
__asm mtctr r11
__asm lwz r11,R11
__asm bctr
}
Here is the function itself and its correct. I can use it in a hook placed else where in the game so it has no issues.
typedef int (__cdecl* CL_GetCurrentCmdNumber)(int localClientNum);
CL_GetCurrentCmdNumber GetCurrentCmdNumber = (CL_GetCurrentCmdNumber)0x82261F90;

VC++ SSE code generation - is this a compiler bug?

A very particular code sequence in VC++ generated the following instruction (for Win32):
unpcklpd xmm0,xmmword ptr [ebp-40h]
2 questions arise:
(1) As far as I understand the intel manual, unpcklpd accepts as 2nd argument a 128-aligned memory address. If the address is relative to a stack frame alignment cannot be forced. Is this really a compiler bug?
(2) Exceptions are thrown from at the execution of this instruction only when run from the debugger, and even then not always. Even attaching to the process and executing this code does not throw. How can this be??
The particular exception thrown is access violation at 0xFFFFFFFF, but AFAIK that's just a code for misalignment.
[Edit:]
Here's some source that demonstrates the bad code generation - but typically doesn't cause a crash. (that's mostly what I'm wondering about)
[Edit 2:]
The code sample now reproduces the actual crash. This one also crashes outside the debugger - I suspect the difference occurs because the debugger launches the program at different typical base addresses.
// mock.cpp
#include <stdio.h>
struct mockVect2d
{
double x, y;
mockVect2d() {}
mockVect2d(double a, double b) : x(a), y(b) {}
mockVect2d operator + (const mockVect2d& u) {
return mockVect2d(x + u.x, y + u.y);
}
};
struct MockPoly
{
MockPoly() {}
mockVect2d* m_Vrts;
double m_Area;
int m_Convex;
bool m_ParClear;
void ClearPar() { m_Area = -1.; m_Convex = 0; m_ParClear = true; }
MockPoly(int len) { m_Vrts = new mockVect2d[len]; }
mockVect2d& Vrt(int i) {
if (!m_ParClear) ClearPar();
return m_Vrts[i];
}
const mockVect2d& GetCenter() { return m_Vrts[0]; }
};
struct MockItem
{
MockItem() : Contour(1) {}
MockPoly Contour;
};
struct Mock
{
Mock() {}
MockItem m_item;
virtual int GetCount() { return 2; }
virtual mockVect2d GetCenter() { return mockVect2d(1.0, 2.0); }
virtual MockItem GetItem(int i) { return m_item; }
};
void testInner(int a)
{
int c = 8;
printf("%d", c);
Mock* pMock = new Mock;
int Flag = true;
int nlr = pMock->GetCount();
if (nlr == 0)
return;
int flr = 1;
if (flr == nlr)
return;
if (Flag)
{
if (flr < nlr && flr>0) {
int c = 8;
printf("%d", c);
MockPoly pol(2);
mockVect2d ctr = pMock->GetItem(0).Contour.GetCenter();
// The mess happens here:
// ; 74 : pol.Vrt(1) = ctr + mockVect2d(0., 1.0);
//
// call ? Vrt#MockPoly##QAEAAUmockVect2d##H#Z; MockPoly::Vrt
// movdqa xmm0, XMMWORD PTR $T4[ebp]
// unpcklpd xmm0, QWORD PTR tv190[ebp] **** crash!
// movdqu XMMWORD PTR[eax], xmm0
pol.Vrt(0) = ctr + mockVect2d(1.0, 0.);
pol.Vrt(1) = ctr + mockVect2d(0., 1.0);
}
}
}
void main()
{
testInner(2);
return;
}
If you prefer, download a ready vcxproj with all the switches set from here. This includes the complete ASM too.

Update: this is now a confirmed VC++ compiler bug, hopefully to be resolved in VS2015 RTM.
Edit: The connect report, like many others, is now garbage. However the compiler bug seems to be resolved in VS2017 - not in 2015 update 3.

Since no one else has stepped up, I'm going to take a shot.
1) If the address is relative to a stack frame alignment cannot be forced. Is this really a compiler bug?
I'm not sure it is true that you cannot force alignment for stack variables. Consider this code:
struct foo
{
char a;
int b;
unsigned long long c;
};
int wmain(int argc, wchar_t* argv[])
{
foo moo;
moo.a = 1;
moo.b = 2;
moo.c = 3;
}
Looking at the startup code for main, we see:
00E31AB0 push ebp
00E31AB1 mov ebp,esp
00E31AB3 sub esp,0DCh
00E31AB9 push ebx
00E31ABA push esi
00E31ABB push edi
00E31ABC lea edi,[ebp-0DCh]
00E31AC2 mov ecx,37h
00E31AC7 mov eax,0CCCCCCCCh
00E31ACC rep stos dword ptr es:[edi]
00E31ACE mov eax,dword ptr [___security_cookie (0E440CCh)]
00E31AD3 xor eax,ebp
00E31AD5 mov dword ptr [ebp-4],eax
Adding __declspec(align(16)) to moo gives
01291AB0 push ebx
01291AB1 mov ebx,esp
01291AB3 sub esp,8
01291AB6 and esp,0FFFFFFF0h <------------------------
01291AB9 add esp,4
01291ABC push ebp
01291ABD mov ebp,dword ptr [ebx+4]
01291AC0 mov dword ptr [esp+4],ebp
01291AC4 mov ebp,esp
01291AC6 sub esp,0E8h
01291ACC push esi
01291ACD push edi
01291ACE lea edi,[ebp-0E8h]
01291AD4 mov ecx,3Ah
01291AD9 mov eax,0CCCCCCCCh
01291ADE rep stos dword ptr es:[edi]
01291AE0 mov eax,dword ptr [___security_cookie (12A40CCh)]
01291AE5 xor eax,ebp
01291AE7 mov dword ptr [ebp-4],eax
Apparently the compiler (VS2010 compiled debug for Win32), recognizing that we will need specific alignments for the code, takes steps to ensure it can provide that.
2) Exceptions are thrown from at the execution of this instruction only when run from the debugger, and even then not always. Even attaching to the process and executing this code does not throw. How can this be??
So, a couple of thoughts:
"and even then not always" - Not standing over your shoulder when you run this, I can't say for certain. However it seems plausible that just by random chance, stacks could get created with the alignment you need. By default, x86 uses 4byte stack alignment. If you need 16 byte alignment, you've got a 1 in 4 shot.
As for the rest (from https://msdn.microsoft.com/en-us/library/aa290049%28v=vs.71%29.aspx#ia64alignment_topic4):
On the x86 architecture, the operating system does not make the alignment fault visible to the application. ...you will also suffer performance degradation on the alignment fault, but it will be significantly less severe than on the Itanium, because the hardware will make the multiple accesses of memory to retrieve the unaligned data.
TLDR: Using __declspec(align(16)) should give you the alignment you want, even for stack variables. For unaligned accesses, the OS will catch the exception and handle it for you (at a cost of performance).
Edit1: Responding to the first 2 comments below:
Based on MS's docs, you are correct about the alignment of stack parameters, but they propose a solution as well:
You cannot specify alignment for function parameters. When data that
has an alignment attribute is passed by value on the stack, its
alignment is controlled by the calling convention. If data alignment
is important in the called function, copy the parameter into correctly
aligned memory before use.
Neither your sample on Microsoft connect nor the code about produce the same code for me (I'm only on vs2010), so I can't test this. But given this code from your sample:
struct mockVect2d
{
double x, y;
mockVect2d(double a, double b) : x(a), y(b) {}
It would seem that aligning either mockVect2d or the 2 doubles might help.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

SSE SIMD Optimization For Loop - visual-c++

_mm_set_pd is not vectorized. If taken literally, it reads the two doubles using scalar operations, then combines the two scalar doubles and copy them into the SSE register. Use _mm_load_pd instead.

Related

How to use unsafe get a byte slice from a string without memory copy

How to simplify mathematical formulas with rust macros?

register_kprobe returns EINVAL (-22) error for instructions involving rip

Xbox 360, PPC function Hook crashes when i call a function in the hook. PowerPC

VC++ SSE code generation - is this a compiler bug?

Categories

Resources