Range optimizations opportunities [duplicate] - rust

These two loops are equivalent in C++ and Rust:
#include <cstdint>
uint64_t sum1(uint64_t n) {
uint64_t sum = 0;
for (uint64_t j = 0; j <= n; ++j) {
sum += 1;
}
return sum;
}
pub fn sum1(num: u64) -> u64 {
let mut sum: u64 = 0;
for j in 0u64..=num {
sum += 1;
}
return sum;
}
However the C++ version generates a very terse assembly:
sum1(unsigned long): # #sum1(unsigned long)
xor eax, eax
.LBB0_1: # =>This Inner Loop Header: Depth=1
add rax, 1
cmp rax, rdi
jbe .LBB0_1
ret
while Rust's version is very long with two checks in the loop instead of one:
example::sum1:
xor eax, eax
xor ecx, ecx
.LBB0_1:
mov rdx, rcx
cmp rcx, rdi
adc rcx, 0
add rax, 1
cmp rdx, rdi
jae .LBB0_3
cmp rcx, rdi
jbe .LBB0_1
.LBB0_3:
ret
Godbolt: https://godbolt.org/z/xYW94qxjK
What is Rust intrinsically trying to prevent that C++ is carefree about?

Overflow in the iterator state.
The C++ version will loop forever when given a large enough input:
#include <cstdint>
std::uint64_t sum1(std::uint64_t n) {
std::uint64_t sum = 0;
for (std::uint64_t j = 0; j <= n; ++j) {
__asm__ ("");
sum += 1;
}
return sum;
}
#include <iostream>
int main() {
std::cout << sum1(UINT64_C(0xffffffff'ffffffff)) << std::endl;
return 0;
}
This is because when the loop counter j finally reaches 0xffffffff'ffffffff, incrementing it wraps around to 0, which means the loop invariant j <= n is always fulfilled and the loop never exits.
Strictly speaking, invoking the original version of sum1 with 0xffffffff'ffffffff infamously triggers undefined behaviour, though not because of overflow alone, but since infinite loops without externally-visible side effects are UB ([intro.progress]/1). This is why in my version I added an empty __asm__ statement to the function to act as a dummy ‘side effect’ preventing the compiler from taking ‘advantage’ of that in optimisation passes.
The Rust version, on the other hand, is not only perfectly well-defined, but iterates exactly as many times as the cardinality of the range:
use std::num::Wrapping;
fn sum1(num: u64) -> u64 {
let mut sum = Wrapping(0);
for _ in 0..=num {
sum += Wrapping(1);
}
return sum.0;
}
fn main() {
println!("{}", sum1(0xffffffff_ffffffff));
}
The above program (slightly modified to avoid getting bogged down in debug versus release mode differences with respect to the summation) will terminate after exactly 18 446 744 073 709 551 616 iterations and print 0; the Rust version carefully maintains iterator state so that overflow does not happen in the iterator.

#user3840170 correctly identified the difference: overflow check in Rust, and not in C++.
Still, the question remains as to why there are 2 checks per loop in the Rust version instead of 1, and the answer to that is that LLVM is not sufficiently smart and/or the RangeInclusive design is not well adapted to LLVM1.
The optimal code generation for short loops, is to split the loop, transforming:
for j in 0u64..=num {
sum += 1;
}
Into:
for j in 0u64..num { // equivalent to for (auto j = 0; j < num; ++j)
sum += 1;
}
if 0 <= num {
sum += 1;
}
This would lead to having a single branch in the main loop, and allow LLVM to switch this to a closed formula2.
The Loop Splitting optimization applies to RangeInclusive and most other Chain iterators, as indeed a RangeInclusive can be thought of as a chain of a once iterator and half-exclusive range iterator (in either order). It is not always a win: like inlining, it implies duplicating the code of the loop body, which if large may lead to a significant overhead in code size.
Unfortunately, LLVM fails to split the loop. I am not sure if it's because the optimization is missing altogether, or it just fails to apply it here for some reason. I'm looking forward to the rustc_codegen_gcc backend, as GCC 7 added Loop Splitting to GCC, and it may be able to generate better code there.
1 See this comment I left over performance issues with RangeInclusive. I spent significant time banging my head over the issue in 2019, and I dearly wish RangeInclusive (and all ranges) were NOT Iterator; it's a costly mistake.
2 Chain iterators, in general, perform much better using .for_each(...), specifically because there the loop is (manually) split. See the playground for (0..=num).for_each(|_| sum += 1) being reduced to num + 1.

These two loops are equivalent in C++ and Rust
Your two code snippets don't share the same behavior. for (uint64_t j = 0; j <= n; ++j) doesn't handle n == uint64_t::MAX (make it infinite looping) while for j in 0u64..=num do (will never go into an infinite loop).
A rust equivalent code could be:
pub fn sum1(num: u64) -> u64 {
let mut sum: u64 = 0;
let mut j = 0;
while j <= num {
sum = sum.wrapping_add(1);
j = j.wrapping_add(1);
}
sum
}
currently produce the following asm godbolt:
example::sum1:
xor eax, eax
.LBB0_1:
add rax, 1
cmp rax, rdi
jbe .LBB0_1
ret

Related

Why does iteration over an inclusive range generate longer assembly in Rust?

These two loops are equivalent in C++ and Rust:
#include <cstdint>
uint64_t sum1(uint64_t n) {
uint64_t sum = 0;
for (uint64_t j = 0; j <= n; ++j) {
sum += 1;
}
return sum;
}
pub fn sum1(num: u64) -> u64 {
let mut sum: u64 = 0;
for j in 0u64..=num {
sum += 1;
}
return sum;
}
However the C++ version generates a very terse assembly:
sum1(unsigned long): # #sum1(unsigned long)
xor eax, eax
.LBB0_1: # =>This Inner Loop Header: Depth=1
add rax, 1
cmp rax, rdi
jbe .LBB0_1
ret
while Rust's version is very long with two checks in the loop instead of one:
example::sum1:
xor eax, eax
xor ecx, ecx
.LBB0_1:
mov rdx, rcx
cmp rcx, rdi
adc rcx, 0
add rax, 1
cmp rdx, rdi
jae .LBB0_3
cmp rcx, rdi
jbe .LBB0_1
.LBB0_3:
ret
Godbolt: https://godbolt.org/z/xYW94qxjK
What is Rust intrinsically trying to prevent that C++ is carefree about?
Overflow in the iterator state.
The C++ version will loop forever when given a large enough input:
#include <cstdint>
std::uint64_t sum1(std::uint64_t n) {
std::uint64_t sum = 0;
for (std::uint64_t j = 0; j <= n; ++j) {
__asm__ ("");
sum += 1;
}
return sum;
}
#include <iostream>
int main() {
std::cout << sum1(UINT64_C(0xffffffff'ffffffff)) << std::endl;
return 0;
}
This is because when the loop counter j finally reaches 0xffffffff'ffffffff, incrementing it wraps around to 0, which means the loop invariant j <= n is always fulfilled and the loop never exits.
Strictly speaking, invoking the original version of sum1 with 0xffffffff'ffffffff infamously triggers undefined behaviour, though not because of overflow alone, but since infinite loops without externally-visible side effects are UB ([intro.progress]/1). This is why in my version I added an empty __asm__ statement to the function to act as a dummy ‘side effect’ preventing the compiler from taking ‘advantage’ of that in optimisation passes.
The Rust version, on the other hand, is not only perfectly well-defined, but iterates exactly as many times as the cardinality of the range:
use std::num::Wrapping;
fn sum1(num: u64) -> u64 {
let mut sum = Wrapping(0);
for _ in 0..=num {
sum += Wrapping(1);
}
return sum.0;
}
fn main() {
println!("{}", sum1(0xffffffff_ffffffff));
}
The above program (slightly modified to avoid getting bogged down in debug versus release mode differences with respect to the summation) will terminate after exactly 18 446 744 073 709 551 616 iterations and print 0; the Rust version carefully maintains iterator state so that overflow does not happen in the iterator.
#user3840170 correctly identified the difference: overflow check in Rust, and not in C++.
Still, the question remains as to why there are 2 checks per loop in the Rust version instead of 1, and the answer to that is that LLVM is not sufficiently smart and/or the RangeInclusive design is not well adapted to LLVM1.
The optimal code generation for short loops, is to split the loop, transforming:
for j in 0u64..=num {
sum += 1;
}
Into:
for j in 0u64..num { // equivalent to for (auto j = 0; j < num; ++j)
sum += 1;
}
if 0 <= num {
sum += 1;
}
This would lead to having a single branch in the main loop, and allow LLVM to switch this to a closed formula2.
The Loop Splitting optimization applies to RangeInclusive and most other Chain iterators, as indeed a RangeInclusive can be thought of as a chain of a once iterator and half-exclusive range iterator (in either order). It is not always a win: like inlining, it implies duplicating the code of the loop body, which if large may lead to a significant overhead in code size.
Unfortunately, LLVM fails to split the loop. I am not sure if it's because the optimization is missing altogether, or it just fails to apply it here for some reason. I'm looking forward to the rustc_codegen_gcc backend, as GCC 7 added Loop Splitting to GCC, and it may be able to generate better code there.
1 See this comment I left over performance issues with RangeInclusive. I spent significant time banging my head over the issue in 2019, and I dearly wish RangeInclusive (and all ranges) were NOT Iterator; it's a costly mistake.
2 Chain iterators, in general, perform much better using .for_each(...), specifically because there the loop is (manually) split. See the playground for (0..=num).for_each(|_| sum += 1) being reduced to num + 1.
These two loops are equivalent in C++ and Rust
Your two code snippets don't share the same behavior. for (uint64_t j = 0; j <= n; ++j) doesn't handle n == uint64_t::MAX (make it infinite looping) while for j in 0u64..=num do (will never go into an infinite loop).
A rust equivalent code could be:
pub fn sum1(num: u64) -> u64 {
let mut sum: u64 = 0;
let mut j = 0;
while j <= num {
sum = sum.wrapping_add(1);
j = j.wrapping_add(1);
}
sum
}
currently produce the following asm godbolt:
example::sum1:
xor eax, eax
.LBB0_1:
add rax, 1
cmp rax, rdi
jbe .LBB0_1
ret

Modifying Individual Bytes in Rust [duplicate]

This question already has answers here:
Converting number primitives (i32, f64, etc) to byte representations
(5 answers)
Closed 2 years ago.
In C, if I want to modify the most significant byte of an integer, I can do it in two different ways:
#include <stdint.h>
#include <stdio.h>
int main() {
uint32_t i1 = 0xDDDDDDDD;
printf("i1 original: 0x%X\n", i1);
uint32_t i1msb = 0xAA000000;
uint32_t i1_mask = 0x00FFFFFF;
uint32_t i1msb_mask = 0xFF000000;
i1 = (i1 & i1_mask) | (i1msb & i1msb_mask);
printf("i1 modified: 0x%X\n", i1);
uint32_t i2 = 0xDDDDDDDD;
printf("i2 original: 0x%X\n", i2);
uint32_t *i2ptr = &i2;
char *c2ptr = (char *) i2ptr;
c2ptr += 3;
*c2ptr = 0xAA;
printf("i2 modified: 0x%X\n", i2);
}
i1 original: 0xDDDDDDDD
i1 modified: 0xAADDDDDD
i2 original: 0xDDDDDDDD
i2 modified: 0xAADDDDDD
The masking approach works in both C and Rust, but I don't have not found any way to do the direct byte manipulation approach in (safe) Rust. Although it has some endianness issues that masking does not, I think these can all be resolved at compile time to provide a safe, cross-platform interface, so why is direct byte manipulation not allowed in safe Rust? Or if it is, please correct me as I have been unable to find a function for it - I am thinking something like
fn change_byte(i: &u32, byte_num: usize, new_val: u8) -> Result<(), ErrorType> { ... }
which checks that byte_num is within range before performing the operation, and in which byte_num always goes from least significant byte to most significant.
Rust doesn't have a function to modify bytes directly. It does, however, have methods to convert integers to byte arrays, and byte arrays to integers. For example, one might write
fn set_low_byte(number:u32, byte:u8)->u32{
let mut arr = number.to_be_bytes();
arr[3] = byte;
u32::from_be_bytes(arr)
}
There are also methods for little and native endian byte order. The bitmask approach is also perfectly doable, and the pointer stuff will also work if you don't mind unsafe.
playground link

How to concatenate bounded-length strings in SIMD/AVX2 code

I have 32 length-1-to-4 strings stored in AVX2 uint8x32 registers, one register for each of length, byte0, byte1, byte2, byte3. I'd like to concatenate all the strings and write them densely to memory. If all the strings were equal length this would be straightforward: I'd shuffle the bytes to their target positions using pshufb and use some blend calls to mix the byte0/byte1/byte2/byte3 registers together. (Alternatively perhaps I could use vpunpck* instructions. Not yet figured out...)
However, the variable-length aspect makes this harder: where each output byte comes from is now a nontrivial function of the lengths. I can't figure out how to implement this efficiently in AVX2 code. Help?
Bottom line: I'd like a reimplementation of the following function, written in (as fast as possible) vector code rather than scalar code:
(godbolt link)
int concat_strings(char* dst, __m256i len_v, __m256i byte0_v, __m256i byte1_v, __m256i byte2_v, __m256i byte3_v) {
char len[32];
char byte0[32];
char byte1[32];
char byte2[32];
char byte3[32];
_mm256_store_si256(reinterpret_cast<__m256i*>(len), len_v);
_mm256_store_si256(reinterpret_cast<__m256i*>(byte0), byte0_v);
_mm256_store_si256(reinterpret_cast<__m256i*>(byte1), byte1_v);
_mm256_store_si256(reinterpret_cast<__m256i*>(byte2), byte2_v);
_mm256_store_si256(reinterpret_cast<__m256i*>(byte3), byte3_v);
int pos = 0;
for (int i = 0; i < 32; ++i) {
dst[pos + 0] = byte0[i];
dst[pos + 1] = byte1[i];
dst[pos + 2] = byte2[i];
dst[pos + 3] = byte3[i];
pos += len[i];
}
return pos;
}
Help?

VC++ SSE code generation - is this a compiler bug?

A very particular code sequence in VC++ generated the following instruction (for Win32):
unpcklpd xmm0,xmmword ptr [ebp-40h]
2 questions arise:
(1) As far as I understand the intel manual, unpcklpd accepts as 2nd argument a 128-aligned memory address. If the address is relative to a stack frame alignment cannot be forced. Is this really a compiler bug?
(2) Exceptions are thrown from at the execution of this instruction only when run from the debugger, and even then not always. Even attaching to the process and executing this code does not throw. How can this be??
The particular exception thrown is access violation at 0xFFFFFFFF, but AFAIK that's just a code for misalignment.
[Edit:]
Here's some source that demonstrates the bad code generation - but typically doesn't cause a crash. (that's mostly what I'm wondering about)
[Edit 2:]
The code sample now reproduces the actual crash. This one also crashes outside the debugger - I suspect the difference occurs because the debugger launches the program at different typical base addresses.
// mock.cpp
#include <stdio.h>
struct mockVect2d
{
double x, y;
mockVect2d() {}
mockVect2d(double a, double b) : x(a), y(b) {}
mockVect2d operator + (const mockVect2d& u) {
return mockVect2d(x + u.x, y + u.y);
}
};
struct MockPoly
{
MockPoly() {}
mockVect2d* m_Vrts;
double m_Area;
int m_Convex;
bool m_ParClear;
void ClearPar() { m_Area = -1.; m_Convex = 0; m_ParClear = true; }
MockPoly(int len) { m_Vrts = new mockVect2d[len]; }
mockVect2d& Vrt(int i) {
if (!m_ParClear) ClearPar();
return m_Vrts[i];
}
const mockVect2d& GetCenter() { return m_Vrts[0]; }
};
struct MockItem
{
MockItem() : Contour(1) {}
MockPoly Contour;
};
struct Mock
{
Mock() {}
MockItem m_item;
virtual int GetCount() { return 2; }
virtual mockVect2d GetCenter() { return mockVect2d(1.0, 2.0); }
virtual MockItem GetItem(int i) { return m_item; }
};
void testInner(int a)
{
int c = 8;
printf("%d", c);
Mock* pMock = new Mock;
int Flag = true;
int nlr = pMock->GetCount();
if (nlr == 0)
return;
int flr = 1;
if (flr == nlr)
return;
if (Flag)
{
if (flr < nlr && flr>0) {
int c = 8;
printf("%d", c);
MockPoly pol(2);
mockVect2d ctr = pMock->GetItem(0).Contour.GetCenter();
// The mess happens here:
// ; 74 : pol.Vrt(1) = ctr + mockVect2d(0., 1.0);
//
// call ? Vrt#MockPoly##QAEAAUmockVect2d##H#Z; MockPoly::Vrt
// movdqa xmm0, XMMWORD PTR $T4[ebp]
// unpcklpd xmm0, QWORD PTR tv190[ebp] **** crash!
// movdqu XMMWORD PTR[eax], xmm0
pol.Vrt(0) = ctr + mockVect2d(1.0, 0.);
pol.Vrt(1) = ctr + mockVect2d(0., 1.0);
}
}
}
void main()
{
testInner(2);
return;
}
If you prefer, download a ready vcxproj with all the switches set from here. This includes the complete ASM too.
Update: this is now a confirmed VC++ compiler bug, hopefully to be resolved in VS2015 RTM.
Edit: The connect report, like many others, is now garbage. However the compiler bug seems to be resolved in VS2017 - not in 2015 update 3.
Since no one else has stepped up, I'm going to take a shot.
1) If the address is relative to a stack frame alignment cannot be forced. Is this really a compiler bug?
I'm not sure it is true that you cannot force alignment for stack variables. Consider this code:
struct foo
{
char a;
int b;
unsigned long long c;
};
int wmain(int argc, wchar_t* argv[])
{
foo moo;
moo.a = 1;
moo.b = 2;
moo.c = 3;
}
Looking at the startup code for main, we see:
00E31AB0 push ebp
00E31AB1 mov ebp,esp
00E31AB3 sub esp,0DCh
00E31AB9 push ebx
00E31ABA push esi
00E31ABB push edi
00E31ABC lea edi,[ebp-0DCh]
00E31AC2 mov ecx,37h
00E31AC7 mov eax,0CCCCCCCCh
00E31ACC rep stos dword ptr es:[edi]
00E31ACE mov eax,dword ptr [___security_cookie (0E440CCh)]
00E31AD3 xor eax,ebp
00E31AD5 mov dword ptr [ebp-4],eax
Adding __declspec(align(16)) to moo gives
01291AB0 push ebx
01291AB1 mov ebx,esp
01291AB3 sub esp,8
01291AB6 and esp,0FFFFFFF0h <------------------------
01291AB9 add esp,4
01291ABC push ebp
01291ABD mov ebp,dword ptr [ebx+4]
01291AC0 mov dword ptr [esp+4],ebp
01291AC4 mov ebp,esp
01291AC6 sub esp,0E8h
01291ACC push esi
01291ACD push edi
01291ACE lea edi,[ebp-0E8h]
01291AD4 mov ecx,3Ah
01291AD9 mov eax,0CCCCCCCCh
01291ADE rep stos dword ptr es:[edi]
01291AE0 mov eax,dword ptr [___security_cookie (12A40CCh)]
01291AE5 xor eax,ebp
01291AE7 mov dword ptr [ebp-4],eax
Apparently the compiler (VS2010 compiled debug for Win32), recognizing that we will need specific alignments for the code, takes steps to ensure it can provide that.
2) Exceptions are thrown from at the execution of this instruction only when run from the debugger, and even then not always. Even attaching to the process and executing this code does not throw. How can this be??
So, a couple of thoughts:
"and even then not always" - Not standing over your shoulder when you run this, I can't say for certain. However it seems plausible that just by random chance, stacks could get created with the alignment you need. By default, x86 uses 4byte stack alignment. If you need 16 byte alignment, you've got a 1 in 4 shot.
As for the rest (from https://msdn.microsoft.com/en-us/library/aa290049%28v=vs.71%29.aspx#ia64alignment_topic4):
On the x86 architecture, the operating system does not make the alignment fault visible to the application. ...you will also suffer performance degradation on the alignment fault, but it will be significantly less severe than on the Itanium, because the hardware will make the multiple accesses of memory to retrieve the unaligned data.
TLDR: Using __declspec(align(16)) should give you the alignment you want, even for stack variables. For unaligned accesses, the OS will catch the exception and handle it for you (at a cost of performance).
Edit1: Responding to the first 2 comments below:
Based on MS's docs, you are correct about the alignment of stack parameters, but they propose a solution as well:
You cannot specify alignment for function parameters. When data that
has an alignment attribute is passed by value on the stack, its
alignment is controlled by the calling convention. If data alignment
is important in the called function, copy the parameter into correctly
aligned memory before use.
Neither your sample on Microsoft connect nor the code about produce the same code for me (I'm only on vs2010), so I can't test this. But given this code from your sample:
struct mockVect2d
{
double x, y;
mockVect2d(double a, double b) : x(a), y(b) {}
It would seem that aligning either mockVect2d or the 2 doubles might help.

does msvc have analog of gcc's ({ })

Does msvc have analog of gcc's ({ }).
I assume the answer is no.
Plase note that this is question of compiler capabilities, not question of taste or style.
Not that I recommend anybody to start using the ({}) construct by ths question.
The reference to ({}) construct is: http://gcc.gnu.org/onlinedocs/gcc-2.95.3/gcc_4.html#SEC62 officially called "Statements and Declarations in Expressions". It allows to embed statements (like for, goto) and declarations into expressions.
In some way, yes. This is a compound statement expression, which one could consider like a lambda function that is immediately called, and only called once.
Recent versions of MSVC should support lambda functions, so that would be something like:
[](){ /* your compound statement expression here */ }();
EDIT: removed a surplus parenthesis
EDIT 2: For your amusement, here is an example of how to use either variation with some (admittedly totally silly) real code. Don't mind too much the actual usefulness of the code, but how expressive it is and how nicely the compiler even optimizes it:
#include <string.h>
#include <stdio.h>
int main()
{
unsigned int a =
({
unsigned int count = 0;
const char* str = "a silly thing";
for(unsigned int i = 0; i < strlen(str); ++i)
count += str[i] == 'i' ? 1 : 0;
count;
});
unsigned int b =
[](){
unsigned int count = 0;
const char* str = "a silly thing";
for(unsigned int i = 0; i < strlen(str); ++i)
count += str[i] == 'i' ? 1 : 0;
return count;
}();
printf("Number of 'i' : %u\t%u\n", a, b);
return 0;
}
... which gcc 4.5 compiles to:
movl $2, 8(%esp)
movl $2, 4(%esp)
movl $LC0, (%esp)
call _printf
No, it does not contain an equivalent form.

Resources