How to load 4 integer values efficiently in neon register on msvc compiler? - visual-c++

How do I implement below operation efficiently on msvc compiler?
uint32x4_t temp = { 1, 2, 3, 4 };
I have to load 4 different values in neon register very efficiently since I am working to optimize performance. Above statement works for android clang but fails on msvc compiler since uint32x4_t is typedef'ed to __n128.
Following is the structure of __n128:
typedef union __declspec(intrin_type) _ADVSIMD_ALIGN(8) __n128
{
unsigned __int64 n128_u64[2];
unsigned __int32 n128_u32[4];
unsigned __int16 n128_u16[8];
unsigned __int8 n128_u8[16];
__int64 n128_i64[2];
__int32 n128_i32[4];
__int16 n128_i16[8];
__int8 n128_i8[16];
float n128_f32[4];
struct
{
__n64 low64;
__n64 high64;
} DUMMYNEONSTRUCT;
} __n128;

In C99, when initializing a union with a initializer list, you can specify the particular members that are initialized, as follows:
uint32x4_t temp = { .n128_u32 = {1,2,3,4} };
However, this C99 syntax is only supported in Visual Studio 2013 and higher. Visual Studio 2012 and below do not support this feature, and thus, you can only initialize a union with a static initializer based on the first entry (n128_u64). You could come up with an initializer that fits your uint32 data into uint64. Since they are constants, it will not take any additional execution time. Looks really ugly:
uint32x4_t temp = { { 1 << 32 | 2, 3 << 32, 4 } };
If this code needs to be portable between compilers, a better option would be to create a preprocessor macro, that handles formatting of constants.

Related

why typedef throwing error :(S) Initializer must be a valid constant expression

in f1.h header using typedef for structure. sample code snippet shown below
typedef struct{
int a;
union u
{
int x;
char y;
}xyz;
}mystruct;
In f2.h header using the structure mysturct to get the offset. Code snippet shown below
static mystruct ktt
//#define OFFSET_T(b, c) ((int*)((&((mystruct*)0)->b)) - (int*)((&((mystruct*)0)->c)))
#define OFFSET_T(b, c) ((char*) &ktt.b - (char *) &ktt.c)
static struct Mystruct1{
int n;
}mystruct1 = {OFFSET_T(xyz,a)};
when i'm doing compilation in AIX machine using xlc compiler it is throwing the error as "1506-221(S) Initializer must be a valid constant expression".
i tried both the macro's but both are getting same error. Is there anything wrong in f2.h macro while performing size of structure to get offset ??
The expression in question needs to be an arithmetic constant expression in order to be portable. Neither macro qualifies, since operands of pointer type are involved and arithmetic constant expressions are restricted such that those operands are not allowed. In C11, this is found in subclause 6.6 paragraph 8.
That said, the code using the first macro (source reproduced below) does compile on multiple versions of the xlc compiler on AIX.
typedef struct{
int a;
union u
{
int x;
char y;
}xyz;
}mystruct;
static mystruct ktt;
#define OFFSET_T(b, c) ((int*)((&((mystruct*)0)->b)) - (int*)((&((mystruct*)0)->c)))
//#define OFFSET_T(b, c) ((char*) &ktt.b - (char *) &ktt.c)
static struct Mystruct1{
int n;
}mystruct1 = {OFFSET_T(xyz,a)};
The compiler invocation I used was:
xlc offsetcalc.c -c -o /dev/null
The version information for one of the older versions I tried is:
IBM XL C/C++ for AIX, V10.1
Version: 10.01.0000.0021
The version information for one of the newest versions I tried is:
IBM XL C/C++ for AIX, V13.1.3 (5725-C72, 5765-J07)
Version: 13.01.0003.0004

How to assign value to union in VC++

There is an union in C and embedded into C++ as below:
typedef union MyUnion MyUnion_;
union MyUnion{
ULONG mLong;
char mChar;
...
};
When I trying to init it like:
MyUnion_ test;
test = (MyUnion_)NULL;
this is can compile by Mingw32, but gives
error: C2440: 'type cast': cannot convert from 'void *' to 'MyUnion_'
in VC++ (VS2015). So how to do cast & initialize of union in VC++ compiler?
Now I am doing like this:
MyUnion_ test;
test.mLong = NULL;
but this makes the program look bad when passing union as a parameter.
void test(MyUnion_ u)
ULONG i = 0;
// mingw32
test((MyUnion_)i);
// vc++
MyUnion_ temp;
temp.mLong = i;
test(temp);
Using a compiler that supports the C++11 uniform initialization syntax you can use a braced initializer with a single value which will be used to initialize the first non-static field of the union …
MyUnion test{ 0 };
You could use NULL instead of zero in the code above but it seems confusing to initialise mLong (which is a ULONG) with NULL.
You can also use braced initialization in an assignment statement if you have to set the variable after it was declared …
MyUnion test{ 0 };
// ...
test = { 3 };
Note that the braced initializer syntax may also be available in older compilers that offer experimental support for what used to be called C++0x
Visual Studio 2015 C++ supports braced initializers unless you are compiling a file with a .c extension or are using the /TC switch to compile as C code (rather than C++ code).
Older C++ (and C) compilers
When using compilers that don't support braced initialization the older assignment initialization syntax can be used in declarations ...
MyUnion_ test = { 0 };
… but not in assignment statements.
Casting to union type
According to this IBM Knowledge Center article casting to a union type is an extension to C99 "... implemented to facilitate porting programs developed with GNU C" - which suggests it's not standard C.
This Microsoft documentation indicates there are no legal casts in C for a union, struct or array.
In C++ a cast to a union type is possible if a suitable constructor exists...
union MyUnion {
unsigned long mLong;
char mChar;
MyUnion(unsigned long val) { mLong = val; };
};
// valid cast
test = (MyUnion)3ul;
// invalid cast - no suitable constructor exists
void * ptr = 0;
test = (MyUnion)ptr;
Default constructor?
typedef union MyUnion MyUnion_;
union MyUnion {
ULONG mLong;
char mChar;
MyUnion(): mLong(0) {}
};
int main()
{
MyUnion_ temp;
return 0;
}

using bitfields as a sorting key in modern C (C99/C11 union)

Requirement:
For my tiny graphics engine, I need an array of all objects to draw. For performance reasons this array needs to be sorted on the attributes. In short:
Store a lot of attributes per struct, add the struct to an array of structs
Efficiently sort the array
walk over the array and perform operations (modesetting and drawing) depending on the attributes
Approach: bitfields in a union (i.e.: let the compiler do the masking and shifting for me)
I thought I had an elegant plan to accomplish this, based on this article: http://realtimecollisiondetection.net/blog/?p=86. The idea is as follows: each attribute is a bitfield, which can be read and written to (step 1). After writing, the sorting procedure look at the bitfield struct as an integer, and sorts on it (step 2). Afterwards (step 3), the bitfields are read again.
Sometimes code says more than a 1000 words, a high-level view:
union key {
/* useful for accessing */
struct {
unsigned int some_attr : 2;
unsigned int another_attr : 3;
/* ... */
} bitrep;
/* useful for sorting */
uint64_t intrep;
};
I would just make sure that the bit-representation was as large as the integer representation (64 bits in this case). My first approach went like this:
union key {
/* useful for accessing */
struct {
/* generic part: 11 bits */
unsigned int layer : 2;
unsigned int viewport : 3;
unsigned int viewportLayer : 3;
unsigned int translucency : 2;
unsigned int type : 1;
/* depends on type-bit: 64 - 11 bits = 53 bits */
union {
struct {
unsigned int sequence : 8;
unsigned int id : 32;
unsigned int padding : 13;
} cmd;
struct {
unsigned int depth : 24;
unsigned int material : 29;
} normal;
};
};
/* useful for sorting */
uint64_t intrep;
};
Note that in this case, there is a decision bitfield called type. Based on that, either the cmd struct or the normal struct gets filled in, just like in the mentioned article. However this failed horribly. With clang 3.3 on OSX 10.9 (x86 macbook pro), the key union is 16 bytes, while it should be 8.
Unable to coerce clang to pack the struct better, I took another approach based on some other stack overflow answers and the preprocessor to avoid me having to repeat myself:
/* 2 + 3 + 3 + 2 + 1 + 5 = 16 bits */
#define GENERIC_FIELDS \
unsigned int layer : 2; \
unsigned int viewport : 3; \
unsigned int viewportLayer : 3; \
unsigned int translucency : 2; \
unsigned int type : 1; \
unsigned int : 5;
/* 8 + 32 + 8 = 48 bits */
#define COMMAND_FIELDS \
unsigned int sequence : 8; \
unsigned int id : 32; \
unsigned int : 8;
/* 24 + 24 = 48 bits */
#define MODEL_FIELDS \
unsigned int depth : 24; \
unsigned int material : 24;
struct generic {
/* 16 bits */
GENERIC_FIELDS
};
struct command {
/* 16 bits */
GENERIC_FIELDS
/* 48 bits */
COMMAND_FIELDS
} __attribute__((packed));
struct model {
/* 16 bits */
GENERIC_FIELDS
/* 48 bits */
MODEL_FIELDS
} __attribute__((packed));
union alkey {
struct generic gen;
struct command cmd;
struct model mod;
uint64_t intrep;
};
Without including the __attribute__((packed)), the command and model structs are 12 bytes. But with the __attribute__((packed)), they are 8 bytes, exactly what I wanted! So it would seems that I have found my solution. However, my small experience with bitfields has taught me to be leery. Which is why I have a few questions:
My questions are:
Can I get this to be cleaner (i.e.: more like my first big union-within-struct-within-union) and still keep it 8 bytes for the key, for fast sorting?
Is there a better way to accomplish this?
Is this safe? Will this fail on x86/ARM? (really exotic architectures are not much of a concern, I'm targeting the 2 most prevalent ones). What about setting a bitfield and then finding out that an adjacent one has already been written to. Will different compilers vary wildly on this?
What issues can I expect from different compilers? Currently I'm just aiming for clang 3.3+ and gcc 4.9+ with -std=c11. However it would be quite nice if I could use MSVC as well in the future.
Related question and webpages I've looked up:
Variable-sized bitfields with aliasing
Unions within unions
for those (like me) scratching their heads about what happens with bitfields that are not byte-aligned and endianness, look no further: http://mjfrazer.org/mjfrazer/bitfields/
Sadly no answer got me the entirety of the way there.
EDIT: While experimenting, setting some values and reading the integer representation. I noticed something that I had forgotten about: endianness. This opens up another can of worms. Is it even possible to do what I want using bitfields or will I have to go for bitshifting operations?
The layout for bitfields is highly implementation (=compiler) dependent. In essence, compilers are free to place consecutive bitfields in the same byte/word if it sees fit, or not. Thus without extensions like the packed attribute that you mention, you can never be sure that your bitfields are squeezed into one word.
Then, if the bitfields are not squeezed into one word, or if you have just some spare bits that you don't use, you may be even more in trouble. These so-called padding bits can have arbitrary values, thus your sorting idea could never work in a portable setting.
For all these reasons, bitfields are relatively rarely used in real code. What you can see more often is the use of macros for the bits of your uint64_t that you need. For each of your bitfields that you have now, you'd need two macros, one to extract the bits and one to set them. Such a code then would be portable on all platforms that have a C99/C11 compiler without problems.
Minor point:
In the declaration of a union it is better to but the basic integer field first. The default initializer for a union uses the first field, so this would then ensure that your union would be initialized to all bits zero by such an initializer. The initializer of the struct would only guarantee that the individual fields are set to 0, the padding bits, if any, would be unspecific.

Compiler Support of members of SSE vector types like m128_f32[x]

This might sound stupid but is there a way to activate support of the inner members of an SSE vector type ?
I know this works fine on MSVC, And I ve found some comments on forums and SO like this.
The question, is can I activate this on CLang at least without creating my own unions ?
Thank you
[edit, workaround]
Currently I decided to create a vec4 type to help me.
here is the code
#include <emmintrin.h>
#include <cstdint>
#ifdef _WIN32
typedef __m128 vec4;
typedef __m128i vec4i;
typedef __m128d vec4d;
#else
typedef union __declspec(align(16)) vec4{
float m128_f32[4];
uint64_t m128_u64[2];
int8_t m128_i8[16];
int16_t m128_i16[8];
int32_t m128_i32[4];
int64_t m128_i64[2];
uint8_t m128_u8[16];
uint16_t m128_u16[8];
uint32_t m128_u32[4];
} vec4;
typedef union __declspec(align(16)) vec4i{
uint64_t m128i_u64[2];
int8_t m128i_i8[16];
int16_t m128i_i16[8];
int32_t m128i_i32[4];
int64_t m128i_i64[2];
uint8_t m128i_u8[16];
uint16_t m128i_u16[8];
uint32_t m128i_u32[4];
} vec4i;
typedef union __declspec(align(16)) vec4d{
double m128d_f64[2];
} vec4d;
#endif
On recent clangs, this Just Works without you needing to do anything at all:
#include <immintrin.h>
float foo(__m128 x) {
return x[1];
}
AFAIK it Just Works in recent GCC builds as well.
However, I should note the following:
Consider carefully whether or not you really need to do element-wise access in your vector code. If you can keep your operations in-lane, they will almost certainly be significantly more efficient.
If you really do need to do a significant number of lanewise or horizontal operations, and you don’t need portability, consider using Clang extended vectors (or “OpenCL vectors") instead of the basic SSE intrinsic types. You can pass them to intrinsics just like __m128 and friends, but they also have much nicer syntax for vector-scalar operations, lane wise operations, vector literals, etc.

Visual C++: How is checked_array_iterator useful?

On compiling code at Warning Level 4 (/W4), I get C4996 warnings on std::copy() calls whose parameters are C arrays (not STL containers like vectors). The recommended solution to fix this seems to be to use stdext::checked_array_iterator.
What is the use of stdext::checked_array_iterator? How does it work?
Why does it not give any compile warning on this piece of erroneous code compiled under Visual C++ 2010?:
#include <algorithm>
#include <iterator>
using namespace std;
int main()
{
int arr0[5] = {100, 99, 98, 97, 96};
int arr1[3];
copy( arr0, arr0 + 5, stdext::checked_array_iterator<int*>( arr1, 3 ) );
return 0;
}
This page, Checked Iterators, describe how it works, but this quote sums it up: Checked iterators ensure that you do not overwrite the bounds of your container.
So if you go outside the bounds of the iterator it'll either throw and exception or call invalid_parameter.

Resources