I'm implementing a sound mixer, it works well without SIMD instructions, but having a hard time figuring out how to extract my sound data into separate channels.
My data comes in an interleaved format: L0R0 L1R1 L2R2 L3R3...
I load them into a __m128i in the same format, so I have 4 sample in the register.
I'd like them to be in separate channels: L0L1L2L3 R0R1R2R3. This is the part that I'm missing.
So the input is: 8 x i16 (4xi32 interleaved)
I would like the output as left = 4 x f32 and right = 4 x f32, then do the mixing.
After the mixing, I can interleave the channels and i get L0R0 L1R1 L2R2...:
__m128 *src0 = mixed_channel0;
__m128 *src1 = mixed_channel1;
__m128 *dest = (__m128i *)buffer;
for (u32 sample_index = 0; sample_index < sample_chunk_count; ++sample_index)
{
__m128 s0 = _mm_load_ps((f32 *)src0++);
__m128 s1 = _mm_load_ps((f32 *)src1++);
__m128i l = _mm_cvtps_epi32(s0);
__m128i r = _mm_cvtps_epi32(s1);
__m128i lr0 = _mm_unpacklo_epi32(l, r);
__m128i lr1 = _mm_unpackhi_epi32(l, r);
*dest++ = _mm_packs_epi32(lr0, lr1);
}
Basically I need to do the opposite:
__m128i input = [L0R0, L1R1, L2R2, L3R3] packed pairs of 16bit ints
// magic happens, then
__m128 left = [L0, L1, L2, L3] packed 32bit floats
__m128 right = [R0, R1, R2, R3] packed 32bit floats
Even if I mask out the low/high order i16-s, then how can i convert them to f32-s? After masking out i would get:
__m128i right = [xx, R0, xx, R1, xx, R2, xx, R3]
__m128i left = [L0, xx, L1, xx, L2, xx, L3, xx]
If I could convert them to 4 x i32-s then it would be easy to convert them to f32-s with _mm_cvtepi32_ps and i would be done.
Thanks.
Mask and shift to go from pairs of 16-bit samples 32-bit samples.
// clunky calling convention, but should inline ok.
__m128 unpack_leftright_16bit_channels(__m128i input, __m128 &right_retval) {
// input = [L0R0, L1R1, L2R2, L3R3] packed pairs of 16bit ints
__m128i sign_extended_left = _mm_srai_epi32(input, 16);
__m128i high_right = _mm_slli_epi32(input, 16);
__m128i sign_extended_right = _mm_srai_epi32(high_right, 16);
right_retval = _mm_cvtepi32_ps(sign_extended_right);
//__m128 right = [R0, R1, R2, R3] packed 32bit floats
__m128 left = _mm_cvtepi32_ps(sign_extended_left);
//__m128 left = [L0, L1, L2, L3] packed 32bit floats
return left;
}
This compiles to what you'd expect with gcc5.3, or clang3.7.
This will bottleneck on shuffle throughput on most microarchitectures (see Agner Fog's insn tables and microarch pdf, and other links in the x86 tag wiki). It might be worth using SSSE3 pshufb to do the logical left-shift, only using actual shift instructions for the arithmetic right shifts that need to leave a copy of the sign bit in the upper half of each 32-bit element. Without AVX, pshufb shuffles in-place, just like pslld shifts in-place (thanks, Intel :(), so it doesn't avoid the extra MOV instruction to make a 2nd copy of the input.
On Skylake, immediate vector shifts run on p0/p1, and so does cvtdq2ps. Using pshufb for the left shift would increase throughput to one float output vector per clock, since shuffles run on port 5.
Pre-skylake, immediate vector shifts only run on a single port, e.g. p0 in Haswell. At least that's not the same port as int->float: Haswell runs cvtdq2ps on p1. So again, pshufb would increase throughput to one ps vector per clock.
It seems like there should be a better way to do this, like with an AND mask or something. But it seems that 2 shifts, or a shuffle+shift, are the best way to sign-extend the low 16 bits of every 32-bit element into the full 32-bit element.
Related
TLDR: given 64 bit registers rs1(signed) = 0xffff'ffff'ffff'fff6 and rs2(unsigned) = 0x10 does the riscv mulhsu instruction return 0x0000'0000'0000'000f or 0xffff'ffff'ffff'ffff or something else entirely to rd?
I am working on implementing a simulated version of the RiscV architecture and have run into a snag when implementing the RV64M mulh[[s]u] instruction. I'm not sure if mulhsu returns a signed or unsigned number. If it does return a signed number, then what is the difference between mulhsu and mulh?
here is some pseudocode demonstrating the problem (s64 and u64 denote signed and unsigned 64-bit register respectively)
rs1.s64 = 0xffff'ffff'ffff'fff6; //-10
rs2.u64 = 0x10; // 16
execute(mulhsu(rs1, rs2));
// which of these is correct? Note: rd only returns the upper 64 bits of the product
EXPECT_EQ(0x0000'0000'0000'000f, rd);
EXPECT_EQ(0xffff'ffff'ffff'ffff, rd);
EXPECT_EQ(<some other value>, rd);
Should rd be signed? unsigned?
From the instruction manual:
MUL performs an XLEN-bit × XLEN-bit multiplication of rs1 by rs2 and places the lower XLEN bits
in the destination register. MULH, MULHU, and MULHSU perform the same multiplication but return
the upper XLEN bits of the full 2 × XLEN-bit product, for signed × signed, unsigned × unsigned,
and signed rs1×unsigned rs2 multiplication, respectively. If both the high and low bits of the same
product are required, then the recommended code sequence is: MULH[[S]U] rdh, rs1, rs2; MUL
rdl, rs1, rs2 (source register specifiers must be in same order and rdh cannot be the same as rs1 or
rs2). Microarchitectures can then fuse these into a single multiply operation instead of performing
two separate multiplies.
The answer for your question is :EXPECT_EQ(0xffff'ffff'ffff'ffff, rd);.
mulhsu will do a multiplication of a sign extend of rs1.s64 and a zero extend rs2.u64.
You can see that in the compiler machine description riscv.md.
so mulhsu (64 bits) will return the equivalent of :
((s128) rs1.s64 * (u128) rs2.u64) >> 64. where s128 is a signed 128 int and u128 an unsigned 128 int.
the difference between the three mul is:
mulhsu is a multiplication between a sign extended register and a zero extended register.
mulh is a multiplication of two sign extended registers.
mulhu is a multiplication of two zero extended registers.
I have 32 length-1-to-4 strings stored in AVX2 uint8x32 registers, one register for each of length, byte0, byte1, byte2, byte3. I'd like to concatenate all the strings and write them densely to memory. If all the strings were equal length this would be straightforward: I'd shuffle the bytes to their target positions using pshufb and use some blend calls to mix the byte0/byte1/byte2/byte3 registers together. (Alternatively perhaps I could use vpunpck* instructions. Not yet figured out...)
However, the variable-length aspect makes this harder: where each output byte comes from is now a nontrivial function of the lengths. I can't figure out how to implement this efficiently in AVX2 code. Help?
Bottom line: I'd like a reimplementation of the following function, written in (as fast as possible) vector code rather than scalar code:
(godbolt link)
int concat_strings(char* dst, __m256i len_v, __m256i byte0_v, __m256i byte1_v, __m256i byte2_v, __m256i byte3_v) {
char len[32];
char byte0[32];
char byte1[32];
char byte2[32];
char byte3[32];
_mm256_store_si256(reinterpret_cast<__m256i*>(len), len_v);
_mm256_store_si256(reinterpret_cast<__m256i*>(byte0), byte0_v);
_mm256_store_si256(reinterpret_cast<__m256i*>(byte1), byte1_v);
_mm256_store_si256(reinterpret_cast<__m256i*>(byte2), byte2_v);
_mm256_store_si256(reinterpret_cast<__m256i*>(byte3), byte3_v);
int pos = 0;
for (int i = 0; i < 32; ++i) {
dst[pos + 0] = byte0[i];
dst[pos + 1] = byte1[i];
dst[pos + 2] = byte2[i];
dst[pos + 3] = byte3[i];
pos += len[i];
}
return pos;
}
Help?
I'm trying to create a randomly generated "planet" (circle), and I want the areas of water, land and foliage to be decided by perlin noise, or something similar. Currently I have this (psudo)code:
for (int radius = 0; radius < circleRadius; radius++) {
for (float theta = 0; theta < TWO_PI; theta += 0.1) {
float x = radius * cosine(theta);
float y = radius * sine(theta);
int colour = whateverFunctionIMake(x, y);
setPixel(x, y, colour);
}
}
Not only does this not work (there are "gaps" in the circle because of precision issues), it's incredibly slow. Even if I increase the resolution by changing the increment to 0.01, it still has missing pixels and is even slower (I get 10fps on my mediocre computer using Java (I know not the best) and an increment of 0.01. This is certainly not acceptable for a game).
How might I achieve a similar result whilst being much less computationally expensive?
Thanks in advance.
Why not use:
(x-x0)^2 + (y-y0)^2 <= r^2
so simply:
int x0=?,y0=?,r=?; // your planet position and size
int x,y,xx,rr,col;
for (rr=r*r,x=-r;x<=r;x++)
for (xx=x*x,y=-r;y<=r;y++)
if (xx+(y*y)<=rr)
{
col = whateverFunctionIMake(x, y);
setPixel(x0+x, y0+y, col);
}
all on integers, no floating or slow operations, no gaps ... Do not forget to use randseed for the coloring function ...
[Edit1] some more stuff
Now if you want speed than you need direct pixel access (in most platforms Pixels, SetPixel, PutPixels etc are slooow. because they perform a lot of stuff like range checking, color conversions etc ... ) In case you got direct pixel access or render into your own array/image whatever you need to add clipping with screen (so you do not need to check if pixel is inside screen on each pixel) to avoid access violations if your circle is overlapping screen.
As mentioned in the comments you can get rid of the x*x and y*y inside loop using previous value (as both x,y are only incrementing). For more info about it see:
32bit SQRT in 16T without multiplication
the math is like this:
(x+1)^2 = (x+1)*(x+1) = x^2 + 2x + 1
so instead of xx = x*x we just do xx+=x+x+1 for not incremented yet x or xx+=x+x-1 if x is already incremented.
When put all together I got this:
void circle(int x,int y,int r,DWORD c)
{
// my Pixel access
int **Pixels=Main->pyx; // Pixels[y][x]
int xs=Main->xs; // resolution
int ys=Main->ys;
// circle
int sx,sy,sx0,sx1,sy0,sy1; // [screen]
int cx,cy,cx0, cy0 ; // [circle]
int rr=r*r,cxx,cyy,cxx0,cyy0; // [circle^2]
// BBOX + screen clip
sx0=x-r; if (sx0>=xs) return; if (sx0< 0) sx0=0;
sy0=y-r; if (sy0>=ys) return; if (sy0< 0) sy0=0;
sx1=x+r; if (sx1< 0) return; if (sx1>=xs) sx1=xs-1;
sy1=y+r; if (sy1< 0) return; if (sy1>=ys) sy1=ys-1;
cx0=sx0-x; cxx0=cx0*cx0;
cy0=sy0-y; cyy0=cy0*cy0;
// render
for (cxx=cxx0,cx=cx0,sx=sx0;sx<=sx1;sx++,cxx+=cx,cx++,cxx+=cx)
for (cyy=cyy0,cy=cy0,sy=sy0;sy<=sy1;sy++,cyy+=cy,cy++,cyy+=cy)
if (cxx+cyy<=rr)
Pixels[sy][sx]=c;
}
This renders a circle with radius 512 px in ~35ms so 23.5 Mpx/s filling on mine setup (AMD A8-5500 3.2GHz Win7 64bit single thread VCL/GDI 32bit app coded by BDS2006 C++). Just change the direct pixel access to style/api you use ...
[Edit2]
to measure speed on x86/x64 you can use RDTSC asm instruction here some ancient C++ code I used ages ago (on 32bit environment without native 64bit stuff):
double _rdtsc()
{
LARGE_INTEGER x; // unsigned 64bit integer variable from windows.h I think
DWORD l,h; // standard unsigned 32 bit variables
asm {
rdtsc
mov l,eax
mov h,edx
}
x.LowPart=l;
x.HighPart=h;
return double(x.QuadPart);
}
It returns clocks your CPU has elapsed since power up. Beware you should account for overflows as on fast machines the 32bit counter is overflowing in seconds. Also each core has separate counter so set affinity to single CPU. On variable speed clock before measurement heat upi CPU by some computation and to convert to time just divide by CPU clock frequency. To obtain it just do this:
t0=_rdtsc()
sleep(250);
t1=_rdtsc();
fcpu = (t1-t0)*4;
and measurement:
t0=_rdtsc()
mesured stuff
t1=_rdtsc();
time = (t1-t0)/fcpu
if t1<t0 you overflowed and you need to add the a constant to result or measure again. Also the measured process must take less than overflow period. To enhance precision ignore OS granularity. for more info see:
Measuring Cache Latencies
Cache size estimation on your system? setting affinity example
Negative clock cycle measurements with back-to-back rdtsc?
I'm writing a multithreaded application and having a problem on the SPARC platform. Ultimately my question comes down to atomicity of this platform and how I could be obtaining this result.
Some pseudocode to help clarify my question:
// Global variable
typdef struct pkd_struct{
uint16_t a;
uint16_t b;
} __attribute__(packed) pkd_struct_t;
pkd_struct_t shared;
Thread 1:
swap_value() {
pkd_struct_t prev = shared;
printf("%d%d\n", prev.a, prev.b);
...
}
Thread 2:
use_value() {
pkd_struct_t next;
next.a = 0; next.b = 0;
shared = next;
printf("%d%d\n", shared.a, shared.b);
...
}
Thread 1 and 2 are accessing the shared variable "shared". One is setting, the other is getting. If Thread 2 is setting "shared" to zero, I'd expect Thread 1 to read count either before OR after the setting -- since "shared" is aligned on a 4-byte boundary. However, I will occasionally see Thread 1 reading the value of the form 0xFFFFFF00. That is the high-order 24 bits are OLD, but the low-order byte is NEW. It appears I've gotten an intermediate value.
Looking at the disassembly, the use_value function simply does an "ST" instruction. Given that the data is aligned and isn't crossing a word boundary, is there any explanation for this behavior? If ST is indeed NOT atomic to use this way, does this explain the result I see (only 1 byte changed?!?)? There is no problem on x86.
UPDATE 1:
I've found the problem, but not the cause. GCC appears to be generating assembly that reads the shared variably byte-by-byte (thus allowing a partial update to be obtained). Comments added, but I am not terribly comfortable with SPARC assembly. %i0 is a pointer to the shared variable.
xxx+0xc: ldub [%i0], %g1 // ld unsigned byte g1 = [i0] -- 0 padded
xxx+0x10: ...
xxx+0x14: ldub [%i0 + 0x1], %g5 // ld unsigned byte g5 = [i0+1] -- 0 padded
xxx+0x18: sllx %g1, 0x18, %g1 // g1 = [i0+0] left shifted by 24
xxx+0x1c: ldub [%i0 + 0x2], %g4 // ld unsigned byte g4 = [i0+2] -- 0 padded
xxx+0x20: sllx %g5, 0x10, %g5 // g5 = [i0+1] left shifted by 16
xxx+0x24: or %g5, %g1, %g5 // g5 = g5 OR g1
xxx+0x28: sllx %g4, 0x8, %g4 // g4 = [i0+2] left shifted by 8
xxx+0x2c: or %g4, %g5, %g4 // g4 = g4 OR g5
xxx+0x30: ldub [%i0 + 0x3], %g1 // ld unsigned byte g1 = [i0+3] -- 0 padded
xxx+0x34: or %g1, %g4, %g1 // g1 = g4 OR g1
xxx+0x38: ...
xxx+0x3c: st %g1, [%fp + 0x7df] // store g1 on the stack
Any idea why GCC is generating code like this?
UPDATE 2: Adding more info to the example code. Appologies -- I'm working with a mix of new and legacy code and it's difficult to separate what's relevant. Also, I understand sharing a variable like this is highly-discouraged in general. However, this is actually in a lock implementation where higher-level code will be using this to provide atomicity and using pthreads or platform-specific locking is not an option for this.
Because you've declared the type as packed, it gets one byte alignment, which means it must be read and written one byte at a time, as SPARC does not allow unaligned loads/stores. You need to give it 4-byte alignment if you want the compiler to use word load/store instructions:
typdef struct pkd_struct {
uint16_t a;
uint16_t b;
} __attribute__((packed, aligned(4))) pkd_struct_t;
Note that packed is essentially meaningless for this struct, so you could leave that out.
Answering my own question here -- this has bugged me for too long and hopefully I can save someone a bit of frustration at some point.
The problem is that although the shared data is aligned, because it is packed GCC reads it byte-by-byte.
There is some discussion here on how packing leading to load/store bloat on SPARC (and other RISC platforms I'd assume...), but in my case it has lead to a race.
I am trying to understand how the data obtained from XGetImage is disposed in memory:
XImage img = XGetImage(display, root, 0, 0, width, height, AllPlanes, ZPixmap);
Now suppose I want to decompose each pixel value in red, blue, green channels. How can I do this in a portable way? The following is an example, but it depends on a particular configuration of the XServer and does not work in every case:
for (int x = 0; x < width; x++)
for (int y = 0; y < height; y++) {
unsigned long pixel = XGetPixel(img, x, y);
unsigned char blue = pixel & blue_mask;
unsigned char green = (pixel & green_mask) >> 8;
unsigned char red = (pixel & red_mask) >> 16;
//...
}
In the above example I am assuming a particular order of the RGB channels in pixel and also that pixels are 24bit-depth: in facts, I have img->depth=24 and img->bits_per_pixels=32 (the screen is also 24-bit depth). But this is not a generic case.
As a second step I want to get rid of XGetPixel and use or describe img->data directly. The first thing I need to know is if there is anything in Xlib which exactly gives me all the informations I need to interpret how the image is built starting from the img->data field, which are:
the order of R,G,B channels in each pixel;
the number of bits for each pixels;
the numbbe of bits for each channel;
if possible, a corresponding FOURCC
The shift is a simple function of the mask:
int get_shift (int mask) {
shift = 0;
while (mask) {
if (mask & 1) break;
shift++;
mask >>=1;
}
return shift;
}
Number of bits in each channel is just the number of 1 bits in its mask (count them). The channel order is determined by the shifts (if red shift is 0, the the first channel is R, etc).
I think the valid values for bits_per_pixel are 1, 2, 4, 8, 15, 16, 24 and 32 (15 and 16 bits are the same 2 bytes per pixel format, but the former has 1 bit unused). I don't think it's worth anyone's time to support anything but 24 and 32 bpp.
X11 is not concerned with media files, so no 4CC code.
This can be read from the XImage structure itself.
the order of R,G,B channels in each pixel;
This is contained in this field of the XImage structure:
int byte_order; /* data byte order, LSBFirst, MSBFirst */
which tells you whether it's RGB or BGR (because it only depends on the endianness of the machine).
the number of bits for each pixels;
can be obtained from this field:
int bits_per_pixel; /* bits per pixel (ZPixmap) */
which is basically the number of bits set in each of the channel masks:
unsigned long red_mask; /* bits in z arrangement */
unsigned long green_mask;
unsigned long blue_mask;
the numbbe of bits for each channel;
See above, or you can use the code from #n.m.'s answer to count the bits yourself.
Yeah, it would be great if they put the bit shift constants in that structure too, but apparently they decided not to, since the pixels are aligned to bytes anyway, in "standard order" (RGB). Xlib makes sure to convert it to that order for you when it retrieves the data from the X server, even if they are stored internally in a different format server-side. So it's always in RGB format, byte-aligned, but depending on the endianness of the machine, the bytes inside an unsigned long can appear in a reverse order, hence the byte_order field to tell you about that.
So in order to extract these channels, just use the 0, 8 and 16 shifts after masking with red_mask, green_mask and blue_mask, just make sure you shift the right bytes depending on the byte_order and it should work fine.