Redhawk FEI sample rate tolerance check failure? - redhawksdr

I'm building a Redhawk 2.1.2 FEI device and encountering a tolerance check failure when I try to do an allocation through the IDE (haven't tried python interface or anything). The request is for 8 MHz and I get a return value of 7999999.93575246725231409 Hz which is waaaay within the 20% tolerance, but I still get this error:
2017-12-24 11:27:10 DEBUG FrontendTunerDevice:484 - allocateCapacity - SR requested: 8000000.000000 SR got: 7999999.935752
2017-12-24 11:27:10 INFO FrontendTunerDevice:490 - allocateCapacity(0): returned sr 7999999.935752 does not meet tolerance criteria of 20.000000 percent
The offending code from frontendInterfaces/libsrc/cpp/fe_tuner_device.cpp:
// check tolerances
if( (floatingPointCompare(frontend_tuner_allocation.sample_rate,0)!=0) &&
(floatingPointCompare(frontend_tuner_status[tuner_id].sample_rate,frontend_tuner_allocation.sample_rate)<0 ||
floatingPointCompare(frontend_tuner_status[tuner_id].sample_rate,frontend_tuner_allocation.sample_rate+frontend_tuner_allocation.sample_rate * frontend_tuner_allocation.sample_rate_tolerance/100.0)>0 ))
{
std::ostringstream eout;
eout<<std::fixed<<"allocateCapacity("<<int(tuner_id)<<"): returned sr "<<frontend_tuner_status[tuner_id].sample_rate<<" does not meet tolerance criteria of "<<frontend_tuner_allocation.sample_rate_tolerance<<" percent";
LOG_INFO(FrontendTunerDevice<TunerStatusStructType>, eout.str());
throw std::logic_error(eout.str().c_str());
}
And the function from frontendInterfaces/libsrc/cpp/fe_tuner_device.h:
inline double floatingPointCompare(double lhs, double rhs, size_t places = 1){
return round((lhs-rhs)*pow(10,places));
/*if(round((lhs-rhs)*(pow(10,places))) == 0)
return 0; // equal
if(lhs<rhs)
return -1; // lhs < rhs
return 1; // lhs > rhs*/
}
I actually copied into a non-Redhawk C++ program that I use to test the devices interfaces and the checks pass. I broke everything out to find the difference and noticed that in Redhawk the sample rate being returned from the Device (or at least printed to the screen) is slightly different than the one outside Redhawk - by like a tiny fraction of a Hz:
// in Redhawk using cout::precision(17)
Sample Rate: 7999999.93575246725231409
// outside Redhawk using cout::precision(17)
Sample Rate: 7999999.96948242187500000
I don't know why there's a difference in the actual sample rates returned but in the Redhawk version it's just enough to make the second part of the check fail:
floatingPointCompare(7999999.93575246725231409,8000000.00000000000000000)<0
1
Basically because:
double a = 7999999.93575246725231409 - 8000000.00000000000000000; // = -0.06424753274768591
double b = pow(10,1); // = 10.00000000000000000
double c = a*b; // = -0.6424753274
double d = round(c); // = -1.00000000000000000
So if a returned sample rate is less than the request by more than 0.049999 Hz then it will fail the allocation regardless of the tolerance %? Maybe I'm just missing something here.

The tolerance checks are specified to be the minimum amount plus a delta and not a variance (plus or minus) from the requested amount.

There should be a document somewhere that describes this in detail but I went to the FMRdsSimulator device's source.
// For FEI tolerance, it is not a +/- it's give me this or better.
float minAcceptableSampleRate = request.sample_rate;
float maxAcceptableSampleRate = (1 + request.sample_rate_tolerance/100.0) * request.sample_rate;
So that should explain why the allocation was failing.

Related

Is there a more efficient way of texturing a circle?

I'm trying to create a randomly generated "planet" (circle), and I want the areas of water, land and foliage to be decided by perlin noise, or something similar. Currently I have this (psudo)code:
for (int radius = 0; radius < circleRadius; radius++) {
for (float theta = 0; theta < TWO_PI; theta += 0.1) {
float x = radius * cosine(theta);
float y = radius * sine(theta);
int colour = whateverFunctionIMake(x, y);
setPixel(x, y, colour);
}
}
Not only does this not work (there are "gaps" in the circle because of precision issues), it's incredibly slow. Even if I increase the resolution by changing the increment to 0.01, it still has missing pixels and is even slower (I get 10fps on my mediocre computer using Java (I know not the best) and an increment of 0.01. This is certainly not acceptable for a game).
How might I achieve a similar result whilst being much less computationally expensive?
Thanks in advance.
Why not use:
(x-x0)^2 + (y-y0)^2 <= r^2
so simply:
int x0=?,y0=?,r=?; // your planet position and size
int x,y,xx,rr,col;
for (rr=r*r,x=-r;x<=r;x++)
for (xx=x*x,y=-r;y<=r;y++)
if (xx+(y*y)<=rr)
{
col = whateverFunctionIMake(x, y);
setPixel(x0+x, y0+y, col);
}
all on integers, no floating or slow operations, no gaps ... Do not forget to use randseed for the coloring function ...
[Edit1] some more stuff
Now if you want speed than you need direct pixel access (in most platforms Pixels, SetPixel, PutPixels etc are slooow. because they perform a lot of stuff like range checking, color conversions etc ... ) In case you got direct pixel access or render into your own array/image whatever you need to add clipping with screen (so you do not need to check if pixel is inside screen on each pixel) to avoid access violations if your circle is overlapping screen.
As mentioned in the comments you can get rid of the x*x and y*y inside loop using previous value (as both x,y are only incrementing). For more info about it see:
32bit SQRT in 16T without multiplication
the math is like this:
(x+1)^2 = (x+1)*(x+1) = x^2 + 2x + 1
so instead of xx = x*x we just do xx+=x+x+1 for not incremented yet x or xx+=x+x-1 if x is already incremented.
When put all together I got this:
void circle(int x,int y,int r,DWORD c)
{
// my Pixel access
int **Pixels=Main->pyx; // Pixels[y][x]
int xs=Main->xs; // resolution
int ys=Main->ys;
// circle
int sx,sy,sx0,sx1,sy0,sy1; // [screen]
int cx,cy,cx0, cy0 ; // [circle]
int rr=r*r,cxx,cyy,cxx0,cyy0; // [circle^2]
// BBOX + screen clip
sx0=x-r; if (sx0>=xs) return; if (sx0< 0) sx0=0;
sy0=y-r; if (sy0>=ys) return; if (sy0< 0) sy0=0;
sx1=x+r; if (sx1< 0) return; if (sx1>=xs) sx1=xs-1;
sy1=y+r; if (sy1< 0) return; if (sy1>=ys) sy1=ys-1;
cx0=sx0-x; cxx0=cx0*cx0;
cy0=sy0-y; cyy0=cy0*cy0;
// render
for (cxx=cxx0,cx=cx0,sx=sx0;sx<=sx1;sx++,cxx+=cx,cx++,cxx+=cx)
for (cyy=cyy0,cy=cy0,sy=sy0;sy<=sy1;sy++,cyy+=cy,cy++,cyy+=cy)
if (cxx+cyy<=rr)
Pixels[sy][sx]=c;
}
This renders a circle with radius 512 px in ~35ms so 23.5 Mpx/s filling on mine setup (AMD A8-5500 3.2GHz Win7 64bit single thread VCL/GDI 32bit app coded by BDS2006 C++). Just change the direct pixel access to style/api you use ...
[Edit2]
to measure speed on x86/x64 you can use RDTSC asm instruction here some ancient C++ code I used ages ago (on 32bit environment without native 64bit stuff):
double _rdtsc()
{
LARGE_INTEGER x; // unsigned 64bit integer variable from windows.h I think
DWORD l,h; // standard unsigned 32 bit variables
asm {
rdtsc
mov l,eax
mov h,edx
}
x.LowPart=l;
x.HighPart=h;
return double(x.QuadPart);
}
It returns clocks your CPU has elapsed since power up. Beware you should account for overflows as on fast machines the 32bit counter is overflowing in seconds. Also each core has separate counter so set affinity to single CPU. On variable speed clock before measurement heat upi CPU by some computation and to convert to time just divide by CPU clock frequency. To obtain it just do this:
t0=_rdtsc()
sleep(250);
t1=_rdtsc();
fcpu = (t1-t0)*4;
and measurement:
t0=_rdtsc()
mesured stuff
t1=_rdtsc();
time = (t1-t0)/fcpu
if t1<t0 you overflowed and you need to add the a constant to result or measure again. Also the measured process must take less than overflow period. To enhance precision ignore OS granularity. for more info see:
Measuring Cache Latencies
Cache size estimation on your system? setting affinity example
Negative clock cycle measurements with back-to-back rdtsc?

gstreamer read decibel from buffer

I am trying to get the dB level of incoming audio samples. On every video frame, I update the dB level and draw a bar representing a 0 - 100% value (0% being something arbitrary such as -20.0dB and 100% being 0dB.)
gdouble sum, rms;
sum = 0.0;
guint16 *data_16 = (guint16 *)amap.data;
for (gint i = 0; i < amap.size; i = i + 2)
{
gdouble sample = ((guint16)data_16[i]) / 32768.0;
sum += (sample * sample);
}
rms = sqrt(sum / (amap.size / 2));
dB = 10 * log10(rms);
This was adapted to C from a code sample, marked as the answer, from here. I am wondering what it is that I am missing from this very simple equation.
Answered: jacket was correct about the code loosing the sign, so everything ended up being positive. Also the code 10 * log(rms) is incorrect. It should be 20 * log(rms) as I am converting amplitude to decibels (as a measure of outputted power).
The level element is best for this task (as #ensonic already mentioned) its intended for exactly what you need..
So basically you add to your pipe element called "level", then enable the messages triggering.
Level element then emits messages which contains values of RMS Peak and Decay. RMS is what you need.
You can setup callback function connected to such message event:
audio_level = gst_element_factory_make ("level", "audiolevel");
g_object_set(audio_level, "message", TRUE, NULL);
...
g_signal_connect (bus, "message::element", G_CALLBACK (callback_function), this);
bus variable is of type GstBus.. I hope you know how to work with buses
Then in callback function check for the element name and get the RMS like is described here
There is also normalization algorithm with pow() function to convert to value between 0.0 -> 1.0 which you can use to convert to % as you stated in your question.

Controlling TI OMAP l138 frequency leads to "Division by zero in kernel"

My team is trying to control the frequency of an Texas Instruments OMAP l138. The default frequency is 300 MHz and we want to put it to 372 MHz in a "complete" form: we would like not only to change the default value to the desired one (or at least configure it at startup), but also be capable of changing the value at run time.
Searching on the web about how to do this, we found an article which tells that one of the ways to do this is by an "echo" command:
echo 372000 /sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed
We did some tests with this command and it runs fine with one problem: sometimes the first call to this echo command leads to a error message of "Division by zero in kernel":
In my personal tests, this error appeared always in the first call to the echo command. All the later calls worked without error. If, then, I reset my processor and calls the command again, the same problem occurs: the first call leads to this error and later calls work without problem.
So my questions are: what is causing this problem? And how could I solve it? (Obviously the answer "always type it twice" doesn't count!)
(Feel free to mention other ways of controlling the OMAP l138's frequency at real time as well!)
Looks to me like you have division by zero in davinci_spi_cpufreq_transition() function. Somewhere in this function (or in some function that's being called in davinci_spi_cpufreq_transition) there is a buggy division operation which tries to divide by some variable which is (in your case) has value of 0. And this is obviously error case which should be handled properly in code, but in fact it isn't.
It's hard to tell which code exactly leads to this, because I don't know which kernel you are using. It would be much more easier if you can provide link to your kernel repository. Although I couldn't find davinci_spi_cpufreq_transition in upstream kernel, I found it here.
davinci_spi_cpufreq_transition() function appears to be in drivers/spi/davinci_spi.c. It calls davinci_spi_calc_clk_div() function. There are 2 division operations there. First is:
prescale = ((clk_rate / hz) - 1);
And second is:
if (hz < (clk_rate / (prescale + 1)))
One of them is probably causing "division by zero" error. I propose you to trace which one is that by modifying davinci_spi_calc_clk_div() function in next way (just add lines marked as "+"):
static void davinci_spi_calc_clk_div(struct davinci_spi *davinci_spi)
{
struct davinci_spi_platform_data *pdata;
unsigned long clk_rate;
u32 hz, cs_num, prescale;
pdata = davinci_spi->pdata;
cs_num = davinci_spi->cs_num;
hz = davinci_spi->speed;
clk_rate = clk_get_rate(davinci_spi->clk);
+ printk(KERN_ERR "### hz = %u\n", hz);
prescale = ((clk_rate / hz) - 1);
if (prescale > 0xff)
prescale = 0xff;
+ printk("### prescale + 1 = %u\n", prescale + 1UL);
if (hz < (clk_rate / (prescale + 1)))
prescale++;
if (prescale < 2) {
pr_info("davinci SPI controller min. prescale value is 2\n");
prescale = 2;
}
clear_fmt_bits(davinci_spi->base, 0x0000ff00, cs_num);
set_fmt_bits(davinci_spi->base, prescale << 8, cs_num);
}
My guess -- it's "hz" variable which is 0 in your case. If it's so, you also may want to add next debug line to davinci_spi_setup_transfer() function:
if (!hz)
hz = spi->max_speed_hz;
+ printk(KERN_ERR "### setup_transfer: setting speed to %u\n", hz);
davinci_spi->speed = hz;
davinci_spi->cs_num = spi->chip_select;
With all those modifications made, rebuild your kernel and you will probably get the clue why you have that "div by zero" error. Just look for lines started with "###" in your kernel boot log. In case you don't know what to do next -- attach those debug lines and I will try to help you.

OpenCL float sum reduction

I would like to apply a reduce on this piece of my kernel code (1 dimensional data):
__local float sum = 0;
int i;
for(i = 0; i < length; i++)
sum += //some operation depending on i here;
Instead of having just 1 thread that performs this operation, I would like to have n threads (with n = length) and at the end having 1 thread to make the total sum.
In pseudo code, I would like to able to write something like this:
int i = get_global_id(0);
__local float sum = 0;
sum += //some operation depending on i here;
barrier(CLK_LOCAL_MEM_FENCE);
if(i == 0)
res = sum;
Is there a way?
I have a race condition on sum.
To get you started you could do something like the example below (see Scarpino). Here we also take advantage of vector processing by using the OpenCL float4 data type.
Keep in mind that the kernel below returns a number of partial sums: one for each local work group, back to the host. This means that you will have to carry out the final sum by adding up all the partial sums, back on the host. This is because (at least with OpenCL 1.2) there is no barrier function that synchronizes work-items in different work-groups.
If summing the partial sums on the host is undesirable, you can get around this by launching multiple kernels. This introduces some kernel-call overhead, but in some applications the extra penalty is acceptable or insignificant. To do this with the example below you will need to modify your host code to call the kernel repeatedly and then include logic to stop executing the kernel after the number of output vectors falls below the local size (details left to you or check the Scarpino reference).
EDIT: Added extra kernel argument for the output. Added dot product to sum over the float 4 vectors.
__kernel void reduction_vector(__global float4* data,__local float4* partial_sums, __global float* output)
{
int lid = get_local_id(0);
int group_size = get_local_size(0);
partial_sums[lid] = data[get_global_id(0)];
barrier(CLK_LOCAL_MEM_FENCE);
for(int i = group_size/2; i>0; i >>= 1) {
if(lid < i) {
partial_sums[lid] += partial_sums[lid + i];
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if(lid == 0) {
output[get_group_id(0)] = dot(partial_sums[0], (float4)(1.0f));
}
}
I know this is a very old post, but from everything I've tried, the answer from Bruce doesn't work, and the one from Adam is inefficient due to both global memory use and kernel execution overhead.
The comment by Jordan on the answer from Bruce is correct that this algorithm breaks down in each iteration where the number of elements is not even. Yet it is essentially the same code as can be found in several search results.
I scratched my head on this for several days, partially hindered by the fact that my language of choice is not C/C++ based, and also it's tricky if not impossible to debug on the GPU. Eventually though, I found an answer which worked.
This is a combination of the answer by Bruce, and that from Adam. It copies the source from global memory into local, but then reduces by folding the top half onto the bottom repeatedly, until there is no data left.
The result is a buffer containing the same number of items as there are work-groups used (so that very large reductions can be broken down), which must be summed by the CPU, or else call from another kernel and do this last step on the GPU.
This part is a little over my head, but I believe, this code also avoids bank switching issues by reading from local memory essentially sequentially. ** Would love confirmation on that from anyone that knows.
Note: The global 'AOffset' parameter can be omitted from the source if your data begins at offset zero. Simply remove it from the kernel prototype and the fourth line of code where it's used as part of an array index...
__kernel void Sum(__global float * A, __global float *output, ulong AOffset, __local float * target ) {
const size_t globalId = get_global_id(0);
const size_t localId = get_local_id(0);
target[localId] = A[globalId+AOffset];
barrier(CLK_LOCAL_MEM_FENCE);
size_t blockSize = get_local_size(0);
size_t halfBlockSize = blockSize / 2;
while (halfBlockSize>0) {
if (localId<halfBlockSize) {
target[localId] += target[localId + halfBlockSize];
if ((halfBlockSize*2)<blockSize) { // uneven block division
if (localId==0) { // when localID==0
target[localId] += target[localId + (blockSize-1)];
}
}
}
barrier(CLK_LOCAL_MEM_FENCE);
blockSize = halfBlockSize;
halfBlockSize = blockSize / 2;
}
if (localId==0) {
output[get_group_id(0)] = target[0];
}
}
https://pastebin.com/xN4yQ28N
You can use new work_group_reduce_add() function for sum reduction inside single work group if you have support for OpenCL C 2.0 features
A simple and fast way to reduce data is by repeatedly folding the top half of the data into the bottom half.
For example, please use the following ridiculously simple CL code:
__kernel void foldKernel(__global float *arVal, int offset) {
int gid = get_global_id(0);
arVal[gid] = arVal[gid]+arVal[gid+offset];
}
With the following Java/JOCL host code (or port it to C++ etc):
int t = totalDataSize;
while (t > 1) {
int m = t / 2;
int n = (t + 1) / 2;
clSetKernelArg(kernelFold, 0, Sizeof.cl_mem, Pointer.to(arVal));
clSetKernelArg(kernelFold, 1, Sizeof.cl_int, Pointer.to(new int[]{n}));
cl_event evFold = new cl_event();
clEnqueueNDRangeKernel(commandQueue, kernelFold, 1, null, new long[]{m}, null, 0, null, evFold);
clWaitForEvents(1, new cl_event[]{evFold});
t = n;
}
The host code loops log2(n) times, so it finishes quickly even with huge arrays. The fiddle with "m" and "n" is to handle non-power-of-two arrays.
Easy for OpenCL to parallelize well for any GPU platform (i.e. fast).
Low memory, because it works in place
Works efficiently with non-power-of-two data sizes
Flexible, e.g. you can change kernel to do "min" instead of "+"

What is openCL equivalent for this cuda "cudaMallocPitch "code.?

My PC has an AMD processor with an ATI 3200 GPU which doesn't support OpenCL. The rest of the codes all running by "Falling back to CPU itself".
I am converting one of the code from CUDA to OpenCL but stuck in some particular part for which there is no exact conversion code in OpenCL. since i have less experience in OpenCL I can't make out this, please suggest me some solution if any of you think will work,
The CUDA code is,
size_t pitch = 0;
cudaError error = cudaMallocPitch((void**)&gpu_data, (size_t*)&pitch,
instances->cols * sizeof(float), instances->rows);
for( int i = 0; i < instances->rows; i++ ){
error = cudaMemcpy((void*)(gpu_data + (pitch/sizeof(float))*i),
(void*)(instances->data + (instances->cols*i)),
instances->cols * sizeof(float) ,cudaMemcpyHostToDevice);
If I remove the pitch value from the above I end up with an problem which doesn't write to the device memory "gpu_data".
Somebody please convert this code to OpenCL and reply. I have converted it to OpenCL, but its not working and the data is not written to "gpu_data". My converted OpenCL code is
gpu_data = clCreateBuffer(context, CL_MEM_READ_WRITE, ((instances->cols)*(instances->rows))*sizeof(float), NULL, &ret);
for( int i = 0; i < instances->rows; i++ ){
ret = clEnqueueWriteBuffer(command_queue, gpu_data, CL_TRUE, 0, ((instances->cols)*(instances->rows))*sizeof(float),(void*)(instances->data + (instances->cols*i)) , 0, NULL, NULL);
Sometimes it runs well for this code and gets stuck in the reading part i.e.
ret = clEnqueueReadBuffer(command_queue, gpu_data, CL_TRUE, 0,sizeof( float ) * instances->cols* 1 , instances->data, 0, NULL, NULL);
overhere. And it gives error like
Unhandled exception at 0x10001098 in CL_kmeans.exe: 0xC000001D: Illegal Instruction.
when break is pressed , it gives:
No symbols are loaded for any call stack frame. The source code cannot be displayed.
while debugging. In the call stack it is displaying:
OCL8CA9.tmp.dll!10001098()
[Frames below may be incorrect and/or missing, no symbols loaded for OCL8CA9.tmp.dll]
amdocl.dll!5c39de16()
I really dont know what it means. someone please help me to rid of this problem.
First of all, in the CUDA code you're doing a horribly inefficient thing to copy the data. The CUDA runtime has the function cudaMemcpy2D that does exactly what you are trying to do by looping over different rows.
What cudaMallocPitch does is to compute an optimal pitch (= distance in byte between rows in a 2D array) such that each new row begins at an address that is optimal for coalescing, and then allocates a memory area as large as pitch times the number of rows you specify. You can emulate the same thing in OpenCL by first computing the optimal pitch and then doing the allocation of the correct size.
The optimal pitch is computed by (1) getting the base address alignment preference for your card (CL_DEVICE_MEM_BASE_ADDR_ALIGN property with clGetDeviceInfo: note that the returned value is in bits, so you have to divide by 8 to get it in bytes); let's call this base (2) find the largest multiple of base that is no less than your natural data pitch (sizeof(type) times number of columns); this will be your pitch.
You then allocate pitch times number of rows bytes, and pass the pitch information to kernels.
Also, when copying data from the host to the device and converesely, you want to use clEnqueue{Read,Write}BufferRect, that are specifically designed to copy 2D data (they are the counterparts to cudaMemcpy2D).

Resources