What is openCL equivalent for this cuda "cudaMallocPitch "code.? - visual-c++

My PC has an AMD processor with an ATI 3200 GPU which doesn't support OpenCL. The rest of the codes all running by "Falling back to CPU itself".
I am converting one of the code from CUDA to OpenCL but stuck in some particular part for which there is no exact conversion code in OpenCL. since i have less experience in OpenCL I can't make out this, please suggest me some solution if any of you think will work,
The CUDA code is,
size_t pitch = 0;
cudaError error = cudaMallocPitch((void**)&gpu_data, (size_t*)&pitch,
instances->cols * sizeof(float), instances->rows);
for( int i = 0; i < instances->rows; i++ ){
error = cudaMemcpy((void*)(gpu_data + (pitch/sizeof(float))*i),
(void*)(instances->data + (instances->cols*i)),
instances->cols * sizeof(float) ,cudaMemcpyHostToDevice);
If I remove the pitch value from the above I end up with an problem which doesn't write to the device memory "gpu_data".
Somebody please convert this code to OpenCL and reply. I have converted it to OpenCL, but its not working and the data is not written to "gpu_data". My converted OpenCL code is
gpu_data = clCreateBuffer(context, CL_MEM_READ_WRITE, ((instances->cols)*(instances->rows))*sizeof(float), NULL, &ret);
for( int i = 0; i < instances->rows; i++ ){
ret = clEnqueueWriteBuffer(command_queue, gpu_data, CL_TRUE, 0, ((instances->cols)*(instances->rows))*sizeof(float),(void*)(instances->data + (instances->cols*i)) , 0, NULL, NULL);
Sometimes it runs well for this code and gets stuck in the reading part i.e.
ret = clEnqueueReadBuffer(command_queue, gpu_data, CL_TRUE, 0,sizeof( float ) * instances->cols* 1 , instances->data, 0, NULL, NULL);
overhere. And it gives error like
Unhandled exception at 0x10001098 in CL_kmeans.exe: 0xC000001D: Illegal Instruction.
when break is pressed , it gives:
No symbols are loaded for any call stack frame. The source code cannot be displayed.
while debugging. In the call stack it is displaying:
OCL8CA9.tmp.dll!10001098()
[Frames below may be incorrect and/or missing, no symbols loaded for OCL8CA9.tmp.dll]
amdocl.dll!5c39de16()
I really dont know what it means. someone please help me to rid of this problem.

First of all, in the CUDA code you're doing a horribly inefficient thing to copy the data. The CUDA runtime has the function cudaMemcpy2D that does exactly what you are trying to do by looping over different rows.
What cudaMallocPitch does is to compute an optimal pitch (= distance in byte between rows in a 2D array) such that each new row begins at an address that is optimal for coalescing, and then allocates a memory area as large as pitch times the number of rows you specify. You can emulate the same thing in OpenCL by first computing the optimal pitch and then doing the allocation of the correct size.
The optimal pitch is computed by (1) getting the base address alignment preference for your card (CL_DEVICE_MEM_BASE_ADDR_ALIGN property with clGetDeviceInfo: note that the returned value is in bits, so you have to divide by 8 to get it in bytes); let's call this base (2) find the largest multiple of base that is no less than your natural data pitch (sizeof(type) times number of columns); this will be your pitch.
You then allocate pitch times number of rows bytes, and pass the pitch information to kernels.
Also, when copying data from the host to the device and converesely, you want to use clEnqueue{Read,Write}BufferRect, that are specifically designed to copy 2D data (they are the counterparts to cudaMemcpy2D).

Related

NtQueryObject returns wrong insufficient required size via WOW64, why?

I am using the NT native API NtQueryObject()/ZwQueryObject() from user mode (and I am aware of the risks in general and I have written kernel mode drivers for Windows in the past in my professional capacity).
Generally when one uses the typical "query information" function (of which there are a few) the protocol is first to ask with a too small buffer to retrieve the required size with STATUS_INFO_LENGTH_MISMATCH, then allocate a buffer of said size and query again -- this time using the buffer and previously returned size.
In order to get the list of object types (67 on my build) on the system I am doing just that:
ULONG Size = 0;
NTSTATUS Status = NtQueryObject(NULL, ObjectTypesInformation, &Size, sizeof(Size), &Size);
And in Size I get 8280 (WOW64) and 8968 (x64). I then proceed to allocate the buffer with calloc() and query again:
ULONG Size2 = 0;
BYTE* Buf = (BYTE*)::calloc(1, Size);
Status = NtQueryObject(NULL, ObjectTypesInformation, Buf, Size, &Size2);
NB: ObjectTypesInformation is 3. It isn't declared in winternl.h, but Nebbett (as ObjectAllTypesInformation) and others describe it. Since I am not querying for a particular object's traits but the system-wide list of object types, I pass NULL for the object handle.
Curiously on WOW64, i.e. 32-bit, the value in Size2 upon return from the second query is 16 Bytes (= 8296) bigger than the previously returned required size.
As far as alignment is concerned, I'd expect at most 8 Bytes for this sort of thing and indeed neither 8280 nor 8296 are at a 16 Byte alignment boundary, but on an 8 Byte one.
Certainly I can add some slack space on top of the returned required size (e.g. ALIGN_UP to the next 32 Byte alignment boundary), but this seems highly irregular to be honest. And I'd rather want to understand what's going on than to implement a workaround that breaks, because I miss something crucial.
The practical issue for the code is that in Debug configurations it tells me there's a corrupted heap somewhere, upon freeing Buf. Which suggests that NtQueryObject() was indeed writing these extra 16 Bytes beyond the buffer I provided.
Question: Any idea why it is doing that?
As usual for NT native API the sources of information are scarce. The x64 version of the exact same code returns the exact number of bytes required. So my thinking here is that WOW64 is the issue. A somewhat cursory look into wow64.dll with IDA didn't reveal any immediate points for suspicion regarding what goes wrong in translating the results to 32-bit here.
PS: Windows 10 (10.0.19043, ntdll.dll "timestamp" 77755782)
PPS: this may be related: https://wj32.org/wp/2012/11/30/obquerytypeinfo-and-ntqueryobject-buffer-overrun-in-windows-8/ Tested it, by checking that OBJECT_TYPE_INFORMATION::TypeName.Length + sizeof(WCHAR) == OBJECT_TYPE_INFORMATION::TypeName.MaximumLength in all returned items, which was the case.
The only part of ObjectTypesInformation that's public is the first field defined in winternl.h header in the Windows SDK:
typedef struct __PUBLIC_OBJECT_TYPE_INFORMATION {
UNICODE_STRING TypeName;
ULONG Reserved [22]; // reserved for internal use
} PUBLIC_OBJECT_TYPE_INFORMATION, *PPUBLIC_OBJECT_TYPE_INFORMATION;
For x86 this is 96 bytes, and for x64 this is 104 bytes (assuming you have the right packing mode enabled). The difference is the pointer in UNICODE_STRING which changes the alignment in x64.
Any additional memory space should be related to the TypeName buffer.
UNICODE_STRING accounts for 8 bytes of the difference between 8280 and 8296. The function uses the sizeof(ULONG_PTR) for alignment of the returned string plus an extra WCHAR, so that could easily account for the remaining 8 bytes.
AFAIK: The public use of NtQueryObject is supposed to be limited to kernel-mode use which of course means it always matches the OS native bitness (x86 code can't run as kernel in x64 native OS), so it's probably just a quirk of using the NT functions via the WOW64 thunk.
Alright, I think I figured out the issue with the help of WinDbg and a thorough look at wow64.dll using IDA.
NB: the wow64.dll I have has the same build number, but differs slightly in data only (checksum, security directory entry, pieces from version resources). The code is identical, which was to be expected, given deterministic builds and how they affect the PE timestamp.
There's an internal function called whNtQueryObject_SpecialQueryCase (according to PDBs), which covers the ObjectTypesInformation class queries.
For the above wow64.dll I used the following points of interest in WinDbg, from a 32 bit program which calls NtQueryObject(NULL, ObjectTypesInformation, ...) (the program itself is irrelevant, though):
0:000> .load wow64exts
0:000> bp wow64!whNtQueryObject_SpecialQueryCase+B0E0
0:000> bp wow64!whNtQueryObject_SpecialQueryCase+B14E
0:000> bp wow64!whNtQueryObject_SpecialQueryCase+B1A7
0:000> bp wow64!whNtQueryObject_SpecialQueryCase+B24A
0:000> bp wow64!whNtQueryObject_SpecialQueryCase+B252
Explanation of the above points of interest:
+B0E0: computing length required for 64 bit query, based on passed length for 32 bit
+B14E: call to NtQueryObject()
+B1A7: loop body for copying 64 to 32 bit buffer contents, after successful NtQueryObject() call
+B24A: computing written length by subtracting current (last + 1) entry from base buffer address
+B252: downsizing returned (64 bit) required length to 32 bit
The logic of this function in regards to just ObjectTypesInformation is roughly as follows:
Common steps
Take the ObjectInformationLength (32 bit query!) argument and size it up to fit the 64 bit info
Align the retrieved size up to the next 16 byte boundary
If necessary allocate the resulting amount from some PEB::ProcessHeap and store in TLS slot 3; otherwise using this as a scratch space
Call NtQueryObject() passing the buffer and length from the two previous steps
The length passed to NtQueryObject() is the one from step 1, not the one aligned to a 16 byte boundary. There seems to be some sort of header to this scratch space, so perhaps that's where the 16 byte alignment comes from?
Case 1: buffer size too small (here: 4), just querying required length
The up-sized length in this case equals 4, which is too small and consequently NtQueryObject() returns STATUS_INFO_LENGTH_MISMATCH. Required size is reported as 8968.
Down-size from the 64 bit required length to 32 bit and end up 16 bytes too short
Return the status from NtQueryObject() and the down-sized required length form the previous step
Case 2: buffer size supposedly (!) sufficient
Copy OBJECT_TYPES_INFORMATION::NumberOfTypes from queried buffer to 32 bit one
Step to the first entry (OBJECT_TYPE_INFORMATION) of source (64 bit) and target (32 bit) buffer, 8 and 4 byte aligned respectively
For for each entry up to OBJECT_TYPES_INFORMATION::NumberOfTypes:
Copy UNICODE_STRING::Length and UNICODE_STRING::MaximumLength for TypeName member
memcpy() UNICODE_STRING::Length bytes from source to target UNICODE_STRING::Buffer (target entry + sizeof(OBJECT_TYPE_INFORMATION32)
Add terminating zero (WCHAR) past the memcpy'd string
Copy the individual members past the TypeName from 64 to 32 bit struct
Compute pointer of next entry by aligning UNICODE_STRING::MaximumLength up to an 8 byte boundary (i.e. the ULONG_PTR alignment mentioned in the other answer) + sizeof(OBJECT_TYPE_INFORMATION64) (already 8 byte aligned!)
The next target entry (32 bit) gets 4 byte aligned instead
At the end compute required (32 bit) length by subtracting the value we arrived at for the "next" entry (i.e. one past the last) from the base address of the buffer passed by the WOW64 program (32 bit) to NtQueryObject()
In my debugged scenario these were: 0x008ce050 - 0x008cbfe8 = 0x00002068 (= 8296), which is 16 bytes larger than the buffer length we were told during case 1 (8280)!
The issue
That crucial last step differs between merely querying and actually getting the buffer filled. There is no further bounds checking in that loop I described for case 2.
And this means it will just overrun the passed buffer and return a written length bigger than the buffer length passed to it.
Possible solutions and workarounds
I'll have to approach this mathematically after some sleep, the workaround is obviously to top up the required length returned from case 1 in order to avoid the buffer overrun. The easiest method is to use my up_size_from_32bit() from the example below and use that on the returned required size. This way you are allocating enough for the 64 bit buffer, while querying the 32 bit one. This should never overrun during the copy loop.
However, the fix in wow64.dll is a little more involved, I guess. While adding bounds checking to the loop would help avert the overrun, it would mean that the caller would have to query for the required size twice, because the first time around it lies to us.
Which means the query-only case (1) would have to allocate that internal buffer after querying the required length for 64 bit, then get it filled and then walk the entries (just like the copy loop), skipping over the last entry to compute the required length the same as it is now done after the copy loop.
Example program demonstrating the "static" computation by wow64.dll
Build for x64, just the way wow64.dll was!
#define WIN32_LEAN_AND_MEAN
#include <Windows.h>
#include <cstdio>
typedef struct
{
ULONG JustPretending[24];
} OBJECT_TYPE_INFORMATION32;
typedef struct
{
ULONG JustPretending[26];
} OBJECT_TYPE_INFORMATION64;
constexpr ULONG size_delta_3264 = sizeof(OBJECT_TYPE_INFORMATION64) - sizeof(OBJECT_TYPE_INFORMATION32);
constexpr ULONG down_size_to_32bit(ULONG len)
{
return len - size_delta_3264 * ((len - 4) / sizeof(OBJECT_TYPE_INFORMATION64));
}
constexpr ULONG up_size_from_32bit(ULONG len)
{
return len + size_delta_3264 * ((len - 4) / sizeof(OBJECT_TYPE_INFORMATION32));
}
// Trying to mimic the wdm.h macro
constexpr size_t align_up_by(size_t address, size_t alignment)
{
return (address + (alignment - 1)) & ~(alignment - 1);
}
constexpr auto u32 = 8280UL;
constexpr auto u64 = 8968UL;
constexpr auto from_64 = down_size_to_32bit(u64);
constexpr auto from_32 = up_size_from_32bit(u32);
constexpr auto from_32_16_byte_aligned = (ULONG)align_up_by(from_32, 16);
int wmain()
{
wprintf(L"32 to 64 bit: %u -> %u -(16-byte-align)-> %u\n", u32, from_32, from_32_16_byte_aligned);
wprintf(L"64 to 32 bit: %u -> %u\n", u64, from_64);
return 0;
}
static_assert(sizeof(OBJECT_TYPE_INFORMATION32) == 96, "Size for 64 bit struct does not match.");
static_assert(sizeof(OBJECT_TYPE_INFORMATION64) == 104, "Size for 64 bit struct does not match.");
static_assert(u32 == from_64, "Must match (from 64 to 32 bit)");
static_assert(u64 == from_32, "Must match (from 32 to 64 bit)");
static_assert(from_32_16_byte_aligned % 16 == 0, "16 byte alignment failed");
static_assert(from_32_16_byte_aligned > from_32, "We're aligning up");
This does not mimic the computation that happens in case 2, though.

Was: How does BPF calculate number of CPU for PERCPU_ARRAY?

I have encountered an interesting issue where a PERCPU_ARRAY created on one system with 2 processors creates an array with 2 per-CPU elements and on another system with 2 processors, an array with 128 per-CPU elements. The latter was rather unexpected to me!
The way I discovered this behavior is that a program that allocated an array for the number of CPUs (using get_nprocs_conf(3)) and then read in the PERCPU_ARRAY into it (using bpf_map_lookup_elem()) ended up writing past the end of the array and crashing.
I would like to find out what is the proper way to determine in a program that reads BPF maps the number of elements in a PERCPU_ARRAY used on a system.
Failing that, I think the second best approach is to pick a buffer for reading in that is "large enough." Here, the problem is similar: what is that number and is there way to learn it at runtime?
The question comes from reading the source of bpftool, which figures this out:
unsigned int get_possible_cpus(void)
{
int cpus = libbpf_num_possible_cpus();
if (cpus < 0) {
p_err("Can't get # of possible cpus: %s", strerror(-cpus));
exit(-1);
}
return cpus;
}
int libbpf_num_possible_cpus(void)
{
static const char *fcpu = "/sys/devices/system/cpu/possible";
static int cpus;
int err, n, i, tmp_cpus;
bool *mask;
/* ---8<--- snip */
}
So that's how they do it!

Computing on variable length arrays in OpenCL

I am using OpenCL (Xcode, Intel GPU), and I am trying to implement a kernel that calculates moving averages and deviations. I want to pass several double arrays of varying lengths to the kernel. Is this possible to implement, or do I need to pad smaller arrays with zeroes so all the arrays are the same size?
I am new to OpenCL and GPGPU, so please forgive my ignorance of any nomenclature.
You can pass to the kernel any buffer, the kernel does not need to use it all.
For example, if your kernel reduces a buffer you can query at run time the amount of work items (items to reduce) using get_global_size(0). And then call the kernel with the proper parameters.
An example (unoptimized):
__kernel reduce_step(__global float* data)
{
int id = get_global_id(0);
int size = get_global_size(0);
int size2 = size/2;
int size2p = (size+1)/2;
if(id<size2) //Only reduce up to size2, the odd element will remain in place
data[id] += data[id+size2p];
}
Then you can call it like this.
void reduce_me(std::vector<cl_float>& data){
size_t size = data.size();
//Copy to a buffer already created, equal or bigger size than data size
// ... TODO, check sizes of buffer or change the buffer set to the kernel args.
queue.enqueueWriteBuffer(buffer,CL_FALSE,0,sizeof(cl_float)*size,data.data());
//Reduce until 1024
while(size > 1024){
queue.enqueueNDRangeKernel(reduce_kernel,cl::NullRange,cl::NDRange(size),cl::NullRange);
size /= 2;
}
//Read out and trim
queue.enqueueReadBuffer(buffer,CL_TRUE,0,sizeof(cl_float)*size,data.data());
data.resize(size);
}

Alsalib mmap direct write

I am just messing around with ALSA library and can't really figure out how to do playback with a direct write.
I am using SND_PCM_ACCESS_MMAP_INTERLEAVED.
I am trying to write a square wave.
I created a buffer of shorts to hold the square wave. I have tested it with snd_pcm_writei and it works.
I then call snd_pcm_begin and use the pointers given from area to write to the device:
while(1)
{
int msg;
frames_available = snd_pcm_avail_update(handle);
snd_pcm_mmap_begin(handle,&areas,&offset,&limit_frames);
frames_to_write = frames; //frames is the size of the buffer in frames
if (frames_to_write > limit_frames)
frames_to_write = 0;
int offset_frames = (areas[0].first + offset*areas[0].step)/16;
short* write_ptr = (short*)areas[0].addr + offset_frames;
// fill the buffer with stuff
for(int i =0; i < frames_to_write;i++)
{
write_ptr[i] = buffer[i];
}
msg = snd_pcm_mmap_commit(handle,offset,frames_to_write);
}
The sound produced is choppy and gets cut off soon after. It gets cut off because the limit_frame reaches 0. I notice that limit_frames stays at 0 even if there are frames_available.
EDIT:
I used memcpy() instead of a for loop and that solved the choppiness. Still gets cut off though. Now I'm curious why memcpy() solves the choppiness. Shouldn't the for loop and memcpy and for loop copy over the memory contiguously?
Using mmap does not make sense if all you're doing is copying the samples from another buffer; that's exactly the same what snd_pcm_writei() does.
Anyway, before calling snd_pcm_mmap_begin(), you must set its last parameter to the number of frames you intend to write, and when it returns a smaller number, you should write that number, instead of 0.
When you have more than one channel, a frame is larger than one sample.

Why FFTW on Windows is faster than on Linux?

I wrote two identical programs in Linux and Windows using the fftw libraries (fftw3.a, fftw3.lib), and compute the duration of the fftwf_execute(m_wfpFFTplan) statement (16-fft).
For 10000 runs:
On Linux: average time is 0.9
On Windows: average time is 0.12
I am confused as to why this is nine times faster on Windows than on Linux.
Processor: Intel(R) Core(TM) i7 CPU 870 # 2.93GHz
Each OS (Windows XP 32 bit and Linux OpenSUSE 11.4 32 bit) are installed on same machines.
I downloaded the fftw.lib (for Windows) from internet and don't know that configurations. Once I build FFTW with this config:
/configure --enable-float --enable-threads --with-combined-threads --disable-fortran --with-slow-timer --enable-sse --enable-sse2 --enable-avx
in Linux and it results in a lib that is four times faster than the default configs (0.4 ms).
16 FFT is very small. What you will find is FFTs smaller than say 64 will be hard coded assembler with no loops to get the highest possible performance. This means they can be highly susceptible to variations in instruction sets, compiler optimisations, even 64 or 32bit words.
What happens when you run a test of FFT sizes from 16 -> 1048576 in powers of 2? I say this as a particular hard-coded asm routine on Linux might not be the best optimized for your machine, whereas you might have been lucky on the Windows implementation for that particular size. A comparison of all sizes in this range will give you a better indication of the Linux vs. Windows performance.
Have you calibrated FFTW? When first run FFTW guesses the fastest implementation per machine, however if you have special instruction sets, or a particular sized cache or other processor features then these can have a dramatic effect on execution speed. As a result performing a calibration will test the speed of various FFT routines and choose the fastest per size for your specific hardware. Calibration involves repeatedly computing the plans and saving the FFTW "Wisdom" file generated. The saved calibration data (this is a lengthy process) can then be re-used. I suggest doing it once when your software starts up and re-using the file each time. I have noticed 4-10x performance improvements for certain sizes after calibrating!
Below is a snippet of code I have used to calibrate FFTW for certain sizes. Please note this code is pasted verbatim from a DSP library I worked on so some function calls are specific to my library. I hope the FFTW specific calls are helpful.
// Calibration FFTW
void DSP::forceCalibration(void)
{
// Try to import FFTw Wisdom for fast plan creation
FILE *fftw_wisdom = fopen("DSPDLL.ftw", "r");
// If wisdom does not exist, ask user to calibrate
if (fftw_wisdom == 0)
{
int iStatus2 = AfxMessageBox("FFTw not calibrated on this machine."\
"Would you like to perform a one-time calibration?\n\n"\
"Note:\tMay take 40 minutes (on P4 3GHz), but speeds all subsequent FFT-based filtering & convolution by up to 100%.\n"\
"\tResults are saved to disk (DSPDLL.ftw) and need only be performed once per machine.\n\n"\
"\tMAKE SURE YOU REALLY WANT TO DO THIS, THERE IS NO WAY TO CANCEL CALIBRATION PART-WAY!",
MB_YESNO | MB_ICONSTOP, 0);
if (iStatus2 == IDYES)
{
// Perform calibration for all powers of 2 from 8 to 4194304
// (most heavily used FFTs - for signal processing)
AfxMessageBox("About to perform calibration.\n"\
"Close all programs, turn off your screensaver and do not move the mouse in this time!\n"\
"Note:\tThis program will appear to be unresponsive until the calibration ends.\n\n"
"\tA MESSAGEBOX WILL BE SHOWN ONCE THE CALIBRATION IS COMPLETE.\n");
startTimer();
// Create a whole load of FFTw Plans (wisdom accumulates automatically)
for (int i = 8; i <= 4194304; i *= 2)
{
// Create new buffers and fill
DSP::cFFTin = new fftw_complex[i];
DSP::cFFTout = new fftw_complex[i];
DSP::fconv_FULL_Real_FFT_rdat = new double[i];
DSP::fconv_FULL_Real_FFT_cdat = new fftw_complex[(i/2)+1];
for(int j = 0; j < i; j++)
{
DSP::fconv_FULL_Real_FFT_rdat[j] = j;
DSP::cFFTin[j][0] = j;
DSP::cFFTin[j][1] = j;
DSP::cFFTout[j][0] = 0.0;
DSP::cFFTout[j][1] = 0.0;
}
// Create a plan for complex FFT.
// Use the measure flag to get the best possible FFT for this size
// FFTw "remembers" which FFTs were the fastest during this test.
// at the end of the test, the results are saved to disk and re-used
// upon every initialisation of the DSP Library
DSP::pCF = fftw_plan_dft_1d
(i, DSP::cFFTin, DSP::cFFTout, FFTW_FORWARD, FFTW_MEASURE);
// Destroy the plan
fftw_destroy_plan(DSP::pCF);
// Create a plan for real forward FFT
DSP::pCF = fftw_plan_dft_r2c_1d
(i, fconv_FULL_Real_FFT_rdat, fconv_FULL_Real_FFT_cdat, FFTW_MEASURE);
// Destroy the plan
fftw_destroy_plan(DSP::pCF);
// Create a plan for real inverse FFT
DSP::pCF = fftw_plan_dft_c2r_1d
(i, fconv_FULL_Real_FFT_cdat, fconv_FULL_Real_FFT_rdat, FFTW_MEASURE);
// Destroy the plan
fftw_destroy_plan(DSP::pCF);
// Destroy the buffers. Repeat for each size
delete [] DSP::cFFTin;
delete [] DSP::cFFTout;
delete [] DSP::fconv_FULL_Real_FFT_rdat;
delete [] DSP::fconv_FULL_Real_FFT_cdat;
}
double time = stopTimer();
char * strOutput;
strOutput = (char*) malloc (100);
sprintf(strOutput, "DSP.DLL Calibration complete in %d minutes, %d seconds\n"\
"Please keep a copy of the DSPDLL.ftw file in the root directory of your application\n"\
"to avoid re-calibration in the future\n", (int)time/(int)60, (int)time%(int)60);
AfxMessageBox(strOutput);
isCalibrated = 1;
// Save accumulated wisdom
char * strWisdom = fftw_export_wisdom_to_string();
FILE *fftw_wisdomsave = fopen("DSPDLL.ftw", "w");
fprintf(fftw_wisdomsave, "%s", strWisdom);
fclose(fftw_wisdomsave);
DSP::pCF = NULL;
DSP::cFFTin = NULL;
DSP::cFFTout = NULL;
fconv_FULL_Real_FFT_cdat = NULL;
fconv_FULL_Real_FFT_rdat = NULL;
free(strOutput);
}
}
else
{
// obtain file size.
fseek (fftw_wisdom , 0 , SEEK_END);
long lSize = ftell (fftw_wisdom);
rewind (fftw_wisdom);
// allocate memory to contain the whole file.
char * strWisdom = (char*) malloc (lSize);
// copy the file into the buffer.
fread (strWisdom,1,lSize,fftw_wisdom);
// import the buffer to fftw wisdom
fftw_import_wisdom_from_string(strWisdom);
fclose(fftw_wisdom);
free(strWisdom);
isCalibrated = 1;
return;
}
}
The secret sauce is to create the plan using the FFTW_MEASURE flag, which specifically measures hundreds of routines to find the fastest for your particular type of FFT (real, complex, 1D, 2D) and size:
DSP::pCF = fftw_plan_dft_1d (i, DSP::cFFTin, DSP::cFFTout,
FFTW_FORWARD, FFTW_MEASURE);
Finally, all benchmark tests should also be performed with a single FFT Plan stage outside of execute, called from code that is compiled in release mode with optimizations on and detached from the debugger. Benchmarks should be performed in a loop with many thousands (or even millions) of iterations and then take the average run time to compute the result. As you probably know the planning stage takes a significant amount of time and the execute is designed to be performed multiple times with a single plan.

Resources