why does a a nodejs array shift/push loop run 1000x slower above array length 87369? - node.js

Why is the speed of nodejs array shift/push operations not linear in the size of the array? There is a dramatic knee at 87370 that completely crushes the system.
Try this, first with 87369 elements in q, then with 87370. (Or, on a 64-bit system, try 85983 and 85984.) For me, the former runs in .05 seconds; the latter, in 80 seconds -- 1600 times slower. (observed on 32-bit debian linux with node v0.10.29)
q = [];
// preload the queue with some data
for (i=0; i<87369; i++) q.push({});
// fetch oldest waiting item and push new item
for (i=0; i<100000; i++) {
q.shift();
q.push({});
if (i%10000 === 0) process.stdout.write(".");
}
64-bit debian linux v0.10.29 crawls starting at 85984 and runs in .06 / 56 seconds. Node v0.11.13 has similar breakpoints, but at different array sizes.

Shift is a very slow operation for arrays as you need to move all the elements but V8 is able to use a trick to perform it fast when the array contents fit in a page (1mb).
Empty arrays start with 4 slots and as you keep pushing, it will resize the array using formula 1.5 * (old length + 1) + 16.
var j = 4;
while (j < 87369) {
j = (j + 1) + Math.floor(j / 2) + 16
console.log(j);
}
Prints:
23
51
93
156
251
393
606
926
1406
2126
3206
4826
7256
10901
16368
24569
36870
55322
83000
124517
So your array size ends up actually being 124517 items which makes it too large.
You can actually preallocate your array just to the right size and it should be able to fast shift again:
var q = new Array(87369); // Fits in a page so fast shift is possible
// preload the queue with some data
for (i=0; i<87369; i++) q[i] = {};
If you need larger than that, use the right data structure

I started digging into the v8 sources, but I still don't understand it.
I instrumented deps/v8/src/builtins.cc:MoveElemens (called from Builtin_ArrayShift, which implements the shift with a memmove), and it clearly shows the slowdown: only 1000 shifts per second because each one takes 1ms:
AR: at 1417982255.050970: MoveElements sec = 0.000809
AR: at 1417982255.052314: MoveElements sec = 0.001341
AR: at 1417982255.053542: MoveElements sec = 0.001224
AR: at 1417982255.054360: MoveElements sec = 0.000815
AR: at 1417982255.055684: MoveElements sec = 0.001321
AR: at 1417982255.056501: MoveElements sec = 0.000814
of which the memmove is 0.000040 seconds, the bulk is the heap->RecordWrites (deps/v8/src/heap-inl.h):
void Heap::RecordWrites(Address address, int start, int len) {
if (!InNewSpace(address)) {
for (int i = 0; i < len; i++) {
store_buffer_.Mark(address + start + i * kPointerSize);
}
}
}
which is (store-buffer-inl.h)
void StoreBuffer::Mark(Address addr) {
ASSERT(!heap_->cell_space()->Contains(addr));
ASSERT(!heap_->code_space()->Contains(addr));
Address* top = reinterpret_cast<Address*>(heap_->store_buffer_top());
*top++ = addr;
heap_->public_set_store_buffer_top(top);
if ((reinterpret_cast<uintptr_t>(top) & kStoreBufferOverflowBit) != 0) {
ASSERT(top == limit_);
Compact();
} else {
ASSERT(top < limit_);
}
}
when the code is running slow, there are runs of shift/push ops followed by runs of 5-6 calls to Compact() for every MoveElements. When it's running fast, MoveElements isn't called until a handful of times at the end, and just a single compaction when it finishes.
I'm guessing memory compaction might be thrashing, but it's not falling in place for me yet.
Edit: forget that last edit about output buffering artifacts, I was filtering duplicates.

this bug had been reported to google, who closed it without studying the issue.
https://code.google.com/p/v8/issues/detail?id=3059
When shifting out and calling tasks (functions) from a queue (array)
the GC(?) is stalling for an inordinate length of time.
114467 shifts is OK
114468 shifts is problematic, symptoms occur
the response:
he GC has nothing to do with this, and nothing is stalling either.
Array.shift() is an expensive operation, as it requires all array
elements to be moved. For most areas of the heap, V8 has implemented a
special trick to hide this cost: it simply bumps the pointer to the
beginning of the object by one, effectively cutting off the first
element. However, when an array is so large that it must be placed in
"large object space", this trick cannot be applied as object starts
must be aligned, so on every .shift() operation all elements must
actually be moved in memory.
I'm not sure there's a whole lot we can do about this. If you want a
"Queue" object in JavaScript with guaranteed O(1) complexity for
.enqueue() and .dequeue() operations, you may want to implement your
own.
Edit: I just caught the subtle "all elements must be moved" part -- is RecordWrites not GC but an actual element copy then? The memmove of the array contents is 0.04 milliseconds. The RecordWrites loop is 96% of the 1.1 ms runtime.
Edit: if "aligned" means the first object must be at first address, that's what memmove does. What is RecordWrites?

Related

How to improve the memory usage in nodejs code?

I tried one code at hackerearth : https://www.hackerearth.com/practice/data-structures/stacks/basics-of-stacks/practice-problems/algorithm/fight-for-laddus/description/
The speed seems fine however the memory usage exceed the 256mb limit by nearly 2.8 times.
In java and python the memory is 5 times less however the time is nearly twice.
What factor can be used to optimise the memory usage in nodejs code implementation?
Here is nodejs implementation:
// Sample code to perform I/O:
process.stdin.resume();
process.stdin.setEncoding("utf-8");
var stdin_input = "";
process.stdin.on("data", function (input) {
stdin_input += input; // Reading input from STDIN
});
process.stdin.on("end", function () {
main(stdin_input);
});
function main(input) {
let arr = input.split("\n");
let testCases = parseInt(arr[0], 10);
arr.splice(0,1);
finalStr = "";
while(testCases > 0){
let inputArray = (arr[arr.length - testCases*2 + 1]).split(" ");
let inputArrayLength = inputArray.length;
testCases = testCases - 1;
frequencyObject = { };
for(let i = 0; i < inputArrayLength; ++i) {
if(!frequencyObject[inputArray[i]])
{
frequencyObject[inputArray[i]] = 0;
}
++frequencyObject[inputArray[i]];
}
let finalArray = [];
finalArray[inputArrayLength-1] = -1;
let stack = [];
stack.push(inputArrayLength-1);
for(let i = inputArrayLength-2; i>=0; i--)
{
let stackLength = stack.length;
while(stackLength > 0 && frequencyObject[inputArray[stack[stackLength-1]]] <= frequencyObject[inputArray[i]])
{
stack.pop();
stackLength--;
}
if (stackLength > 0) {
finalArray[i] = inputArray[stack[stackLength-1]];
} else {
finalArray[i] = -1;
}
stack.push(i);
}
console.log(finalArray.join(" ") + "\n")
}
}
What factor can be used to optimize the memory usage in nodejs code implementation?
Here are some things to consider:
Don't buffer any more input data than you need to before you process it or output it.
Try to avoid making copies of data. Use the data in place if possible. Remember that all string operations create a new string that is likely a copy of the original data. And, many array operations like .map(), .filter(), etc... create new copies of the original array.
Keep in mind that garbage collection is delayed and is typically done during idle time. So, for example, modifying strings in a loop may create a lot of temporary objects that all must exist at once, even though most or all of them will be garbage collected when the loop is done. This creates a poor peak memory usage.
Buffering
The first thing I notice is that you read the entire input file into memory before you process any of it. Right away for large input files, you're going to use a lot of memory. Instead, what you want to do is read enough of a chunk to get the next testCase and then process it.
FYI, this incremental reading/processing will make the code significantly more complicated to write (I've written an implementation myself) because you have to handle partially read lines, but it will hold down memory use a bunch and that's what you asked for.
Copies of Data
After reading the entire input file into memory, you immediately make a copy of it all with this:
let arr = input.split("\n");
So, now you've more than doubled the amount of memory the input data is taking up. Instead of just one string for all the input, you now still have all of that in memory, but you've now broken it up into hundreds of other strings (each with a little overhead of its own for a new string and of course a copy of each line).
Modifying Strings in a Loop
When you're creating your final result which you call finalStr, you're doing this over and over again:
finalStr = finalStr + finalArray.join(" ") + "\n"
This is going to create tons and tons of incremental strings that will likely end up all in memory at once because garbage collection probably won't run until the loop is over. As an example, if you had 100 lines of output that were each 100 characters long so the total output (not count line termination characters) was 100 x 100 = 10,000 characters, then constructing this in a loop like you are would create temporary strings of 100, 200, 300, 400, ... 10,000 which would consume 5000 (avg length) * 100 (number of temporary strings) = 500,000 characters. That's 50x the total output size consumed in temporary string objects.
So, not only does this create tons of incremental strings each one larger than the previous one (since you're adding onto it), it also creates your entire output in memory before writing any of it out to stdout.
Instead, you can incremental output each line to stdout as you construct each line. This will put the worst case memory usage at probably about 2x the output size whereas you're at 50x or worse.

Set lookup in Nodejs does not seem to have O(1)

I wrote a test to test the lookup speed of Set in Nodejs (v8.4).
const size = 5000000;
const lookups = 1000000;
const set = new Set();
for (let i = 0; i < size; i++) {
set.add(i);
}
const samples = [];
for (let i = 0; i < lookups; i++) {
samples.push(Math.floor(Math.random() * size));
}
const start = Date.now();
for (const key of samples) {
set.has(key);
}
console.log(`size: ${size}, time: ${Date.now() - start}`);
After running it with size = 5000, 50000, 500000, and 5000000, the result is surprising to me:
size: 5000, time: 29
size: 50000, time: 41
size: 500000, time: 81
size: 5000000, time: 130
I expected the time it takes is relatively constant. But it increases substantially as the number of items in the Set increases. Isn't the lookup supposed to be O(1)? What am I missing here?
Update 1:
After reading some comments and answers, I understand the point everyone is trying to make here. Maybe my question should be 'What is causing the increase in time?'. In hash map implementation, with the same number of lookups, the reason for increase in lookup time can only be there are more key collisions.
Update 2:
After more research, here is what I found:
V8 uses ordered hash table for both Set and Map implementation
According to this link, there are performance impact on the lookup time for ordered hash map, while unordered hash map's performance stays constant.
However, V8's ordered hash table implementation is based on this, and that doesn't seem to add any overhead to the look up time with increasing number of items.
Regardless of whether the JS Set implementation is actually O(1) or not (I'm not sure it is), you should not expect O(1) operations to result in speed that is identical across calls. It is a means of measuring the operation complexity rather than the actual throughput speed.
To demonstrate this, consider the use case of sorting an array of numbers. You can sort using array.sort which I believe is O(n * log(n)) in Node.js. You can also create a (bad, but amusing) O(n) implementation using timeouts (ignore complexity of adding to the array, etc):
// input data
let array = [
681, 762, 198, 347, 340,
73, 989, 967, 409, 752,
660, 914, 711, 153, 691,
35, 112, 907, 970, 67
];
// buffer of new
let sorted = [];
// O(n) sorting algorithm
array.forEach(function (num) {
setTimeout(sorted.push.bind(sorted, num), num);
});
// ensure sort finished
setTimeout(function () {
console.log(sorted);
}, 2000);
Of course, the first implementation is faster - but in terms of complexity, the second one is "better". The point is that you should only really be using O to estimate, it does not guarantee any specific amount of time. If you called the O(n) above with an array of 20 numbers (so the same length) but it had only two digit numbers, it would be a large execution time difference.
Stupid example, but it should hopefully support the point I'm trying to make :)
Caching and memory locality. V8's implementation of Set lookup has O(1) theoretical complexity, but real hardware has its own constraints and characteristics. Specifically, not every memory access has the same speed. Theoretical complexity analysis is only concerned with the number of operations, not the speed of each operation.
Update for updated question:
This answers your updated question! When you make many requests to a small Set, it will be likely that the CPU has cached the relevant chunks of memory, making many of the lookups faster than they would be if the data had to be retrieved from memory. There don't have to be more collisions for this effect to happen; it is simply the case that accessing a small memory region repeatedly is faster than spreading out the same number of accesses over a large memory region.
In fact, you can measure the same effect (with smaller magnitude) with an array:
const size = 5000000;
const lookups = 1000000;
const array = new Array(size);
for (let i = 0; i < size; i++) {
array[i] = 1;
}
const start = Date.now();
var result = 0;
for (var i = 0; i < lookups; i++) {
var sample = Math.floor(Math.random() * size);
result += array[sample];
}
const end = Date.now();
console.log(`size: ${size}, time: ${end - start}`);
A million lookups of random indices on a 5,000-element array will be faster than a million lookups of random indices on a 5,000,000 element array.
The reason is that for a smaller data structure, there's a greater likelihood that the random accesses will read elements that are already in the CPU's cache.
In theory you could be right, a Set could have a lookup of O(1), but the JS set definition is very specific on the algorithm. See ECMA Script definition. There is a loop over all elements included.
Try have a look at various HashSet implementation you can find for example here, there might be one with O(1) .has-speed.

NodeJS, Promises and performance

My question is about performance in my NodeJS app...
If my program run 12 iteration of 1.250.000 each = 15.000.000 iterations all together - it takes dedicated servers at Amazon the following time to process:
r3.large: 2 vCPU, 6.5 ECU, 15 GB memory --> 123 minutes
4.8xlarge: 36 vCPU, 132 ECU, 60 GB memory --> 102 minutes
I have some code similair to the code below...
start();
start(){
for(var i=0; i<12; i++){
function2(); // Iterates over a collection - which contains data split up in intervals - by date intervals. This function is actually also recursive - due to the fact - that is run through the data many time (MAX 50-100 times) - due to different intervals sizes...
}
}
function2(){
return new Promise{
for(var i=0; i<1.250.000; i++){
return new Promise{
function3(); // This function simple iterate through all possible combinations - and call function3 - with all given values/combinations
}
}
}
}
function3(){
return new Promise{ // This function simple make some calculations based on the given values/combination - and then return the result to function2 - which in the end - decides which result/combination was the best...
}}
This is equal to 0.411 millisecond / 441 microseconds pér iteration!
When i look at performance and memory usage in the taskbar... the CPU is not running at 100% - but more like 50%...the entire time?
The memory usage starts very low - but KEEPS growing in GB - every minute until the process is done - BUT the (allocated) memory is first released when i press CTRL+C in the Windows CMD... so its like the NodeJS garbage collection doesn't not work optimal - or may be its simple the design of the code again...
When i execute the app i use the memory opt like:
node --max-old-space-size="50000" server.js
PLEASE tell me every thing you thing i can do - to make my program FASTER!
Thank you all - so much!
It's not that the garbage collector doesn't work optimally but that it doesn't work at all - you don't give it any chance to.
When developing the tco module that does tail call optimization in Node i noticed a strange thing. It seemed to leak memory and I didn't know why. It turned out that it was because of few console.log()
calls in various places that I used for testing to see what's going on because seeing a result of recursive call millions levels deep took some time so I wanted to see something while it was doing it.
Your example is pretty similar to that.
Remember that Node is single-threaded. When your computations run, nothing else can - including the GC. Your code is completely synchronous and blocking - even though it's generating millions of promises in a blocking manner. It is blocking because it never reaches the event loop.
Consider this example:
var a = 0, b = 10000000;
function numbers() {
while (a < b) {
console.log("Number " + a++);
}
}
numbers();
It's pretty simple - you want to print 10 million numbers. But when you run it it behaves very strangely - for example it prints numbers up to some point, and then it stops for several seconds, then it keeps going or maybe starts trashing if you're using swap, or maybe gives you this error that I just got right after seeing the Number 8486:
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - process out of memory
Aborted
What's going on here is that the main thread is blocked in a synchronous loop where it keeps creating objects but the GC has no chance to release them.
For such long running tasks you need to divide your work and get into the event loop once in a while.
Here is how you can fix this problem:
var a = 0, b = 10000000;
function numbers() {
var i = 0;
while (a < b && i++ < 100) {
console.log("Number " + a++);
}
if (a < b) setImmediate(numbers);
}
numbers();
It does the same - it prints numbers from a to b but in bunches of 100 and then it schedules itself to continue at the end of the event loop.
Output of $(which time) -v node numbers1.js 2>&1 | egrep 'Maximum resident|FATAL'
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - process out of memory
Maximum resident set size (kbytes): 1495968
It used 1.5GB of memory and crashed.
Output of $(which time) -v node numbers2.js 2>&1 | egrep 'Maximum resident|FATAL'
Maximum resident set size (kbytes): 56404
It used 56MB of memory and finished.
See also those answers:
How to write non-blocking async function in Express request handler
How node.js server serve next request, if current request have huge computation?
Maximum call stack size exceeded in nodejs
Node; Q Promise delay
How to avoid jimp blocking the code node.js

can we inspect value in the cache after a read instruction has executed?

Recently, I write a program to transpose a matrix.
for (int i = 0; i < 1000; i++)
{
for (int j = 0; j < 1000; j++)
{
ndata[j][i] = odata[i][j];
}
}
From above code, we know inner loop is cache friendly for odata, however not friendly for ndata, which will lead a lot of cache miss, I want to inspect value of the L1 cache and L2 cache after a read instruction has executed. How can I do this?
First of all - 1000*1000*2 elements (of what, int?) won't fit in any L1 I know of, maybe an L3.
As for your question - there's no simple way to inspect the contents of a cache (expect for running it on a CPU or cache simulator that produces that information), you may measure the access time of that line but by that you would a) affect the contents of the cache or their LRU weights, and b) probably get meaningless results unless you measure accessing multiple such lines in a single measurement and amortize.
By the way, if you're interested in improving this code, just add SW prefetches for ndata[j+1][i] on each iteration.

Why FFTW on Windows is faster than on Linux?

I wrote two identical programs in Linux and Windows using the fftw libraries (fftw3.a, fftw3.lib), and compute the duration of the fftwf_execute(m_wfpFFTplan) statement (16-fft).
For 10000 runs:
On Linux: average time is 0.9
On Windows: average time is 0.12
I am confused as to why this is nine times faster on Windows than on Linux.
Processor: Intel(R) Core(TM) i7 CPU 870 # 2.93GHz
Each OS (Windows XP 32 bit and Linux OpenSUSE 11.4 32 bit) are installed on same machines.
I downloaded the fftw.lib (for Windows) from internet and don't know that configurations. Once I build FFTW with this config:
/configure --enable-float --enable-threads --with-combined-threads --disable-fortran --with-slow-timer --enable-sse --enable-sse2 --enable-avx
in Linux and it results in a lib that is four times faster than the default configs (0.4 ms).
16 FFT is very small. What you will find is FFTs smaller than say 64 will be hard coded assembler with no loops to get the highest possible performance. This means they can be highly susceptible to variations in instruction sets, compiler optimisations, even 64 or 32bit words.
What happens when you run a test of FFT sizes from 16 -> 1048576 in powers of 2? I say this as a particular hard-coded asm routine on Linux might not be the best optimized for your machine, whereas you might have been lucky on the Windows implementation for that particular size. A comparison of all sizes in this range will give you a better indication of the Linux vs. Windows performance.
Have you calibrated FFTW? When first run FFTW guesses the fastest implementation per machine, however if you have special instruction sets, or a particular sized cache or other processor features then these can have a dramatic effect on execution speed. As a result performing a calibration will test the speed of various FFT routines and choose the fastest per size for your specific hardware. Calibration involves repeatedly computing the plans and saving the FFTW "Wisdom" file generated. The saved calibration data (this is a lengthy process) can then be re-used. I suggest doing it once when your software starts up and re-using the file each time. I have noticed 4-10x performance improvements for certain sizes after calibrating!
Below is a snippet of code I have used to calibrate FFTW for certain sizes. Please note this code is pasted verbatim from a DSP library I worked on so some function calls are specific to my library. I hope the FFTW specific calls are helpful.
// Calibration FFTW
void DSP::forceCalibration(void)
{
// Try to import FFTw Wisdom for fast plan creation
FILE *fftw_wisdom = fopen("DSPDLL.ftw", "r");
// If wisdom does not exist, ask user to calibrate
if (fftw_wisdom == 0)
{
int iStatus2 = AfxMessageBox("FFTw not calibrated on this machine."\
"Would you like to perform a one-time calibration?\n\n"\
"Note:\tMay take 40 minutes (on P4 3GHz), but speeds all subsequent FFT-based filtering & convolution by up to 100%.\n"\
"\tResults are saved to disk (DSPDLL.ftw) and need only be performed once per machine.\n\n"\
"\tMAKE SURE YOU REALLY WANT TO DO THIS, THERE IS NO WAY TO CANCEL CALIBRATION PART-WAY!",
MB_YESNO | MB_ICONSTOP, 0);
if (iStatus2 == IDYES)
{
// Perform calibration for all powers of 2 from 8 to 4194304
// (most heavily used FFTs - for signal processing)
AfxMessageBox("About to perform calibration.\n"\
"Close all programs, turn off your screensaver and do not move the mouse in this time!\n"\
"Note:\tThis program will appear to be unresponsive until the calibration ends.\n\n"
"\tA MESSAGEBOX WILL BE SHOWN ONCE THE CALIBRATION IS COMPLETE.\n");
startTimer();
// Create a whole load of FFTw Plans (wisdom accumulates automatically)
for (int i = 8; i <= 4194304; i *= 2)
{
// Create new buffers and fill
DSP::cFFTin = new fftw_complex[i];
DSP::cFFTout = new fftw_complex[i];
DSP::fconv_FULL_Real_FFT_rdat = new double[i];
DSP::fconv_FULL_Real_FFT_cdat = new fftw_complex[(i/2)+1];
for(int j = 0; j < i; j++)
{
DSP::fconv_FULL_Real_FFT_rdat[j] = j;
DSP::cFFTin[j][0] = j;
DSP::cFFTin[j][1] = j;
DSP::cFFTout[j][0] = 0.0;
DSP::cFFTout[j][1] = 0.0;
}
// Create a plan for complex FFT.
// Use the measure flag to get the best possible FFT for this size
// FFTw "remembers" which FFTs were the fastest during this test.
// at the end of the test, the results are saved to disk and re-used
// upon every initialisation of the DSP Library
DSP::pCF = fftw_plan_dft_1d
(i, DSP::cFFTin, DSP::cFFTout, FFTW_FORWARD, FFTW_MEASURE);
// Destroy the plan
fftw_destroy_plan(DSP::pCF);
// Create a plan for real forward FFT
DSP::pCF = fftw_plan_dft_r2c_1d
(i, fconv_FULL_Real_FFT_rdat, fconv_FULL_Real_FFT_cdat, FFTW_MEASURE);
// Destroy the plan
fftw_destroy_plan(DSP::pCF);
// Create a plan for real inverse FFT
DSP::pCF = fftw_plan_dft_c2r_1d
(i, fconv_FULL_Real_FFT_cdat, fconv_FULL_Real_FFT_rdat, FFTW_MEASURE);
// Destroy the plan
fftw_destroy_plan(DSP::pCF);
// Destroy the buffers. Repeat for each size
delete [] DSP::cFFTin;
delete [] DSP::cFFTout;
delete [] DSP::fconv_FULL_Real_FFT_rdat;
delete [] DSP::fconv_FULL_Real_FFT_cdat;
}
double time = stopTimer();
char * strOutput;
strOutput = (char*) malloc (100);
sprintf(strOutput, "DSP.DLL Calibration complete in %d minutes, %d seconds\n"\
"Please keep a copy of the DSPDLL.ftw file in the root directory of your application\n"\
"to avoid re-calibration in the future\n", (int)time/(int)60, (int)time%(int)60);
AfxMessageBox(strOutput);
isCalibrated = 1;
// Save accumulated wisdom
char * strWisdom = fftw_export_wisdom_to_string();
FILE *fftw_wisdomsave = fopen("DSPDLL.ftw", "w");
fprintf(fftw_wisdomsave, "%s", strWisdom);
fclose(fftw_wisdomsave);
DSP::pCF = NULL;
DSP::cFFTin = NULL;
DSP::cFFTout = NULL;
fconv_FULL_Real_FFT_cdat = NULL;
fconv_FULL_Real_FFT_rdat = NULL;
free(strOutput);
}
}
else
{
// obtain file size.
fseek (fftw_wisdom , 0 , SEEK_END);
long lSize = ftell (fftw_wisdom);
rewind (fftw_wisdom);
// allocate memory to contain the whole file.
char * strWisdom = (char*) malloc (lSize);
// copy the file into the buffer.
fread (strWisdom,1,lSize,fftw_wisdom);
// import the buffer to fftw wisdom
fftw_import_wisdom_from_string(strWisdom);
fclose(fftw_wisdom);
free(strWisdom);
isCalibrated = 1;
return;
}
}
The secret sauce is to create the plan using the FFTW_MEASURE flag, which specifically measures hundreds of routines to find the fastest for your particular type of FFT (real, complex, 1D, 2D) and size:
DSP::pCF = fftw_plan_dft_1d (i, DSP::cFFTin, DSP::cFFTout,
FFTW_FORWARD, FFTW_MEASURE);
Finally, all benchmark tests should also be performed with a single FFT Plan stage outside of execute, called from code that is compiled in release mode with optimizations on and detached from the debugger. Benchmarks should be performed in a loop with many thousands (or even millions) of iterations and then take the average run time to compute the result. As you probably know the planning stage takes a significant amount of time and the execute is designed to be performed multiple times with a single plan.

Resources