Timing is not reliable playing audio using sdl2/mixer in nim - audio

I'm trying to build a simple metronome to learn the nim programming language, and though I can get audio to play, the timing doesn't work. I'm running this on Mac OSX and there is always a lag every third or fourth 'click'
Here's my code:
# nim code to create a metronome
import times, os
import sdl2, sdl2/mixer
sdl2.init(INIT_AUDIO)
var click : ChunkPtr
var channel : cint
var audio_rate : cint
var audio_format : uint16
var audio_buffers : cint = 4096
var audio_channels : cint = 2
if mixer.openAudio(audio_rate, audio_format, audio_channels, audio_buffers) != 0:
quit("There was a problem")
click = mixer.loadWAV("click.wav")
var bpm = 120
var next_click = getTime()
let dur = initDuration(milliseconds = toInt(60000 / bpm))
var last_click = getTime()
while true:
var now = getTime()
if now >= next_click:
next_click = next_click + dur
# discard mixer.playChannelTimed(0, click, 0, cint(500)
discard mixer.playChannel(0, click, 0)
os.sleep(1)
Any idea why the lag?
(by the way, the click.wav file is only one channel and 0.2 seconds long)

The call to os.sleep(1) is unreliable as a high precision timing control. On MacOSX it calls to nanosleep, which states:
If the interval specified in req is not an exact multiple of the
granularity underlying clock (see time(7)), then the interval will be
rounded up to the next multiple. Furthermore, after the sleep
completes, there may still be a delay before the CPU becomes free to
once again execute the calling thread.
As such, you need to find a different more reliable waiting method or simply remove that delay and burn CPU cicles in the hope of being more precise (your program could still get preempted by the OS anyway).

Related

Godot: Rendering Terrain In Separate Thread Slower Than Main Thread?

I'll admit, I'm new to multithreading and I was hoping to dabble in it with my C++ project first but I've hit a snag in my Godot game project where rendering the terrain gave me a small lag spike every time new terrain was generated so I wanted to move it to a separate thread. The only problem is that I can't find good resources on Godot multithreading so I was simply going off the documentation. I practically copied the same design in the documentation but it ended up making my game slower and even lags the main thread, not just the generation thread.
I've done a lot of my own research and I know SO is really keen on that or they kick you out so I want to list it here:
Godot Docs, only teaches you about starting up a thread, mutexes, and semaphores.
From what I understand mutexes lock a resource so only that thread can access it and only that thread can unlock it. From tests on my machine constant locks and unlocks don't seem to cause much overhead.
Semaphores from what I understand are a tool to signal some thread from a different thread, whereas mutexes only can unlock and lock from the same thread, one thread can signal a semaphore while another thread waits for that signal. This too doesn't seem to cause much overhead
Doing some practical experiments, it seems that if I get a handle on a chunk and call its render method, the method doesn't happen on that thread which I assume is the culprit however if that's the case I don't understand why the rendering could be SLOWER than doing it all on the main thread unless there's an overhead to calling a function on an object that was created on the main thread however that confuses me even more as isn't all memory shared between threads so why would it need to do something extra to call a function?
Using "call_deferred" seems to make the separate thread slightly faster but heavily slows the main thread. And tbh I'm not completely knowledgeable on call_deferred it seems to call the function during idle time, I experimented with it because of my next research point which is
Thread Safe APIs, after reading this I understand that interacting with the active tree isn't thread-safe, which means using call_deferred is preferred to interact with it during idle time. It is stated that it is preferred to construct one scene on a separate thread, then use call_deferred to do only one call to add_child. This seems to help get around that Thread Safety issue so that's what I did
That's the best research I could do and I hope it shows I really have tried what I could. It's absolutely not the best that's possible I'm sure, it's just that's the extent of my expertise in research which is why I came here (Y'all seem to have expertise far beyond what I can imagine having haha)
However, taking what I understood from all of this I decided to create a system where once an array of indices to positions to generate is written to, it posts a signal to a semaphore which will start the other thread's generation algorithm. The thread is in a while loop where at the start it has a semaphore.wait() to wait for that signal that the array is written to and ready. It goes through the indices and calls the render function for the chunks around that point (I didn't mention the array holds a Vector2 of the chunk position to render around) For this case the only point right now is the players position so the array is always 1 but that's just for now. The render function of the chunks builds a Node2D with all the tiles before doing only one call to add_child through call_deferred to get around the Thread Safety issues. One issue is that there will be one call_deferred for each chunk however when I tried to fix that it wouldn't work at all which was also weird.
So here I am with the code:
GameMap Code (Simplified)
# Made up of MapChunks
var map = {}
var chunk_loaders = [Vector2(0,0)]
var render_distance = 7 # Must be Odd
var chunk_tick_dist = 19 # Must be Odd
var noise_list = {"map_noise" : null, "foliage_noise" : null}
var chunk_gen_thread
var chunk_gen_thread_exit = true
var mutex
var semaphore
var indices_to_generate = []
onready var player = get_node("../Player")
func _ready():
mutex = Mutex.new()
semaphore = Semaphore.new()
chunk_gen_thread = Thread.new()
chunk_gen_thread.start(self, "chunk_generation_thread")
generate_noise()
regen_chunks(get_chunk_from_world(player.position.x, player.position.y), 0)
func _exit_tree():
mutex.lock()
chunk_gen_thread_exit = true
mutex.unlock()
semaphore.post()
chunk_gen_thread.wait_to_finish()
func chunk_generation_thread(userData):
while true:
semaphore.wait() # Wait for chunk gen signal
# Protect run loop with mutex
mutex.lock()
var should_exit = !chunk_gen_thread_exit
mutex.unlock()
if should_exit:
break
# Regen Chunks
mutex.lock()
for i in indices_to_generate:
var lc_pos = Vector2(chunk_loaders[i].x - floor(render_distance/2), chunk_loaders[i].y - floor(render_distance/2))
var upper_lc = lc_pos - Vector2(1, 1)
for x in render_distance:
for y in render_distance:
var chunk_pos = Vector2(lc_pos.x+x, lc_pos.y+y)
var chunk = retrieve_chunk(chunk_pos.x, chunk_pos.y)
chunk.rerender_chunk()
for x in render_distance+2:
for y in render_distance+2:
if x != 0 and x != render_distance+1 and y != 0 and y != render_distance+1:
continue
var chunk = Vector2(upper_lc.x+x, upper_lc.y+y)
var unrender_chunk = retrieve_chunk(chunk.x, chunk.y)
unrender_chunk.unrender()
mutex.unlock()
func regen_chunks(chunk_position, chunk_loader_index):
mutex.lock()
if chunk_loader_index >= chunk_loaders.size():
chunk_loaders.append(Vector2(0,0))
chunk_loaders[chunk_loader_index] = chunk_position
indices_to_generate = [chunk_loader_index]
mutex.unlock()
semaphore.post()
func retrieve_chunk(x, y):
mutex.lock()
if !map.has(Vector2(x, y)):
create_chunk(x, y)
mutex.unlock()
return map[Vector2(x, y)]
func create_chunk(x, y):
var new_chunk = MapChunk.new()
add_child(new_chunk)
new_chunk.generate_chunk(x, y)
MapChunk Code (Simplified)
var thread_scene
onready var game_map = get_parent()
func _ready():
thread_scene = Node2D.new()
func generate_chunk(x, y):
chunk_position = Vector2(x, y)
rerender_chunk()
func rerender_chunk():
if !un_rendered:
return
un_rendered = false
lc_position.x = chunk_position.x*(CHUNK_WIDTH)
lc_position.y = chunk_position.y*(CHUNK_HEIGHT)
thread_scene.queue_free()
thread_scene = Node2D.new()
chunk_map.resize(CHUNK_WIDTH)
for x in CHUNK_WIDTH:
chunk_map[x] = []
chunk_map[x].resize(CHUNK_HEIGHT)
for y in CHUNK_HEIGHT:
var cell_value = game_map.get_noise_value("map_noise", lc_position.x+x, lc_position.y+y)
assign_ground_cell(cell_value, x, y)
self.call_deferred("add_child", thread_scene)
func unrender():
if un_rendered:
return
un_rendered = true
for x in CHUNK_WIDTH:
for y in CHUNK_HEIGHT:
if chunk_map[x][y].occupying_tile != null:
chunk_map[x][y].occupying_tile.call_deferred("queue_free")
chunk_map[x][y].call_deferred("queue_free")
func assign_ground_cell(cell_value, x, y):
if cell_value < 0.4:
chunk_map[x][y] = game_map.create_tile("GRASS", lc_position.x+x, lc_position.y+y)
generate_grass_foliage(x, y)
elif cell_value < 0.5:
chunk_map[x][y] = game_map.create_tile("SAND", lc_position.x+x, lc_position.y+y)
else:
chunk_map[x][y] = game_map.create_tile("WATER", lc_position.x+x, lc_position.y+y)
thread_scene.add_child(chunk_map[x][y])
func generate_grass_foliage(x, y):
var cell_value = game_map.get_noise_value("foliage_noise", lc_position.x+x, lc_position.y+y)
if cell_value >= 0.4:
chunk_map[x][y].occupying_tile = game_map.create_tile("TREE", lc_position.x+x, lc_position.y+y)
chunk_map[x][y].occupying_tile.parent_tile = chunk_map[x][y]
chunk_map[x][y].occupying_tile.z_index = 3
elif cell_value >= 0.2 and cell_value < 0.4:
chunk_map[x][y].occupying_tile = game_map.create_tile("GRASS_BLADE", lc_position.x+x, lc_position.y+y)
chunk_map[x][y].occupying_tile.parent_tile = chunk_map[x][y]
chunk_map[x][y].occupying_tile.z_index = 1
if chunk_map[x][y].occupying_tile != null:
thread_scene.add_child(chunk_map[x][y].occupying_tile)
KEEP IN MIND
All of this code works fine if it's all on the main thread!!
There is nothing wrong with the chunk generation code itself! It works completely fine if I remove the thread.start thing from the ready function. It all works except there's like a 0.5-second lag spike every time it's called that I'm trying to get rid of. I am almost 89% sure this should purely be a thread problem. (I'm sure I could improve the chunk gen algorithm more but I also really want to understand threads)

First note played in AKSequencer is off

I am using AKSequencer to create a sequence of notes that are played by an AKMidiSampler. My problem is, at higher tempos the first note always plays with a little delay, no matter what i do.
I tried prerolling the sequence but it won't help. Substituting the AKMidiSampler with an AKSampler or a AKSamplePlayer (and using a callback track to play them) hasn't helped either, though it made me think that the problem probably resides in the sequencer or in the way I create the notes.
Here's an example of what I'm doing (I tried to make it as simple as I could):
import UIKit
import AudioKit
class ViewController: UIViewController {
let sequencer = AKSequencer()
let sampler = AKMIDISampler()
let callbackInst = AKCallbackInstrument()
var metronomeTrack : AKMusicTrack?
var callbackTrack : AKMusicTrack?
let numberOfBeats = 8
let tempo = 280.0
var startTime : TimeInterval = 0
override func viewDidLoad() {
super.viewDidLoad()
print("Begin setup.")
// Load .wav sample in AKMidiSampler
do {
try sampler.loadWav("tick")
} catch {
print("sampler.loadWav() failed")
}
// Create tracks for the sequencer and set midi outputs
metronomeTrack = sequencer.newTrack("metronomeTrack")
callbackTrack = sequencer.newTrack("callbackTrack")
metronomeTrack?.setMIDIOutput(sampler.midiIn)
callbackTrack?.setMIDIOutput(callbackInst.midiIn)
// Setup and start AudioKit
AudioKit.output = sampler
do {
try AudioKit.start()
} catch {
print("AudioKit.start() failed")
}
// Set sequencer tempo
sequencer.setTempo(tempo)
// Create the notes
var midiSequenceIndex = 0
for i in 0 ..< numberOfBeats {
// Add notes to tracks
metronomeTrack?.add(noteNumber: 60, velocity: 100, position: AKDuration(beats: Double(midiSequenceIndex)), duration: AKDuration(beats: 0.5))
callbackTrack?.add(noteNumber: MIDINoteNumber(midiSequenceIndex), velocity: 100, position: AKDuration(beats: Double(midiSequenceIndex)), duration: AKDuration(beats: 0.5))
print("Adding beat number \(i+1) at position: \(midiSequenceIndex)")
midiSequenceIndex += 1
}
// Set the callback
callbackInst.callback = {status, noteNumber, velocity in
if status == .noteOn {
let currentTime = Date().timeIntervalSinceReferenceDate
let noteDelay = currentTime - ( self.startTime + ( 60.0 / self.tempo ) * Double(noteNumber) )
print("Beat number: \(noteNumber) delay: \(noteDelay)")
} else if ( noteNumber == midiSequenceIndex - 1 ) && ( status == .noteOff) {
print("Sequence ended.\n")
self.toggleMetronomePlayback()
} else {return}
}
// Preroll the sequencer
sequencer.preroll()
print("Setup ended.\n")
}
#IBAction func playButtonPressed(_ sender: UIButton) {
toggleMetronomePlayback()
}
func toggleMetronomePlayback() {
if sequencer.isPlaying == false {
print("Playback started.")
startTime = Date().timeIntervalSinceReferenceDate
sequencer.play()
} else {
sequencer.stop()
sequencer.rewind()
}
}
}
Could anyone help? Thank you.
As Aure commented, the start up latency is a known problem. Even with preroll, there is still noticeable latency, especially at higher tempos.
But if you are using a looping sequence, I found that you can sometimes mitigate how noticeable the latency is by setting the 'starting point' of the sequence to a position after the final MIDI event, but within the loop length. If you can find a good position, you can get the latency effects out of the way before it loops back to your content.
Make sure to call setTime() before you need it (e.g., after stopping the sequence, not when you are ready to play) because the setTime()call itself can introduce about 200ms of wonkiness.
Edit:
As an afterthought, you could do the same thing on a non-looping sequence by enabling looping and using an arbitrarily long sequence length. If you needed playback to stop at the end of the MIDI content, you could do this with an AKCallbackInstrument triggered by an MIDI event placed just after the final note.
After a bit of testing I actually found out that it is not the first note that plays off but the subsequent notes that play in advance. Moreover, the amount of notes that play exactly on time when starting the sequencer depends on the set tempo.
The funny thing is that if the tempo is < 400 there will be one note played on time and the others in advance, if it is 400 <= bpm < 800 there will be two notes played correctly and the others in advance and so on, for every 400 bpm increment you get one more note played correctly.
So... since the notes are played in advance and not late, the solution that solved it for me is:
1) Use a sampler that is not connected directly to a track's midi output but has its .play() method called inside a callback.
2) Keep track of when the sequencer gets started
3) At every callback calculate when the note should play in relation to the start time and store what time it actually is, so you can then calculate the offset.
4) use the computed offset to dispatch_async after the offset your .play() method.
And that's it, I tested this on multiple devices and now all the notes play perfectly on time.
I had the same issue, preroll didn't help, but I have managed to solve it with a dedicated sampler for the first notes.
I used a delay on the other sampler, about 0.06 of a second, works like a charm.
Kind of a silly solution but it did the job and I could go on with the project :)
//This is for fixing AK bug that plays the first playback not in delay
let fixDelay = AKDelay()
fixDelay.dryWetMix = 1
fixDelay.feedback = 0
fixDelay.lowPassCutoff = 22000
fixDelay.time = 0.06
fixDelay.start()
let preDelayMixer = AKMixer()
let preFirstMixer = AKMixer()
[playbackSampler,vocalSampler] >>> preDelayMixer >>> fixDelay
[firstNoteVocalSampler, firstRoundPlaybackSampler] >>> preFirstMixer
[fixDelay,preFirstMixer] >>> endMixer

Qt5: How to execute a "task" based on a weekly scheduler?

I am using Qt5 on Windows7 platform.
I have an app running 24/24, that it's supposed to connect to some remote devices in order to open or close the service on them. Connection is done via TCP.
For each day of the week there is/should be the possibility to set the hour&minute for both operations/tasks: open-service and close-service, as in the code below:
#define SUNDAY 0
#define MONDAY 1
//...
#define SATURDAY 6
struct Day_OpenCloseService
{
bool automaticOpenService;
int openHour;
int openMinute;
bool automaticCloseService;
int closeHour;
int closeMinute;
};
QVector<Day_OpenCloseService> Week_OpenCloseService(7);
Week_OpenCloseService[SUNDAY].automaticOpenService = true;
Week_OpenCloseService[SUNDAY].openHour = 7;
Week_OpenCloseService[SUNDAY].openMinute = 0;
Week_OpenCloseService[SUNDAY].automaticCloseService = false;
//
Week_OpenCloseService[MONDAY].automaticOpenService = true;
Week_OpenCloseService[MONDAY].openHour = 4;
Week_OpenCloseService[MONDAY].openMinute = 30;
Week_OpenCloseService[MONDAY].automaticCloseService = true;
Week_OpenCloseService[MONDAY].closeHour = 23;
Week_OpenCloseService[MONDAY].closeMinute = 0;
// ...
Week_OpenCloseService[SATURDAY].automaticOpenService = true;
Week_OpenCloseService[SATURDAY].openHour = 6;
Week_OpenCloseService[SATURDAY].openMinute = 15;
Week_OpenCloseService[SATURDAY].automaticCloseService = false;
Week_OpenCloseService[SATURDAY].closeHour = 23;
Week_OpenCloseService[SATURDAY].closeMinute = 59;
If automaticOpenService is true for a day, then an open-service will be executed at the specified hour&minute, in a new thread (I suppose).
If automaticOpenService is false, then no open-service is executed for that day of the week.
And the same goes for the automaticCloseService...
Now, the question is:
How to start the open-service and close-service tasks, based on the above "scheduler"?
Ok, the open-service and close-service tasks are not implemented yet, but they will be just some simple commands via TCP connection to the remote devices (which are listening on a certain port).
I'm still weighing on how to implement that, too... (single-thread, multi-thread, concurrent, etc).
A basic implementation of a scheduler will hold a list of upcoming tasks (maybe with just two items in the list in your case) that is kept sorted by the time at which those tasks need to be executed. Since you are using Qt, you could use QDateTime objects to represent the times at which your upcoming tasks need to be done.
Once you have that list set up, it's just a matter of calculating how many seconds remain between the current time and the timestamp of the first item in the list, and then waiting that number of seconds. The QDateTime::secsTo() method is very useful here as it will do just that calculation for you. You can then call QTimer::singleShot() to make it so that a signal will be emitted in that-many seconds.
When the qTimer's signal is emitted and your slot-method is called, you slot method will check the QDateTime of the first item in the list; if the current time is greater than or equal to that item's QDateTime, then it's time to execute the task, and the pop that item off the head of the list (and maybe reschedule a new task for tomorrow?). Repeat until either the list is empty or the first item in the list has a QDateTime that is still in the future, in which case you'd go back to step 1 again. Repeat indefinitely.
Note that multithreading isn't required to accomplish this task under Qt (and using multithreading wouldn't make the task any easier, either, so I'd avoid it if possible).

Why FFTW on Windows is faster than on Linux?

I wrote two identical programs in Linux and Windows using the fftw libraries (fftw3.a, fftw3.lib), and compute the duration of the fftwf_execute(m_wfpFFTplan) statement (16-fft).
For 10000 runs:
On Linux: average time is 0.9
On Windows: average time is 0.12
I am confused as to why this is nine times faster on Windows than on Linux.
Processor: Intel(R) Core(TM) i7 CPU 870 # 2.93GHz
Each OS (Windows XP 32 bit and Linux OpenSUSE 11.4 32 bit) are installed on same machines.
I downloaded the fftw.lib (for Windows) from internet and don't know that configurations. Once I build FFTW with this config:
/configure --enable-float --enable-threads --with-combined-threads --disable-fortran --with-slow-timer --enable-sse --enable-sse2 --enable-avx
in Linux and it results in a lib that is four times faster than the default configs (0.4 ms).
16 FFT is very small. What you will find is FFTs smaller than say 64 will be hard coded assembler with no loops to get the highest possible performance. This means they can be highly susceptible to variations in instruction sets, compiler optimisations, even 64 or 32bit words.
What happens when you run a test of FFT sizes from 16 -> 1048576 in powers of 2? I say this as a particular hard-coded asm routine on Linux might not be the best optimized for your machine, whereas you might have been lucky on the Windows implementation for that particular size. A comparison of all sizes in this range will give you a better indication of the Linux vs. Windows performance.
Have you calibrated FFTW? When first run FFTW guesses the fastest implementation per machine, however if you have special instruction sets, or a particular sized cache or other processor features then these can have a dramatic effect on execution speed. As a result performing a calibration will test the speed of various FFT routines and choose the fastest per size for your specific hardware. Calibration involves repeatedly computing the plans and saving the FFTW "Wisdom" file generated. The saved calibration data (this is a lengthy process) can then be re-used. I suggest doing it once when your software starts up and re-using the file each time. I have noticed 4-10x performance improvements for certain sizes after calibrating!
Below is a snippet of code I have used to calibrate FFTW for certain sizes. Please note this code is pasted verbatim from a DSP library I worked on so some function calls are specific to my library. I hope the FFTW specific calls are helpful.
// Calibration FFTW
void DSP::forceCalibration(void)
{
// Try to import FFTw Wisdom for fast plan creation
FILE *fftw_wisdom = fopen("DSPDLL.ftw", "r");
// If wisdom does not exist, ask user to calibrate
if (fftw_wisdom == 0)
{
int iStatus2 = AfxMessageBox("FFTw not calibrated on this machine."\
"Would you like to perform a one-time calibration?\n\n"\
"Note:\tMay take 40 minutes (on P4 3GHz), but speeds all subsequent FFT-based filtering & convolution by up to 100%.\n"\
"\tResults are saved to disk (DSPDLL.ftw) and need only be performed once per machine.\n\n"\
"\tMAKE SURE YOU REALLY WANT TO DO THIS, THERE IS NO WAY TO CANCEL CALIBRATION PART-WAY!",
MB_YESNO | MB_ICONSTOP, 0);
if (iStatus2 == IDYES)
{
// Perform calibration for all powers of 2 from 8 to 4194304
// (most heavily used FFTs - for signal processing)
AfxMessageBox("About to perform calibration.\n"\
"Close all programs, turn off your screensaver and do not move the mouse in this time!\n"\
"Note:\tThis program will appear to be unresponsive until the calibration ends.\n\n"
"\tA MESSAGEBOX WILL BE SHOWN ONCE THE CALIBRATION IS COMPLETE.\n");
startTimer();
// Create a whole load of FFTw Plans (wisdom accumulates automatically)
for (int i = 8; i <= 4194304; i *= 2)
{
// Create new buffers and fill
DSP::cFFTin = new fftw_complex[i];
DSP::cFFTout = new fftw_complex[i];
DSP::fconv_FULL_Real_FFT_rdat = new double[i];
DSP::fconv_FULL_Real_FFT_cdat = new fftw_complex[(i/2)+1];
for(int j = 0; j < i; j++)
{
DSP::fconv_FULL_Real_FFT_rdat[j] = j;
DSP::cFFTin[j][0] = j;
DSP::cFFTin[j][1] = j;
DSP::cFFTout[j][0] = 0.0;
DSP::cFFTout[j][1] = 0.0;
}
// Create a plan for complex FFT.
// Use the measure flag to get the best possible FFT for this size
// FFTw "remembers" which FFTs were the fastest during this test.
// at the end of the test, the results are saved to disk and re-used
// upon every initialisation of the DSP Library
DSP::pCF = fftw_plan_dft_1d
(i, DSP::cFFTin, DSP::cFFTout, FFTW_FORWARD, FFTW_MEASURE);
// Destroy the plan
fftw_destroy_plan(DSP::pCF);
// Create a plan for real forward FFT
DSP::pCF = fftw_plan_dft_r2c_1d
(i, fconv_FULL_Real_FFT_rdat, fconv_FULL_Real_FFT_cdat, FFTW_MEASURE);
// Destroy the plan
fftw_destroy_plan(DSP::pCF);
// Create a plan for real inverse FFT
DSP::pCF = fftw_plan_dft_c2r_1d
(i, fconv_FULL_Real_FFT_cdat, fconv_FULL_Real_FFT_rdat, FFTW_MEASURE);
// Destroy the plan
fftw_destroy_plan(DSP::pCF);
// Destroy the buffers. Repeat for each size
delete [] DSP::cFFTin;
delete [] DSP::cFFTout;
delete [] DSP::fconv_FULL_Real_FFT_rdat;
delete [] DSP::fconv_FULL_Real_FFT_cdat;
}
double time = stopTimer();
char * strOutput;
strOutput = (char*) malloc (100);
sprintf(strOutput, "DSP.DLL Calibration complete in %d minutes, %d seconds\n"\
"Please keep a copy of the DSPDLL.ftw file in the root directory of your application\n"\
"to avoid re-calibration in the future\n", (int)time/(int)60, (int)time%(int)60);
AfxMessageBox(strOutput);
isCalibrated = 1;
// Save accumulated wisdom
char * strWisdom = fftw_export_wisdom_to_string();
FILE *fftw_wisdomsave = fopen("DSPDLL.ftw", "w");
fprintf(fftw_wisdomsave, "%s", strWisdom);
fclose(fftw_wisdomsave);
DSP::pCF = NULL;
DSP::cFFTin = NULL;
DSP::cFFTout = NULL;
fconv_FULL_Real_FFT_cdat = NULL;
fconv_FULL_Real_FFT_rdat = NULL;
free(strOutput);
}
}
else
{
// obtain file size.
fseek (fftw_wisdom , 0 , SEEK_END);
long lSize = ftell (fftw_wisdom);
rewind (fftw_wisdom);
// allocate memory to contain the whole file.
char * strWisdom = (char*) malloc (lSize);
// copy the file into the buffer.
fread (strWisdom,1,lSize,fftw_wisdom);
// import the buffer to fftw wisdom
fftw_import_wisdom_from_string(strWisdom);
fclose(fftw_wisdom);
free(strWisdom);
isCalibrated = 1;
return;
}
}
The secret sauce is to create the plan using the FFTW_MEASURE flag, which specifically measures hundreds of routines to find the fastest for your particular type of FFT (real, complex, 1D, 2D) and size:
DSP::pCF = fftw_plan_dft_1d (i, DSP::cFFTin, DSP::cFFTout,
FFTW_FORWARD, FFTW_MEASURE);
Finally, all benchmark tests should also be performed with a single FFT Plan stage outside of execute, called from code that is compiled in release mode with optimizations on and detached from the debugger. Benchmarks should be performed in a loop with many thousands (or even millions) of iterations and then take the average run time to compute the result. As you probably know the planning stage takes a significant amount of time and the execute is designed to be performed multiple times with a single plan.

What are Lua coroutines even for? Why doesn't this code work as I expect it?

I'm having trouble understanding this code... I was expecting something similar to threading where I would get an output with random "nooo" and "yaaaay"s interspersed with each other as they both do the printing asynchronously, but rather I discovered that the main thread seems to block on the first calling of coroutine.resume() and thus prevents the next from being started until the first has yielded.
If this is the intended operation coroutines, what are they useful for, and how would I achieve the goal I was hoping for? Would I have to implement my own scheduler for these coroutines to operate asynchronously?, because that seems messy, and I may as well use functions!
co1 = coroutine.create(function ()
local i = 1
while i < 200 do
print("nooo")
i = i + 1
end
coroutine.yield()
end)
co2 = coroutine.create(function ()
local i = 1
while i < 200 do
print("yaaaay")
i = i + 1
end
coroutine.yield()
end)
coroutine.resume(co1)
coroutine.resume(co2)
Coroutines aren't threads.
Coroutines are like threads that are never actively scheduled. So yes you are kinda correct that you would have to write you own scheduler to have both coroutines run simultaneously.
However you are missing the bigger picture when it comes to coroutines. Check out wikipedia's list of coroutine uses. Here is one concrete example that might guide you in the right direction.
-- level script
-- a volcano erupts every 2 minutes
function level_with_volcano( interface )
while true do
wait(seconds(5))
start_eruption_volcano()
wait(frames(10))
s = play("rumble_sound")
wait( end_of(s) )
start_camera_shake()
-- more stuff
wait(minutes(2))
end
end
The above script could be written to run iteratively with a switch statement and some clever state variables. But it is much more clear when written as a coroutine. The above script could be a thread but do you really need to dedicate a kernel thread to this simple code. A busy game level could have 100's of these coroutines running without impacting performance. However if each of these were a thread you might get away with 20-30 before performance started to suffer.
A coroutine is meant to allow me to write code that stores state on the stack so that I can stop running it for a while (the wait functions) and start it again where I left off.
Since there have been a number of comments asking how to implement the wait function that would make deft_code's example work, I've decided to write a possible implementation. The general idea is that we have a scheduler with a list of coroutines, and the scheduler decides when to return control to the coroutines after they give up control with their wait calls. This is desirable because it makes asynchronous code be readable and easy to reason about.
This is only one possible use of coroutines, they are a more general abstraction tool that can be used for many different purposes (such as writing iterators and generators, writing stateful stream processing objects (for example, multiple stages in a parser), implementing exceptions and continuations, etc.).
First: the scheduler definition:
local function make_scheduler()
local script_container = {}
return {
continue_script = function(frame, script_thread)
if script_container[frame] == nil then
script_container[frame] = {}
end
table.insert(script_container[frame],script_thread)
end,
run = function(frame_number, game_control)
if script_container[frame_number] ~= nil then
local i = 1
--recheck length every time, to allow coroutine to resume on
--the same frame
local scripts = script_container[frame_number]
while i <= #scripts do
local success, msg =
coroutine.resume(scripts[i], game_control)
if not success then error(msg) end
i = i + 1
end
end
end
}
end
Now, initialising the world:
local fps = 60
local frame_number = 1
local scheduler = make_scheduler()
scheduler.continue_script(frame_number, coroutine.create(function(game_control)
while true do
--instead of passing game_control as a parameter, we could
--have equivalently put these values in _ENV.
game_control.wait(game_control.seconds(5))
game_control.start_eruption_volcano()
game_control.wait(game_control.frames(10))
s = game_control.play("rumble_sound")
game_control.wait( game_control.end_of(s) )
game_control.start_camera_shake()
-- more stuff
game_control.wait(game_control.minutes(2))
end
end))
The (dummy) interface to the game:
local game_control = {
seconds = function(num)
return math.floor(num*fps)
end,
minutes = function(num)
return math.floor(num*fps*60)
end,
frames = function(num) return num end,
end_of = function(sound)
return sound.start+sound.duration-frame_number
end,
wait = function(frames_to_wait_for)
scheduler.continue_script(
frame_number+math.floor(frames_to_wait_for),
coroutine.running())
coroutine.yield()
end,
start_eruption_volcano = function()
--obviously in a real game, this could
--affect some datastructure in a non-immediate way
print(frame_number..": The volcano is erupting, BOOM!")
end,
start_camera_shake = function()
print(frame_number..": SHAKY!")
end,
play = function(soundname)
print(frame_number..": Playing: "..soundname)
return {name = soundname, start = frame_number, duration = 30}
end
}
And the game loop:
while true do
scheduler.run(frame_number,game_control)
frame_number = frame_number+1
end
co1 = coroutine.create(
function()
for i = 1, 100 do
print("co1_"..i)
coroutine.yield(co2)
end
end
)
co2 = coroutine.create(
function()
for i = 1, 100 do
print("co2_"..i)
coroutine.yield(co1)
end
end
)
for i = 1, 100 do
coroutine.resume(co1)
coroutine.resume(co2)
end

Resources