I'm new to audio processing and dealing with data that's being streamed in real-time. What I want to do is:
listen to a built-in microphone
chunk together samples into 0.1second chunks
convert the chunk into a periodogram via the short-time Fourier transform (STFT)
apply some simple functions
convert back to time series data via the inverse STFT (ISTFT)
play back the new audio on headphones
I've been looking around for "real time spectrograms" to give me a guide on how to work with the data, but no dice. I have, however, discovered some interesting packages, including PortAudio.jl, DSP.jl and MusicProcessing.jl.
It feels like I'd need to make use of multiprocessing techniques to just store the incoming data into suitable chunks, whilst simultaneously applying some function to a previous chunk, whilst also playing another previously processed chunk. All of this feels overcomplicated, and has been putting me off from approaching this project for a while now.
Any help will be greatly appreciated, thanks.
As always start with a simple version of what you really need ... ignore for now pulling in audio from a microphone, instead write some code to synthesize a sin curve of a known frequency and use that as your input audio, or read in audio from a wav file - benefit here is its known and reproducible unlike microphone audio
this post shows how to use some of the libs you mention http://www.seaandsailor.com/audiosp_julia.html
You speak of "real time spectrogram" ... this is simply repeatedly processing a window of audio, so lets initially simplify that as well ... once you are able to read in the wav audio file then send it into a FFT call which will return back that audio curve in its frequency domain representation ... as you correctly state this freq domain data can then be sent into an inverse FFT call to give you back the original time domain audio curve
After you get above working then wrap it in a call which supplies a sliding window of audio samples to give you the "real time" benefit of being able to parse incoming audio from your microphone ... keep in mind you always use a power of 2 number of audio samples in your window of samples you feed into your FFT and IFFT calls ... lets say your window is 16384 samples ... your julia server will need to juggle multiple demands (1) pluck the next buffer of samples from your microphone feed (2) send a window of samples into your FFT and IFFT call ... be aware the number of audio samples in your sliding window will typically be wider than the size of your incoming microphone buffer - hence the notion of a sliding window ... over time add your mic buffer to the front of this window and remove same number of samples off from tail end of this window of samples
Related
I'm making a game in which there are a series of events (which happens, say, every 30 frames in a 60fps setting) that I want to sync with the music (at 120 bpm). In usual cases, e.g. rhythm games, syncing the events to the music is easier, because human seems to perceive much smaller gaps in music than in videos. However, in my case, the game heavily depends on frame-based time, and a lot of things will break if I change the schedule of my series of events.
After a lot of experiments, it seems to me almost impossible to tweak the music without disturbing the human ear: A jump of ~1ms is noticeable, a ~10ms discrepancy between video and audio is noticeable, a 0.5% change in the pitch is noticeable. And I don't have handy tools to speed up audio without changing the pitch.
What is the easiest way out in this circumstance? Is there any reference on this subject that I can refer to? Any advice is appreciated!
The method I that I successfully use (in Java) is to route the playback signal through a path that allows the counting of PCM frames (audio frames run at rates like 44100 fps, as opposed to screen updates which run at rates like 60 fps). I don't know about other languages, but with Java, this can be done by outputting using a SourceDataLine class. As the audio frame count is incremented, it can be compared to the next item (pending item) on a collection of events that require triggers to other systems or threads. Java has an excellent class for handling the collection of events: ConcurrentSkipListSet. It is asynchronous, and automatically sorts elements via a Comparator set to the desired PCM frame count.
Some example code that showing the counting of frames can be seen in this tutorial Using Files and Format Converters, if you search on the page for the phrase "Here, do something useful with the audio data". They are counting bytes, not PCM frames, but the example does give the basic idea.
Why is counting PCM effective? I think this has to do with the fact that this code (in Java) is the closest we get to the point where audio data is fed to the native code controlling the sound system, and that this code employs a blocking queue. Thus, the write operations only happen when the audio system is ready to receive and playback more sound data, and audio systems have to be very accurate in how they maintain their rate of processing. The amount of time variance that occurs here (especially if the thread is given a high priority) is smaller than the time variance incurred by choices made by the JVM as it juggles multiple threads and processes.
I'm trying to create a MOV file with two audio tracks and one video track, and I'm trying to do so without AVAssetExportSession or AVComposition, as I want to have the resultant file ready almost immediately after the AVCaptureSession ends. An export after the capture session may only take a few seconds, but not in the case of a 5 minute capture session. This looks like it should be possible, but I feel like I'm just a step away:
There's source #1 - video and audio recorded via AVCaptureSession (handled via AVCaptureVideoDataOutput and AVCaptureAudioDataOutput).
There's source #2 - an audio file read in with an AVAssetReader. Here I use an AVAssetWriterInput and requestMediaDataWhenReadyOnQueue. I call setTimeRange on its AVAssetReader, from CMTimeZero to the duration of the asset, and this shows correctly as 27 seconds when logged out.
I have each of the three inputs working on a queue of its own, and all three are concurrent. Logging shows that they're all handling sample buffers - none appear to be lagging behind or stuck in a queue that isn't processing.
The important point is that the audio file works on its own, using all the same AVAssetWriter code. If I set my AVAssetWriter to output a WAVE file and refrain from adding the writer inputs from #1 (the capture session), I finish my writer session when the audio-from-file samples are depleted. The audio file reports as being of a certain size, and it plays back correctly.
With all three writer inputs added, and the file type set to AVFileTypeQuickTimeMovie, the requestMediaDataOnQueue process for the audio-from-file still appears to read the same data. The resultant mov file shows three tracks, two audio, one video, and the duration of the captured audio and video are not identical in length but they've obviously worked, and the video plays back with both intact. The third track (the second audio track), however, shows a duration of zero.
Does anyone know if this whole solution is possible, and why the duration of the from-file audio track is zero when it's in a MOV file? If there was a clear way for me to mix the two audio tracks I would, but for one, AVAssetReaderAudioMixOutput takes two AVAssetTracks, and I essentially want to mix an AVAssetTrack with captured audio, and they aren't managed or read in the same way.
I'd also considered that the QuickTime Movie won't accept certain audio formats, but I'm making a point of passing the same output settings dictionary to both audio AVAssetWriterInputs, and the captured audio does play and report its duration (and the from-file audio plays when in a WAV file with those same output settings), so I don't think this is an issue.
Thanks.
I discovered that the reason for this is:
I correctly use the Presentation Time Stamp of the incoming capture session data (I use the PTS of the video data at the moment) to begin a writer session (startSessionAtSourceTime), and that meant that the timestamp of the audio data read from file had the wrong timestamp - outwith the time range that was dictated to the AVAssetWriter session. So I had to further process the data from the audio file, changing its timing information by using CMSampleBufferCreateCopyWithNewTiming.
CMTime bufferDuration = CMSampleBufferGetOutputDuration(nextBuffer);
CMSampleBufferRef timeAdjustedBuffer;
CMSampleTimingInfo timingInfo;
timingInfo.duration = bufferDuration;
timingInfo.presentationTimeStamp = _presentationTimeUsedToStartSession;
timingInfo.decodeTimeStamp = kCMTimeInvalid;
CMSampleBufferCreateCopyWithNewTiming(kCFAllocatorDefault, nextBuffer, 1, &timingInfo, &timeAdjustedBuffer);
Wireless connections like bluetooth are limited by transmission bandwidth resulting in a limited bitrate and audio sampling frequency.
Can a high definition audio output like 24bit/96khz be created by combining two separate audio streams of 24bit/48khz each, transmitted from a source to receiver speakers/earphones.
I tried to understand how a DSP(digital signal processor) works, but I am unable to find the exact technical words that explain this kind of audio splitting and re-combining technique for increasing the audio resolution
No, you would have to upsample the two original audio streams to 96 kHz. Combining two audio streams will not increase audio resolution; all you're really doing is summing two streams together.
You'll probably want to read this free DSP resource for more information.
Here is a simple construction which could be used to create two audio streams at 24bit/48kHz from a higher resolution 24bit/96kHz stream, which could later be recombined to recreate a single audio stream at 24bit/96kHz.
Starting with an initial high resolution source at 24bit/96kHz {x[0],x[1],x[2],...}:
Take every even sample of the source (i.e. {x[0],x[2],x[4],...} ), and send it over your first 24bit/48kHz channel (i.e. producing the stream y1 such that y1[0]=x[0], y1[1]=x[2], ...).
At the same time, take every odd sample {x[1],x[3],x[5],...} of the source, and send it over your second 24bit/48kHz channel (i.e. producing the stream y2 such that y2[0]=x[1], y2[1]=x[3], ...).
At the receiving end, you should then be able to reconstruct the original 24bit/96kHz audio signal by interleaving the samples from your first and second channel. In other words you would be recreating an output stream out with:
out[0] = y1[0]; // ==x[0]
out[1] = y2[0]; // ==x[1]
out[2] = y1[1]; // ==x[2]
out[3] = y2[1]; // ==x[3]
out[4] = y1[2]; // ==x[4]
out[5] = y2[2]; // ==x[5]
...
That said, transmitting those two streams of 24bit/48kHz would require an effective bandwidth of 2*24bit*48000kHz = 2304kbps, which is exactly the same as transmitting one stream of 24bit/96kHz. So, while this allows you to fit the audio stream in channels of fixed bandwidth, you are not reducing the total bandwidth requirement this way.
Could please you provide you definition of "combining". Based on the data rates, it seems like you want to do a multiplex (combining two mono channels into a stereo channel). If the desire is to "add" two channels together (two monos into a single mono or two stereo channels into one stereo), then you should not have to increase your sampling rate (you are adding two band limited signals, increasing the sampling rate is not necessary).
I want to built a SoundWave sampling an audio stream.
I read that a good method is to get amplitude of the audio stream and represent it with a Polygon. But, suppose we have and AudioGraph with just a DeviceInputNode and a FileOutpuNode (a simple recorder).
How can I get the amplitude from a node of the AudioGraph?
What is the best way to periodize this sampling? Is a DispatcherTimer good enough?
Any help will be appreciated.
First, everything you care about is kind of here:
uwp AudioGraph audio processing
But since you have a different starting point, I'll explain some more core things.
An AudioGraph node is already periodized for you -- it's generally how audio works. I think Win10 defaults to periods of 10ms and/or 20ms, but this can be set (theoretically) via the AudioGraphSettings.DesiredSamplesPerQuantum setting, with the AudioGraphSettings.QuantumSizeSelectionMode = QuantumSizeSelectionMode.ClosestToDesired; I believe the success of this functionality actually depends on your audio hardware and not the OS specifically. My PC can only do 480 and 960. This number is how many samples of the audio signal to accumulate per channel (mono is one channel, stereo is two channels, etc...), and this number will also set the callback timing as a by-product.
Win10 and most devices default to 48000Hz sample rate, which means they are measuring/output data that many times per second. So with my QuantumSize of 480 for every frame of audio, i am getting 48000/480 or 100 frames every second, which means i'm getting them every 10 milliseconds by default. If you set your quantum to 960 samples per frame, you would get 50 frames every second, or a frame every 20ms.
To get a callback into that frame of audio every quantum, you need to register an event into the AudioGraph.QuantumProcessed handler. You can directly reference the link above for how to do that.
So by default, a frame of data is stored in an array of 480 floats from [-1,+1]. And to get the amplitude, you just average the absolute value of this data.
This part, including handling multiple channels of audio, is explained more thoroughly in my other post.
Have fun!
Now I am working on a musical project in which i need accurate timings.I already used NSTimer,NSdate,but iam geting delay while playing the beats(beats i.e tik tok)So i have decided to use Audio Queue API to play my sound file present in the main bundle that is of .wav format, Its been 2 weeks i am struggling with this, Can anybody please help me out of this problem.
Use uncompressed sounds and a single Audio Queue, continuously running. Don't stop it between beats. Instead mix in raw PCM samples of each Tock sound (etc.) starting some exact number of samples apart. Play silence (zeros) in between. That will produce sub-millisecond accurate timing.