Combining multiple input channels to one output channel audio live - audio

I am trying to make my own basic mixer and wanted to know how I could take multiple channels of input audio and outputting all of the channels as one mixed audio source with controllable levels for each input channel. Right now I am trying to use pyo but I am unable to mix the channels in real-time.

here is some pseudo code to combine multiple input channels into a single output channel where each input channel has its own volume control in array mix_volume
max_index = length(all_chan[0]) // identify audio buffer size
all_chan // assume all channels live in a two dimensional array where
// dimension 0 is which channel and dim 1 is index into each audio sample
mix_volume // array holding multiplication factor to control volume per channel
// each element a floating point value between 0.0 and 1.0
output_chan // define and/or allocate your output channel buffer
for index := 0; index < max_index; index++ {
curr_sample := 0 // output audio curve height for current audio sample
for curr_chan := 0; curr_chan < num_channels; curr_chan++ {
curr_sample += (all_chan[curr_chan][index] * mix_volume[curr_chan])
}
output_chan[index] = curr_sample / num_channels // output audio buffer
}
the trick to perform above on a live stream is to populate the above all_chan audio buffers inside an event loop where you copy into these buffers the audio sample values for each channel then execute above code from inside that event loop ... typically you will want your audio buffers to have about 2^12 ( 4096 ) audio samples ... experiment using larger or smaller buffer size ... too small and this event loop will become very cpu intensive yet too large and you will incur an audible delay ... have fun
you may want to use a compiled language like golang YMMV

Related

How to analyze MP3 for beat/drums timestamps, trigger actions and playback at the same time (Rust)

I want to trigger an action (let a bright light flash for example) when the beat or drums in a mp3 file are present during playback. I don't know the theoretically procedure/approach I should take.
First I thought about statically analyzing the MP3 in the first step. The result of the analysis would be at which timestamps the action should be triggered. Then I start the MP3 and another thread starts the actions at the specific timings. This should be easy because I can use rodio-crate for playback. But the static analyzing parts is still heavy.
Analysis algorithm:
My idea was to read the raw audio data from a MP3 using minimp3-crate and do a FFT with rustfft-crate. When I have the spectrum analysis from FFT I could look where the deep frequencies are on a high volume and this should be the beat of the song.
I tried combining minimp3 and rustfft but I have absolutely no clue what the data that I get really means.. And I can't write a test for it really either..
This is my approach so far:
use minimp3::{Decoder, Frame, Error};
use std::fs::File;
use std::sync::Arc;
use rustfft::FFTplanner;
use rustfft::num_complex::Complex;
use rustfft::num_traits::{Zero, FromPrimitive, ToPrimitive};
fn main() {
let mut decoder = Decoder::new(File::open("08-In the end.mp3").unwrap());
loop {
match decoder.next_frame() {
Ok(Frame { data, sample_rate, channels, .. }) => {
// we only need mono data; because data is interleaved
// data[0] is first value channel left, data[1] is first channel right, ...
let mut mono_audio = vec![];
for i in 0..data.len() / channels {
let sum = data[i] as i32 + data[i+1] as i32;
let avg = (sum / 2) as i16;
mono_audio.push(avg);
}
// unnormalized spectrum; now check where the beat/drums are
// by checking for high volume in low frequencies
let spectrum = calc_fft(&mono_audio);
},
Err(Error::Eof) => break,
Err(e) => panic!("{:?}", e),
}
}
}
fn calc_fft(raw_mono_audio_data: &Vec<i16>) -> Vec<i16> {
// Perform a forward FFT of size 1234
let len = raw_mono_audio_data.len();
let mut input: Vec<Complex<f32>> = vec![];
//let mut output: Vec<Complex<f32>> = vec![Complex::zero(); 256];
let mut spectrum: Vec<Complex<f32>> = vec![Complex::zero(); len];
// from Vec<i16> to Vec<Complex<f32>>
raw_mono_audio_data.iter().for_each(|val| {
let compl = Complex::from_i16(*val).unwrap();
input.push(compl);
});
let mut planner = FFTplanner::new(false);
let fft = planner.plan_fft(len);
fft.process(&mut input, &mut spectrum);
// to Vec<i16>
let mut output_i16 = vec![];
spectrum.iter().for_each(|val| {
if let Some(val) = val.to_i16() {
output_i16.push(val);
}
});
output_i16
}
My problem is also that the FFT function doesn't have any parameter where I can specify the sample_rate (which is 48.000kHz). All I get from decoder.next_frame() is Vec<i16> with 2304 items..
Any ideas how I can achive that and what the numbers I currently get actually mean?
TL;DR:
Decouple analysis and audio data preparation. (1) Read the MP3/WAV data, join the two channels to mono (easier analysis), take slices from the data with a length that is a power of 2 (for the FFT; if required fill with additional zeroes) and finally (2) apply that data to the crate spectrum_analyzer and learn from the code (which is excellently documented) how the presence of certain frequencies can be obtained from the FFT.
Longer version
Decouple the problem into smaller problems/subtasks.
analysis of audio data in discrete windows => beat: yes or no
a "window" is usually a fixed-size view into the on-going stream of audio data
choose a strategy here: for example a lowpass filter, a FFT, a combination, ... search for "beat detection algorithm" in literature
if you are doing an FFT, you should extend your data window always to the next power of 2 (e.g. fill with zeroes).
read the mp3, convert it to mono and then pass the audio samples step by step to the analysis algorithm.
You can use the sampling rate and the sample index to calculate the point in time
=> attach "beat: yes/no" to timestamps inside the song
The analysis-part should be kept generally usable, so that it works for live audio as well as files. Music is usually discretized with 44100Hz or 48000Hz and 16 bit resolution. All common audio libraries will give you an interface to access audio input from the microphone with these properties. If you read a MP3 or a WAV instead, the music (the audio data) is usually in the same format. If you analyze windows of a length of 2048 at 44100Hz for example, each window has a length of 1/f * n == T * n == n/f == (2048/44100)s == ~46,4ms. The shorter the time window, the faster your beat detection can operate but the less your accuracy will be - it's a tradeoff :)
Your algorithm could keep knowledge about previous windows to overlap them to reduce noise/wrong data.
To view existing code that solves these sub-problems, I suggest the following crates
https://crates.io/crates/lowpass-filter : Simple low pass filter to get the low frequencies of a data window => (probably a) beat
https://crates.io/crates/spectrum-analyzer : spectrum analysis of an audio window with FFT and excellent documentation about how it is done inside the repository
With the crate beat detector there is a solution that pretty much implements the original content of this question. It connects live audio input with the analysis algorithm.

How to detect a basic audio signal into a much bigger one (mpg123 output signal)

I am new to signal processing and I don't really understand the basics (and more). Sorry in advance for any mistake into my understanding so far.
I am writing C code to detect a basic signal (18Hz simple sinusoid 2 sec duration, generating it using Audacity is pretty simple) into a much bigger mp3 file. I read the mp3 file and copy it until I match the sound signal.
The signal to match is { 1st channel: 18Hz sin. signal , 2nd channel: nothing/doesn't matter).
To match the sound, I am calculating the frequency of the mp3 until I find a good percentage of 18Hz freq. during ~ 2 sec. As this frequency is not very common, I don't have to match it very precisely.
I used mpg123 to convert my file, I fill the buffers with what it returns. I initialised it to convert the mp3 to Mono RAW audio:
init:
int ret;
const long *rates;
size_t rate_count, i;
mpg123_rates(&rates, &rate_count);
mpg123_handle *m = mpg123_new(NULL, &ret);
mpg123_format_none(m);
for(i=0; i<rate_count; ++i)
mpg123_format(m, rates[i], MPG123_MONO, MPG123_ENC_SIGNED_32);
if(m == NULL)
{
//err
} else {
mpg123_open_feed(m);
}
(...)
unsigned char out[8*MAX_MP3_BUF_SIZE];
ret = mpg123_decode(m, buf->data, buf->size, out, 8*MAX_MP3_BUF_SIZE, &size);
`(...)
unsigned char out[8*MAX_MP3_BUF_SIZE];
ret = mpg123_decode(m, buf->data, buf->size, out, 8*MAX_MP3_BUF_SIZE, &size);
(...) `
But I have to idea how to get the resulting buffer to calculate the FFT to get the frequency.
//FREQ Calculation with libfftw3
int transform_size = MAX_MP3_BUF_SIZE * 2;
fftw_complex *fftout = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * transform_size);
fftw_complex *fftin = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * transform_size);
fftw_plan p = fftw_plan_dft_r2c_1d(transform_size, fftin, fftout, FFTW_ESTIMATE);
I can get a good RAW Audio (PCM ?) into a buffer (if I write it, it can be read and converted into wave with sox:
sox --magic -r 44100 -e signed -b 32 -c 1 rps.raw rps.wav
Any help is appreciated. My knowledge of signal processing is poor, I am not even sure of what to do with the FFT to get the frequency of the signal. Code is just fyi, it is contained into a much bigger project (for which a simple grep is not an option)
Don't use MP3 for this. There's a good chance your 18 Hz will disappear or at least become distorted. 18 Hz is will below audible. MP3 and other lossy algorithms use a variety of techniques to remove sounds that we're not going to hear.
Assuming PCM, since you only need one frequency band, consider using the Goertzel algorithm. This is more efficient than FFT/DFT for your use case.

AAC stream resampled incorrectly

I do have a very particular problem, I wish I could find the answer to.
I'm trying to read an AAC stream from an URL (online streaming radio e.g. live.noroc.tv:8000/radionoroc.aacp) with NAudio library and get IEEE 32 bit floating samples.
Besides that I would like to resample the stream to a particular sample rate and channel count (rate 5512, mono).
Below is the code which accomplishes that:
int tenSecondsOfDownloadedAudio = 5512 * 10;
float[] buffer = new float[tenSecondsOfDownloadedAudio];
using (var reader = new MediaFoundationReader(pathToUrl))
{
var ieeeFloatWaveFormat = WaveFormat.CreateIeeeFloatWaveFormat(5512, 1); // mono
using (var resampler = new MediaFoundationResampler(reader, ieeeFloatWaveFormat))
{
var waveToSampleProvider = new WaveToSampleProvider(resampler);
int readSamples = 0;
int tempBuffer = new float[5512]; // 1 second buffer
while(readSamples <= tenSecondsOfDownloadedAudio)
{
int read = waveToSampleProvider.Read(tempBuffer, 0, tempBuffer.Length);
if(read == 0)
{
Thread.Sleep(500); // allow streaming buffer to get loaded
continue;
}
Array.Copy(tempBuffer, 0, buffer, readSamples, tempBuffer.Length);
readSamples += read;
}
}
}
These particular samples are then written to a Wave audio file using the following simple method:
using (var writer = new WaveFileWriter("path-to-audio-file.wav", WaveFormat.CreateIeeeFloatWaveFormat(5512, 1)))
{
writer.WriteSamples(samples, 0, samples.Length);
}
What I've encountered is that NAudio does not read 10 seconds of audio (as it was requested) but only 5, though the buffer array gets fully loaded with samples (which at this rate and channel count should contain 10 seconds of audio samples).
Thus the final audio file plays the stream 2 times as slower as it should (5 second stream is played as 10).
Is this somewhat related to different bit depths (should I record at 64 bits per sample as opposite to 32).
I do my testing at Windows Server 2008 R2 x64, with MFT codecs installed.
Would really appreciate any suggestions.
The problem seems to be with MediaFoundationReader failing to handle HE-AACv2 in ADTS container with is a standard online radio stream format and most likely the one you are dealing with.
Adobe products have the same problem mistreating this format exactly the same way^ stretching the first half of the audio to the whole duration and : Corrupted AAC files recorded from online stream
Supposedly, it has something to do with HE-AACv2 stereo stream being actually a mono stream with additional info channel for Parametric Stereo.

Panning stereo audio samples

Suppose I've got a 16-bit PCM audio file. I wanna pan all of it completely to the left. How would I do this, purely through byte manipulation? Do I just mix the samples of the right channel with those of the left channel?
I'd also like to ask (since it seems related), how would I go about turning stereo samples into mono samples?
I'm doing this with Haxe, but code in something like C (or just an explanation of the method) should be sufficient. Thanks!
You'll first need to convert the raw bytes into int arrays. Your output for the left channel will be the sum divided by 2.
for (int i = 0 ; i < numFrames ; ++i)
{
*pOutputL++ = (*pInputL++ + *pInputR++) >> 1;
*pOutputR++ = 0;
}

LibAV - what approach to take for realtime audio and video capture?

I'm using libav to encode raw RGB24 frames to h264 and muxing it to flv. This works
all fine and I've streamed for more then 48 hours w/o any problems! My next step
is to add audio to the stream. I'll be capturing live audio and I want to encode it
in real time using speex, mp3 or nelly moser.
Background info
I'm new to digital audio and therefore I might be doing things wrong. But basically my application gets a "float" buffer with interleaved audio. This "audioIn" function gets called by the application framework I'm using. The buffer contains 256 samples per channel,
and I have 2 channels. Because I might be mixing terminology, this is how I use the
data:
// input = array with audio samples
// bufferSize = 256
// nChannels = 2
void audioIn(float * input, int bufferSize, int nChannels) {
// convert from float to S16
short* buf = new signed short[bufferSize * 2];
for(int i = 0; i < bufferSize; ++i) { // loop over all samples
int dx = i * 2;
buf[dx + 0] = (float)input[dx + 0] * numeric_limits<short>::max(); // convert frame of the first channel
buf[dx + 1] = (float)input[dx + 1] * numeric_limits<short>::max(); // convert frame of the second channel
}
// add this to the libav wrapper.
av.addAudioFrame((unsigned char*)buf, bufferSize, nChannels);
delete[] buf;
}
Now that I have a buffer, where each sample is 16 bits, I pass this short* buffer, to my
wrapper av.addAudioFrame() function. In this function I create a buffer, before I encode
the audio. From what I read, the AVCodecContext of the audio encoder sets the frame_size. This frame_size must match the number of samples in the buffer when calling avcodec_encode_audio2(). Why I think this, is because of what is documented here.
Then, especially the line:
If it is not set, frame->nb_samples must be equal to avctx->frame_size for all frames except the last.*(Please correct me here if I'm wrong about this).
After encoding I call av_interleaved_write_frame() to actually write the frame.
When I use mp3 as codec my application runs for about 1-2 minutes and then my server, which is receiving the video/audio stream (flv, tcp), disconnects with a message "Frame too large: 14485504". This message is generated because the rtmp-server is getting a frame which is way to big. And this is probably due to the fact I'm not interleaving correctly with libav.
Questions:
There quite some bits I'm not sure of, even when going through the source code of libav and therefore I hope if someone has an working example of encoding audio which comes from a buffer which which comes from "outside" libav (i.e. your own application). i.e. how do you create a buffer which is large enough for the encoder? How do you make the "realtime" streaming work when you need to wait on this buffer to fill up?
As I wrote above I need to keep track of a buffer before I can encode. Does someone else has some code which does this? I'm using AVAudioFifo now. The functions which encodes the audio and fills/read the buffer is here too: https://gist.github.com/62f717bbaa69ac7196be
I compiled with --enable-debug=3 and disable optimizations, but I'm not seeing any
debug information. How can I make libav more verbose?
Thanks!

Resources