Finding the "noise level" of an audio recording programmatically - audio

I am tasked with something seemingly trivial which is to
find out how "noisy" a given recording is.
This recording came about via a voice recorder, a
OLYMPUS VN-733 PC which was fairly cheap (I am not doing
advertisement, I merely mention this because I in no way
aim to do anything "professional" here, I simply need to
solve a seemingly simple problem).
To preface this, I have already obtained several datasets
from different outside locations, in particular parks or
near-road recordings. That is, the noise that exists at
these specific locations, and to then compare this noise,
on average, with the other locations.
In other words:
I must find out how noisy location A is compared to location
B and C.
I have made 1 minute recordings each so that at the
least the time span of a recording can be compared
to the other locations (and I was using the very
same voice record at all positions, in the same
height etc...).
A sample file can be found at:
http://shevegen.square7.ch/test.mp3
(This may eventually be moved lateron, it just serves as
example how these recordings may sound right now. I am
unhappy about the initial noisy clipping-sound, ideally
I'd only capture the background noise of the cars etc..
but for now this must suffice.)
Now my specific question is, how can I find out how "noisy"
or "loud" this is?
The primary goal is to compare them to the other .mp3
files, which would suffice for my purpose just fine.
But ideally it would be nice to calculate on average
how "loud" every individual .mp3 is and then compared
it to the other ones (there are several recordings
per given geolocation, so I could even merge them
together).
There are some similar questions but not one in particular
that I was able to find that could answer this in a
objective manner, or perhaps I did not understand the
problem at hand. I have all the audio datasets already
but I have no idea how to find out how "loud" any one
of them is individually; there are some apps on smartphones
that claim that they can do this automatically but since
I do not have any smartphone, this is a dead end for me.
Any general advice will be much appreciated.

Noise is a notion difficult to define. Then, I will focus on loudness.
You could compute the energy of each files. For that, you need to access the samples of the audio signal (generally from a built-in function of you programming language). Then you could compute the RMS energy of the signal.
That could be the more basic processing.

Related

Can I use waveform of the song to proceed audio comparison?

I am planning to develop a music app which includes a function to find the similar song just like what KKBOX and Shazam are doing, but I'm not familiar in this area. I've found that they applied FFT to proceed the comparison of the songs so that the user can search the similar song.
However, i am thinking that what if I generate the waveform of the song, and then directly compare the waveform of the songs. I would like to ask is it possible for my idea?
As your objective is to find "similar" songs, comparing a 2d waveform is highly unlikely to work. However, it's a good idea to first explore the feasibility of your approach, before rejecting it out of hand.
I would suggest picking a set of 5 songs
1 song and 1 song you think is very similar to it
1 song that's different from the first one, and a song by the same band on the same album (or from the same time period)
1 audio file that's completely different (e.g. from an audiobook or podcast)
Run through the librosa tutorials (https://librosa.org/doc/main/tutorial.html) and/or some of the walkthroughs on Medium (e.g. https://towardsdatascience.com/extract-features-of-music-75a3f9bc265d), but stopping before you get to the part that uses MFCC. Just focus on the waveform images.
Looking at the visualizations for your songs and thinking through this problem, reason about a)why the waveform-comparison ought to work, and b)why the waveform-comparison won't work.
So think about things like tempo, timbre and timing - what would be the effect on the waveform of playing the same song on different instruments, with a different effects treatment, at a different tempo, or in a different order (same song, but changing order of verses and chorus).
Setting aside the non-trivial quetion of which waveform you'd be using (amplitude? of what frequency/frequencies?), at this point, you should see how many problems there are with just looking at the waveform, and why MFCC (or similar) is better. Additionally, you'll be better prepared to think about how MFCC parameters might be selected - how much of the song do you need to sample, when should you start the sampling.
Is your idea possible? Probably not in the way you are thinking - maybe you could experiment with something like transforming the data of the song in some way and then comparing that representation (e.g. looking at changes in amplitude or tempo) The problem with audio is that it encapsulates a lot of features in its signal:
key
tempo
effects treatment (e.g. reverb)
instruments
tone
dynamics
etc.
Watch a tutorial on audio mixing and you'll see/hear just how much the output signal of the exact same song can be changed without actually changing the song being played.
Innovation sometimes emerges when curious people try things that 'probably won't work', so anything is worth a shot, but once you've figured out for yourself why something won't work, it's useful to accept commonly used techniques, and look for opportunities for innovation in other ways.

How can I synchronize two audio recordings *without* timestamps?

Let's say I have two separate recordings of the same concert (created on a user's phone and then uploaded to our server). These recordings are then aligned according to their creation timestamp. However, when these recordings are played together or quickly toggled between, it is revealed that their creation timestamps must be off because there is a perceptible delay.
Since the time stamp is not a reliable way to align these recordings, what is an alternative? I would really prefer not to have to learn about audio signal processing to solve this problem, but recognize this may be the only way. So, I guess my question is:
Can I get away with doing some kind of clock synchronization? Is that even possible if the internal device clocks are clearly off by an unknown amount? If yes, a general outline of how this would work and key words would be appreciated.
If #1 is not an option, I guess I need to learn about audio signal processing? Again, a general outline of how to tackle the problem from that angle and some key words would be appreciated.
There are 2 separate issues you need to deal with. Issue 1 is the alignment of the start time of the recordings. I doubt you can expect that both user's pressed record at the exact same moment. Even if they did they may be located different distances from the speaker and it takes time for sound to travel. Aligning the start times by hand is pretty trivial. The human brain is good at comparing the similarities of sound. Programmatically it's a different story. You might try using something like cross correlation or looking over on dsp.stackexchange.com. There is no exact method though.
Issue 2 is that the clocks driving the A/D converters on the two devices are not going to be running at the same exact rate. So even if you synchronize the start time, eventually the two are going to drift apart. The time it takes to noticeably drift is a function of the difference of the two clock frequencies. If they are relatively close you may not notice in a short recording. To counter act this you need to stretch the time of one of the recordings. This increases or decreases the duration of the recording without affecting the pitch. There are plenty of audio recording apps that allow you to time stretch but they don't give you any help in figuring out by how much. Start be googling "time stretching" or again have a look at dsp.stackexchange.com.
I realize neither of these are direct answers - rather suggestions.
Take a look at this document, describes how you can align recordings using Sonic Visualizer(GPL) and a plugin.
I've not used it before, but found the document (and this question) when I was faced with a similar problem.

Can FFT be used to find drum solos/breaks in audio files?

Is it possible with FFT to find a drum solo, or a drum break, in an audio file? Is this something FFT is able to do and are there any resources online that could aid me with learning?
In general, a FFT is not a good choice for detecting the onset of percussion sounds:
An FFT is always calculated over a window of samples (in effect a period of time) and yields the magnitude of signal within the bin and its phase offset. You can therefore determine that there is signal at that particular bin, but not its onset time. The best time resolution available is the window period. Of course, you can make the period shorter at the expense of frequency resolution.
Percussion sounds tend to look like noise and spread across the spectrum. This would be OK if you only had percussions sounds, but is not great in real-life polyphonic content.
However, you might be able to find some inference from the different characteristics of the spectra of a drum solo vs instrumental sections of a track.
The problem of finding the time at which percussion sounds start in music is described in academic journals as onset dectection and is one of the many techniques used for feature extraction; the wider field is known as Music Information Retrieval. Your problem sounds like one of identifying sections in audio files and this might be described as partitioning
A good place to start is Sonic Visualiser which is a tool written specifically for MIR applications. Plug-ins exist for various types of feature extraction. From these you will be able to easily find the large body of academic work in this area. There is an added bonus that the existing plug-ins are all open source too.
I'd look here, there was a bit of discussion with great pointers on the Gamedev SE: https://gamedev.stackexchange.com/questions/9761/beat-detection-and-fft :-)

Choosing an audio API

I'm struggling to choose between a vast number of audio programming languages and APIs. I'm very (totally) new to audio programming so please bear with me.
Software
I need to be able to:
Alter volume of different sounds before outputting them to anything (these sounds can have a variety of different origins, for example mp3s and microphone input)
phase shift sounds
superimpose sounds that I have tweaked (as per items 1 and 2)
control the output to each of 8 channels independently of one another
make this all happen on Windows7
These capabilities need be abstracted by a graphical frontend I will probably make myself. What I want to be able to do is create 'sound sources' and move them around a 3D environment along either pre-defined trajectories and/or in relation to the movement of whoever is inside the rig. The reason I want to do pitch bending is so I can mess with red-shift stuff.
I don't want to have to construct full tracks before-hand and just play them. I want the sound that is played to depend on external input from sensors as well as what I am doing on the frontend.
As far as I know this means I cant use any existing full audio making app.
The Question
I've been looking around for for the API or language I should use and I have not turned up a blank, quite the opposite actually. I'm struggling to narrow down my search. A lot of my problem stems from the fact that I have no experience in audio programming.
So, does anyone know off-hand of an API or language that meets my criteria?
Hardware stuff and goals
(I left this until last because I'm not sure how relevant it is)
My goal is to make three rings of speakers at different heights and to have enough control over them to be able to simulate any number of 'sound sources' within the array. The idea is to have someone stand in the middle of the rig and be able to make it sound like there are lots of things moving around them. To get this working I'm planning on doing a little trig and using 8 channels of audio from my PC. The maths is pretty straight forward, it just the rest that I need to worry about
What I want to do next is attach a bunch of cameras to the thing and do some simple image recognition stuff to be able to 'attach sound sources' to different objects. Eg. If someone is standing in the right place it can be made to seem as though all red balls quack like a duck, and all orange ones moan hauntingly.
This is not to detract from Richard Small's answer, but to comment on some of the other options out there:
If you are looking for something higher-level with which you can prototype and develop this faster, you want max/msp or it's open source competitor puredata. These are designed for musicians who are technically minded, but not so much for programmers. As a result, you can build this sort of thing quickly and efficiently.
You also have some lower level options: PortAudio can handle your audio I/O, you would have to do the sound generation and effects and so on on your own or with other libraries. Cinder and OpenFramewoks both provide interfaces for audio, cameras, and other stuff for "creative programming". I'm afraid I don't know if they meet your full requirements, but they are powerful and popular for this sort of thing so I encorage you to look at them.
The two major ones these days tend to be
WWise
WWise Download Link
FMOD
FMOD Download Link
These two engines may even in fact be overkill for what you need, but I can almost guarantee that they will be capable of anything you require.

Frequency differences from MP3 to mic

I'm trying to compare sound clips based on microphone recording. Simply put I play an MP3 file while recording from the speakers, then attempt to match the two files. I have the algorithms in place that works, but I'm seeing a slight difference I'd like to sort out to get better accuracy.
The microphone seem to favor some frequencies (add amplitude), and be slightly off on others (peaks are wider on the mic).
I'm wondering what the cause of this difference is, and how to compensate for it.
Background:
Because of speed issues in how I'm doing comparison I select certain frequencies with certain characteristics. The problem is that a high percentage of these (depending on how many I choose) don't match between MP3 and mic.
It's called the response characteristic of the microphone. Unfortunately, you can't easily get around it without buying a different, presumably more expensive, microphone.
If you can measure the actual microphone frequency response by some method (which generally requires having some etalon acoustic system and an anechoic chamber), you can compensate for it by applying an equaliser tuned to exactly inverse characteristic, like discussed here. But in practice, as Kilian says, it's much simpler to get a more precise microphone. I'd recommend a condenser or an electrostatic one.

Resources