Signal/Sound Processing: Making text vibrate to music - audio

I'm working on a simple music visualization. Probably not relevant, but I am doing the sound processing using the new WebKit Audio Data API and the dsp.js library.
I want to make a text vibrate (grow/shrink) to the rhythm of the music. What is the best way to do this?
What I've done so far is ran the signals through a FFT. I look at the bottom 10% of frequencies (bass notes?) and when the amplitude surpasses a certain threshold, I animate the text.
Does this sound right? Or am I completely off?

You say you've done it, and then you ask if you are way off? Well, you tell us: does it work for your application?
One potential problem is that the FFT is slow, both in that there may be a lag between your input and output and there will be a lot of CPU used. I don't expect this will matter for your application, but, in general, you are better off using a low-pass filter. When the output of the low-pass goes above some level, you can use that to trigger something for some short amount of time.
Another issue is simply that this is only a very basic beat detection algorithm. It might work for bass-heavy "four on the floor" music, but you'll need to figure out where the threshold goes and how to keep it moving when the bass stops or something. You may want to research beat detection algorithms. The open source aubio has some.
http://aubio.org/

Related

Can I use waveform of the song to proceed audio comparison?

I am planning to develop a music app which includes a function to find the similar song just like what KKBOX and Shazam are doing, but I'm not familiar in this area. I've found that they applied FFT to proceed the comparison of the songs so that the user can search the similar song.
However, i am thinking that what if I generate the waveform of the song, and then directly compare the waveform of the songs. I would like to ask is it possible for my idea?
As your objective is to find "similar" songs, comparing a 2d waveform is highly unlikely to work. However, it's a good idea to first explore the feasibility of your approach, before rejecting it out of hand.
I would suggest picking a set of 5 songs
1 song and 1 song you think is very similar to it
1 song that's different from the first one, and a song by the same band on the same album (or from the same time period)
1 audio file that's completely different (e.g. from an audiobook or podcast)
Run through the librosa tutorials (https://librosa.org/doc/main/tutorial.html) and/or some of the walkthroughs on Medium (e.g. https://towardsdatascience.com/extract-features-of-music-75a3f9bc265d), but stopping before you get to the part that uses MFCC. Just focus on the waveform images.
Looking at the visualizations for your songs and thinking through this problem, reason about a)why the waveform-comparison ought to work, and b)why the waveform-comparison won't work.
So think about things like tempo, timbre and timing - what would be the effect on the waveform of playing the same song on different instruments, with a different effects treatment, at a different tempo, or in a different order (same song, but changing order of verses and chorus).
Setting aside the non-trivial quetion of which waveform you'd be using (amplitude? of what frequency/frequencies?), at this point, you should see how many problems there are with just looking at the waveform, and why MFCC (or similar) is better. Additionally, you'll be better prepared to think about how MFCC parameters might be selected - how much of the song do you need to sample, when should you start the sampling.
Is your idea possible? Probably not in the way you are thinking - maybe you could experiment with something like transforming the data of the song in some way and then comparing that representation (e.g. looking at changes in amplitude or tempo) The problem with audio is that it encapsulates a lot of features in its signal:
key
tempo
effects treatment (e.g. reverb)
instruments
tone
dynamics
etc.
Watch a tutorial on audio mixing and you'll see/hear just how much the output signal of the exact same song can be changed without actually changing the song being played.
Innovation sometimes emerges when curious people try things that 'probably won't work', so anything is worth a shot, but once you've figured out for yourself why something won't work, it's useful to accept commonly used techniques, and look for opportunities for innovation in other ways.

Methods for simulating moving audio source

I'm currently researching an problem regarding DOA (direction of arrival) regression for an audio source, and need to generate training data in the form of audio signals of moving sound sources. In particular, I have the stationary sound files, and I need to simulate a source and microphone(s) with the distances between them changing to reflect movement.
Is there any software online that could potentially do the trick? I've looked into pyroomacoustics and VA as well as other potential libraries, but none of them seem to deal with moving audio sources, due to the difficulties in simulating the doppler effect.
If I were to write up my own simulation code for dealing with this, how difficult would it be? My use case would be an audio source and a microphone in some 2D landscape, both moving with their own velocities, where I would want to collect the recording from the microphone as an audio file.
Some speculation here on my part, as I have only dabbled with writing some aspects of what you are asking about and am not experienced with any particular libraries. Likelihood is good that something exists and will turn up.
That said, I wonder if it would be possible to use either the Unreal or Unity game engine. Both, as far as I can remember, grant the ability to load your own cues and support 3D including Doppler.
As far as writing your own, a lot depends on what you already know. With a single-point mike (as opposed to stereo) the pitch shifting involved is not that hard. There is a technique that involves stepping through the audio file's DSP data using linear interpolation for steps that lie in between the data points, which is considered to have sufficient fidelity for most purposes. Lot's of trig, too, to track the changes in velocity.
If we are dealing with stereo, though, it does get more complicated, depending on how far you want to go with it. The head masks high frequencies, so real time filtering would be needed. Also it would be good to implement delay to match the different arrival times at each ear. And if you start talking about pinnas, I'm way out of my league.
As of now it seems like Pyroomacoustics does not support moving sound sources. However, do check a possible workaround suggested by the developers here in Issue #105 - where the idea of using a time-varying convolution on a dense microphone array is suggested.

How can I synchronize two audio recordings *without* timestamps?

Let's say I have two separate recordings of the same concert (created on a user's phone and then uploaded to our server). These recordings are then aligned according to their creation timestamp. However, when these recordings are played together or quickly toggled between, it is revealed that their creation timestamps must be off because there is a perceptible delay.
Since the time stamp is not a reliable way to align these recordings, what is an alternative? I would really prefer not to have to learn about audio signal processing to solve this problem, but recognize this may be the only way. So, I guess my question is:
Can I get away with doing some kind of clock synchronization? Is that even possible if the internal device clocks are clearly off by an unknown amount? If yes, a general outline of how this would work and key words would be appreciated.
If #1 is not an option, I guess I need to learn about audio signal processing? Again, a general outline of how to tackle the problem from that angle and some key words would be appreciated.
There are 2 separate issues you need to deal with. Issue 1 is the alignment of the start time of the recordings. I doubt you can expect that both user's pressed record at the exact same moment. Even if they did they may be located different distances from the speaker and it takes time for sound to travel. Aligning the start times by hand is pretty trivial. The human brain is good at comparing the similarities of sound. Programmatically it's a different story. You might try using something like cross correlation or looking over on dsp.stackexchange.com. There is no exact method though.
Issue 2 is that the clocks driving the A/D converters on the two devices are not going to be running at the same exact rate. So even if you synchronize the start time, eventually the two are going to drift apart. The time it takes to noticeably drift is a function of the difference of the two clock frequencies. If they are relatively close you may not notice in a short recording. To counter act this you need to stretch the time of one of the recordings. This increases or decreases the duration of the recording without affecting the pitch. There are plenty of audio recording apps that allow you to time stretch but they don't give you any help in figuring out by how much. Start be googling "time stretching" or again have a look at dsp.stackexchange.com.
I realize neither of these are direct answers - rather suggestions.
Take a look at this document, describes how you can align recordings using Sonic Visualizer(GPL) and a plugin.
I've not used it before, but found the document (and this question) when I was faced with a similar problem.

Frequency differences from MP3 to mic

I'm trying to compare sound clips based on microphone recording. Simply put I play an MP3 file while recording from the speakers, then attempt to match the two files. I have the algorithms in place that works, but I'm seeing a slight difference I'd like to sort out to get better accuracy.
The microphone seem to favor some frequencies (add amplitude), and be slightly off on others (peaks are wider on the mic).
I'm wondering what the cause of this difference is, and how to compensate for it.
Background:
Because of speed issues in how I'm doing comparison I select certain frequencies with certain characteristics. The problem is that a high percentage of these (depending on how many I choose) don't match between MP3 and mic.
It's called the response characteristic of the microphone. Unfortunately, you can't easily get around it without buying a different, presumably more expensive, microphone.
If you can measure the actual microphone frequency response by some method (which generally requires having some etalon acoustic system and an anechoic chamber), you can compensate for it by applying an equaliser tuned to exactly inverse characteristic, like discussed here. But in practice, as Kilian says, it's much simpler to get a more precise microphone. I'd recommend a condenser or an electrostatic one.

Touchscreen using sound input?

i don't really know if it is actually possible, but i believe that it can be made. How possible is it to make a program that recognizes different sound bouncing from the screen and turn it into a position that will obviously be later fed to the mouse.
I know that it sounds kind of dumb, but lately i've been noticing that a very dull, strong sound is made when touching the screen, and that sound varies when doing so at different positions. Probably the microphone "hears" differently because the screen acts as a drum with the casing. Anyways, what do you think, anyone has any experience programming with sound?
First of all most domestic touch screens work by detecting pressure based on a criss-cross mesh layer underneath the display layer.
However I have seen an example where a touch interface was interrogated onto a pane of glass, it used 4 microphones to determine the corners, when you tapped a certain part of the screen it measures the delay in the sound getting to each microphone, therefore allowing one to triangulate the touch.
This is the methodology you would use, you don't even need to set up the hardware to test it, you could throw up an interface in VB, when you click in a box it sends out a circular wave and just calculate using the times it takes to reach the 4 points where the pointer is.
EDIT
As nikie suggested, drag & drop, or any kind of gestures would be impossible using the microphone method, as the technique needs a wave of sound to detect the input.
http://computer.howstuffworks.com/question7161.htm
I don't know if this will get you far, but you can investigate the techniques used in MIDI drums for returning various nuances of play.

Resources