I recently wondered if it is possible to zoom or move things in FFMPEG based on an audio source.
I already played around with complex filters as they allow some audio visualization but didn't really manage to move/zoom things based on sound. See good examples of complex filters used for audio visualization at: https://hhsprings.bitbucket.io/docs/programming/examples/ffmpeg/audio_visualization/index.html
My current situation is that i have multiple inputs which one of should react on sound/maybe even on special frequency's.
Related
I am planning to develop a music app which includes a function to find the similar song just like what KKBOX and Shazam are doing, but I'm not familiar in this area. I've found that they applied FFT to proceed the comparison of the songs so that the user can search the similar song.
However, i am thinking that what if I generate the waveform of the song, and then directly compare the waveform of the songs. I would like to ask is it possible for my idea?
As your objective is to find "similar" songs, comparing a 2d waveform is highly unlikely to work. However, it's a good idea to first explore the feasibility of your approach, before rejecting it out of hand.
I would suggest picking a set of 5 songs
1 song and 1 song you think is very similar to it
1 song that's different from the first one, and a song by the same band on the same album (or from the same time period)
1 audio file that's completely different (e.g. from an audiobook or podcast)
Run through the librosa tutorials (https://librosa.org/doc/main/tutorial.html) and/or some of the walkthroughs on Medium (e.g. https://towardsdatascience.com/extract-features-of-music-75a3f9bc265d), but stopping before you get to the part that uses MFCC. Just focus on the waveform images.
Looking at the visualizations for your songs and thinking through this problem, reason about a)why the waveform-comparison ought to work, and b)why the waveform-comparison won't work.
So think about things like tempo, timbre and timing - what would be the effect on the waveform of playing the same song on different instruments, with a different effects treatment, at a different tempo, or in a different order (same song, but changing order of verses and chorus).
Setting aside the non-trivial quetion of which waveform you'd be using (amplitude? of what frequency/frequencies?), at this point, you should see how many problems there are with just looking at the waveform, and why MFCC (or similar) is better. Additionally, you'll be better prepared to think about how MFCC parameters might be selected - how much of the song do you need to sample, when should you start the sampling.
Is your idea possible? Probably not in the way you are thinking - maybe you could experiment with something like transforming the data of the song in some way and then comparing that representation (e.g. looking at changes in amplitude or tempo) The problem with audio is that it encapsulates a lot of features in its signal:
key
tempo
effects treatment (e.g. reverb)
instruments
tone
dynamics
etc.
Watch a tutorial on audio mixing and you'll see/hear just how much the output signal of the exact same song can be changed without actually changing the song being played.
Innovation sometimes emerges when curious people try things that 'probably won't work', so anything is worth a shot, but once you've figured out for yourself why something won't work, it's useful to accept commonly used techniques, and look for opportunities for innovation in other ways.
Where should I begin?
I am trying to build a real-time stem-split music visualizer for VJing and the like. What sets this apart is that I would like to split the input audio stream into its stems (either algorithmically or using something like Spleeter) and then use each stem data to control different aspects of the visualization.
For example:
The isolated drums to play a BPM-synced video.
I'm hoping to achieve this by making a short looping video at a fixed BPM (say, 60) and then by detecting the BPM of the stream, adjust the playback speed of the video so that the video is in sync.
The isolated synth stream could control DMX lights.
I want to try and encode this data in, say, the last row of pixels in the above video. By reading the colour, intensity, and movement data from the pixels, moves and timings could be read and sent to the lights in real-time. I'm doing this so that the user can encode all the data needed for a scene into one video file.
The isolated vocals could be synced and displayed on screen using
MusixMatch.
The isolated bassline could be parsed into MIDI data and visualized on screen.
All of the above can be controlled live.
Now the problem is that I am relatively inexperienced with programming. I am not sure where to start. Which language to use, which IDE, how to display visuals, how to interact with audio input streams, how to use DMX and how to visualize MIDI data. I know this is currently quite a bit out of my depth but I'll manage with the right resources. Please give me some advice on where to begin for a project like this.
I am working on a project that involves using a lot of found audio clips (some new, some very old archival and poor quality etc).
I am trying to figure out a way to have all audio clips to be of a similar quality (if this is possible) and play at a similar volume?
I have use of both audacity and ableton...any suggestions would be great.
What you are asking for is commonly called normalization. There are several tools that can do it, including commandline tools and also audacity.
You'll find the tool in audacity under Effect > Normalize...
You can select multiple audio tracks.
You could also consider using a limiter and/or a compressor on your track. Have a look in the Live effect reference for more info on these: https://www.ableton.com/en/manual/live-audio-effect-reference/
The results will not be as good as applying normalization by hand, but it will be a lot quicker.
I'm currently researching an problem regarding DOA (direction of arrival) regression for an audio source, and need to generate training data in the form of audio signals of moving sound sources. In particular, I have the stationary sound files, and I need to simulate a source and microphone(s) with the distances between them changing to reflect movement.
Is there any software online that could potentially do the trick? I've looked into pyroomacoustics and VA as well as other potential libraries, but none of them seem to deal with moving audio sources, due to the difficulties in simulating the doppler effect.
If I were to write up my own simulation code for dealing with this, how difficult would it be? My use case would be an audio source and a microphone in some 2D landscape, both moving with their own velocities, where I would want to collect the recording from the microphone as an audio file.
Some speculation here on my part, as I have only dabbled with writing some aspects of what you are asking about and am not experienced with any particular libraries. Likelihood is good that something exists and will turn up.
That said, I wonder if it would be possible to use either the Unreal or Unity game engine. Both, as far as I can remember, grant the ability to load your own cues and support 3D including Doppler.
As far as writing your own, a lot depends on what you already know. With a single-point mike (as opposed to stereo) the pitch shifting involved is not that hard. There is a technique that involves stepping through the audio file's DSP data using linear interpolation for steps that lie in between the data points, which is considered to have sufficient fidelity for most purposes. Lot's of trig, too, to track the changes in velocity.
If we are dealing with stereo, though, it does get more complicated, depending on how far you want to go with it. The head masks high frequencies, so real time filtering would be needed. Also it would be good to implement delay to match the different arrival times at each ear. And if you start talking about pinnas, I'm way out of my league.
As of now it seems like Pyroomacoustics does not support moving sound sources. However, do check a possible workaround suggested by the developers here in Issue #105 - where the idea of using a time-varying convolution on a dense microphone array is suggested.
I am curious. How would one implement the most simple audio engine ever? I have something like a stream for audio data in mind using your default audio device. Playing a lot with RtAudio, I think if one could drop some of the features, this would be possible. Someone any idea where to start?
I would do it (did do it) like this:
http://ccan.ozlabs.org/info/wwviaudio.html
Well there is no reason why you can't create an audio engine that has a trivially simple interface:
audioEngine.PlayStream(myStream)
The audio engine would then periodically read data from that stream and send it to the soundcard. The reason audio engines tend to be more complicated than this, is that there are all kinds of parameters you might want to control, including latency of playback, sample rate, bit depth, as well as often the need to convert audio between formats. Add in the problems of repositioning streams, and synchronizing multiple streams, supporting multiple audio driver APIs etc, and soon you have an audio engine as complicated as any other.
Thank you for your answers.
to .Mark Heath:
yes of course I know that there might be a lot of parameters to tweak be it the filter cutoff, resonance, delay timing etc etc ..
I was just curious how to build an audio engine as simple as possible and modular as possible. The major intention I had in mind was to rebuild the gameboy soundchip ( again here, there a lot of implementations ie. JavaBoy).
to.smcameron
It seems that ccan/wwviaaudio has a dependency to libvorbis / portaudio (version >=19), that would yield the same effect as using rtaudio ( which is, compared to other realtime audio interface having build in asio support, rather small). However, I will give it a try.
regards,
audax