Algorithm to remove vocal from audio file

Algorithm to remove vocal from audio file - audio

I know this has already been posted more than 10 years ago but I want to believe that some progress has been made on this side. (we have Deepfake nowadays, so much progress on the AI side).
I tried some tutorials with audacity but was highly disappointed with the result (to be fair the resulting output is not that bad, but not good enough for prod).
What reputable algorithm could I use to process myself a mp3 file and remove the vocals while preserving the drums and centered instruments, and removing vocal echo?

This task is known in the community as "Vocal Source Separation" or "Vocal Signal Separation" or "Singing Voice Source Separation", which are specialized "Music Source Separation" tasks, again an example of the more general "Source Separtion" task.
Here are some papers: Music Source Separation.
One of the most actively developed open source solutions is Spleeter, which has been used commercially in various audio products. There is an online tool based on it, you can try it out at Splitter.ai. The "2 stem" version will give you one track with vocals, and one track with everything else.

Related

APCS final project: Converting an audio file to a simpler MIDI file

Lets say I have the audio file for Happy Birthday. I want to convert that audio file into an audio file that sounds like this : happy birthday.
First, I'd like to know if I have the ability to program this? Can a highschooler who's almost finished with APCS program this?
If I can:
How would I change the bpm of the song? I've searched through a bunch of websites, but they weren't very helpful.
I know that audio files can be represented in waveforms. How would I scan for each individual wave in an audio file (I need this to isolate the notes)?

This is a very ambitious project, actually. One reason is that it involves using digital signal processing tools like FFT (Fast fourier transforms) to analyze the sound to pick out the pitches. You might be able to find a library that can do this, but as far as coding such a tool, that would involve a steep learning curve.
If you would like to look further into this, there is a good online resource called "The Scientists and Engineers Guide to Digital Signal Processing". I was able to work through and understand the discrete fourier transform with only high school math (lots of trig) and a bit of calculus. It was a lift, though.
Trying to analyze rhythm is also no easy task. Even with advanced tools provided in professional notation system such as Finale, people have trouble playing rhythms in time well enough for the best transcription tools. Algorithms that "quantize" the beats help but also limit the amount of detail that can be included in the playback.
My guess is that as interesting and worthwhile as this project would be, to bring it to completion before the semester ends would require putting together prebuilt pieces. A lot of programming is done that way, these days.
If you scale the project back to something like just getting your code to analyze a short sample of a single note and give its pitch, that would be both impressive and doable with a lot of work. It could be done with a DFT algorithm instead of requiring FFT, reducing the amount of info you'd have to acquire first. That way, you'd only have to work your way up to understanding and implementing the material on this link which is about calculating the DFT. Notice that there is example code in BASIC. The code examples throughout this book are a big help.

Realtime Sound Routing...Trigger a Sound with Another Sound

I'm looking for a program that is able to recognize individual audio samples from my computer and reroute them to trigger WAV files from a library. In my project, it would need to be realtime as the latency would not be a desired result. I tried using dictation software that would recognize words to trigger opening a file and that's the direction where I want to go, but instead of words I want it to be sounds and it would happen in realtime. I'm not sure where to go and am just looking for some guidance. Does anyone have any suggestions of what I should do?

That's a fairly broad question, but I can tell you how I would do it. (Hardly the only way, but where I would start.)
If you're looking for real time input, the Java Sound library (excellent tutorial here) allows for that. (Just note that microphone input from a web page is difficult on anything, due to major security concerns, so this would be a desktop application.)
If it needs to be real time, the first thing I would suggest is stream and multithread the hell out of it. I would suggest the Java 8 Stream API, but since you're looking for subsamples that match a specific pattern, then each data point will have to be aware of the state of its neighbors, and that isn't easy with streams.
You will probably want to know if a sound roughly resembles an audio profile, so for that, I would pick a tolerance on just how close you want it to be for a match (remembering that samples may not line up 100% anyway, so "exact" is not an option), and then look up Hidden Markov Models. I suggest these because they're what voice recognition software typically uses, and while your sounds may not be voices, it will give you an idea of what has already been done.
You'll also want to maintain a limited list of audio samples in memory. Specifically, you will likely need the most recent data, because an audio signal is a time-variant signal, and you can't get a match from just one point. I wouldn't make it much longer than the longest sample you're looking to recognize, as audio takes up a boatload of memory.
Lastly (for audio), I would recommend picking a standard format for comparison. Make it as good as gets you decent results, and start high. You will want to convert everything to that format before you compare it.
Once you recognize a specific sound, it's basically a Command Pattern. Specific sounds can be mapped, even with a java.util.HashMap, to specific files, which (if there are few enough) you might even have pre-loaded.
Lastly, it's worth looking at the Java Speech API. It's not part of the JDK and it's quite dated, but you might get some good advice from its implementation.
This is of course the advice of a Java-preferring programmer, but I imagine that there might be some decent libraries in Python and Ruby to help you as well; and of course there's something in C somewhere. This may sound like a lot, but most of the material is already implemented and ready-to-go.
Hopefully this helps, let's look forward to other answers.

Can FFT be used to find drum solos/breaks in audio files?

Is it possible with FFT to find a drum solo, or a drum break, in an audio file? Is this something FFT is able to do and are there any resources online that could aid me with learning?

In general, a FFT is not a good choice for detecting the onset of percussion sounds:
An FFT is always calculated over a window of samples (in effect a period of time) and yields the magnitude of signal within the bin and its phase offset. You can therefore determine that there is signal at that particular bin, but not its onset time. The best time resolution available is the window period. Of course, you can make the period shorter at the expense of frequency resolution.
Percussion sounds tend to look like noise and spread across the spectrum. This would be OK if you only had percussions sounds, but is not great in real-life polyphonic content.
However, you might be able to find some inference from the different characteristics of the spectra of a drum solo vs instrumental sections of a track.
The problem of finding the time at which percussion sounds start in music is described in academic journals as onset dectection and is one of the many techniques used for feature extraction; the wider field is known as Music Information Retrieval. Your problem sounds like one of identifying sections in audio files and this might be described as partitioning
A good place to start is Sonic Visualiser which is a tool written specifically for MIR applications. Plug-ins exist for various types of feature extraction. From these you will be able to easily find the large body of academic work in this area. There is an added bonus that the existing plug-ins are all open source too.

I'd look here, there was a bit of discussion with great pointers on the Gamedev SE: https://gamedev.stackexchange.com/questions/9761/beat-detection-and-fft :-)

Choosing an audio API

I'm struggling to choose between a vast number of audio programming languages and APIs. I'm very (totally) new to audio programming so please bear with me.
Software
I need to be able to:
Alter volume of different sounds before outputting them to anything (these sounds can have a variety of different origins, for example mp3s and microphone input)
phase shift sounds
superimpose sounds that I have tweaked (as per items 1 and 2)
control the output to each of 8 channels independently of one another
make this all happen on Windows7
These capabilities need be abstracted by a graphical frontend I will probably make myself. What I want to be able to do is create 'sound sources' and move them around a 3D environment along either pre-defined trajectories and/or in relation to the movement of whoever is inside the rig. The reason I want to do pitch bending is so I can mess with red-shift stuff.
I don't want to have to construct full tracks before-hand and just play them. I want the sound that is played to depend on external input from sensors as well as what I am doing on the frontend.
As far as I know this means I cant use any existing full audio making app.
The Question
I've been looking around for for the API or language I should use and I have not turned up a blank, quite the opposite actually. I'm struggling to narrow down my search. A lot of my problem stems from the fact that I have no experience in audio programming.
So, does anyone know off-hand of an API or language that meets my criteria?
Hardware stuff and goals
(I left this until last because I'm not sure how relevant it is)
My goal is to make three rings of speakers at different heights and to have enough control over them to be able to simulate any number of 'sound sources' within the array. The idea is to have someone stand in the middle of the rig and be able to make it sound like there are lots of things moving around them. To get this working I'm planning on doing a little trig and using 8 channels of audio from my PC. The maths is pretty straight forward, it just the rest that I need to worry about
What I want to do next is attach a bunch of cameras to the thing and do some simple image recognition stuff to be able to 'attach sound sources' to different objects. Eg. If someone is standing in the right place it can be made to seem as though all red balls quack like a duck, and all orange ones moan hauntingly.

This is not to detract from Richard Small's answer, but to comment on some of the other options out there:
If you are looking for something higher-level with which you can prototype and develop this faster, you want max/msp or it's open source competitor puredata. These are designed for musicians who are technically minded, but not so much for programmers. As a result, you can build this sort of thing quickly and efficiently.
You also have some lower level options: PortAudio can handle your audio I/O, you would have to do the sound generation and effects and so on on your own or with other libraries. Cinder and OpenFramewoks both provide interfaces for audio, cameras, and other stuff for "creative programming". I'm afraid I don't know if they meet your full requirements, but they are powerful and popular for this sort of thing so I encorage you to look at them.

The two major ones these days tend to be
WWise
WWise Download Link
FMOD
FMOD Download Link
These two engines may even in fact be overkill for what you need, but I can almost guarantee that they will be capable of anything you require.

Audio support for programming languages

I want to start on a hobby project that focuses on displaying audio files in a folder in a certain fashion and has the ability to play such an audio file and shows basic control options for playing. However, i'm struggling to find a fit programming language for this.
The displaying part shouldn't be too hard and can probably be done in most of the programming languages. The audio part is what concerns me the most since it's not the main focus of the project and should only do limited things (so it shouldn't be too hard) and i do not know anything about sound support in the programming languages i currently know. (Java, C and C++)
Specifically i would like to be able to do these things:
Play a sound file
Stop/pause a playing song
Adjust volume
Show a bar that displays the current position in the song
Most files will be .mp3 files but being able to process other formats is certainly a plus. Since this is just a small project it's ok if it runs just on Windows. Scalabilty would be nice but not required.
It would be nice to have a small overview of audio support/audio libraries of programming languages (i'm always up for something new) that can accomplish these simple things, in a not too complicated way, aswell as personal experiences.
In this way i hope to create a better understanding of which programming language fits my project best. (i would very much like to not have to change language mid-way the project)
--
Edit:
This is only for a later stage of the project if the first part was successfull: i will want to change the file names of the audio files that are displayed. (to make them follow a specific format)

I haven't written audio processing programs much, but I know a lot of them exist for C and C++. For Java perhaps, too, but I don't know Java. I had used audio with SDL in a game, but that doesn't have that many features and I don't recommend it.
There's this question asking for a library in C, and there are a couple of similar questions that SO brings up on the side. You may want to take a look at those.
You would also need to look for a library that loads different file types. SDL at least, only opens .wav files, which I believe most of the playback libraries would support. For MP3, you will most likely need an additional library. I know Audacity uses LAME Mp3 so I'm guessing that should be good.
Some of the functionalities you want is also doable by yourself. For example, knowing the length of the music and the amount you have already read, you will know how far in the audio you are. Adjusting the volume is also a multiplication (in the simplest case) that you can do on the audio data if the library doesn't provide it.
A very good choice seems to be PortAudio which is used by Audacity, and also recommended in the accepted answer of the question I mentioned above.

I've done audio apps in both Java and C++. Java development goes way faster because it's a more powerful language and has garbage collection, but JavaSound is a pretty awful solution for audio. Of course, there are wrappers for FFMPEG and other stuff, so you can get a lot of things working. Here's an example of a Java audio app: http://www.indabamusic.com/help/mantis
OTOH, C++ gives you lots of control, low latency and wealth of libraries. (another answer mentioned Portaudio, which is, indeed, great.) But you will definitely find it also has a much longer development cycle.
You can certainly do everything you want to do with either language.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string