Audio content analysis for online audiovisual data - audio

I want to work on a project where I have to segment and classify online audiovisual data based on its audio content, i.e. different parts of the audio visual data will be segmented and classified as silence, music, speech, speech+background music, etc based on their audio content.
I am aware that I have to obtain the audio part from the audiovisual data and extract features like zero crossing, spectral peaks, etc. and find out segment boundaries in order to segment audio data.
But I'm lost in the beginning itself.
I do not know how to start off with the project. The output of the software are segments of audiovisual data under different categories like silence, speech, music, etc.
It will be really helpful if someone lets me know
Which programming language is convenient for this purpose?
What steps should i follow in order to develop this software?
I have no background in digital signal processing. It'll be really helpful if I get some guidance

I'd suggest to look into a multimedia framework such as GStreamer. It is crossplatform, but the easiest to get started on Linux where it originates from. It already comes with all kind of plugins to receve, demux and decode audio and video. It also has a couple of analyzers (such as level and spectrum analyzers for audio as well as voice activity detection). Those could be a good starting point for your experiments. Gstreamer itself is written in C, but applications can use the language bindings to python, perl, c#, c++, java, ...

Related

APCS final project: Converting an audio file to a simpler MIDI file

Lets say I have the audio file for Happy Birthday. I want to convert that audio file into an audio file that sounds like this : happy birthday.
First, I'd like to know if I have the ability to program this? Can a highschooler who's almost finished with APCS program this?
If I can:
How would I change the bpm of the song? I've searched through a bunch of websites, but they weren't very helpful.
I know that audio files can be represented in waveforms. How would I scan for each individual wave in an audio file (I need this to isolate the notes)?
This is a very ambitious project, actually. One reason is that it involves using digital signal processing tools like FFT (Fast fourier transforms) to analyze the sound to pick out the pitches. You might be able to find a library that can do this, but as far as coding such a tool, that would involve a steep learning curve.
If you would like to look further into this, there is a good online resource called "The Scientists and Engineers Guide to Digital Signal Processing". I was able to work through and understand the discrete fourier transform with only high school math (lots of trig) and a bit of calculus. It was a lift, though.
Trying to analyze rhythm is also no easy task. Even with advanced tools provided in professional notation system such as Finale, people have trouble playing rhythms in time well enough for the best transcription tools. Algorithms that "quantize" the beats help but also limit the amount of detail that can be included in the playback.
My guess is that as interesting and worthwhile as this project would be, to bring it to completion before the semester ends would require putting together prebuilt pieces. A lot of programming is done that way, these days.
If you scale the project back to something like just getting your code to analyze a short sample of a single note and give its pitch, that would be both impressive and doable with a lot of work. It could be done with a DFT algorithm instead of requiring FFT, reducing the amount of info you'd have to acquire first. That way, you'd only have to work your way up to understanding and implementing the material on this link which is about calculating the DFT. Notice that there is example code in BASIC. The code examples throughout this book are a big help.

Audio signal comparison from Radar

I an working on a student project. We have a radar that gives an audio detection using headphones to indicate the type of target. Target types are (eg car/truck/man). Radar distinguishes between these targets based on doppler variation, down converts this into audible range and operator can hear it through headphone. System has provided sample audio files corresponding to each type of target(man/car/truck) to train the operator to know as to what he is hearing when live signal is fed and accordingly decide what target it is.
I intend that a software can do the job of this operator.
I want to compare live audio signal input from Radar with 7 different test audio files and want the software to tell me which file matches the input.
kindly educate me .... can these audio fingerprinting softwares do my job.
What you're trying to be done can be implemented in GNU Radio, in a lot of ways.
You could, for example, take the audio signal as input to an audio source, connect that to a set of xlating FIR filters, which you'd design using the gr_filter_design tool; you then would estimate the (potentially decimated) signal in these bands by converting the complex samples to their power (complex to Mag^2) and would then further low-pass and decimate, to then select the band with the highest energy. All this can be done in a nice graphical way in the GNU Radio Companion (gnuradio-companion), which will then generate Python code, which is used to set up the signal flow graph based on the C++ GNU Radio framework.
I recommend you read the Guided Tutorials and see where you get from there.

Basics of Digital Audio

I have recently started going through sound card drivers in Linux[ALSA].
Can a link or reference be suggested where I can get good basics of Audio like :
Sampling rate,bit size etc.
I want to know exactly how samples are stored in Audio files on a computer and reverse of this which is how samples(numbers) are played back.
The Audacity tutorial is a good place to start. Another introduction that covers similar ground. The PureData tutorial at flossmanuals is also a good starting point. Wikipedia is a good source once you have the basics down.
Audio is input into a computer via an analog-to-digital converter (ADC). Digital audio is output via a digital-to-analog converter (DAC).
Sample rate is the number of times per second at which the analog signal is measured and stored digitally. You can think of the sample rate as the time resolution of an audio signal. Bit size is the number of bits used to store each sample. You can think of it as analogous to the color depth of an image pixel.
David Cottle's SuperCollider book also has a great introduction to digital audio.
I was in the same situation, and certainly this kind of information is out there but you need to do some research first. This is what I have found:
Digital Audio processing is a branch of DSP (Digital Signal
Processing).
DSP is one of the most powerful technologies that will
shape science and engineering in the twenty-first century.
Revolutionary changes have already been made in a broad range of
fields: communications, medical imaging, radar & sonar, high fidelity
music reproduction, and oil prospecting, to name just a few. Each of
these areas has developed a deep DSP technology, with its own
algorithms, mathematics, and specialized techniques…
This quote was taken from a very helpful guide that covers every topic in depth called the “The Scientist and Engineer's Guide to
Digital Signal Processing”. And though you are not asking for DSP specifically there’s a chapter that covers all digital audio related topics with a very good explanation.
You can find it in the chapter 22 - Audio Processing, and covers all this topics:
Human Hearing: how the sound is perceived by our ears, this is the
basis of how then the sound is then generated artificially.
Timbre: explains the properties of sound, like loudness, pitch and
timbre.
Sound Quality vs. Data Rate: once you know the previous concepts
we start to translate it to the electronic side.
High Fidelity Audio: gives you a picture of how sound is then
processed digitally.
Companding: here you can find how sound is then processed and
compressed for telecommunications.
Speech Synthesis and Recognition: More processes applied to the
sound, like filters, synthesis, etc.
Nonlinear Audio Processing: this is more advanced but understandable,
for sound treatment and other topics.
It explains the basics of sound in the real world, in case you might want to take a look, and then it explains how the sound is processed in the computer including what you are asking for.
But there are other topics that can be found in wikipedia that are more specific, let’s say the “Digital audio” page that explains every detail of this topic, this site can be used as a reference for further research, just in the beginning you can find a few links to sample rate, sound waves, digital forms, standards, bit depth, telecommunications, etc. There are a few things you might need to study more, like the nyquist-shannon theorem, fourier transforms, complex numbers and so on, but this is only used in very specific and advanced topics that you might not review or use. But I mention it just in case you are interested. You can find information in both the DSP guide book and wikipedia although you need to study some math.
I’ve been using python to develop and study these subjects with code since it has a lot of useful libraries, like numpy, sound device, scipy, etc. And then you can start plating with sound. On youtube you can find lots of videos that also guide you on how to do this. I’ve found synthesis, filters, voice recognition, you can create wav files with just code, which is great. But also I’ve seen projects in C/C++, Javascript, and other languages, so it might help you to keep learning and coding fun things.
There are a few other references across the internet but you might need to know what you are looking for, this book and the wikipedia page would be the best starting points for me, since it gives you the basics and explains in depth every topic. Then depending on the goal you want to achieve you can then start looking for more information.

How to convert human voice into digital format?

I am working on a project where biometric system is used to secure the system. We are planning to use human voice to secure the system.
Idea is to allow the person to say some words or sentences and system will store that voice in digital format. Next time person wants to enter the system, he/she has to speak some words which may or may not be different from the words used earlier.
We don't want to match words but want to match voice frequency.
I have read some research papers regarding this system but those papers don't have any implementation details.
So just want to know whether there is any software/API which can convert analog voice into digital format and will also tell us the frequency of voice.
Until now I was working on normal web based applications so I know normal APIs and platforms like Java EE, C#, etc but I don't have any experience about this kind of application.
Please enlighten !!!
http://www.loquendo.com/en/products/speaker-verification/
http://www.nuance.com/for-business/by-solution/contact-center-customer-care/cccc-solutions-services/verifier/index.htm
(two links removed due to reported virus content)
http://www.persay.com/products.asp
This is as good a starting point as any : http://marsyas.info/
It's a open source software framework for audio processing. They've listed a bunch of projects that have used their framework in various ways so you could probably draw inspiration from it.
http://marsyas.info/about/projects. The Telligence project in particular seems the closest to your needs as it it was used to gender classify audio : http://marsyas.info/about/projects#5Teligence
There are two steps on a project like this one I believe:
First step would be to record the voice from an analog input into digital format (let's assume wav-pcm). For this you can use DirectShow API in C#, or standard Wav-In as in this project: http://www.codeproject.com/KB/audio-video/cswavrec.aspx. You may consider compressing your audio files later on, there are many options for this, in Windows you may consider Windows Media Format SDK to avoid licensing issues with other formats.
Second step is to build or use a voice recognition framework, if you want to build a recognition framework you will probably need to define a set of "features" for your sound fragments and select+implement a recognition algorithm. There are many aproaches available for this, IEEE amd ACM.org websties are usually good sources. If you want to use an existing framework you may want to consider Nuance Recognizer (commercial) or http://cmusphinx.sourceforge.net (open source).
Hope this helps.

Library for extracting words (speech) out from audio stream?

I have an audio stream and I would extract words (speech) from it. So for example having audio.wav I would get 001.wav, 002.wav, 003.wav, etc where each XXX.wav is one word.
I am looking for a library or program to do it -- platform does not matter, but I prefer open-source solution.
Thank you in advance for help.
Nuance, the company that makes Dragon Naturally Speaking, has a number of Software Development Kits.
The Audio Mining kit seems to match your requirements:
Dragon NaturallySpeaking SDK
AudioMining is a speaker-independent
speech recognition toolkit that
enables the indexing of 100% of the
speech information within audio files.
The technology uses highly accurate
speech recognition to turn audio files
into XML text with timestamp
information. This can be integrated
with standard text-search products to
enable rapid access to specific audio
content.
The speech to speech+metadata is far and away the hardest part to get right. Once you have the speech + metadata, extracting the words as individual audio files is much more straightforward.

Resources