How to compare two audio files to check if they have a similar sound - audio

I have a situation here. Suppose I have two short audio files which contains some sounds. Suppose, first file has sound 'hello'(audio 1) and second file has 'bye'(audio 2) spoken by someone. There is another audio file which has 'hello'(audio 3) spoken by the same person but is a different recording.
How can I detect that audio 3 is similar to audio 1 (irrespective of the speaker)? I'm here dealing with sounds and not only speech. So there can be a whistle sound also in place of the words.

You would have to program a statistical analysis of each file, then use pattern matching to determine the level of similarity between them.
The simplest solution for words would be to license an api version of a speech engine such as Dragon, then convert the audio files to text output and compare them.

Related

Create an SRT file from mp3 + text

TL;DR: what approach would you take to match an existing text file into a mp3, generating an SRT file?
The situation is this:
I have people reading over text word by word
They upload the mp3 files into our system
Then the mp3 shall have SRT files so we can highlight the text when the voice is played over it.
I know there are some speech recognition software out there (with still mediocre results), but this is a bit different: we only need to somehow match what we already have.
How would you approach this problem? Any idea?
Cheers from Sai Gon, Vietnam
Till
P.S. I started using Stackoverflow a couple of years ago but just had to create a new account linked to my Google account.

Is a MIDI file defined as a sound file type?

I've just finished an assessment on Data Structures where one of the questions given was regarding MIDI files. In the descriptor (a criteria for assessors) it states the following:
"Candidates will need to demonstrate their knowledge... by showing they can describe:
At least two Standard File types for images, sounds, video or compression."
Does a MIDI file fall under any of these categories? From what I have searched online, it doesn't. So I am somewhat bemused at a question regarding MIDI files being in this assessment.
Cheers
Not really, no. MIDI tells a synthesizer how to make sounds, where what you can make is constrained by the synthesizer. It cannot be used to represent arbitrary sound data. Though it has been used in a way not originally intended to send small sound samples to a synthesizer.
Compressed sound formats are an entirely different thing from MIDI.
(By the way, the quoted question needs some commas to make sense.)
When I send a MIDI file to my synthesizer, I expect it to produce sounds.

Creating preview audio clips from m4a files

Another post here answered the question of creating 30 second preview clips from WAV audio files (Create mp3 previews from wav and aiff files). My needs slightly overlap, but differing details are beyond my knowledge.
Requirements/Options: clip length; beginning & ending fade length; input filetypes: m4a/AAC/AIFF; output filetype: mp3; kbps (e.g. 192); original files unaltered; suffix new mp3 names with " (Preview)"
Limitations: no uploading of original files to a server (desktop processing); no compiling (unix/Terminal/Bash script only); recursive processing of files in sub-directories
Any/All assistance and advice is welcome.
You'll most likely get the best results with a DAW (digital audio workstation) and an audio file converter.
For a DAW, Reaper comes with a 60 day trial, and it has everything you need to cut the songs where you need and to do fade ins/fade outs, and other effects if you'd like.
www.reaper.fm
Simply use a converter to convert the m4a file to .wav, .mp3 or whatever you prefer, and then if you need it back in m4a, convert it back. I say this because some DAWs can't work with m4a files, but if which ever one you choose to work with can then no conversion is necessary,
There are many options for what DAW and what converter you use, I recommend Reaper for a DAW, and most converters essentially do the same thing, so it doesn't make much of a difference which one you choose.
Hope this helps!

Efficiently generating time index of pre-transcribed speech using it's audio source and open source tools

On TED.com they have transcriptions and they go to the appropriate section of the video when clicking a part of the transcription.
I want to do this for 80 hours of audios and transcriptions I have, on Linux with OSS.
This is the approach I'm thinking:
Start small with a 30 minuite sample
Split the audio up into 2 minute WAV file formatted chunks, even if it breaks words up
Run the phrase spotter from CMU Sphinx's long-audio-aligner on each chunk, with the transcript
Take the time index for identified words/phrases found in each bit and calculate the actual estimated time of the ngrams in the original audio file.
Does this seem like an efficient approach? Has anyone actually done this?
Are there alternate approaches that are worth trying like dumb word counting that may be accurate enough?
You can just feed all your audio and text in a long audio aligner and it will give you the timestamps of the words. Using this timestamps you can jump to the specific word in a file.
I'm not sure why do you want to split your audio or do something else.

Library for extracting words (speech) out from audio stream?

I have an audio stream and I would extract words (speech) from it. So for example having audio.wav I would get 001.wav, 002.wav, 003.wav, etc where each XXX.wav is one word.
I am looking for a library or program to do it -- platform does not matter, but I prefer open-source solution.
Thank you in advance for help.
Nuance, the company that makes Dragon Naturally Speaking, has a number of Software Development Kits.
The Audio Mining kit seems to match your requirements:
Dragon NaturallySpeaking SDK
AudioMining is a speaker-independent
speech recognition toolkit that
enables the indexing of 100% of the
speech information within audio files.
The technology uses highly accurate
speech recognition to turn audio files
into XML text with timestamp
information. This can be integrated
with standard text-search products to
enable rapid access to specific audio
content.
The speech to speech+metadata is far and away the hardest part to get right. Once you have the speech + metadata, extracting the words as individual audio files is much more straightforward.

Resources