Efficiently generating time index of pre-transcribed speech using it's audio source and open source tools - linux

On TED.com they have transcriptions and they go to the appropriate section of the video when clicking a part of the transcription.
I want to do this for 80 hours of audios and transcriptions I have, on Linux with OSS.
This is the approach I'm thinking:
Start small with a 30 minuite sample
Split the audio up into 2 minute WAV file formatted chunks, even if it breaks words up
Run the phrase spotter from CMU Sphinx's long-audio-aligner on each chunk, with the transcript
Take the time index for identified words/phrases found in each bit and calculate the actual estimated time of the ngrams in the original audio file.
Does this seem like an efficient approach? Has anyone actually done this?
Are there alternate approaches that are worth trying like dumb word counting that may be accurate enough?

You can just feed all your audio and text in a long audio aligner and it will give you the timestamps of the words. Using this timestamps you can jump to the specific word in a file.
I'm not sure why do you want to split your audio or do something else.

Related

Node.js: Is it possible to extract sub-clips from a broader video file based on start/stop time stamps?

I have a flow where iOS app users will record a large video file and upload it to our server. After the fact, the user might want to extract certain portions of that larger video based on specific time stamps and generate a highlight reel that can be viewed and shared locally back on the iOS device.
As a FE developer I don't really have much experience with where to even start here. Our BE will be built in NodeJS. It seems to me that this should be a relatively straightforward problem to solve, but I don't know.
Are there APIs that make movie manipulation easy? Can I easily extract a clip based on a start and stop time and save that as a separate file? Are those costly tasks? Or not too bad?
I'm guessing that the response to this call would be a list of a series of file names that have been generated as a result of these clips being generated, that the iOS app could then pull down and load.
It's not quite as straightforward as it might seem as video files are quite structured with header information and indexing into the individual video and audio tracks and frames. Any splitting up or cropping needs to allow for this and also create new files with the correct headers and indexing etc.
Fortunately, there are indeed libraries that you can use to do this type of thing, one of the most powerful being ffmpeg.
There are projects which allow the ffmpeg command line tool be used programatically - the advantage of this approach is that you get to leverage the vast community knowledge base for ffmpeg command line.
One of the popular ones for nodejs is:
https://github.com/damianociarla/node-ffmpeg
You can then look at the ffmpeg documentation or community answers to find the particularly functionality you need - for example to crop video at a start and end time as you asked:
https://stackoverflow.com/a/42827058/334402
https://superuser.com/a/704118
The general idea is quite simple and will be of the format:
ffmpeg -i yourInputVideo.mp4 -ss 01:30:00 -to 02:30:00 -c copy copy yourNewOutputVideo.mp4
It's worth taking a look at the seeking info in the ffmpeg online documentation (https://ffmpeg.org/ffmpeg.html) to help understand the examples, especially the second one above:
-ss position (input/output)
When used as an input option (before -i), seeks in this input file to position. Note that in most formats it is not possible to seek exactly, so ffmpeg will seek to the closest seek point before position. When transcoding and -accurate_seek is enabled (the default), this extra segment between the seek point and position will be decoded and discarded. When doing stream copy or when -noaccurate_seek is used, it will be preserved.
When used as an output option (before an output url), decodes but discards input until the timestamps reach position.
position must be a time duration specification, see (ffmpeg-utils)the Time duration section in the ffmpeg-utils(1) manual.

How to compare two audio files to check if they have a similar sound

I have a situation here. Suppose I have two short audio files which contains some sounds. Suppose, first file has sound 'hello'(audio 1) and second file has 'bye'(audio 2) spoken by someone. There is another audio file which has 'hello'(audio 3) spoken by the same person but is a different recording.
How can I detect that audio 3 is similar to audio 1 (irrespective of the speaker)? I'm here dealing with sounds and not only speech. So there can be a whistle sound also in place of the words.
You would have to program a statistical analysis of each file, then use pattern matching to determine the level of similarity between them.
The simplest solution for words would be to license an api version of a speech engine such as Dragon, then convert the audio files to text output and compare them.

Hashing raw audio data

I'm looking for a solution to this task: I want to open any audio file (MP3,FLAC,WAV), then proceed it to the extracted form and hash this data. The thing is: I don't know how to get this extracted audio data. DirectX could do the job, right? And also, I suppose if I have fo example two MP3 files, both 320kbps and only ID3 tags differ and there's a garbage inside on of the files mixed with audio data (MP3 format allows garbage to be inside) and I extract both files, I should get the exactly same audio data, right? I'd only differ if one file is 128 and the other 320, for example. Okay so, the question is, is there a way to use DirectX to get this extracted audio data? I imagine it'd be some function returning byte array or something. Also, it would be handy to just extract whole file without playback. I want to process hundreds of files so 3-10min/s each (if files have to be played at natural speed for decoding) is way worse that one second for each file (only extracting)
I hope my question is understandable.
Thanks a lot for answers,
Aaron
Use http://sox.sourceforge.net/ (multiplatform). It's faster than realtime as you'd like, and it's designed for batch mode much more than DirectX. For example, sox -r 48k -b 16 -L -c 1 in.mp3 out.raw. Loop that over your hundreds of files using whatever scripting language you like (bash, python, .bat, ...).

Searching for audio

I am looking for a toolkit or library to search contents of audio files for am audio sample.
For example I have 5 seconds of speech that I know it exists in hundreds of hours of audio, and I want to find exact file and position of this sub-samples.
The sample is %99 similar but maybe converted to different audio format so it may have minor differences in waveform.
I prefer .NET library if there is such an option.
Thank you.
What you are trying to do is not an easy DSP problem to solve, and there is no one foolproof method. There is however an excellent recent article on audio fingerprinting on codeproject which goes into some depth on an algorithm that searches for duplicate MP3s, with code in C#. You may be able to adapt the algorithm to your needs.

Estimating the time-position in an audio using data?

I am wondering on how to estimate where I am currently in an audio with regards to time, by using the data.
For example, I read data by byte[8192] blocks. How can I know how much byte[8192] is equivalent to in time?
If this is some sort of raw-ish encoding, like PCM, this is simple. The length in time is a function of the sample rate, bit depth, and number of channels. 30 seconds of 16-bit audio at 44.1kHz in mono is 2.5MB. However, you also need to factor in headers and container format crapola. WAV files for example can have a lot of other stuff in them.
Compressed formats are much more tricky. You can never be sure where you are without playing through the file to get to where you are. Of course you can always guesstimate based on the percentage of the file length, if that is good enough for your case.
I think this is not what he was asking.
First you have to tell us what kind of data you are using. WAV? MP3? Usually without knowing where that block came from - so you know if you have some kind of frame information and where to find it - you are not able to determine that block's position.
If you have the full stream and this data then you can do a search

Resources