TL;DR: what approach would you take to match an existing text file into a mp3, generating an SRT file?
The situation is this:
I have people reading over text word by word
They upload the mp3 files into our system
Then the mp3 shall have SRT files so we can highlight the text when the voice is played over it.
I know there are some speech recognition software out there (with still mediocre results), but this is a bit different: we only need to somehow match what we already have.
How would you approach this problem? Any idea?
Cheers from Sai Gon, Vietnam
Till
P.S. I started using Stackoverflow a couple of years ago but just had to create a new account linked to my Google account.
Related
I am working on a small school project where I have to take a lot of audio files and transcribe them into .txt files. I am a beginner at programming.
So far I've tried alexkras method using Google's Cloud Speech API. But I can't use this for mass transcribing as it is done by converting the audio to .wav using an external software(This can be done through ffmpeg too so not a big deal) and splitting up the new .wav file into <60s parts as Cloud Speech can only transcribe <60s at a time which is a big loss in trans unless you upload them to GCS but this is also a problem for mass transcribing as some .wav files are large enough(A 1 hour podcast I used turned into 800mb file) the process is slowed down.
The next one I tried is using gcloud SDK and directly transcribing audio files on the GCS using a small code in my terminal, now the problem I observed here is the transcription is not complete and it shows the transcription this way,
Example from Google:
{
"#type": "type.googleapis.com/google.cloud.speech.v1.LongRunningRecognizeResponse",
"results": [
{
"alternatives": [
{
"confidence": 0.9840146,
"transcript": "how old is the Brooklyn Bridge"
}
]
}
]
}
Which is not ideal, maybe there is a way to transfer it into a text file but the transcriptions I've done so far are not complete, I got like a total of <30 lines of text from a 11-minute video.
The most effective method I've tried is the alexkras method but as I've said above there are problems with that too(In my case). I've been looking into Machine Learning methods for speech-to-text so it can recognise or transcribe audios with accent too.
Do you guys know any method to help me transcribe mass audios into text effectively? It would have been so happy with alexkras method if it wasn't for the splitting of files or uploading it to GSC. I would greatly appreciate any help or suggestions or guidance with this. Thank you.
you can try the Watson STT API, the file/stream size limitation is 100MBs, which means that if using the proper encoding you can decode files up to several hours long. You can use sox or ffmpeg for the audio conversion if needed, the lighter weight codec is audio/ogg
https://www.ibm.com/watson/developercloud/speech-to-text/api/v1/#recognize_sessionless12
see the curl example to get you started
I've just been exploring the AWS Transcribe product. It requires an AWS account, which can be obtained free, with a credit card for payment if you go over the free limits.
It provides up to 60 minutes per month of audio transcription. If you go beyond 60 minutes of audio, you'll need to pay a bit less than $1.50 per hour of audio transcribed.
The transcription results in a .JSON file that is not easy to read. But, there is a php script on GitHub that turns the .JSON file into a very easy-to-read transcript.
I've found it to be pretty accurate, and relatively easy to use. I'd look into it if I were you.
How to create simple mp3 file? It can be just one tone 440hz playing for one minute for example. I know that there is the specification of mp3 but it's not so obvious and it is easy to do some mistake and receive incorrect format. May be someone already have any example of source code that create such simple mp3 file? Thank you :)
I have a situation here. Suppose I have two short audio files which contains some sounds. Suppose, first file has sound 'hello'(audio 1) and second file has 'bye'(audio 2) spoken by someone. There is another audio file which has 'hello'(audio 3) spoken by the same person but is a different recording.
How can I detect that audio 3 is similar to audio 1 (irrespective of the speaker)? I'm here dealing with sounds and not only speech. So there can be a whistle sound also in place of the words.
You would have to program a statistical analysis of each file, then use pattern matching to determine the level of similarity between them.
The simplest solution for words would be to license an api version of a speech engine such as Dragon, then convert the audio files to text output and compare them.
Another post here answered the question of creating 30 second preview clips from WAV audio files (Create mp3 previews from wav and aiff files). My needs slightly overlap, but differing details are beyond my knowledge.
Requirements/Options: clip length; beginning & ending fade length; input filetypes: m4a/AAC/AIFF; output filetype: mp3; kbps (e.g. 192); original files unaltered; suffix new mp3 names with " (Preview)"
Limitations: no uploading of original files to a server (desktop processing); no compiling (unix/Terminal/Bash script only); recursive processing of files in sub-directories
Any/All assistance and advice is welcome.
You'll most likely get the best results with a DAW (digital audio workstation) and an audio file converter.
For a DAW, Reaper comes with a 60 day trial, and it has everything you need to cut the songs where you need and to do fade ins/fade outs, and other effects if you'd like.
www.reaper.fm
Simply use a converter to convert the m4a file to .wav, .mp3 or whatever you prefer, and then if you need it back in m4a, convert it back. I say this because some DAWs can't work with m4a files, but if which ever one you choose to work with can then no conversion is necessary,
There are many options for what DAW and what converter you use, I recommend Reaper for a DAW, and most converters essentially do the same thing, so it doesn't make much of a difference which one you choose.
Hope this helps!
On TED.com they have transcriptions and they go to the appropriate section of the video when clicking a part of the transcription.
I want to do this for 80 hours of audios and transcriptions I have, on Linux with OSS.
This is the approach I'm thinking:
Start small with a 30 minuite sample
Split the audio up into 2 minute WAV file formatted chunks, even if it breaks words up
Run the phrase spotter from CMU Sphinx's long-audio-aligner on each chunk, with the transcript
Take the time index for identified words/phrases found in each bit and calculate the actual estimated time of the ngrams in the original audio file.
Does this seem like an efficient approach? Has anyone actually done this?
Are there alternate approaches that are worth trying like dumb word counting that may be accurate enough?
You can just feed all your audio and text in a long audio aligner and it will give you the timestamps of the words. Using this timestamps you can jump to the specific word in a file.
I'm not sure why do you want to split your audio or do something else.