What would you recommend using for audio file transcribing into a .txt? - audio

I am working on a small school project where I have to take a lot of audio files and transcribe them into .txt files. I am a beginner at programming.
So far I've tried alexkras method using Google's Cloud Speech API. But I can't use this for mass transcribing as it is done by converting the audio to .wav using an external software(This can be done through ffmpeg too so not a big deal) and splitting up the new .wav file into <60s parts as Cloud Speech can only transcribe <60s at a time which is a big loss in trans unless you upload them to GCS but this is also a problem for mass transcribing as some .wav files are large enough(A 1 hour podcast I used turned into 800mb file) the process is slowed down.
The next one I tried is using gcloud SDK and directly transcribing audio files on the GCS using a small code in my terminal, now the problem I observed here is the transcription is not complete and it shows the transcription this way,
Example from Google:
{
"#type": "type.googleapis.com/google.cloud.speech.v1.LongRunningRecognizeResponse",
"results": [
{
"alternatives": [
{
"confidence": 0.9840146,
"transcript": "how old is the Brooklyn Bridge"
}
]
}
]
}
Which is not ideal, maybe there is a way to transfer it into a text file but the transcriptions I've done so far are not complete, I got like a total of <30 lines of text from a 11-minute video.
The most effective method I've tried is the alexkras method but as I've said above there are problems with that too(In my case). I've been looking into Machine Learning methods for speech-to-text so it can recognise or transcribe audios with accent too.
Do you guys know any method to help me transcribe mass audios into text effectively? It would have been so happy with alexkras method if it wasn't for the splitting of files or uploading it to GSC. I would greatly appreciate any help or suggestions or guidance with this. Thank you.

you can try the Watson STT API, the file/stream size limitation is 100MBs, which means that if using the proper encoding you can decode files up to several hours long. You can use sox or ffmpeg for the audio conversion if needed, the lighter weight codec is audio/ogg
https://www.ibm.com/watson/developercloud/speech-to-text/api/v1/#recognize_sessionless12
see the curl example to get you started

I've just been exploring the AWS Transcribe product. It requires an AWS account, which can be obtained free, with a credit card for payment if you go over the free limits.
It provides up to 60 minutes per month of audio transcription. If you go beyond 60 minutes of audio, you'll need to pay a bit less than $1.50 per hour of audio transcribed.
The transcription results in a .JSON file that is not easy to read. But, there is a php script on GitHub that turns the .JSON file into a very easy-to-read transcript.
I've found it to be pretty accurate, and relatively easy to use. I'd look into it if I were you.

Related

Create an SRT file from mp3 + text

TL;DR: what approach would you take to match an existing text file into a mp3, generating an SRT file?
The situation is this:
I have people reading over text word by word
They upload the mp3 files into our system
Then the mp3 shall have SRT files so we can highlight the text when the voice is played over it.
I know there are some speech recognition software out there (with still mediocre results), but this is a bit different: we only need to somehow match what we already have.
How would you approach this problem? Any idea?
Cheers from Sai Gon, Vietnam
Till
P.S. I started using Stackoverflow a couple of years ago but just had to create a new account linked to my Google account.

Manipulating audio to bypass content ID detection

I'm using YouTube's "auto-generated" captions feature to generate transcripts of mp3 files. I do this by first converting the mp3 to a blank mp4, uploading to YouTube, waiting for the auto generated captions to appear, then extracting the SRT file.
The issue I'm having though is that a few of the mp3 files I've uploaded have been flagged as having copyrighted content, and as such no auto-generated captions have been made for them.
I have no desire to publish the mp3s on YouTube, they're uploaded as unlisted videos and all I require are the SRT files. Is there a way to manipulate the audio to bypass YouTube's content ID system? I've tried altering the pitch in Audacity, but it doesn't matter how subtle or extreme the pitch change is, they're still flagged as having copyrighted content. Is there anything else I can do to the audio other than adjusting the pitch that might work?
I'm hoping this post doesn't breach any rules on here, and I can't stress enough that I'm not looking to publish these mp3s, I just want the auto-generated SRTs.
No one can know how to cheat on Content ID
Obviously, as Content ID is a private algorithm developed by Google, no one can know for sure how do they detect copyrighted audio in a video.
But, we can assume that one of the first things they did was to make their algorithm pitch-independent. Otherwise, everyone would change the pitch of their videos and cheat on Content ID easily.
How to use Youtube to get your subtitles anyway
If I am not mistaken, Content ID blocks you because of musical content, rather than vocal content. Thus, to address your original problem, one solution would be to detect musical content (based on spectral analysis) and cut it from the original audio. If the problem is with pure vocal content as well, you could try to filter it heavily and that might work.
Other solutions
Youtube being made by Google, why not using directly the Speech API that Google offers and which most likely perform audio transcription on Youtube? And if results are not satisfying, you could try other services (IBM, Microsoft, Amazon and others have theirs).

Converting Audio From Unknown Format

I would like to create a utility in either PHP or Perl to convert an audio file created by the Nortel's Callpilot voice mail system into a wave file. The problem is that the format, which has the .vbk file extension, is unknown to virtually any audio player. To date, I have not found one that will play a .vbk file. I've looked at audio file conversion libraries in CPAN and tried many of them, they don't recognize the file. I was not successful with PHP's audio formats manipulation either. Nortel does provide a converter, however, it does not suite my needs. I would like to have this run via cron on a CentOS system. I don't know how to reverse engineer this format. There seems to be just scraps of info on this format on the web. This page indicates that it is "based on the H.232 format":
https://www.odesk.com/o/jobs/job/Reverse-Engineer-Nortel-VBK-Audio-Format_~~f501f11679f3f6bb/
I know this is a very old thread, but I've recently been looking into converting Nortel's vbk format as well. Importing the vbk files into Audacity with raw data option, Encoding: U-Law, Byte order: little-endian, Channels: 1 Channel (Mono), Sample rate: 8000 Hz. Not sure if they have multiple formats for their vbk files, but mine were from a BCM50 phone system.
Well, this is the joy of closed proprietary systems. But there is a chance they could play nice. Try to contact Callpilot and see if they'll give you the format specs. It's worth a shot.
As for reverse engineering, you need to be able to generate known content. Like a constant tone at 60Hz for exactly 1 second. Then at 50Hz. Then at 10 seconds. Compare them. Isolate the data from the metadata. There is going to be compression involved, so try a handful of common compression schemes, maybe research into Nortel's practices will probably tell you more. If you can feed that into a player and get a tone back out, you're on your way.
There's probably more informed and structured ways to go about reverse engineering, but from my experience it's a lot of trial and error.

Efficiently generating time index of pre-transcribed speech using it's audio source and open source tools

On TED.com they have transcriptions and they go to the appropriate section of the video when clicking a part of the transcription.
I want to do this for 80 hours of audios and transcriptions I have, on Linux with OSS.
This is the approach I'm thinking:
Start small with a 30 minuite sample
Split the audio up into 2 minute WAV file formatted chunks, even if it breaks words up
Run the phrase spotter from CMU Sphinx's long-audio-aligner on each chunk, with the transcript
Take the time index for identified words/phrases found in each bit and calculate the actual estimated time of the ngrams in the original audio file.
Does this seem like an efficient approach? Has anyone actually done this?
Are there alternate approaches that are worth trying like dumb word counting that may be accurate enough?
You can just feed all your audio and text in a long audio aligner and it will give you the timestamps of the words. Using this timestamps you can jump to the specific word in a file.
I'm not sure why do you want to split your audio or do something else.

Searching for audio

I am looking for a toolkit or library to search contents of audio files for am audio sample.
For example I have 5 seconds of speech that I know it exists in hundreds of hours of audio, and I want to find exact file and position of this sub-samples.
The sample is %99 similar but maybe converted to different audio format so it may have minor differences in waveform.
I prefer .NET library if there is such an option.
Thank you.
What you are trying to do is not an easy DSP problem to solve, and there is no one foolproof method. There is however an excellent recent article on audio fingerprinting on codeproject which goes into some depth on an algorithm that searches for duplicate MP3s, with code in C#. You may be able to adapt the algorithm to your needs.

Resources