Microsoft Speech Recognition defaults vs API - windows-10

So I've been using Microsoft Speech Recognition in Windows 10, doing the training exercises, dictating text into Wordpad and correcting it, adding words to the dictionary and so on. I would like to use the software to transcribe .wav files. It appears one can do this using the Windows Speech Recognition API, but this seems to involve creating and loading one's own grammar files, which suggests to me that this would basically create a new speech recognizer, that uses the same building blocks but is a different program from the one that runs when I click "Start Speech Recognition" in the start menu. In particular, it would perform differently because of differences in training or configuration.
Am I wrong in this ? And if I'm not, is there still a way of retrieving all the data the default speech recognizer uses so I can reproduce its behavior exactly? If I need to create a separate speech recognizer with its own grammar files and separate training history and so on in order to transcribe .wav files then so be it but I'd like to better understand what's going on here.

The Woundify open source project contains examples of how to convert wav files and to text (STT).

Related

How to detect handwriting using Google Cloud Vision API

TL;DR: how can I detect the presence of handwriting in an image?
I'm using Google's Python Vision API to scan for text in images, with generally good results. Most of the time the images contain printed text, but sometimes there is handwriting.
As noted in the documentation, you sometimes get better results for handwritten text using document_text_detection rather than the standard text_detection API call. My own tests back this up, but also show that the standard text_detection call generally works best for printed text in JPEG images.
So I'd like to use the standard text_detection by default, and only run images thrrough document_text_detection if there is handwriting. However, I can't find a reliable way to detect the presence of handwritten text in an image using the Vision APIs.
I tried label detection, but there does not appear to be a specific label for handwriting. Occasionally it will spit out "Calligraphy" but not reliably.
Does anyone know of a way to accomplish this?
I haven’t used Google Cloud Vision API but you can try Object detection models. I would suggest to create a labeled dataset over the document images of your use case using tools like LabelImg and train an Object detection model like Yolov3 [paper] [code]. I have worked on similar problems It should work.

Perform Speech-to-Text in Python using pre-transcribed text as guide

I'm working on a python application that's meant to align video clips based on what actors are saying on screen.
As an example, I have a scene where actors are reading dialogue from a script. They do the 3min scene 10 times.
I am currently transcribing what they say using speech-to-text, but because the actors are reading the same dialogue repeatedly, I want to use the pre-transcribed dialogue (the movie script) to help guide the speech-to-text engine to be more accurate.
For example:
"Are you telling me you built a time machine out of a Delorean?"
Speech to text returns:
"Are you talking me you building a time machine out of a daylight?"
I should be able to figure out where the mistakes are and estimate the correct line using the original script and lock everything against the movie script.
I'm currently using CMUSphinx in Python to get my STT data and it works very well. But I'm having some trouble with the logic on this next part.
I'll post some code shortly!
EDIT: Discovered that the search term I was looking for is "audio aligner" and "long audio aligner." These seem to be tools included in some STT packages. CMUSphinx in particular may have the ability to do this built in. Exploring that.

Are there any open-source phoneme sets (for speech synthesis)?

I am trying to make a super basic speech synthesizer, and I need some form of phoneme audio files so that I can piece them together and build words. Are there any open phoneme sets that I would be able to use for this?
For super basic speech synthesizer it's worth to check espeak http://espeak.sourceforge.net, it's better than to glue sound files together.
This may be more than you're looking for, but have you checked into http://www.vocaloid.com/en/ by any chance? There are many speech products on the market. You might also be interested in http://msdn.microsoft.com/en-us/library/hh361572(v=office.14).aspx

Library for extracting words (speech) out from audio stream?

I have an audio stream and I would extract words (speech) from it. So for example having audio.wav I would get 001.wav, 002.wav, 003.wav, etc where each XXX.wav is one word.
I am looking for a library or program to do it -- platform does not matter, but I prefer open-source solution.
Thank you in advance for help.
Nuance, the company that makes Dragon Naturally Speaking, has a number of Software Development Kits.
The Audio Mining kit seems to match your requirements:
Dragon NaturallySpeaking SDK
AudioMining is a speaker-independent
speech recognition toolkit that
enables the indexing of 100% of the
speech information within audio files.
The technology uses highly accurate
speech recognition to turn audio files
into XML text with timestamp
information. This can be integrated
with standard text-search products to
enable rapid access to specific audio
content.
The speech to speech+metadata is far and away the hardest part to get right. Once you have the speech + metadata, extracting the words as individual audio files is much more straightforward.

Libraries of audio samples (spoken text)

For a project we're currently working on, we need a library of spoken words in many different languages.
Two options seem possible: text-to-speech or "real" recordings by native speakers. As the quality is important to us, we're thinking about going the latter path.
In order to create a prototype for our application, we're looking for libraries that contain as many words in different languages as possible. To get a feeling for the quality of our approach, this library should not be made up of synthesized speech.
Do you know of any available/accessible libraries?
A co-worker just found this community based library, which is nice, but rather small in size:
Forvo.com
I've just found this on the Audacity wiki: VoxForge. From their site:
VoxForge was set up to collect transcribed speech for use with Free and Open Source Speech Recognition Engines (on Linux, Windows and Mac).
We will make available all submitted audio files under the GPL license, and then 'compile' them into acoustic models for use with Open Source speech recognition engines such as Sphinx, ISIP, Julius and HTK (note: HTK has distribution restrictions).
There is also Old time radio, not sure if this is the sort of spoken word you're after though.
My guess is that you won't find a library anywhere that consists of just individual words. Whatever you find, you're going to have to open the audio up in an editor (like Pro Tools or Cool Edit) and chop it up into individual words.
You would probably be better off creating a list of all the words you need for each language, and then finding native speakers to read them while you record. You can have them read slowly, so that you'll have an easy time chopping up each individual word.
One I use to use a lot: http://shtooka.net/index.php
Easy access to the recordings.

Resources