Identify start/stop times of spoken words within a phrase using Sphinx - cmusphinx

I'm trying to identify the start/end time of individual words within a phrase. I have a WAV file of the phrase AND the text of the utterance.
Is there an intelligent way of combining these two data (audio, text) to improve Sphinx's recognition abilities? What I'd like as output are accurate start/stop times for each word within the phrase.
(I know you can pass -time yes to pocketsphinx to get the time data I'm looking for -- however, the speech recognition itself is not very accurate.)
The solution cannot be for a specific speaker, as the corpus I'm working with contains a lot of different speakers, although they are all using US English.

We have a specific tool for that - audio aligner in sphinx4. You can check
http://cmusphinx.sourceforge.net/2014/07/long-audio-aligner-landed-in-trunk/

Related

Is there a way to get timestamps of speaker switch times using Google Cloud's speech to text service?

I know there is a way to get delineated words by speaker using the google cloud speech to text API. I'm looking for a way to get the timestamps of when a speaker changes for a longer file. I know that Descript must do something like this under the hood. , which I am trying to replicate. My desired end result is to be able to split an audio file with multiple speakers into clips of each speaker, in the order that they occurred.
I know I could probably extract timestamps for each word and then iterate through the results, getting the timestamps for when a previous result is a different speaker than the current result. This seems very tedious for a long audio file and I'm not sure how accurate this is.
Google "Speech to text" - phone model does what you are looking at by giving result end times for each identified speaker.
Check more here https://cloud.google.com/speech-to-text/docs/phone-model

how to get list of words whose parts of speech never change

I am working some NLP project . I need to find all the words in English whose parts of speech never change (ie: always have a single parts of speech in any sentence). Can anyone suggest how to find them and is there any specific name to these kind of words.

How to train custom speech model in Microsoft cognitive services Speech to text

I'm doing a POC with Speech to text. I need to recognize specific words like "D-STUM" (daily stand up meeting). The problem is, every time I tell my program to recognize "D-STUM", i get "Destiny", "This theme", etc.
I already went on speech.microsoft.com/.../customspeech, and I've recorded around 40 wav files of people saying "D-STUM". I've also created a file named "trans.txt" which contains every wav file with the word "D-STUM" after each file. Like this :
D_stum_1.wav D-STUM
D_stum_2.wav D-STUM
D_stum_3.wav D-STUM
D_stum_4.wav D-STUM
...
Then I uploaded a zip containing the wav files and the trans.txt file, train a model with those datas, and created an endpoint. I referenced this endpoint on my soft, and launched it.
I expect my custom speech-to-text to recognize people saying "D-STUM" and displaying "D-STUM" as text. I never had "D-STUM" displayed after customizing the model.
Did I do something wrong? Is it the right way to do a custom training?
Is 40 samples not enough for the model to be properly trained?
Thank you for your answers.
Custom Speech has several ways to get a better understanding of specific words:
By providing audio sample with their transcription, as you have done
By providing text sample (without audio)
Based on my previous use-cases, I would highly suggest to create a training file with 5 to 10 sentences in it, each one containing "D-STUM" in its usage context. Then duplicate those sentences like 10 to 20 times in the file.
It worked for us to understand specific words.
Additionally, if you are using "en-US" or "de-DE" as target language, you can use a pronunciation file, see here

How to automatically detect sentence fragments in a text file

I am working on a project and need a tool or an API in order to detect sentence fragments in large text. There are many solutions such as OpenNLP for detecting sentences in given file. However, I wasn't able to find any explicit solution to the problem of finding words, phrases or event character combinations which are not belong to any grammatically correct sentences.
Any help will be greatly appreciated.
Thanks,
Lorderon
you could use n-grams as a work around:
Suppose you have a large collection of text with real sentences for reference. You could extract all sequences of 1,2,3,4,5, or more words and then in your text double check if the fragments from your text exist as n-grams.
you can download n-grams directly from google: http://googleresearch.blogspot.de/2006/08/all-our-n-gram-are-belong-to-you.html but you might need a lot of traffic.
You could also count the n-grams yourself in this case you can take the parsed data sets of the wikipedia from my website:
http://glm.rene-pickhardt.de/data/ and the source code from https://github.com/renepickhardt/generalized-language-modeling-toolkit in order to create the ngrams yourself (or any other ngram toolkit like srilm, kylm, opengrm,...)

Building your own text corpus

It may sounds stupid, but do you know how to build text corpus? I have searched everywhere and there is already existing corpus, but I wonder how did they build it? For example, if I want to build corpus with positive and negative tweets, then I have to just make two files? But what about inner of those files? Dont get it((((
in this example he stores pos and neg tweets in RedisDB.
But what about inner of those files?
This depends mostly on what library you're using. XML (with a variety of tags) is common, as is one sentence per line. The tricky part is getting the data in the first place.
For example, if I want to build corpus with positive and negative tweets
Does this mean that you want to know how to mark the tweets as positive and negative? If so, what you're looking for is called text classification or semantic analysis.
If you want to find a bunch of tweets, I'd check one of these pages (just from a quick search of my own).
Clickonf5: http://clickonf5.org/5438/download-tweets-pdf-xml-format-local-machine-server/
Quora: http://quora.com/What-is-the-best-tool-to-download-and-archive-Twitter-data-of-certain-hashtags-and-mentions-for-academic-research
Google Groups: http://groups.google.com/forum/?fromgroups#!topic/twitter-development-talk/kfislDfxunI
For general learning about how to create a corpus, I would check out the Handbook of Natural Language Processing Wiki by Richard Xiao.

Resources