Do I need to add updated phoneme sequence of words to .dict file while adapting AM using cmusphinx? - cmusphinx

I am trying to adapt en-us acoustic model with indian english accent recordings. Since many words are pronounced in different accent, do I need to add the updated phoneme representation of words? Currently I am following this link: https://cmusphinx.github.io/wiki/tutorialadapt/#accumulating-observation-counts and here nothing is mentioned about updating your .dict file.
PS: Should I add new words directly in the dictionary?

There is Indian English model in downloads, you should use it instead. It comes with Indian English dictionary.

Related

How to remove a word from Aspell's British dictionary

When I check my texts with aspell (with the British dictionary), the word "froward" is accepted (because it is a real English word). However I never use it, so in my texts "froward" is always a misspelling of "forward". Therefore I want aspell to reject "froward".
How can I remove a word from Aspell's standard dictionary? Is there a way to create a "blacklist" of words? There is no way to mark it in .aspell.en.pws, because the personal dictionary only contains a "whitelist".
You can't.
Aspell does not support it.
Submit an issue or a pull request on the official repo if you care.

After using pdftotext: find page of string from txt

I am currently coding in python and managed to use pdftotext in order to extract the text from a pdf.
That particular text file is split up in a list of strings. By using regular expression I am able to find specific words I am interested in. The reason why I divide the text into a list is that I want to measure the distance between two specific words and by distance I mean the number of words in between the two words.
However after finding the position of the words I would like to be able to refer back to the initial pdf. In detail, I am interested in the page and maybe even line (if pdf supports this kind of structure) where these words are located.
One idea I have is to do this process for each page of the pdf, so when I find these words I know on what page this was. But this has the big disadvantage that sometimes page breaks are not necessarily natural. Meaning, I would lose the ability to find the words if they are unfortunately separated by a page break.
Do you have any idea how to do this in a more sophisticated manner?
You'll need a more sophisticated library than the one you're using. The Datalogics PDF Java Toolkit has several classes that can extract text from a PDF file. The one you use depends on what you want to do with the text after extraction. The ReadingOrderTextExtractor will create a list of lists that will allow you to extract the text and examine the content of paragraphs, sentences within those paragraphs, and words within that sentence. You'll not only be able to tell the distance between the words but whether they are in the same sentence or paragraph. One you've found a Word object, you can then find both it's location on the page, allowing for highlighting, and the page number it's on.

text to phonemes converter

I'm searching for a tool that converts text to phonemes, (like text to speech software)
I can program one but it will not be without errors and takes a lot of time!
so my question is:
is there a simple tool for converting e.g.
"hello" to "HH AH0 L OW1"
maybe some command-line tool so i can capture the stdout?
i'm searching for the phonemes in 'Arpabet' style (see the 'hello' example).
espeak does something like that but the output is not in Arpabet style and the phonemes are
not split by some determiner.
If you had searched for Arpabet on wiki you would have found your answer. The CMU guys have prepared scripts which convert most english words to their respective Arpabet phonetic break up.
If you want the phone sequence of a couple of words you can use their interface here. But, if you want it for a big file then you might have to run their scripts on your own. They used to have a working page here, but it seems to be not working now.

List of Acronyms for Text Mining

I'm trying to do some text mining and my sample text has a lot of acronyms in it. I would like to try to flatten the acronyms into the phrases.
Does anyone know where I can get a list? http://www.acronymslist.com is a pretty awesome site and I was looking for something that just had an open list I could download.
Thanks,
mj
http://en.wikipedia.org/wiki/List_of_acronyms_and_initialisms
Wikipedia content is freely reusable, subject to the terms of the Creative Commons Attribution-ShareAlike License.

Analyzing Text for Accents

This is the first part of another question of mine that had a recommendation to make it two questions: Adding Accents to Speech Generation.
Summary: The other question asks how to add an accent programatically to generated speech. Not an accent mark or inflection, but a full accent like a British or Scottish or Russian one.
The first question (same as this one) asks how the original text could be analyzed to determine what accents need to be added and where.
Basically, how could text be analyzed to find these accents and generate a set of instructions that could be used to add any accent to any generated speech?

Resources