Use German dictionary and language model with Sphinx4 - cmusphinx

I can use the en-us things that come with Sphinx4, no problem:
cfg.setAcousticModelPath("resource:/edu/cmu/sphinx/models/en-us/en-us")
cfg.setDictionaryPath("resource:/edu/cmu/sphinx/models/en-us/cmudict-en-us.dict")
cfg.setLanguageModelPath("resource:/edu/cmu/sphinx/models/en-us/en-us.lm.bin")
I can use this to transcribe an English sound file recording.
Now I want to use this with German recordings. On the website I find a link to Acoustic and Language Models. In it there is an archive 'German Voxforge'. It it I find the corresponding files for the acoustic model path. But it does not contain a dictionary or language model as far as I can see.
How do I get the dictionary and language model path for German in Sphinx4?

You create them yourself. You can create language model from subtitles or wikipedia dumps. The documentation is here.
Latest German models are actually not on CMUSphinx page, they are at github/gooofy. In this gooofy project you can find dictionary documentation, models and related matherials.

I have tried the German model with pocketsphinx and got some errors due to the "invalid" language model *.lm.bin files were used.
I have switched to the *.lm.gz and it working fine.
The proper configuration list is:
fst = voxforge-de.fst
hmm folder = model_parameters/voxforge.cd_cont_6000
dictionary = cmusphinx-voxforge-de.dic
language model = cmusphinx-voxforge-de.lm.gz
To get the "hmm" path you should unzip an archive:
cmusphinx-de-voxforge-5.2.tar.gz
I think it should be the same for a Sphinx4, so please give it a try.

Related

How to import any UML/XMI files to StarUML?

I am trying to import a UML Diagram (of a C++ project) I designed in a program called Visual Paradigm.
This program allows me to save the UML diagram in various formats
)
and when I choose the XMI format (supported by StarUML through an extension
) it allows me to pick the XMI version to save the file
The problem comes when I try to import the file in StarUml: when I try to load an XMI file (I tried every version) that cames from V.P. it says "Failed to load the file";
On the other hand, if I save the diagram into UML2 format and then I try to open it, StarUML just does nothing.
Do You have any suggestions to work this problem out?
Here is a zip archive with another simpler project containing source code and XMI files (different versions) generated by Visual Paradigm: Project.rar
In StarUML Github Issues there is something very similar to you issue.
I had the same problem and the workaround proposed worked for me, search for
file "xmi-reader.js", then change in function "loadFromFile" the line:
var XMINode = dom.getElementsByTagName('XMI')[0]
to
var XMINode = dom.getElementsByTagName('xmi:XMI')[0]
Adding the name space "xmi:" to the name of the element makes it work.
Depending on you version of StarUML the file name could be xmi21-reader.js .

How to generate API documentation from docstrings, for functional code

All I want is to generate API docs from function docstrings in my source code, presumably through sphinx's autodoc extension, to comprise my lean API documentation. My code follows the functional programming paradigm, not OOP, as demonstrated below.
I'd probably, as a second step, add one or more documentation pages for the project, hosting things like introductory comments, code examples (leveraging doctest I guess) and of course linking to the API documentation itself.
What might be a straightforward flow to accomplish documentation from docstrings here? Sphinx is a great popular tool, yet I find its getting started pages a bit dense.
What I've tried, from within my source directory:
$ mkdir documentation
$ sphinx-apidoc -f --ext-autodoc -o documentation .
No error messages, yet this doesn't find (or handle) the docstrings in my source files; it just creates an rst file per source, with contents like follows:
tokenizer module
================
.. automodule:: tokenizer
:members:
:undoc-members:
:show-inheritance:
Basically, my source files look like follows, without much module ceremony or object oriented contents in them (I like functional programming, even though it's python this time around). I've truncated the sample source file below of course, it contains more functions not shown below.
tokenizer.py
from hltk.util import clean, safe_get, safe_same_char
"""
Basic tokenization for text
not supported:
+ forms of pseuod elipsis (...)
support for the above should be added only as part of an automata rewrite
"""
always_swallow_separators = u" \t\n\v\f\r\u200e"
always_separators = ",!?()[]{}:;"
def is_one_of(char, chars):
'''
Returns whether the input `char` is any of the characters of the string `chars`
'''
return chars.count(char)
Or would you recommend a different tool and flow for this use case?
Many thanks!
If you find Sphinx too cumbersome and particular to use for simple projects, try pdoc:
$ pdoc --html tokenizer.py

How to use Stanford Open IE with nltk

I am on an NLP project right now and I need to use Stanford Open information extraction tool with python (nltk if possible). I found a python wrapper
but it's poorly documented and does not give full functionality interface to Stanford Open IE. Any suggestion?
One approach is to use the CoreNLP Server, which outputs OpenIE triples (see, e.g., corenlp.run). Among other libraries, Stanford's Stanza library is written in Python can call a server instance to get annotations. Make sure to include all the required annotators: tokenize,ssplit,pos,lemma,ner,depparse,natlog,openie.
I just found another way with pycorenlp and corenlp
nlp = StanfordCoreNLP(<<url_to_your_server>>)
text = "'the quick brown fox jumps over the lazy dog.'"
output = nlp.annotate(text, properties={
'annotators': 'tokenize, ssplit, pos, depparse, parse, openie',
'outputFormat': 'json'
})
and the properties can be found through the keys you can get from
print(output['sentences'][0].keys)

CMU Sphinx for Indian English

I have tried CMU Sphinx and it works fine with American English. Now,I want to use CMU Sphinx for detecting (Indian Accent) English. What exactly are the steps/changes I should do?
What you will have to do is adapt the acoustic model. Check the CMU Sphinx wiki page, they have explained the procedure of both training and adapting acoustic models. The link that works for now: http://cmusphinx.sourceforge.net/wiki/
According to what the site says,
CMUSphinx provides ways for adaptation which is sufficient for most cases when more accuracy is required. Adaptation is known to work well when you are using different recording environments (close-distance or far microphone or telephone channel), or when a slightly different accent is present (UK English or even Indian English) or even another language.
One thing you can also do is download pre-trained files from here:
https://sourceforge.net/projects/cmusphinx/files/Acoustic%20and%20Language%20Models/
The files inside these .tar.gz are a bit different from the structure I have in my version of the lib, so I had to follow the steps in the following link to make it work:
https://github.com/Uberi/speech_recognition/issues/192
I'll show the steps I have taken, which are basically what the link above says, but it may die, so here it goes:
On my computer (Ubuntu 18.04.4), the dictionaries are kept here:
~/.local/lib/python2.7/site-packages/speech_recognition/pocketsphinx-data
Inside the above folter, I had a subfolder en-US, in which I have the following files (F) and directories (D):
D acoustic-model
F language-model.lm.bin
F LICENSE.txt
F pronounciation-dictionary.dict
So I downloaded the .tar.gz for Indian language and made it look like the en-US directory. Like this:
tar zxvf cmusphinx-en-in-8khz-5.2.tar.gz
mv cmusphinx-en-in-8khz-5.2 en-IN
cd en-IN
mv en-us.lm.bin language-model.lm.bin
mv en_in.dic pronounciation-dictionary.dict
mv en_in.cd_cont_5000 acoustic-model
cd ..
Then I moved it to the correct directory.
mv en-IN ~/.local/lib/python2.7/site-packages/speech_recognition/pocketsphinx-data
From this point, I was able to use en-IN.

How to create a GrammaticalRelation in Stanford CoreNLP

I have recently upgraded to the latest version of Stanford CoreNLP. The code I previously used to get the subject or object in a sentence was
System.out.println("subject: "+dependencies.getChildWithReln(dependencies.getFirstRoot(), EnglishGrammaticalRelations.NOMINAL_SUBJECT));
but this now returns null.
I have tried creating a relation with
GrammaticalRelation subjreln =
edu.stanford.nlp.trees.GrammaticalRelation.valueOf("nsubj");
without success. If I extract a relation using code like
GrammaticalRelation target = (dependencies.childRelns(dependencies.getFirstRoot())).iterator().next();
Then run the same request,
System.out.println("target: "+dependencies.getChildWithReln(dependencies.getFirstRoot(), target));
then I get the desired result, confirming that the parsing worked fine (I also know this from printing out the full dependencies).
I suspect my problem has to do with the switch to universal dependencies, but I don't know how to create the GrammaticalRelation from scratch in a way that will match what the dependency parser found.
Since version 3.5.2 the default dependency representation in CoreNLP is Universal Dependencies. This new representation is implemented in a different class (UniversalEnglishGrammaticalRelations) so the GrammaticalStructure objects are now defined somewhere else.
All you have to do to use the new version is to replace EnglishGrammaticalRelations with UniversalGrammaticalRelations:
System.out.println("subject: "+dependencies.getChildWithReln(dependencies.getFirstRoot(), UniversalEnglishGrammaticalRelations.NOMINAL_SUBJECT));
Note, however, that some relations in the new representation are different and might no longer exist (nsubj still does). We are currently compiling migration guidelines from the old representation to the new Universal Dependencies relations. It is still incomplete but it already contains all relation names and their class names in CoreNLP.

Resources