How to create Dictionary file from vocab? - cmusphinx

How to create dictionary(.dict) file for our specific domain Language model. I'm using CMU tool kit to create ARPA format Language model, but in that there is no option to create .dict file. Thanks in advance.

There is a short tutorial page that explains several ways to generate the dictionary for Sphinx.
In general, for English there is an existing dictionary that covers quite many words. If it does not contain any of your specific domain words, the pronunciations should be generated by grapheme-to-phoneme (G2P) system listed in the first link. G2P learns from an existing dictionary and generates pronunciations for the new ones.
One thing to take into account is the acoustic model. If you use some of the already trained Sphinx models, you should make sure the pronunciations are generated with the same phoneme set as the training dictionary.

Related

Recommended annotation tool to create a Named Entities Recognition data set

I'm new to NLP. I am looking for recommendations for an Annotation tool to create a labeled NER dataset from raw texts.
In details:
I'm trying to create a labeled data set for specific types of Entities in order to develop my own NER project (rule based at first).
I assumed there will be some friendly frameworks that allows create tagging projects, tag text data, create a labeled dataset, and even share projects so several people could work on the same project, but I'm struggling to find one (I admit "friendly" or "intuitive" are subjective, yet this is my experience).
So far I've tried several Frameworks:
I tried LightTag. It makes the tagging itself fast and easy (i.e. marking the words and giving them labels) but the entire process of creating a useful dataset is not as intuitive as I expected (i.e. uploading the text files, split to different tagging objects, save the tags, etc.)
I've installed and tried LabelStudio and found it less mature then LightTag (don't mean to judge here :))
I've also read about spaCy's Prodigy, which offers a paid annotation tool. I would consider purchasing it, but their website only offers a live demo of the the tagging phase and I can't access if their product is superior to the other two products above.
Even in StackOverflow the latest question I found on that matter is over 5 years ago.
Do you have any recommendation for a tool to create a labeled NER dataset from raw text?
⚠️ Disclaimer
I am the author of Acharya. I would limit my answers to the points raised in the question.
Based on your question, Acharya would help you in creating the project and upload your raw text data and annotate them to create a labeled dataset.
It would allow you to mark records individually for train or test in the dataset and would give data-centric reports to identify and fix annotation/labeling errors.
It allows you to add different algorithms (bring your own algorithm) to the project and train the model regularly. Once trained, it can give annotation suggestions from the trained models on untagged data to make the labeling process faster.
If you want to train in a different setup, it allows you to export the labeled dataset in multiple supported formats.
Currently, it does not support sharing of projects.
Acharya community edition is in alpha release.
github page (https://github.com/astutic/Acharya)
website (https://acharya.astutic.com/)
Doccano is another open-source annotation tool that you can check out https://github.com/doccano/doccano
I have used both DOCCANO (https://github.com/doccano/doccano) and BRAT (https://brat.nlplab.org/).
Find the latter very good and it supports more functions. Both are free to use.

NLP : Is Gazetteer a cheat

In NLP there is a concept of Gazetteer which can be quite useful for creating annotations. As far as i understand,
A gazetteer consists of a set of lists containing names of entities such as cities, organisations, days of the week, etc. These lists are used to find occurrences of these names in text, e.g. for the task of named entity recognition.
So it is essentially a lookup. Isn't this kind of a cheat? If we use a Gazetteer for detecting named entities, then there is not much Natural Language Processing going on. Ideally, i would want to detect named entities using NLP techniques. Otherwise how is it any better than a regex pattern matcher.
Does that make sense?
Depends on how you built/use your gazetteer. If you are presenting experiments in a closed domain and you custom picked your gazetteer, then yes, you are cheating.
If you are using some openly available gazetteer and performing experiments on a large dataset or using it in an application in the wild where you don't control the input then you are fine.
We found ourselves in a similar situation. We partition our dataset and use the training data to automatically build our gazetteers. As long as you report your methodology you should not feel like cheating (let the reviewers complain).

Can extract generic entities using Lingpipe other than People, Org and Loc?

I have read through Lingpipe for NLP and found that we have a capability there to identify mentions of names of people, locations and organizations. My questions is that if I have a training set of documents that have mentions of let's say software projects inside the text, can I use this training set to train a named entity recognizer? Once the training is complete, I should be able to feed a test set of textual documents to the trained model and I should be able to identify mentions of software projects there.
Is this generic NER possible using NER? If so, what features should I be using that I should feed?
Thanks
Abhishek S
Provided that you have enough training data with tagged software projects that would be possible.
If using Lingpipe, I would use character n-grams model as the first option for your task. They are simple and usually do the work. If results are not good enough some of the standard NER features are:
tokens
part of speech (POS)
capitalization
punctuaction
character signatures: these are some ideas: ( LUCENE -> AAAAAA -> A) , (Lucene -> Aaaaaa -> Aa ), (Lucene-core --> Aaaaa-aaaa --> Aa-a)
it may also be useful to compose a gazzeteer (list of software projects) if you can obtain that from Wikipedia, sourceforge or any other internal resource.
Finally, for each token you could add contextual features, tokens before the current one (t-1, t-2...), tokens after the current one (t+1,t+2...) as well as their bigram combinations (t-2^t-1), (t+1^t+2).
Of course you can. Just get train data with all categories you need and follow tutorial http://alias-i.com/lingpipe/demos/tutorial/ne/read-me.html.
No feature tuning is required since lingpipe uses only hardcoded one (shapes, sequnce word and ngramms)

Can I identify intranet page content using Named Entity Recognition?

I am new to Natural Language Processing and I want to learn more by creating a simple project. NLTK was suggested to be popular in NLP so I will use it in my project.
Here is what I would like to do:
I want to scan our company's intranet pages; approximately 3K pages
I would like to parse and categorize the content of these pages based on certain criteria such as: HR, Engineering, Corporate Pages, etc...
From what I have read so far, I can do this with Named Entity Recognition. I can describe entities for each category of pages, train the NLTK solution and run each page through to determine the category.
Is this the right approach? I appreciate any direction and ideas...
Thanks
It looks like you want to do text/document classification, which is not quite the same as Named Entity Recognition, where the goal is to recognize any named entities (proper names, places, institutions etc) in text. However, proper names might be very good features when doing text classification in a limited domain, it is for example likely that a page with the name of the head engineer could be classified as Engineering.
The NLTK book has a chapter on basic text classification.

Improving entity naming with custom file/code in NLTK

We've been working with the NLTK library in a recent project where we're
mainly interested in the named entities part.
In general we're getting good results using the NEChunkParser class.
However, we're trying to find a way to provide our own terms to the
parser, without success.
For example, we have a test document where my name (Shay) appears in
several places. The library finds me as GPE while I'd like it to find
me as PERSON...
Is there a way to provide some kind of a custom file/
code so the parser will be able to interpret the named entity as I
want it to?
Thanks!
The easy solution is to compile a list of entities that you know are misclassified, then filter the NEChunkParser output in a postprocessing module and replace these entities' tags with the tags you want them to have.
The proper solution is to retrain the NE tagger. If you look at the source code for NLTK, you'll see that the NEChunkParser is based on a MaxEnt classifier, i.e. a machine learning algorithm. You'll have to compile and annotate a corpus (dataset) that is representative for the kind of data you want to work with, then retrain the NE tagger on this corpus. (This is hard, time-consuming and potentially expensive.)

Resources