Which tool is good, phonetisaurus or logios with cmudict-0.7b? - cmusphinx

As phonetisaurus is not maintained and new online LM Tool v3 is using logios.
Question is which tool is recommended for phonetic dictionary generation?
If logios, then anyone knows any good document about the procedure ?

Related

Standalone and open source library in java that allows document clustering similar to carrot2

I am looking to cluster short text documents, each a few hundred character long.
I have been using carrot2 workbench and I really like its capabilities but the API is really archaic and difficult to understand / use.
I am looking for a replacement that has similar capabilities (clustering algorithms) but with a better API.
I'm really looking for something in Java or Python and it has to be open source and free as in beer
So lingpipe (http://alias-i.com/lingpipe/) does not qualify.
Thanks.
scikit-learn is in Python, supports a wide range of machine learning algorithms (including clustering) and is very well documented.

how to design a full-text indexing system?

Lucene is a great open source indexng library, my problem is not about how to use this kind of indexing tool, but to learn and understand how they are designed.
Maybe I should read the source code of Lucene, but I can't seem to find any tutorial about how this great work is done.
So, is there any other way or a book that can help me gain a concrete understanding of how to design such a indexing system?
Thank you.
The science behind Lucene is called as Information Retrieval. When you start appreciating the Algorithms and Data Structures behind Information Retrieval, you are all done and Lucene or Sphinx would merely be tools to solve your tasks. The very first thing is you can go through Inverted Index Data Structure.
A great book about Information Retrieval Algorithms and Data Structure can be found here: http://nlp.stanford.edu/IR-book/ This Stanford text is a good resource and a good starting point in coming to know about how Information Retrieval Systems are designed

Which phrase extraction tool is the state of art now?

I know of the following open source tools, but I haven't found any comparisons of how good they are respectively.
Tools with ready to use phrase extraction:
KEA
MAUI (http://code.google.com/p/maui-indexer/)
Dragon, xTract (http://dragon.ischool.drexel.edu/xtract.asp)
Lingpipe (http://alias-i.com/lingpipe/demos/tutorial/interestingPhrases/read-me.html)
Mahout (https://cwiki.apache.org/MAHOUT/collocations.html)
Anything else
Did anyone ever see such a comparison?
MAUI outperforms KEA on my experiments.
There is a comparison on unsupervised automatic key phrase extraction methods (Coling 2010 paper). But they don't analyse supervised methods, I'm planning to do that in a near future.
In addition, I've also explored a richer set of features which improved the performance of automatic Key Phrase Extraction which is still far from perfect. I might release the extended version of MAUI with those extensions next year.
Please read the following papers or email me more details:
Supervised Topical Key Phrase Extraction of News Stories using Crowdsourcing, Light Filtering and Co-reference Normalization
Keyphrase Cloud Generation of Broadcast News
I like Mallet because it has a command line tool that is really easy to use

Natural language de-identification

I am looking for a natural language tool that can automatically de-identify English text. For example, every email address should be renamed or obscured. But proper names should be de-identified, as should addresses and what not.
There is a MITRE Identification Scrubber Toolkit. I don't know how well it works.
My questions:
Are there any other tools out there?
Does anyone have experience with the MITRE tool? How well does it work?
Thanks.
De-identification (perhaps more often referred to as anonymization) is a very active research area as its success is obviously a requirement for the use of authentic text corpora in such fields as NLP for healthcare, medicine and the like. I recommend that you look at the tools listed in the answer to this question on CrossValidated. If you follow the links further, you will find research papers describing how these tools work with further references and results evaluations.

Generating questions from text (NLP)

What approaches are there to generating question from a sentence? Let's say I have a sentence "Jim's dog was very hairy and smelled like wet newspaper" - which toolkit is capable of generating a question like "What did Jim's dog smelled like?" or "How hairy was Jim's dog?"
Thanks!
Unfortunately there isn't one, exactly. There is some code written as part of Michael Heilman's PhD dissertation at CMU; perhaps you'll find it and its corresponding papers interesting?
If it helps, the topic you want information on is called "question generation". This is pretty much the opposite of what Watson does, even though "here is an answer, generate the corresponding question" is exactly how Jeopardy is played. But actually, Watson is a "question answering" system.
In addition to the link to Michael Heilman's PhD provided by dmn, I recommend checking out the following papers:
Automatic Question Generation and Answer Judging: A Q&A Game for Language Learning (Yushi Xu, Anna Goldie, Stephanie Seneff)
Automatic Question Generationg from Sentences (Husam Ali, Yllias Chali, Sadid A. Hasan)
As of 2022, Haystack provides a comprehensive suite of tools to accomplish the purpose of Question generation and answering using the latest and greatest Transformer models and Transfer learning.
From their website,
Haystack is an open-source framework for building search systems that work intelligently over large document collections. Recent advances in NLP have enabled the application of question answering, retrieval and summarization to real world settings and Haystack is designed to be the bridge between research and industry.
NLP for Search: Pick components that perform retrieval, question answering, reranking and much more
Latest models: Utilize all transformer based models (BERT, RoBERTa, MiniLM, DPR) and smoothly switch when new ones get published
Flexible databases: Load data into and query from a range of databases such as Elasticsearch, Milvus, FAISS, SQL and more
Scalability: Scale your system to handle millions of documents and deploy them via REST API
Domain adaptation: All tooling you need to annotate examples, collect user-feedback, evaluate components and finetune models.
Based on my personal experience, I am 95% successful in generating Questions and Answers in my Internship for training purposes. I have a sample web user interface to demonstrate and the code too. My Web App and Code.
Huge shoutout to the developers on the Slack channel for helping noobs in AI like me! Implementing and deploying a NLP model has never been easier if not for Haystack. I believe this is the only tool out there where one can easily develop and deploy.
Disclaimer: I do not work for deepset.ai or Haystack, am just a fan of haystack.
As of 2019, Question generation from text has become possible. There are several research papers for this task.
The current state-of-the-art question generation model uses language modeling with different pretraining objectives. Research paper, code implementation and pre-trained model are available to download on the Paperwithcode website link.
This model can be used to fine-tune on your own dataset (instructions for finetuning are given here).
I would suggest checking out this link for more solutions. I hope it helps.

Resources