Devanagaric text processing(NLP) where to start

Devanagaric text processing(NLP) where to start - nlp

I am new to Devnagaric NLP, Is there any group or resources that would help me get started with NLP in Devnagaric language(Mostly Nepali language or similar like Hindi). I want to be able to develop fonts for Devanagaric and also do some font processing application. If anyone (working in this field), could give me some advice then it would be highly appreciable.
Thanks in advance

I am new to Devnagaric NLP, Is there any group or resources that would help me get started with NLP in Devnagaric language(Mostly Nepali language or similar like Hindi)
You can use embeddings given by fasttext [https://fasttext.cc/docs/en/pretrained-vectors.html#content] and use some deep learning RNN models like LSTM for text-classification, sentiment analysis.
You can find some datasets for named entity recoginition here [http://ltrc.iiit.ac.in/ner-ssea-08/index.cgi?topic=5]
For Processing Indian languages, you can refer here [https://github.com/anoopkunchukuttan/indic_nlp_library]
Nltk supports the indian lanugages, for pos tagging and nlp related tasks you can refer here [http://www.nltk.org/_modules/nltk/corpus/reader/indian.html]

Is there any group or resources that would help me get started with NLP in Devnagaric language?
The Bhasa Sanchar project under Madan Puraskar Pustakalaya has developed a Nepali corpus. You may request a Nepali corpus for non-commerical purposes from the contact provided in the link above.
Python's NLTK has the Hindi Language corpus. You may import it using
from nltk.corpus import indian
For gaining insight to Devnagari based NLP, I suggest you go through research papers.Nepali being an under-resourced language;much work yet to be done, and it might be difficult to get contents for the same.
You should probably look into language detection,text classification,sentiment analysis among others (preferably based on POS tagging library from the corpus) for grasping the basics.
For the second part of the question
I am pretty sure font development doesn't come under the domain of Natural Language Processing. Did you mean something else?

Related

Can anyone give a brief overview of how to proceed with Named Entity Recognition in Tamil Language?

I have looked at Stanford NER and Polyglot. Both does not support Tamil Language.
I would like to use ML along with some rule based NLP processing to do the entity recognition

Neither Stanford NER nor Polyglot are rule-based. If you are only considering rule-based systems, you should probably look for existing frameworks that process Tamil correctly, or head to generic ones (e.g. GATE).
Have a look at this paper that report existing NER systems for Tamil, you may contact authors.
If you find no system available, it should be rather easy to train one using existing datasets such as NER-FIRE2013 and NER-FIRE2014: ask organizers how it would be possible to obtain access to those datasets.
Hope this helps!

Accuracy: ANNIE vs Stanford NLP vs OpenNLP with UIMA

My work is planning on using a UIMA cluster to run documents through to extract named entities and what not. As I understand it, UIMA have very few NLP components packaged with it. I've been testing GATE for awhile now and am fairly comfortable with it. It does ok on normal text, but when we run it through some representative test data, the accuracy drops way down. The text data we have internally is sometimes all caps, sometimes all lowercase, or a mix of the two in the same document. Even using ANNIE's all caps rules, the accuracy still leaves much to be desired. I've recently heard of Stanford NLP and OpenNLP but haven't had time to extensively train and test them. How do those two compare in terms of accuracy with ANNIE? Do they work with UIMA like GATE does?
Thanks in advance.

It's not possible/reasonable to give a general estimate on performance of these systems. As you said, on your test data the accuracy declines. That's for several reasons, one is the language characteristics of your documents, another is characteristics of the annotations you are expecting to see. Afaik for every NER task there are similar but still different annotation guidelines.
Having that said, on your questions:
ANNIE is the only free open source rule-based NER system in Java I could find. It's written for news articles and I guess tuned for the MUC 6 task. It's good for proof of concepts, but getting a bit outdated. Main advantage is that you can start improving it without any knowledge in machine learning, nlp, well maybe a little java. Just study JAPE and give it a shot.
OpenNLP, Stanford NLP, etc. come by default with models for news articles and perform (just looking at results, never tested them on a big corpus) better than ANNIE. I liked the Stanford parser better than OpenNLP, again just looking at documents, mostly news articles.
Without knowing what your documents look like I really can't say much more. You should decide if your data is suitable for rules or you go the machine learning way and use OpenNLP or Stanford parser or Illinois tagger or anything. The Stanford parser seems more appropriate for just pouring your data, training and producing results, while OpenNLP seems more appropriate for trying different algorithms, playing with parameters, etc.
For your GATE over UIMA dispute, I tried both and found more viral community and better documentation for GATE. Sorry for giving personal opinions :)

Just for the record answering the UIMA angle: For both Stanford NLP and OpenNLP, there is excellent packaging as UIMA analysis engines available via the DKPro Core project.

I would like to add one more note. UIMA and GATE are two frameworks for the creation of Natural Language Processing(NLP) applications. However, Name Entity Recognition (NER) is a basic NLP component and you can find an implementation of NER, independent of UIMA and GATE. The good news is you can usually find a wrapper for a decent NER in the UIMA and GATE. To make it clear let see this example:
OpenNLP NER
A wrapper for OpenNLP NER in GATE
A wrapper for OpenNLP NER in UIMA
It is the same for the Stanford NER component.
Coming back to your question, this website lists the state of the art NERs:
http://www.aclweb.org/aclwiki/index.php?title=Named_Entity_Recognition_(State_of_the_art)
For example, in the MUC-7 competition, best participant named LTG got the result with the accuracy of 93.39%.
http://www.aclweb.org/aclwiki/index.php?title=MUC-7_(State_of_the_art)
Note that if you want to use such a state of are implementation, you may have some issue with their license.

Analysing meaning of sentences

Are there any tools that analyze the meaning of given sentences? Recommendations are greatly appreciated.
Thanks in advance!

I am also looking for similar tools. One thing I found recently was this sentiment analysis tool built by researchers at Stanford.
It provides a model of analyzing the sentiment of a given sentence. It's interesting and even this seemingly simple idea is quite involved to model in an accurate way. It utilizes machine learning to develop higher accuracy as well. There is a live demo where you can input sentences to analyze.
http://nlp.stanford.edu/sentiment/
I also saw this RelEx semantic dependency relationship extractor.
http://wiki.opencog.org/w/Sentence_algorithms

Some natural language understanding tools can analyze the meaning of sentences, including NLTK and Attempto Controlled English. There are several implementations of discourse representation structures and semantic parsers with a similar purpose.
There are also several parsers that can be used to generate a meaning representation from the text that is being parsed.

details on the following Natural Language Processing terms?

Named Entity Extraction (extract ppl, cities, organizations)
Content Tagging (extract topic tags by scanning doc)
Structured Data Extraction
Topic Categorization (taxonomy classification by scanning doc....bayesian )
Text extraction (HTML page cleaning)
are there libraries that i can use to do any of the above functions of NLP ?
dont really feel like forking out cash to AlchemyAPI

There are actually plenty of freely available open-source natural language processing packages out there. Here's a brief list, organized by what language the toolkit is implemented in:
Python: Natural Language Toolkit NLTK
Java: OpenNLP, Gate, and Stanford's JavaNLP
.NET: Sharp NLP
If you're uncertain which one to go with, I would recommend starting with NLTK. The package is reasonably easy to use and has great documentation online, including a free book.
You should be able to use NLTK to easily accomplish the NLP tasks you've listed, e.g. named entity recognition (NER), extracting tags for documents, and document categorization.
What the Alchemy people call structured data extraction looks like it's just HTML scrapping that is robust against changes to the underlying HTML as long as the page still visually renders the same way. So, it's not really a NLP task.
For the extraction of text from HTML, just use boilerpipe. It's fast, good, and free.

The Apache UIMA project was originally created by IBM and provides an NLP framework much like GATE. There are various annotators out there that are built for UIMA.

Natural Language Processing Package

I have started working on a project which requires Natural Language Processing. We have do the spell checking as well as mapping sentences to phrases and their synonyms. I first thought of using GATE but i am confused on what to use? I found an interesting post here which got me even more confused.
http://lordpimpington.com/codespeaks/drupal-5.1/?q=node/5
Please help me decide on what suits my purpose the best. I am working a web application which will us this NLP tool as a service.

You didn't really give much info, but try this: http://www.nltk.org/
I don't think NLTK does spell checking (I could be wrong on this), but it can do parts of speech tagging for text input.
For finding/matching synonyms you could use something like WordNet http://wordnet.princeton.edu/
If you're doing something really domain specific: I would recommend coming up with your own ontology for domain specific terms.

If you are using Python you can develop a spell checker with Python Enchant.
NLTK is good for developing Sentiment Analysis system too. I have some prototypes of the same too
Jaggu

If you are using deep learning based models, and if you have sufficient data, you can implement task specific models for any purpose. With the development of deep leaning based languages models, you can used word embedding based models with lexicon resources to obtain synonyms and antonyms. You can also follow the links below to obtain more resources.
https://stanfordnlp.github.io/CoreNLP/
https://www.nltk.org/
https://wordnet.princeton.edu/

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string