Well documented NLP libraries in any language supporting Slovan languages? - nlp

Do you have any tips of well documented, developer friendly NLP libraries for text analysis (morphology, text concept) for Slovan languages like Czech, Polish etc?
The API could be in any language - java, python, c, node, whatever.
Nice lib for stemming as an example could be this one: https://github.com/dundalek/czech-stemmer
I am studying the best options for text analysis. I want to be able to get most out of a sentence in specific topic. Let's say that i will have medical sentence and thanks to my dictionary words in the databases I will be able to do analysis based on NLP algorithm.
Thanks!

Try polyglot. it supports both Polish and Czech.

Related

How to Detect Present Simple Tense in English Sentences using a Rule-Based Approach?

I have a straightforward task of determining the sentence structure, specifically to identify if a sentence written in plain English is in the "present simple" tense. I am aware of a couple of libraries that could help with this task:
OpenNLP
CoreNLP
However, it seems that both of these libraries use machine learning in the background and require pre-trained language models. I am looking for a more lightweight solution, possibly using a rule-based approach. Is it possible to use OpenNLP or CoreNLP without machine learning for my task?

Semantic analysis of text

Which tools would you recommend to look into for semantic analysis of text?
Here is my problem: I have a corpus of words (keywords, tags).
I need to process sentences, input by users and find if they are semantically close to words in the corpus that I have.
Any kind of suggestions (books or actual toolkits / APIs) are very welcome.
Regards,
Some useful links to begin with:
http://alias-i.com/lingpipe/demos/tutorial/sentiment/read-me.html
http://kmandcomputing.blogspot.com/2008/06/opinion-mining-with-rapidminer-quick.html
http://rapid-i.com/content/blogcategory/38/69/
http://www.cs.cornell.edu/People/pabo/movie-review-data/otherexperiments.html
http://wordnet.princeton.edu/
Tools/Libraries:
Open NLP
lingpipe
If you consider your corpus as an ontology, Apache Stanbol - http://incubator.apache.org/stanbol/ - might be useful. It uses dbpedia as the default ontology while analyzing text. Although it is incubating, enhancer component is good enough foe adoption. So, you can give it a try.
You can try some WordNet similarity measurements. Ted Pedersen has a compilation of those metrics in WordNet::Similarity which you can experiment and look into. There are counterpart implementations in other languages (e.g. Java).

details on the following Natural Language Processing terms?

Named Entity Extraction (extract ppl, cities, organizations)
Content Tagging (extract topic tags by scanning doc)
Structured Data Extraction
Topic Categorization (taxonomy classification by scanning doc....bayesian )
Text extraction (HTML page cleaning)
are there libraries that i can use to do any of the above functions of NLP ?
dont really feel like forking out cash to AlchemyAPI
There are actually plenty of freely available open-source natural language processing packages out there. Here's a brief list, organized by what language the toolkit is implemented in:
Python: Natural Language Toolkit NLTK
Java: OpenNLP, Gate, and Stanford's JavaNLP
.NET: Sharp NLP
If you're uncertain which one to go with, I would recommend starting with NLTK. The package is reasonably easy to use and has great documentation online, including a free book.
You should be able to use NLTK to easily accomplish the NLP tasks you've listed, e.g. named entity recognition (NER), extracting tags for documents, and document categorization.
What the Alchemy people call structured data extraction looks like it's just HTML scrapping that is robust against changes to the underlying HTML as long as the page still visually renders the same way. So, it's not really a NLP task.
For the extraction of text from HTML, just use boilerpipe. It's fast, good, and free.
The Apache UIMA project was originally created by IBM and provides an NLP framework much like GATE. There are various annotators out there that are built for UIMA.

Natural Language Processing Package

I have started working on a project which requires Natural Language Processing. We have do the spell checking as well as mapping sentences to phrases and their synonyms. I first thought of using GATE but i am confused on what to use? I found an interesting post here which got me even more confused.
http://lordpimpington.com/codespeaks/drupal-5.1/?q=node/5
Please help me decide on what suits my purpose the best. I am working a web application which will us this NLP tool as a service.
You didn't really give much info, but try this: http://www.nltk.org/
I don't think NLTK does spell checking (I could be wrong on this), but it can do parts of speech tagging for text input.
For finding/matching synonyms you could use something like WordNet http://wordnet.princeton.edu/
If you're doing something really domain specific: I would recommend coming up with your own ontology for domain specific terms.
If you are using Python you can develop a spell checker with Python Enchant.
NLTK is good for developing Sentiment Analysis system too. I have some prototypes of the same too
Jaggu
If you are using deep learning based models, and if you have sufficient data, you can implement task specific models for any purpose. With the development of deep leaning based languages models, you can used word embedding based models with lexicon resources to obtain synonyms and antonyms. You can also follow the links below to obtain more resources.
https://stanfordnlp.github.io/CoreNLP/
https://www.nltk.org/
https://wordnet.princeton.edu/

text mining library or lingual library?

i have a bunch of data harvested from a forum I own, and would like to do some text mining or use some linguistic library to extract useful information.
any text mining, data mining library in any language will do.
Thank you.
I recommend that you have a look at R. It has an extensive number of text mining packages: have a look at the Natural Language Processing view. In particular, look at the tm package. Here are some relevant links:
Paper about the package in the Journal of Statistical Computing: http://www.jstatsoft.org/v25/i05/paper. The paper includes a nice example of an analysis of the R-devel
mailing list (https://stat.ethz.ch/pipermail/r-devel/) newsgroup postings from 2006.
Package homepage: http://cran.r-project.org/web/packages/tm/index.html
Look at the introductory vignette: http://cran.r-project.org/web/packages/tm/vignettes/tm.pdf
Another example of useful package for this is Gary King's readme package.
You may like to have a look at the Python NLTK (Natural Language ToolKit): it's specifically designed for this kind of thing.
There is also a great book you can but to get you started.
Mallet is a java library designed for text mining. Once you have preprocessed the text data, a general data mining tool like Weka would also suffice your task.
If you have access to SPSS or SAS, their products should be more easier to use.
Try GATE, it has GUI and of course you can use java api for more power:
http://gate.ac.uk/family/developer.html
You can also use Weka for processing text and doing text mining, have a look at these useful lectures:
http://sentimentmining.net/weka/
stanford core-nlp is good for English text, and has things like Named Entity Recognition. Take a look at: http://nlp.stanford.edu/software/corenlp.shtml
GATE, which Ehsan already recommended, is also good, but it can be a bit complicated if you need to write your own components. For large-scale stuff it's great though.
UIMA is similar to GATE, but not as easy to use because it doesn't feature an extensive GUI like GATE. (http://uima.apache.org)
I would recommend the following Python libraries:
nltk
keras
tensorflow
Note: Before any text analysis you should clean the data based on your requirement

Resources