text mining library or lingual library? - text

i have a bunch of data harvested from a forum I own, and would like to do some text mining or use some linguistic library to extract useful information.
any text mining, data mining library in any language will do.
Thank you.

I recommend that you have a look at R. It has an extensive number of text mining packages: have a look at the Natural Language Processing view. In particular, look at the tm package. Here are some relevant links:
Paper about the package in the Journal of Statistical Computing: http://www.jstatsoft.org/v25/i05/paper. The paper includes a nice example of an analysis of the R-devel
mailing list (https://stat.ethz.ch/pipermail/r-devel/) newsgroup postings from 2006.
Package homepage: http://cran.r-project.org/web/packages/tm/index.html
Look at the introductory vignette: http://cran.r-project.org/web/packages/tm/vignettes/tm.pdf
Another example of useful package for this is Gary King's readme package.

You may like to have a look at the Python NLTK (Natural Language ToolKit): it's specifically designed for this kind of thing.
There is also a great book you can but to get you started.

Mallet is a java library designed for text mining. Once you have preprocessed the text data, a general data mining tool like Weka would also suffice your task.
If you have access to SPSS or SAS, their products should be more easier to use.

Try GATE, it has GUI and of course you can use java api for more power:
http://gate.ac.uk/family/developer.html
You can also use Weka for processing text and doing text mining, have a look at these useful lectures:
http://sentimentmining.net/weka/

stanford core-nlp is good for English text, and has things like Named Entity Recognition. Take a look at: http://nlp.stanford.edu/software/corenlp.shtml
GATE, which Ehsan already recommended, is also good, but it can be a bit complicated if you need to write your own components. For large-scale stuff it's great though.
UIMA is similar to GATE, but not as easy to use because it doesn't feature an extensive GUI like GATE. (http://uima.apache.org)

I would recommend the following Python libraries:
nltk
keras
tensorflow
Note: Before any text analysis you should clean the data based on your requirement

Related

Well documented NLP libraries in any language supporting Slovan languages?

Do you have any tips of well documented, developer friendly NLP libraries for text analysis (morphology, text concept) for Slovan languages like Czech, Polish etc?
The API could be in any language - java, python, c, node, whatever.
Nice lib for stemming as an example could be this one: https://github.com/dundalek/czech-stemmer
I am studying the best options for text analysis. I want to be able to get most out of a sentence in specific topic. Let's say that i will have medical sentence and thanks to my dictionary words in the databases I will be able to do analysis based on NLP algorithm.
Thanks!
Try polyglot. it supports both Polish and Czech.

Toolkits to design a TTS (Text-to-speech) system for a custom language?

I'd like to create a TTS system for a native american language (wayuunaiki).
The language is written in latin (western) alphabet.
I also have information about the phonetics (the rules to convert each word into IPA symbols).
I'm planning to create a database of voice recordings from the native people. Then I want to somehow train that data, using the IPA equivalency information to generate a more accurate speech model.
I'm totally new to Natural Language Processing, so my question is.. which tools can I use to perform what I'm planning?
I've heard that HTK ans CMU Sphinx are quite good in speech recognition. No idea about speech generation. Also heard about Festival, but i read it only uses predefined most known languages: English, Spanish, and so.
Excuse my typing faults. I'm still learning English. Thanks in advance!
You can add new language in Festival, it's actually specifically designed to simplify new language creation. For more details read the festvox book:
http://festvox.org/bsv/
Another toolkit to consider is OpenMary, see their documentation too
https://github.com/marytts/marytts/wiki/New-Language-Support
It is more modern and might be easier for you.
In any case you will have to spend some time and write the code to describe your language. Usually it's about 300 lines of code. After that you can record single-speaker TTS database and run voice building process. The more you record the better the result would be.
Use Festival toolkit for text to speech (Tips : Use Linux operating system)

Semantic analysis of text

Which tools would you recommend to look into for semantic analysis of text?
Here is my problem: I have a corpus of words (keywords, tags).
I need to process sentences, input by users and find if they are semantically close to words in the corpus that I have.
Any kind of suggestions (books or actual toolkits / APIs) are very welcome.
Regards,
Some useful links to begin with:
http://alias-i.com/lingpipe/demos/tutorial/sentiment/read-me.html
http://kmandcomputing.blogspot.com/2008/06/opinion-mining-with-rapidminer-quick.html
http://rapid-i.com/content/blogcategory/38/69/
http://www.cs.cornell.edu/People/pabo/movie-review-data/otherexperiments.html
http://wordnet.princeton.edu/
Tools/Libraries:
Open NLP
lingpipe
If you consider your corpus as an ontology, Apache Stanbol - http://incubator.apache.org/stanbol/ - might be useful. It uses dbpedia as the default ontology while analyzing text. Although it is incubating, enhancer component is good enough foe adoption. So, you can give it a try.
You can try some WordNet similarity measurements. Ted Pedersen has a compilation of those metrics in WordNet::Similarity which you can experiment and look into. There are counterpart implementations in other languages (e.g. Java).

details on the following Natural Language Processing terms?

Named Entity Extraction (extract ppl, cities, organizations)
Content Tagging (extract topic tags by scanning doc)
Structured Data Extraction
Topic Categorization (taxonomy classification by scanning doc....bayesian )
Text extraction (HTML page cleaning)
are there libraries that i can use to do any of the above functions of NLP ?
dont really feel like forking out cash to AlchemyAPI
There are actually plenty of freely available open-source natural language processing packages out there. Here's a brief list, organized by what language the toolkit is implemented in:
Python: Natural Language Toolkit NLTK
Java: OpenNLP, Gate, and Stanford's JavaNLP
.NET: Sharp NLP
If you're uncertain which one to go with, I would recommend starting with NLTK. The package is reasonably easy to use and has great documentation online, including a free book.
You should be able to use NLTK to easily accomplish the NLP tasks you've listed, e.g. named entity recognition (NER), extracting tags for documents, and document categorization.
What the Alchemy people call structured data extraction looks like it's just HTML scrapping that is robust against changes to the underlying HTML as long as the page still visually renders the same way. So, it's not really a NLP task.
For the extraction of text from HTML, just use boilerpipe. It's fast, good, and free.
The Apache UIMA project was originally created by IBM and provides an NLP framework much like GATE. There are various annotators out there that are built for UIMA.

Natural Language Processing Package

I have started working on a project which requires Natural Language Processing. We have do the spell checking as well as mapping sentences to phrases and their synonyms. I first thought of using GATE but i am confused on what to use? I found an interesting post here which got me even more confused.
http://lordpimpington.com/codespeaks/drupal-5.1/?q=node/5
Please help me decide on what suits my purpose the best. I am working a web application which will us this NLP tool as a service.
You didn't really give much info, but try this: http://www.nltk.org/
I don't think NLTK does spell checking (I could be wrong on this), but it can do parts of speech tagging for text input.
For finding/matching synonyms you could use something like WordNet http://wordnet.princeton.edu/
If you're doing something really domain specific: I would recommend coming up with your own ontology for domain specific terms.
If you are using Python you can develop a spell checker with Python Enchant.
NLTK is good for developing Sentiment Analysis system too. I have some prototypes of the same too
Jaggu
If you are using deep learning based models, and if you have sufficient data, you can implement task specific models for any purpose. With the development of deep leaning based languages models, you can used word embedding based models with lexicon resources to obtain synonyms and antonyms. You can also follow the links below to obtain more resources.
https://stanfordnlp.github.io/CoreNLP/
https://www.nltk.org/
https://wordnet.princeton.edu/

Resources