NLP: Morphological manipulations - nlp

I am trying to build an NLP system for an assignment, for which I am allowed to use external libraries.
I am using parse trees to break down sentences into their constituent parts down to nouns, verbs, etc.
I am looking for a library or software that would let me identify which lexical form a word is in, and possibly translate it to some other form for me.
Basically, I need something with functions like isPlural, singularize, getInfinitive, etc.
I have considered the Ruby Linguistics package and a simple Porter Stemmer (for infinitives) but neither is very good.
This does not seem like a very hard problem, just very tedious.
Does anyone know of a good package/library/software that could do things like that?

Typically, in order to build a parse tree of a sentence, one needs to first determine the part-of-speech and lemma information of the words in the sentence. So, you should have this information already.
But in any case, in order to map wordforms to their lemmas, and synthesize wordforms from lemmas, take a look at morpha and morphg, and also the Java version of (or front-end to) morphg contained in the SimpleNLG package. There are methods like getInfinitive, getPastParticiple, etc. See e.g. the API for the Verb class.

Related

How to detect sentence stress by python NLP packages (spaCy or NLTK)?

Can we detect the sentence stress (the stress on some words or pauses between words in a sentence) using common NLP packages such as spaCy or NLTK?
How can we tell content words from structure words using spaCy or NLTK?
Since all NLP programs detect the dependencies, there should be a possibility to identify which words are stressed in natural speech.
I don't think that NLTK or spacy support this directly. You can find content words with either tool, sure, but that's only part of the picture. You want to look for software related to prosody or intonation, which you might find as a component of a text-to-speech system.
Here's a very recently published research paper with code that might be a good place to start: https://github.com/Helsinki-NLP/prosody/ . The annotated data and the references could be useful even if the code might not be exactly the kind of approach you're looking for.
I assume you do not have a special training data set with labeled data in what words to stress. So I guess the simplest way would be to assume, that stressed words are all of the same Part-of-speech. I guess nouns and verbs would be a good start, excluding modal verbs for example.
NLTK comes with PoS-Taggers.
But as natural speech depends lot on context, it's probaly difficult for humans as well to identify a single solution for what to stress in a sentence.

How can we extract the main verb from a sentence?

For example, "parrots do not swim." Here the main verb is "swim". How can we extract that by language processing? Are there any known algorithms for this purpose?
You can run a dependency parsing algorithm on the sentence and the find the dependent of the root relation. For example, running the sentence "Parrots do not swim" through the Stanford Parser online demo, I get the following dependencies:
nsubj(swim-4, Parrots-1)
aux(swim-4, do-2)
neg(swim-4, not-3)
root(ROOT-0, swim-4)
Each of these lines provides information about a different grammatical relation between two words in the sentence (see below). You need the last line, which says that swim is the root of the sentence, i.e. the main verb. So to extract the main verb, perform dependency parsing first and find the dependency that reads root(ROOT-0, X). X will be the main verb.
There are several readily available dependency parsers, such as the one available with Stanford CoreNLP or Malt parser. I prefer Stanford because it is comparable in accuracy, but has better documentation and supports multithreaded parsing (if you have lots of text). The Stanford parser outputs XML, so you will have to parse that to get the dependency information above.
For the sake of completeness, a brief explanation of the rest of the output. The first line says that parrots, the first word in the sentence, is the subject of swim, the 4th word. The second line says that do is an auxiliary verb related to swim, and the third says that not negates swim. For a more detailed explanation of the meaning of each dependency, see the Stanford typed dependency manual.
Edit:
Depending on how you define main verb, some sentences may have more than one main verb, e.g. I like cats and hate snakes. The dependency parse for this contain the dependencies:
root(ROOT-0, like-2)
conj(like-2, hate-5)
which together say that according to the parser the main verb is like, but hate is conjoined to it. For your purposes you might want to consider both like and hate to be main.
To get the verb (or any other Part-Of-Speech) there are many supervised and unsupervised algorithms available like Viterbi Algorithm, Hidden Markov Models, Brill Tagger, Constraint Grammer, etc. Even we have libraries like NLTK(Natural Language Tool Kit) for Python (and similar is also available for Java) which have these algorithm already encoded in them. Annotating POS in any document or sentence is a complex job (especially when you desire high accuracy ) and you need an in-depth knowledge in this field, begin with the very basics first and continuous effort might lead you to develop an algorithm which has higher efficiency than the prevailing one.

How can I determine the language of a web page, like Chrome does?

I am trying to get corpus for a certain language. But when I get a webpage, how can I determine the language of it?
Chrome can do it, but what's the principle?
I can come up with some ad-hoc methods like educated guess based on characters set, IP address, HTML tags etc. But more formal method?
I suppose the common method is looking at things like letter frequencies, common letter sequences and words, character sets (as you describe)... there are lots of different ways. An easy one would be to just get a bunch of dictionary files for various languages and test which one gets the most hits from the page, then offer, say, the next three as alternatives.
If you are just interested in collecting corpora of different languages, you can look at country specific pages. For example, <website>.es is likely to be in Spanish, and <website>.de is likely to be in German.
Also, Wikipedia is translated into many languages. It is not hard to write a scraper for a particular language.
The model that determines a webpage's language in Chrome is called the Compact Language Detector v3 (CLD3) and it's open source C++ code (sort of, it's not reproducible). There's also official Python bindings for it:
pip install gcld3

Natural Language Processing Package

I have started working on a project which requires Natural Language Processing. We have do the spell checking as well as mapping sentences to phrases and their synonyms. I first thought of using GATE but i am confused on what to use? I found an interesting post here which got me even more confused.
http://lordpimpington.com/codespeaks/drupal-5.1/?q=node/5
Please help me decide on what suits my purpose the best. I am working a web application which will us this NLP tool as a service.
You didn't really give much info, but try this: http://www.nltk.org/
I don't think NLTK does spell checking (I could be wrong on this), but it can do parts of speech tagging for text input.
For finding/matching synonyms you could use something like WordNet http://wordnet.princeton.edu/
If you're doing something really domain specific: I would recommend coming up with your own ontology for domain specific terms.
If you are using Python you can develop a spell checker with Python Enchant.
NLTK is good for developing Sentiment Analysis system too. I have some prototypes of the same too
Jaggu
If you are using deep learning based models, and if you have sufficient data, you can implement task specific models for any purpose. With the development of deep leaning based languages models, you can used word embedding based models with lexicon resources to obtain synonyms and antonyms. You can also follow the links below to obtain more resources.
https://stanfordnlp.github.io/CoreNLP/
https://www.nltk.org/
https://wordnet.princeton.edu/

Libraries or tools for generating random but realistic text

I'm looking for tools for generating random but realistic text. I've implemented a Markov Chain text generator myself and while the results were promising, my attempts at improving them haven't yielded any great successes.
I'd be happy with tools that consume a corpus or that operate based on a context-sensitive or context-free grammar. I'd like the tool to be suitable for inclusion into another project.
Most of my recent work has been in Java so a tool in that language is preferred, but I'd be OK with C#, C, C++, or even JavaScript.
This is similar to this question, but larger in scope.
Extending your own Markov chain generator is probably your best bet, if you want "random" text. Generating something that has context is an open research problem.
Try (if you haven't):
Tokenising punctuation separately, or include punctuation in your chain if you're not already. This includes paragraph marks.
If you're using a 2- or 3- history Markov chain, try resetting to using a 1-history one when you encounter full stops or newlines.
Alternatively, you could use WordNet in two passes with your corpus:
Analyse sentences to determine common sequences of word types, ie nouns, verbs, adjectives, and adverbs. WordNet includes these. Everything else (pronouns, conjunctions, whatever) is excluded, but you could essentially pass those straight through.
This would turn "The quick brown fox jumps over the lazy dog" into "The [adjective] [adjective] [noun] [verb(s)] over the [adjective] [noun]"
Reproduce sentences by randomly choosing a template sentence and replacing [adjective], [nouns] and [verbs] with actual adjectives nouns and verbs.
There are quite a few problems with this approach too: for example, you need context from the surrounding words to know which homonym to choose. Looking up "quick" in wordnet yields the stuff about being fast, but also the bit of your fingernail.
I know this doesn't solve your requirement for a library or a tool, but might give you some ideas.
I've used for this purpose many data sets, including wikinews articles.
I've extracted text from them using this tool:
http://alas.matf.bg.ac.rs/~mr04069/WikiExtractor.py

Resources