Difference between word stemming and depluralization - nlp

In understanding string matching: What is the exact difference between word stemming and depluralization?
Or do they mean the same thing?

First, stemming refers to the process of reducing a word to its stem. However, that may mean a number of different things. Most linguists differentiate between at least two ways of doing it:
Removing grammatical, but not derivational morphemes. Grammatical morphemes are components of the word that are related to its grammatical role in a particular sentence, e.g. number, case, gender, tense, aspect etc.
Removing both grammatical and derivational morphemes. Derivational morphemes are components of the word that are related to its derivation from another word, e.g. the "-er" in "worker" is related to how it is derived (or can be considered as derived) from "work".
Therefore, depluralization, which is a rather unusual term, but obviously refers to removing a plural morpheme (such as the "-s" at the end of "computers"), is part of a kind of stemming, specifically a part of the removal of grammatical (but not derivational) morphemes.
In English, the morphology of nouns is largely limited to plural ("computers") and genitive (second case, "computer's"), hence as far as English is concerned, depluralization may be seen as (almost) synonymous with (grammatical) stemming, at least to the extent that stemming is applied to nouns, and, to some degree, adjectives, (which it is e.g. in the context of information retrieval). However, wherever verbs are considered, past tense, passive voice and other inflectional forms are subject to stemming (but not to depluralization).
Furthermore, in languages other than English, even nouns may have a very rich morphology, including morphemes for such things as case, politeness level, or special kinds of plural (such as dual). And then, depluralization (if you want to use that term at all) would refer to only a very small part of the overall stemming process.
Another related term is lemmatization, which is often used synonymously with stemming. One distinction between the two that I found many people (including myself) to make is this:
Stemming is used to refer to a rule-based or machine-learning based technique that removes parts of a word (mostly endings) that look like grammatical morphemes
Lemmatization is used to refer to a process that does the same, but using an actual dictionary of the language to deal with highly irregular forms (such as the plural "women")
(But, again, not everyone will agree with this distinction.)

They are not the same. There are a few approaches to stemming a word, depluralization is one strategy.
just one quick example: a stemmer might stem "childish" into "child", or the word "stemmer" into "stem", while a depluralization algorithm will not.

Stemming is converting multiple words with the same root to one word.
Ex. "cats", "catlike", "catty" to "cat"
Depluralization is converting plural words into singular.
Ex. "cats" to "cat"
Additional info for stemming and algorithms
http://en.wikipedia.org/wiki/Stemming#Algorithms

Related

NLP: retrieve vocabulary from text

I have some texts in different languages and, potentially, with some typo or other mistake, and I want to retrieve their own vocabulary. I'm not experienced with NLP in general, so maybe I use some word improperly.
With vocabulary I mean a collection of words of a single language in which every word is unique and the inflections for gender, number, or tense are not considered (e.g. think, thinks and thought are are all consider think).
This is the master problem, so let's reduce it to the vocabulary retrieving of one language, English for example, and without mistakes.
I think there are (at least) three different approaches and maybe the solution consists of a combination of them:
search in a database of words stored in relation with each others. So, I could search for thought (considering the verb) and read the associated information that thought is an inflection of think
compute the "base form" (a word without inflections) of a word by processing the inflected form. Maybe it can be done with stemming?
use a service by any API. Yes, I accept also this approach, but I'd prefer to do it locally
For a first approximation, it's not necessary that the algorithm distinguishes between nouns and verbs. For instance, if in the text there were the word thought like both noun and verb, it could be considered already present in the vocabulary at the second match.
We have reduced the problem to retrieve a vocabulary of an English text without mistakes, and without consider the tag of the words.
Any ideas about how to do that? Or just some tips?
Of course, if you have suggestions about this problem also with the others constraints (mistakes and multi-language, not only Indo-European languages), they would be much appreciated.
You need lemmatization - it's similar to your 2nd item, but not exactly (difference).
Try nltk lemmatizer for Python or Standford NLP/Clear NLP for Java. Actually nltk uses WordNet, so it is really combination of 1st and 2nd approaches.
In order to cope with mistakes use spelling correction before lemmatization. Take a look at related questions or Google for appropriate libs.
About part of speech tag - unfortunately, nltk doesn't consider POS tag (and context in general), so you should provide it with the tag that can be found by nltk pos tagging. Again, it is already discussed here (and related/linked questions). I'm not sure about Stanford NLP here - I guess it should consider context, but I was sure that NLTK does so. As I can see from this code snippet, Stanford doesn't use POS tags, while Clear NLP does.
About other languages - google for lemmatization models, since algorithm for most languages (at least from the same family) is almost the same, differences are in training data. Take a look here for example of German; it is a wrapper for several lemmatizers, as I can see.
However, you always can use stemmer at cost of precision, and stemmer is more easily available for different languages.
Topic Word has become an integral part of the rising debate in the present world. Some people perceive that Topic Word (Synonyms) beneficial, while opponents reject this notion by saying that it leads to numerous problems. From my point of view, Topic Word (Synonyms) has more positive impacts than negative around the globe. This essay will further elaborate on both positive and negative effects of this trend and thus will lead to a plausible conclusion.
On the one hand, there is a myriad of arguments in favour of my belief. The topic has a plethora of merits. The most prominent one is that the Topic Word (Synonyms). According to the research conducted by Western Sydney University, more than 70 percentages of the users were in favour of the benefits provided by the Topic Word (Synonyms). Secondly, Advantage of Essay topic. Thus, it can say that Topic Word (Synonyms) plays a vital role in our lives.
On the flip side, critics may point out that one of the most significant disadvantages of the Topic Word (Synonyms) is that due to Demerits relates to the topic. For instance, a survey conducted in the United States reveals that demerit. Consequently, this example explicit shows that it has various negative impacts on our existence.
As a result, after inspection upon further paragraphs, I profoundly believe that its benefits hold more water instead of drawbacks. Topic Word (Synonyms) has become a crucial part of our life. Therefore, efficient use of Topic Word (Synonyms) method should promote; however, excessive and misuse should condemn.

Unknown word handling in Part of speech Tagger

What is the correct way to apply the unknown word handling.....
I am confused with in the things like first I have to check that word starts with Capital or first to check for the suffix?
Should I gather the knowledge of Capitalize word being a noun from corpus or assign them Noun Tag blindly....
What would be better approached?
Your question is probably too broad to answer properly but given your level of abstraction, here are a few things to consider when deciding how "it depends".
Capitalization is not a good universal strategy because different languages have different capitalization norms. In German, every properly spelled Noun is written with a Capital Letter, whereas some languages do not distinguish between upper and lower case at all (and some scripts lack this distinction -- Arabic, Hebrew, Thai, Devanagari, not to mention Far Eastern scripts which of course are a completely different challenge altogether).
In English, obviously, capitalization is a good indicator that you are probably looking at a proper noun, but the absence of capitalization does not help you decide the correct POS at all.
Suffix matching is one of many possible categories for deciding the POS of an unknown word. Your choice of wording -- "the suffix" -- implies you have a very simplistic understanding of word formation. Some languages have suffix derivation and inflection but there are many other patterns. Swahili inflection uses prefixes, Arabic and Hebrew use infixes (which are however not marked orthographically), some languages mark plural through reduplication, etc.
Though it's no longer state of the art, a look at the Brill tagger is probably a good start for a better understanding of possible strategies.
A competing approach is to use syntactic constraints to disambiguate the role of each word. An application of constraint grammar is to use the POS tags of surrounding words to decide the most likely reading of an ambiguous or unknown word.
Are you trying to write your own POS-tagger?
If not, I suggest you use the Stanford POS-tagger, or some other open source software. It will attempt to assign each word in a sentence the correct POS-tag. You can download it here:
http://nlp.stanford.edu/software/tagger.shtml
This paper presents a simple lexicon-based approach for tagging unknown-words. It shows that the lexicon-based approach obtains promising tagging results of unknown words on 13 languages, including Bulgarian, Czech, Dutch, English, French, German, Hindi, Italian, Portuguese, Spanish, Swedish, Thai and Vietnamese.
In addition, you can also find in the paper accuracy results (for known words and unknown words) of 3 POS and morphological taggers on the 13 languages.

Stemmers vs Lemmatizers

Natural Language Processing (NLP), especially for English, has evolved into the stage where stemming would become an archaic technology if "perfect" lemmatizers exist. It's because stemmers change the surface form of a word/token into some meaningless stems.
Then again the definition of the "perfect" lemmatizer is questionable because different NLP task would have required different level of lemmatization. E.g. Convert words between verb/noun/adjective forms.
Stemmers
[in]: having
[out]: hav
Lemmatizers
[in]: having
[out]: have
So the question is, are English stemmers any useful at all today? Since we have a plethora of lemmatization tools for English
If not, then how should we move on to build robust lemmatizers that
can take on nounify, verbify, adjectify and adverbify
preprocesses?
How could the lemmatization task be easily scaled to other languages
that have similar morphological structures as English?
Q1: "[..] are English stemmers any useful at all today? Since we have a plethora of lemmatization tools for English"
Yes. Stemmers are much simpler, smaller, and usually faster than lemmatizers, and for many applications, their results are good enough. Using a lemmatizer for that is a waste of resources. Consider, for example, dimensionality reduction in Information Retrieval. You replace all drive/driving with driv in both the searched documents and the query. You do not care if it is drive or driv or x17a$ as long as it clusters inflectionally related words together.
Q2: "[..]how should we move on to build robust lemmatizers that can take on nounify, verbify, adjectify, and adverbify preprocesses?
What is your definition of a lemma, does it include derivation (drive - driver) or only inflection (drive - drives - drove)? Does it take into account semantics?
If you want to include derivation (which most people would say includes verbing nouns etc.) then keep in mind that derivation is far more irregular than inflection. There are many idiosyncracies, gaps, etc. Do you really want for to change (change trains) and change (as coins) to have the same lemma? If not, where do you draw the boundary? How about nerve - unnerve, earth -- unearth - earthling, ... It really depends on the application.
If you take into account semantics (bank would be labeled as bank-money or bank-river depending on context), how deep do you go (do you distinguish bank-institution from bank-building)? Some apps may not care about this at all, some might want to distinguish basic semantics, and some might want it fined-grained.
Q3: "How could the lemmatization task be easily scaled to other languages that have similar morphological structures as English?"
What do you mean by "similar morphological structures as English"? English has very little inflectional morphology. There are good lemmatizers for languages of other morphological types (truly inflectional, agglutinative, template, ...).
With a possible exception of agglutinative languages, I would argue that a lookup table (say a compressed trie) is the best solution. (Possibly with some backup rules for unknown words such as proper names). The lookup is followed by some kind of disambiguation (ranging from trivial - take the first one, or take the first one consistent with the words POS tag, to much more sophisticated). The more sophisticated disambiguations are usually supervised stochastical algorithms (e.g. TreeTagger or Faster), although a combination of machine learning and manually created rules has been done too (see e.g. this).
Obviously, for most languages, you do not want to create the lookup table by
hand, but instead, generate it from a description of the morphology of
that language. For inflectional languages, you can go the engineering
way of Hajic for Czech or Mikheev for Russian, or, if you are daring,
you use two-level morphology. Or you can do something in between,
such as Hana (myself) (Note that these are all full
morphological analyzers that include lemmatization as one of their features). Or you can learn
the lemmatizer in an unsupervised manner a la Yarowsky and
Wicentowski, possibly with manual post-processing, correcting the
most frequent words.
There are way too many options and it really all depends on what you want to do with the results.
One classical application of either stemming or lemmatization is the improvement of search engine results: By applying stemming (or lemmatization) to the query as well as (prior to indexing) to all tokens indexed, users searching for, say, "having" are able to find results containing "has".
(Arguably, verbs are somewhat uncommon in most search queries, but the same principle applies to nouns, especially in languages with a rich noun morphology.)
For the purpose of search result improvement, it is not actually important whether the stem (or lemma) is meaningful ("have") or not ("hav"). It only needs to able to represent the word in question, and all its inflectional forms. In fact, some systems use numbers or other kinds of id-strings instead of either stem or lemma (or base form or whatever it may be called).
Hence, this is an example of an application where stemmers (by your definition) are as good as lemmatizers.
However, I am not quite convinced that your (implied) definition of "stemmer" and "lemmatizer" are generally accepted. I am not sure if there is any generally accepted definition of these terms, but the way I define them is as follows:
Stemmer: A function that reduces inflectional forms to stems or base forms, using rules and lists of known suffixes.
Lemmatizer: A function that performs the same reduction, but using a comprehensive full-form dictionary to be able to deal with irregular forms.
Based on these definitions, a lemmatizer is essentially a higher-quality (and more expensive) version of a stemmer.
The answer is highly dependent on the task or specific field of study within the Natural Language Processing (NLP) that we are talking about.
It is worth pointing out that it has been proved that in some specific tasks, like Sentiment Analysis (that is a favorite sub-field in NLP), using a Stemmer or Lemmatizer as a feature in the development of a system (training a machine learning model) does not have a noticeable effect on the accuracy of the model no matter how great the tool is. Even though it makes the performance a little bit better, there are more important features like Dependency parsing that have a considerable potential to be worked on in such systems.
It is important to mention that the characteristics of the language which we are working on should also be taken into the consideration.
Stemming just removes or stems the last few characters of a word, often leading to incorrect meanings and spelling. Lemmatization considers the context and converts the word to its meaningful base form, which is called Lemma. Sometimes, the same word can have multiple different Lemmas. We should identify the Part of Speech (POS) tag for the word in that specific context. Here are the examples to illustrate all the differences and use cases:
If you lemmatize the word 'Caring', it would return 'Care'. If you stem, it would return 'Car' and this is erroneous.
If you lemmatize the word 'Stripes' in verb context, it would return 'Strip'. If you lemmatize it in noun context, it would return 'Stripe'. If you just stem it, it would just return 'Strip'.
You would get same results whether you lemmatize or stem words such as walking, running, swimming... to walk, run, swim etc.
Lemmatization is computationally expensive since it involves look-up tables and what not. If you have large dataset and performance is an issue, go with Stemming. Remember you can also add your own rules to Stemming. If accuracy is paramount and dataset isn't humongous, go with Lemmatization.

Filtering out meaningless phrases

I have an algorithm (that I can't change) that outputs a list of phrases. These phrases are intended to be "topics". However, some of them are meaningless on their own. Take this list:
is the fear
freesat
are more likely to
first sight
an hour of
sue apple
depression and
itunes
How can I filter out those phrases that don't make sense on their own, to leave a list like the following?
freesat
first sight
sue apple
itunes
This will be applied to sets of phrases in many languages, but English is the priority.
It's got to be grammatically acceptable in that it can't rely on other words in the original sentence that it was extracted from; e.g. it can't end in 'and'.
Although this is still an underspecified question, it sounds like you want some kind of grammar checker. I suggest you try applying a part-of-speech tagger to each phrase, compile a list of patterns of POS tags that are acceptable (e.g. anything that ends in a preposition would be unacceptable) and use that to filter your input.
At a high level, it seems that phrases which were only nouns or adjective-noun combos would give much better results.
Examples:
"Blue Shirt"
"Happy People"
"Book"
First of all, this problem can be as complex as you want it to be. For third-party reading/solutions, I came across:
http://en.wikipedia.org/wiki/List_of_natural_language_processing_toolkits
http://research.microsoft.com/en-us/groups/nlp/
http://sharpnlp.codeplex.com/ (note the part of speech tagger)
If you need 100% accuracy, then I wouldn't write such a tool myself.
However, if the problem domain is limited...
I would start by throwing out conjunctions, prepositions, contractions, state-of-being verbs, etc. This is a fairly short list in English (and looks very similar to the stopwords which #HappyTimeGopher suggested).
After that, you could create a dictionary (as an indexed structure, of course) of all acceptable nouns and adjectives and compare each word in the raw phrases to that. Anything which didn't occur in the dictionary and occur in the correct sequence could be thrown out or ranked lower.
This could be useful if you were given 100 input values and wanted to select the best 5. Finding the values in the dictionary would mean that it's likely the word/phrase was good.
I've auto-generated such a dictionary before by building a raw index from thousands of documents pertaining to a vertical industry. I then spent a few hours with SQL and Excel stripping out problems easily spotted by a human. The resulting list wasn't perfect but it eliminated most of the blatantly dumb/pointless terminology.
As you may have guessed, none of this is foolproof, although checking adjective-to-noun sequence would help somewhat. Consider the case of "Greatest Hits" versus "Car Hits [Wall]".
Proper nouns (e.g. person names) don't work well with the dictionary approach, since it's probably not feasible to build a dictionary of all variations of given/surnames.
To summarize:
use a list of stopwords
generate a dictionary of words, classifying them with a part of speech(s)
run raw phrases through dictionary and stopwords
(optional) rank on how confident you are on a match
if needed, accept phrases which didn't violate known patterns (this would handle many proper nouns)
If you've access to the text these phrases were generated from, it may be easier to just create your own topic tags.
Failing that, I'd probably just remove anything that contained a stop word. See this list, for example:
http://www.ranks.nl/resources/stopwords.html
I wouldn't break out POS tagging or anything stronger for this.
It seems you could create a list that filters out three things:
Prepositions: https://en.wikipedia.org/wiki/List_of_English_prepositions
Conjunctions: https://en.wikipedia.org/wiki/Conjunction_(grammar)
Verb forms of to-be: http://www.englishplus.com/grammar/00000040.htm
If you filter on these things you'd get pretty far. Are you more concerned with false negatives or positives? If false negatives aren't a huge problem, this is how I would approach it.

noun countability

Are there any recourses on determining the countability of nouns? Either some way to work it out or a dictionary that records whether a noun is likely to countable or not countable?
I'm not interested in whether the noun can be countable but more is it likely to be countable. for instance rice can go to rices which means it can be countable but in most cases it wont be.
This is a tough one. Many English words can be both (beer, time, glass, language, etc etc) depending on the context/meaning.
Figuring out (un)countability from the word alone or from a regular dictionary is impossible or impractical.
You can try to figure it out from a large text corpus by seeing how the word is used:
if there's a plural form or not
if there's an indefinite article before it or none
if it's used with many/few, much/little, a piece of(?), etc
But many words can function as both nouns and adjectives and that complicates matters. For example in an air pump, air functions as an adjective and an refers to pump, not to air.
Likewise, many words can function as both nouns and verbs and have identical forms. For example, in she pressures him, pressures isn't a plural of pressure.
Also, some uncountable nouns can have an indefinite article before them when they are made more specific, e.g. knowledge vs a good practical knowledge.
You can gather statistics from an analyzed corpus and based on it judge whether or not a word is more likely to be countable or uncountable.
There are several existing English lexica that contain information about count/mass/etc. distinctions, none of which quite agree with each other because they focus on slightly different distinctions and it's a complicated task. Two are ComLex and CUVPlus (which I can't find a download link for at the moment, although you can find it mentioned in many places).
Check out the work by Timothy Baldwin and Francis Bond in 2003 on learning noun countability from corpora. If you have many occurrences of an unfamiliar noun in a corpus, you can do fairly well at the task of figuring out whether this noun can possibly be a count noun, can possibly be a mass noun, etc. however individual instances are still be quite difficult to classify. If you have the sentence "the wug was white" and according to your lexicon "wug" can be either count or mass, there's not enough information in the immediate context to help you classify it.
I'm not sure if there is an 'official' dictionary saying if a noun is likely to be countable or not, but I can come up with two ways you could go about this:
Either assuming that a noun is likely to be uncountable if somebody put it in a 'list of mass nouns' or 'list of uncountable nouns' (you find quite a lot if you google for those phrases, for example this).
Or make a little corpus study and see how often the word is used in which way: searching "rice" in the Corpus of contemporary american English gives us 22265 hits, while the word "rices" is only found 69 times.
It depends on the context and whether the noun may have plural on its own. Different senses of the same word may differ, e.g.:
expectation: the feeling vs. what is being expected
salt: table salt vs. a type of a chemical element
Our API, GlobalNLP, returns the countability of nouns (among other things) in a particular context in this method: https://nlp.linguasys.com/docs/services/53fccbb15cfea30d9c48f8d6/operations/542a6da01c78d80a3cd6692a

Resources