Natural Language Processing (NLP), especially for English, has evolved into the stage where stemming would become an archaic technology if "perfect" lemmatizers exist. It's because stemmers change the surface form of a word/token into some meaningless stems.
Then again the definition of the "perfect" lemmatizer is questionable because different NLP task would have required different level of lemmatization. E.g. Convert words between verb/noun/adjective forms.
Stemmers
[in]: having
[out]: hav
Lemmatizers
[in]: having
[out]: have
So the question is, are English stemmers any useful at all today? Since we have a plethora of lemmatization tools for English
If not, then how should we move on to build robust lemmatizers that
can take on nounify, verbify, adjectify and adverbify
preprocesses?
How could the lemmatization task be easily scaled to other languages
that have similar morphological structures as English?
Q1: "[..] are English stemmers any useful at all today? Since we have a plethora of lemmatization tools for English"
Yes. Stemmers are much simpler, smaller, and usually faster than lemmatizers, and for many applications, their results are good enough. Using a lemmatizer for that is a waste of resources. Consider, for example, dimensionality reduction in Information Retrieval. You replace all drive/driving with driv in both the searched documents and the query. You do not care if it is drive or driv or x17a$ as long as it clusters inflectionally related words together.
Q2: "[..]how should we move on to build robust lemmatizers that can take on nounify, verbify, adjectify, and adverbify preprocesses?
What is your definition of a lemma, does it include derivation (drive - driver) or only inflection (drive - drives - drove)? Does it take into account semantics?
If you want to include derivation (which most people would say includes verbing nouns etc.) then keep in mind that derivation is far more irregular than inflection. There are many idiosyncracies, gaps, etc. Do you really want for to change (change trains) and change (as coins) to have the same lemma? If not, where do you draw the boundary? How about nerve - unnerve, earth -- unearth - earthling, ... It really depends on the application.
If you take into account semantics (bank would be labeled as bank-money or bank-river depending on context), how deep do you go (do you distinguish bank-institution from bank-building)? Some apps may not care about this at all, some might want to distinguish basic semantics, and some might want it fined-grained.
Q3: "How could the lemmatization task be easily scaled to other languages that have similar morphological structures as English?"
What do you mean by "similar morphological structures as English"? English has very little inflectional morphology. There are good lemmatizers for languages of other morphological types (truly inflectional, agglutinative, template, ...).
With a possible exception of agglutinative languages, I would argue that a lookup table (say a compressed trie) is the best solution. (Possibly with some backup rules for unknown words such as proper names). The lookup is followed by some kind of disambiguation (ranging from trivial - take the first one, or take the first one consistent with the words POS tag, to much more sophisticated). The more sophisticated disambiguations are usually supervised stochastical algorithms (e.g. TreeTagger or Faster), although a combination of machine learning and manually created rules has been done too (see e.g. this).
Obviously, for most languages, you do not want to create the lookup table by
hand, but instead, generate it from a description of the morphology of
that language. For inflectional languages, you can go the engineering
way of Hajic for Czech or Mikheev for Russian, or, if you are daring,
you use two-level morphology. Or you can do something in between,
such as Hana (myself) (Note that these are all full
morphological analyzers that include lemmatization as one of their features). Or you can learn
the lemmatizer in an unsupervised manner a la Yarowsky and
Wicentowski, possibly with manual post-processing, correcting the
most frequent words.
There are way too many options and it really all depends on what you want to do with the results.
One classical application of either stemming or lemmatization is the improvement of search engine results: By applying stemming (or lemmatization) to the query as well as (prior to indexing) to all tokens indexed, users searching for, say, "having" are able to find results containing "has".
(Arguably, verbs are somewhat uncommon in most search queries, but the same principle applies to nouns, especially in languages with a rich noun morphology.)
For the purpose of search result improvement, it is not actually important whether the stem (or lemma) is meaningful ("have") or not ("hav"). It only needs to able to represent the word in question, and all its inflectional forms. In fact, some systems use numbers or other kinds of id-strings instead of either stem or lemma (or base form or whatever it may be called).
Hence, this is an example of an application where stemmers (by your definition) are as good as lemmatizers.
However, I am not quite convinced that your (implied) definition of "stemmer" and "lemmatizer" are generally accepted. I am not sure if there is any generally accepted definition of these terms, but the way I define them is as follows:
Stemmer: A function that reduces inflectional forms to stems or base forms, using rules and lists of known suffixes.
Lemmatizer: A function that performs the same reduction, but using a comprehensive full-form dictionary to be able to deal with irregular forms.
Based on these definitions, a lemmatizer is essentially a higher-quality (and more expensive) version of a stemmer.
The answer is highly dependent on the task or specific field of study within the Natural Language Processing (NLP) that we are talking about.
It is worth pointing out that it has been proved that in some specific tasks, like Sentiment Analysis (that is a favorite sub-field in NLP), using a Stemmer or Lemmatizer as a feature in the development of a system (training a machine learning model) does not have a noticeable effect on the accuracy of the model no matter how great the tool is. Even though it makes the performance a little bit better, there are more important features like Dependency parsing that have a considerable potential to be worked on in such systems.
It is important to mention that the characteristics of the language which we are working on should also be taken into the consideration.
Stemming just removes or stems the last few characters of a word, often leading to incorrect meanings and spelling. Lemmatization considers the context and converts the word to its meaningful base form, which is called Lemma. Sometimes, the same word can have multiple different Lemmas. We should identify the Part of Speech (POS) tag for the word in that specific context. Here are the examples to illustrate all the differences and use cases:
If you lemmatize the word 'Caring', it would return 'Care'. If you stem, it would return 'Car' and this is erroneous.
If you lemmatize the word 'Stripes' in verb context, it would return 'Strip'. If you lemmatize it in noun context, it would return 'Stripe'. If you just stem it, it would just return 'Strip'.
You would get same results whether you lemmatize or stem words such as walking, running, swimming... to walk, run, swim etc.
Lemmatization is computationally expensive since it involves look-up tables and what not. If you have large dataset and performance is an issue, go with Stemming. Remember you can also add your own rules to Stemming. If accuracy is paramount and dataset isn't humongous, go with Lemmatization.
Related
I am looking into extracting the meaning of expressions used in everyday speaking. For an instance, it is apparent to a human that the sentence The meal we had at restaurant A tasted like food at my granny's. means that the food was tasty.
How can I extract this meaning using a tool or a technique?
The method I've found so far is to first extract phrases using Stanford CoreNLP POS tagging, and use a Word Sense Induction tool to derive the meaning of the phrase. However, as WSI tools are used to get the meaning of words when they have multiple meanings, I am not sure if it would be the best tool to use.
What would be the best method to extract the meanings? Or is there any tool that can both identify phrases and extract their meanings?
Any help is much appreciated. Thanks in advance.
The problem you pose is a difficult one. You should use tools from Sentiment Analysis to get a gist of the sentence emotional message. There are more sophisticated approaches which attempt at extracting what quality is assigned to what object in the sentence (this you can get from POS-tagged sentences + some hand-crafted Information Extraction rules).
However, you may want to also explore paraphrasing the more formal language to the common one and look for those phrases. For that you would need to a good (exhaustive) dictionary of common expressions to start with (there are sometimes slang dictionaries available - but I am not aware of any for English right now). You could then map the colloquial ones to some more formal ones which are likely to be caught by some embedding space (frequently used in Sentiment Analysis).
I have some texts in different languages and, potentially, with some typo or other mistake, and I want to retrieve their own vocabulary. I'm not experienced with NLP in general, so maybe I use some word improperly.
With vocabulary I mean a collection of words of a single language in which every word is unique and the inflections for gender, number, or tense are not considered (e.g. think, thinks and thought are are all consider think).
This is the master problem, so let's reduce it to the vocabulary retrieving of one language, English for example, and without mistakes.
I think there are (at least) three different approaches and maybe the solution consists of a combination of them:
search in a database of words stored in relation with each others. So, I could search for thought (considering the verb) and read the associated information that thought is an inflection of think
compute the "base form" (a word without inflections) of a word by processing the inflected form. Maybe it can be done with stemming?
use a service by any API. Yes, I accept also this approach, but I'd prefer to do it locally
For a first approximation, it's not necessary that the algorithm distinguishes between nouns and verbs. For instance, if in the text there were the word thought like both noun and verb, it could be considered already present in the vocabulary at the second match.
We have reduced the problem to retrieve a vocabulary of an English text without mistakes, and without consider the tag of the words.
Any ideas about how to do that? Or just some tips?
Of course, if you have suggestions about this problem also with the others constraints (mistakes and multi-language, not only Indo-European languages), they would be much appreciated.
You need lemmatization - it's similar to your 2nd item, but not exactly (difference).
Try nltk lemmatizer for Python or Standford NLP/Clear NLP for Java. Actually nltk uses WordNet, so it is really combination of 1st and 2nd approaches.
In order to cope with mistakes use spelling correction before lemmatization. Take a look at related questions or Google for appropriate libs.
About part of speech tag - unfortunately, nltk doesn't consider POS tag (and context in general), so you should provide it with the tag that can be found by nltk pos tagging. Again, it is already discussed here (and related/linked questions). I'm not sure about Stanford NLP here - I guess it should consider context, but I was sure that NLTK does so. As I can see from this code snippet, Stanford doesn't use POS tags, while Clear NLP does.
About other languages - google for lemmatization models, since algorithm for most languages (at least from the same family) is almost the same, differences are in training data. Take a look here for example of German; it is a wrapper for several lemmatizers, as I can see.
However, you always can use stemmer at cost of precision, and stemmer is more easily available for different languages.
Topic Word has become an integral part of the rising debate in the present world. Some people perceive that Topic Word (Synonyms) beneficial, while opponents reject this notion by saying that it leads to numerous problems. From my point of view, Topic Word (Synonyms) has more positive impacts than negative around the globe. This essay will further elaborate on both positive and negative effects of this trend and thus will lead to a plausible conclusion.
On the one hand, there is a myriad of arguments in favour of my belief. The topic has a plethora of merits. The most prominent one is that the Topic Word (Synonyms). According to the research conducted by Western Sydney University, more than 70 percentages of the users were in favour of the benefits provided by the Topic Word (Synonyms). Secondly, Advantage of Essay topic. Thus, it can say that Topic Word (Synonyms) plays a vital role in our lives.
On the flip side, critics may point out that one of the most significant disadvantages of the Topic Word (Synonyms) is that due to Demerits relates to the topic. For instance, a survey conducted in the United States reveals that demerit. Consequently, this example explicit shows that it has various negative impacts on our existence.
As a result, after inspection upon further paragraphs, I profoundly believe that its benefits hold more water instead of drawbacks. Topic Word (Synonyms) has become a crucial part of our life. Therefore, efficient use of Topic Word (Synonyms) method should promote; however, excessive and misuse should condemn.
In understanding string matching: What is the exact difference between word stemming and depluralization?
Or do they mean the same thing?
First, stemming refers to the process of reducing a word to its stem. However, that may mean a number of different things. Most linguists differentiate between at least two ways of doing it:
Removing grammatical, but not derivational morphemes. Grammatical morphemes are components of the word that are related to its grammatical role in a particular sentence, e.g. number, case, gender, tense, aspect etc.
Removing both grammatical and derivational morphemes. Derivational morphemes are components of the word that are related to its derivation from another word, e.g. the "-er" in "worker" is related to how it is derived (or can be considered as derived) from "work".
Therefore, depluralization, which is a rather unusual term, but obviously refers to removing a plural morpheme (such as the "-s" at the end of "computers"), is part of a kind of stemming, specifically a part of the removal of grammatical (but not derivational) morphemes.
In English, the morphology of nouns is largely limited to plural ("computers") and genitive (second case, "computer's"), hence as far as English is concerned, depluralization may be seen as (almost) synonymous with (grammatical) stemming, at least to the extent that stemming is applied to nouns, and, to some degree, adjectives, (which it is e.g. in the context of information retrieval). However, wherever verbs are considered, past tense, passive voice and other inflectional forms are subject to stemming (but not to depluralization).
Furthermore, in languages other than English, even nouns may have a very rich morphology, including morphemes for such things as case, politeness level, or special kinds of plural (such as dual). And then, depluralization (if you want to use that term at all) would refer to only a very small part of the overall stemming process.
Another related term is lemmatization, which is often used synonymously with stemming. One distinction between the two that I found many people (including myself) to make is this:
Stemming is used to refer to a rule-based or machine-learning based technique that removes parts of a word (mostly endings) that look like grammatical morphemes
Lemmatization is used to refer to a process that does the same, but using an actual dictionary of the language to deal with highly irregular forms (such as the plural "women")
(But, again, not everyone will agree with this distinction.)
They are not the same. There are a few approaches to stemming a word, depluralization is one strategy.
just one quick example: a stemmer might stem "childish" into "child", or the word "stemmer" into "stem", while a depluralization algorithm will not.
Stemming is converting multiple words with the same root to one word.
Ex. "cats", "catlike", "catty" to "cat"
Depluralization is converting plural words into singular.
Ex. "cats" to "cat"
Additional info for stemming and algorithms
http://en.wikipedia.org/wiki/Stemming#Algorithms
I'm new to Natural Language Processing and I'm a confused about the terms used.
What is tokenization? POS tagging? Entity Identify?
Tokenization is only split the text in parts that can have a meaning or give a meaning for these parts? And the meaning, what is the name when I determine that something is a noun, verb or adjetive. And if I want to divide into dates, names, currency?
I need a simple explanation about the areas/terms used in NLP.
Let's use an example like
My cat's name is Pat. He likes to sit on the mat.
Tokenization is to take these sentences into what we call tokens, which are basically the words. The tokens for this sentence are my, cat's, name, is, pat, he, likes, to sit, on, the, mat. (Sometimes you may see cat's as two tokens; this depends on personal preference and intention lol.)
POS stands for Part-Of-Speech, so to tag these sentences for parts-of-speech would be to run it through a program called a POS tagger, which will label each token in the sentence for its part-of-speech. The output from the tagger written by a group at Stanford in this case is:
My_PRP$ cat_NN 's_POS name_NN is_VBZ Pat_NNP ._.
He_PRP likes_VBZ to_TO sit_VB on_IN the_DT mat_NN ._.
(Here is a good example of cat's being treated as two tokens.)
Entity Identify is more often called Named Entity Recognition. It is the process of taking a text like ours and identifying things that are mostly proper nouns but can also include dates or anything else that you teach the recognizer to, well, recognize. For our example a Named Entity Recognition system would insert a tag like
<NAME>Pat</NAME>
for our cat's name. If there was another sentence like
Pat is a part-time consultant for IBM in Yorktown Heights, New York.
now the recognizer would label three entities (four total since Pat would be labeled twice).
<NAME>Pat</NAME>
<ORGANIZATION>IBM</ORGANIZATION>
<LOCATION>Yorktown Heights, New York</LOCATION>
Now how all of these tools actually work is a whole other story. :)
To add to dmn's explanation:
In general, there are two themes you should care about in NLP:
Statistical vs Rule-Based Analysis
Lightweight vs Heavyweight Analysis
Statistical Analysis uses statistics machine learning techniques to classify text and in general have good precision and good recall. Rule-Based Analysis techniques basically use hand-built rules and have very good precision but terrible recall (basically they identify the cases in your rules, but nothing else).
Lightweight vs Heavyweight Analysis are the two approaches you'll see in the field. In general, academic work is heavyweight, featuring parsers, fancy classifiers and lots of very high tech NLP stuff. In industry, by and large the focus is on data, and a lot of the academic stuff scales poorly and going beyond standard statistical or machine learning techniques doesn't bring you much. For example, parsing is largely useless (and slow) and as such keyword and ngram analysis is actually pretty useful, especially when you have a lot of data. For example, Google Translate isn't apparently that fancy behind the scenes- they just have so much data they can crush everybody else no matter how refined their translation software is.
The upshot of this is in industry there's a lot of machine learning and math, but the NLP stuff is used is not very sophisticated, because the sophisticated stuff really doesn't work well. Far preferred is using user data like clicks on related subjects and mechanical turk... and this works very well as people are far better at understanding natural language than computers.
Parsing is break a sentence down into phrases, say verb phrase, noun phrase, prepositional phrase, etc and get a grammatical tree. You can use the online version of the Stanford Parser to play with examples and get a feel for what a parser does. For example, Let's say we have the sentence
My cat's name is Pat.
Then we do POS tagging:
My/PRP$ cat/NN 's/POS name/NN is/VBZ Pat/NNP ./.
Using the POS tags and a trained statistical parser, we get a parse tree:
(ROOT
(S
(NP
(NP (PRP$ My) (NN cat) (POS 's))
(NN name))
(VP (VBZ is)
(NP (NNP Pat)))
(. .)))
We can also do a slightly different type of parse called a dependency parse:
poss(cat-2, My-1)
poss(name-4, cat-2)
possessive(cat-2, 's-3)
nsubj(Pat-6, name-4)
cop(Pat-6, is-5)
N-Grams are basically sets of adjacent words of length n. You can look at n-grams in Google's data here. You can also do character n-grams which are used heavily for spelling correction.
Sentiment Analysis is analyzing text to extract how people feel about something or in what light things (such as brands) are mentioned. This involves a lot of looking at words that denote emotion.
Semantic Analysis is analyzing the meaning of text. Often this takes the form of taxonomies and ontologies where you group concepts together (dog,cat belong to animal and pet) but it is a very undeveloped field. Resources like WordNet and Framenet are useful here.
To answer the more specific part of your question: tokenization is breaking the text into parts (usually words), not caring too much about their meaning. POS tagging is disambiguating between possible parts of speech (noun, verb, etc.), it takes place after tokenization. Recognizing dates, names etc. is named entity recognition (NER).
I am trying to find words (specifically physical objects) related to a single word. For example:
Tennis: tennis racket, tennis ball, tennis shoe
Snooker: snooker cue, snooker ball, chalk
Chess: chessboard, chess piece
Bookcase: book
I have tried to use WordNet, specifically the meronym semantic relationship; however, this method is not consistent as the results below show:
Tennis: serve, volley, foot-fault, set point, return, advantage
Snooker: nothing
Chess: chess move, checkerboard (whose own meronym relationships shows ‘square’ & 'diagonal')
Bookcase: shelve
Weighting of terms will eventually be required, but that is not really a concern now.
Anyone have any suggestions on how to do this?
Just an update: Ended up using a mixture of both Jeff's and StompChicken's answers.
The quality of information retrieved from Wikipedia is excellent, specifically how (unsurprisingly) there is so much relevant information (in comparison to some corpora where terms such as 'blog' and 'ipod' do not exist).
The range of results from Wikipedia is the best part. The software is able to match terms such as (lists cut for brevity):
golf: [ball, iron, tee, bag, club]
photography: [camera, film, photograph, art, image]
fishing: [fish, net, hook, trap, bait, lure, rod]
The biggest problem is classifying certain words as physical artefacts; default WordNet is not a reliable resource as many terms (such as 'ipod', and even 'trampolining') do not exist in it.
I think what you are asking for is a source of semantic relationships between concepts. For that, I can think of a number of ways to go:
Semantic similarity algorithms. These algorithms usually perform a tree walk over the relationships in Wordnet to come up with a real-valued score of how related two terms are. These will be limited by how well WordNet models the concepts that you are interested in. WordNet::Similarity (written in Perl) is pretty good.
Try using OpenCyc as a knowledge base. OpenCyc is a open-source version of Cyc, a very large knowledge base of 'real-world' facts. It should have a much richer set of sematic realtionships than WordNet does. However, I have never used OpenCyc so I can't speak to how complete it is, or how easy it is to use.
n-gram frequency analysis. As mentioned by Jeff Moser. A data-driven approach that can 'discover' relationships from large amounts of data, but can often produce noisy results.
Latent Semantic Analysis. A data-driven approach similar to n-gram frequency analysis that finds sets of semantically related words.
[...]
Judging by what you say you want to do, I think the last two options are more likely to be successful. If the relationships are not in Wordnet then semantic similarity won't work and OpenCyc doesn't seem to know much about snooker other than the fact that it exists.
I think a combination of both n-grams and LSA (or something like it) would be a good idea. N-gram frequencies will find concepts tightly bound to your target concept (e.g. tennis ball) and LSA would find related concepts mentioned in the same sentence/document (e.g. net, serve). Also, if you are only interested in nouns, filtering your output to contain only nouns or noun phrases (by using a part-of-speech tagger) might improve results.
In the first case, you probably are looking for n-grams where n = 2. You can get them from places like Google or create your own from all of Wikipedia.
For more information, check out this related Stack Overflow question.