Infinitive form disambiguation - nlp

How to decide whether in a sentence a word is infinitive or not?
For example here "fixing" is infinitive:
Fixing the door was also easy but fixing the window was very hard.
But in
I am fixing the door
it is not. How do people disambiguate these cases?

To elaborate on my comment:
In PoS tagging, choosing between a gerund (VBG) and a noun (NN) is quite subtle and has many special cases. My understanding is fixing should be tagged as a gerund in your first sentence, because it can be modified by an adverb in that context. Citing from the Penn PoS tagging guidelines (page 19):
"While both nouns and gerunds can be preceded by an article or a possessive pronoun, only a noun (NN) can be modified by an adjective, and only a gerund (VBG) can be modified by an adverb."
EXAMPLES:
Good/JJ cooking/NN is something to enjoy.
Cooking/VBG well/RB is a useful skill.

Assuming you meant 'automatically disambiguate', this task requires a bit of processing (pos-tagging and syntactic parsing). The idea is to find instances of a verb that are not preceded by an agreeing Subject Noun Phrase. If you also want to catch infinitive forms like "to fix", just add that to the list of forms you are looking for.

Related

Is there any Information extraction to find subject and verb/relation doubles in sentence just like ClausIE, Reverb, etc?

I have used ClausIE and it returns the Subject, verb and Object triples from a sentence. But these won't work when the text is short text and not even a complete sentence. I just want a library or otherwise which can return just the subject verb pairs from short text/phrase.
An example short text is "Proposal 32 accepted". It should have some dependency or maybe rules used to identify that the term "Proposal" is the subject and the term "accepted" is verb/relation.
I have tried Stanford online parser for the above text but it doesn't return anything maybe because there is no object in the text.
Any advice would be appreciated.
The problem is, you got a Subject ("Proposal 32") and a Verb ("accepted"). Because you don't have an Object, there's no triple.
But, what you could do, is to try to identify the Subject and the Verb, by using Tokenization, using Stanford online
For example:
- The sentence is probably "declarative" if Stanford uses the "S" tag.
- if the sentence is declarative, then:
- the Subject is usually the Noun group that is in front of the main Verb group. In Stanford online that's the first NP in front of the first VP.
Now: if you:
- Add "is" in front of the main verb you get: "Proposal 32 is accepted".
- Which is: "Proposal 32 = accepted", which is a logical comparison that any programming language understands
The problem of course is, you don't always get these simple short sentences. There are probably some packages out there that can deal with this out-of-the-box. But not that I know of.
What you can do: make some rules of your own, based on English grammar. It would only understand sentences for the rules that you make. But maybe that's all you need. If you only have to deal with these very short combinations, a few well designed rules can do the job

Identify prepositons and individual POS

I am trying to find correct parts of speech for each word in paragraph. I am using Stanford POS Tagger. However, I am stuck at a point.
I want to identify prepositions from the paragraph.
Penn Treebank Tagset says that:
IN Preposition or subordinating conjunction
how, can I be sure if current word is be preposition or subordinating conjunction. How can I extract only prepositions from paragraph in this case?
You can't be sure. The reason for this somewhat strange PoS is that it's really hard to automatically determine if, for example, for is a preposition or a subordinate conjunction. So in order for automatic taggers to have a better precision, this distinction is simply ignored. Note that there is also a tag TO, which is given to any occurrence of to, regardless of its function as a preposition, infinitive particle or whatever (I think there are others).
If you need to identify prepositions properly, you need to retrain a tagger with a modified tag set, or maybe train a classifier which takes PoS-tagged text and only does this final disambiguation.
I have had some breakthrough to understand if the word is actually preposition or subordinating conjunction.
I have parsed following sentence :
She left early because Mike arrived with his new girlfriend.
(here because is subordinating conjunction )
After POS tagging
She_PRP left_VBD early_RB because_IN Mike_NNP arrived_VBD with_IN
his_PRP$ new_JJ girlfriend_NN ._.
here , to make sure because is a preposition or not I have parsed the sentence.
here because has direct parent after IN as SBAR(Subordinate Clause) as root.
with also comes under IN but its direct parent will be PP so it is a preposition.
Example 2 :
Keep your hand on the wound until the nurse asks you to take it off.
(here until is coordinating conjunction )
POS tagging is :
Keep_VB your_PRP$ hand_NN on_IN the_DT wound_NN until_IN the_DT
nurse_NN asks_VBZ you_PRP to_TO take_VB it_PRP off_RP ._.
So , until and on are marked as IN.
However, picture gets clearer when we actually parse the sentence.
So finally I conclude because is subordinating conjunction and with is preposition.
Tried for many variations of sentences .. worked for almost all except some cases for before and after.

why Wordnet dictionary doesn't contain the word 'she'?

anyone know why wordnet doesn't contain the word 'she'? thanks.
see this link
The answer to this is in the WordNet FAQ (which I just discovered existed), and also in this other question.
Basically, she is a pronoun - a word that kind of stands in place for a noun. Instead of referring to Betty by her name - which is a proper noun - you may refer to her as she.
Pronouns by themselves (without Betty, in this case) don't actually contain any meaning. Some people, like the WordNet people, call that kind of word closed-class words. By design, WordNet only includes open-class words.
From the Wordnet FAQ:
Q. Why is WordNet missing: of, an, the, and, about, above, because, etc. [and pronouns]
A. WordNet only contains "open-class words": nouns, verbs,
adjectives, and adverbs. Thus, excluded words include determiners,
prepositions, pronouns, conjunctions, and particles.
Wordnet isn't a standard dictionary, for example, if you search "he" you end up with the element helium and the 5th letter of the hebrew alphabet as definitions. However, you don't end up with the definition of a noun referring to a man.
My best guess is that "she" isn't contained because it doesn't have a definition of a anything other than a noun referring to a woman. I say best guess because I'm not a language expert so I can't definitively say it has no other definitions. If you look up "he" on thefreedictionary.com though it does have references to helium and hebrew. If you look up "she" it only has definitions related to gender.
tl;dr The reason "she" doesn't exist as a word in wordnet is that, by their rules (or whatever you want to call it) it isn't a word.

What Is the Difference Between POS Tagging and Shallow Parsing?

I'm currently taking a Natural Language Processing course at my University and still confused with some basic concept. I get the definition of POS Tagging from the Foundations of Statistical Natural Language Processing book:
Tagging is the task of labeling (or tagging) each word in a sentence
with its appropriate part of speech. We decide whether each word is a
noun, verb, adjective, or whatever.
But I can't find a definition of Shallow Parsing in the book since it also describe shallow parsing as one of the utilities of POS Tagging. So I began to search the web and found no direct explanation of shallow parsing, but in Wikipedia:
Shallow parsing (also chunking, "light parsing") is an analysis of a sentence which identifies the constituents (noun groups, verbs, verb groups, etc.), but does not specify their internal structure, nor their role in the main sentence.
I frankly don't see the difference, but it may be because of my English or just me not understanding simple basic concept. Can anyone please explain the difference between shallow parsing and POS Tagging? Is shallow parsing often also called Shallow Semantic Parsing?
Thanks before.
POS tagging would give a POS tag to each and every word in the input sentence.
Parsing the sentence (using the stanford pcfg for example) would convert the sentence into a tree whose leaves will hold POS tags (which correspond to words in the sentence), but the rest of the tree would tell you how exactly these these words are joining together to make the overall sentence. For example an adjective and a noun might combine to be a 'Noun Phrase', which might combine with another adjective to form another Noun Phrase (e.g. quick brown fox) (the exact way the pieces combine depends on the parser in question).
You can see how parser output looks like at http://nlp.stanford.edu:8080/parser/index.jsp
A shallow parser or 'chunker' comes somewhere in between these two. A plain POS tagger is really fast but does not give you enough information and a full blown parser is slow and gives you too much. A POS tagger can be thought of as a parser which only returns the bottom-most tier of the parse tree to you. A chunker might be thought of as a parser that returns some other tier of the parse tree to you instead. Sometimes you just need to know that a bunch of words together form a Noun Phrase but don't care about the sub-structure of the tree within those words (i.e. which words are adjectives, determiners, nouns, etc and how do they combine). In such cases you can use a chunker to get exactly the information you need instead of wasting time generating the full parse tree for the sentence.
POS tagging is a process deciding what is the type of every token from a text, e.g. NOUN, VERB, DETERMINER, etc. Token can be word or punctuation.
Meanwhile shallow parsing or chunking is a process dividing a text into syntactically related group.
Pos Tagging output
My/PRP$ dog/NN likes/VBZ his/PRP$ food/NN ./.
Chunking output
[NP My Dog] [VP likes] [NP his food]
The Constraint Grammar framework is illustrative. In its simplest, crudest form, it takes as input POS-tagged text, and adds what you could call Part of Clause tags. For an adjective, for example, it could add #NN> to indicate that it is part of an NP whose head word is to the right.
In POS_tagger, we tag words using a "tagset" like {noun, verb, adj, adv, prob...}
while shallow parser try to define sub-components such as Name Entity and phrases in the sentence like
"I'm currently (taking a Natural (Language Processing course) at (my University)) and (still confused with some basic concept.)"
D. Jurafsky and J. H. Martin say in their book, that shallow parse (partial parse) is a parse that doesn't extract all the possible information from the sentence, but just extract valuable in the specific case information.
Chunking is just a one of the approaches to shallow parsing. As it was mentioned, it extracts only information about basic non-recursive phrases (e.g. verb phrases or noun phrases).
Other approaches, for example, produce flatted parse trees. These trees may contain information about part-of-speech tags, but defer decisions that may require semantic or contextual factors, such as PP attachments, coordination ambiguities, and nominal compound analyses.
So, shallow parse is the parse that produce a partial parse tree. Chunking is an example of such parsing.

What is the difference between lemmatization vs stemming?

When do I use each ?
Also...is the NLTK lemmatization dependent upon Parts of Speech?
Wouldn't it be more accurate if it was?
Short and dense: http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.
However, the two words differ in their flavor. Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma .
From the NLTK docs:
Lemmatization and stemming are special cases of normalization. They identify a canonical representative for a set of related word forms.
Lemmatisation is closely related to stemming. The difference is that a
stemmer operates on a single word without knowledge of the context,
and therefore cannot discriminate between words which have different
meanings depending on part of speech. However, stemmers are typically
easier to implement and run faster, and the reduced accuracy may not
matter for some applications.
For instance:
The word "better" has "good" as its lemma. This link is missed by
stemming, as it requires a dictionary look-up.
The word "walk" is the base form for word "walking", and hence this
is matched in both stemming and lemmatisation.
The word "meeting" can be either the base form of a noun or a form
of a verb ("to meet") depending on the context, e.g., "in our last
meeting" or "We are meeting again tomorrow". Unlike stemming,
lemmatisation can in principle select the appropriate lemma
depending on the context.
Source: https://en.wikipedia.org/wiki/Lemmatisation
Stemming just removes or stems the last few characters of a word, often leading to incorrect meanings and spelling. Lemmatization considers the context and converts the word to its meaningful base form, which is called Lemma. Sometimes, the same word can have multiple different Lemmas. We should identify the Part of Speech (POS) tag for the word in that specific context. Here are the examples to illustrate all the differences and use cases:
If you lemmatize the word 'Caring', it would return 'Care'. If you stem, it would return 'Car' and this is erroneous.
If you lemmatize the word 'Stripes' in verb context, it would return 'Strip'. If you lemmatize it in noun context, it would return 'Stripe'. If you just stem it, it would just return 'Strip'.
You would get same results whether you lemmatize or stem words such as walking, running, swimming... to walk, run, swim etc.
Lemmatization is computationally expensive since it involves look-up tables and what not. If you have large dataset and performance is an issue, go with Stemming. Remember you can also add your own rules to Stemming. If accuracy is paramount and dataset isn't humongous, go with Lemmatization.
There are two aspects to show their differences:
A stemmer will return the stem of a word, which needn't be identical to the morphological root of the word. It usually sufficient that related words map to the same stem,even if the stem is not in itself a valid root, while in lemmatisation, it will return the dictionary form of a word, which must be a valid word.
In lemmatisation, the part of speech of a word should be first determined and the normalisation rules will be different for different part of speech, while the stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech.
Reference http://textminingonline.com/dive-into-nltk-part-iv-stemming-and-lemmatization
The purpose of both stemming and lemmatization is to reduce morphological variation. This is in contrast to the the more general "term conflation" procedures, which may also address lexico-semantic, syntactic, or orthographic variations.
The real difference between stemming and lemmatization is threefold:
Stemming reduces word-forms to (pseudo)stems, whereas lemmatization reduces the word-forms to linguistically valid lemmas. This difference is apparent in languages with more complex morphology, but may be irrelevant for many IR applications;
Lemmatization deals only with inflectional variance, whereas stemming may also deal with derivational variance;
In terms of implementation, lemmatization is usually more sophisticated (especially for morphologically complex languages) and usually requires some sort of lexica. Satisfatory stemming, on the other hand, can be achieved with rather simple rule-based approaches.
Lemmatization may also be backed up by a part-of-speech tagger in order to disambiguate homonyms.
As MYYN pointed out, stemming is the process of removing inflectional and sometimes derivational affixes to a base form that all of the original words are probably related to. Lemmatization is concerned with obtaining the single word that allows you to group together a bunch of inflected forms. This is harder than stemming because it requires taking the context into account (and thus the meaning of the word), while stemming ignores context.
As for when you would use one or the other, it's a matter of how much your application depends on getting the meaning of a word in context correct. If you're doing machine translation, you probably want lemmatization to avoid mistranslating a word. If you're doing information retrieval over a billion documents with 99% of your queries ranging from 1-3 words, you can settle for stemming.
As for NLTK, the WordNetLemmatizer does use the part of speech, though you have to provide it (otherwise it defaults to nouns). Passing it "dove" and "v" yields "dive" while "dove" and "n" yields "dove".
An example-driven explanation on the differenes between lemmatization and stemming:
Lemmatization handles matching “car” to “cars” along
with matching “car” to “automobile”.
Stemming handles matching “car” to “cars” .
Lemmatization implies a broader scope of fuzzy word matching that is
still handled by the same subsystems. It implies certain techniques
for low level processing within the engine, and may also reflect an
engineering preference for terminology.
[...] Taking FAST as an example,
their lemmatization engine handles not only basic word variations like
singular vs. plural, but also thesaurus operators like having “hot”
match “warm”.
This is not to say that other engines don’t handle synonyms, of course
they do, but the low level implementation may be in a different
subsystem than those that handle base stemming.
http://www.ideaeng.com/stemming-lemmatization-0601
Stemming is the process of removing the last few characters of a given word, to obtain a shorter form, even if that form doesn't have any meaning.
Examples,
"beautiful" -> "beauti"
"corpora" -> "corpora"
Stemming can be done very quickly.
Lemmatization on the other hand, is the process of converting the given word into it's base form according to the dictionary meaning of the word.
Examples,
"beautiful" -> "beauty"
"corpora" -> "corpus"
Lemmatization takes more time than stemming.
I think Stemming is a rough hack people use to get all the different forms of the same word down to a base form which need not be a legit word on its own
Something like the Porter Stemmer can uses simple regexes to eliminate common word suffixes
Lemmatization brings a word down to its actual base form which, in the case of irregular verbs, might look nothing like the input word
Something like Morpha which uses FSTs to bring nouns and verbs to their base form
Huang et al. describes the Stemming and Lemmatization as the following. The selection depends upon the problem and computational resource availability.
Stemming identifies the common root form of a word by removing or replacing word suffixes (e.g. “flooding” is stemmed as “flood”), while lemmatization identifies the inflected forms of a word and returns its base form (e.g. “better” is lemmatized as “good”).
Huang, X., Li, Z., Wang, C., & Ning, H. (2020). Identifying disaster related social media for rapid response: a visual-textual fused CNN architecture. International Journal of Digital Earth, 13(9), 1017–1039. https://doi.org/10.1080/17538947.2019.1633425
Stemming and Lemmatization both generate the foundation sort of the inflected words and therefore the only difference is that stem may not be an actual word whereas, lemma is an actual language word.
Stemming follows an algorithm with steps to perform on the words which makes it faster. Whereas, in lemmatization, you used a corpus also to supply lemma which makes it slower than stemming. you furthermore might had to define a parts-of-speech to get the proper lemma.
The above points show that if speed is concentrated then stemming should be used since lemmatizers scan a corpus which consumes time and processing. It depends on the problem you’re working on that decides if stemmers should be used or lemmatizers.
for more info visit the link:
https://towardsdatascience.com/stemming-vs-lemmatization-2daddabcb221
Stemming
is the process of producing morphological variants of a root/base word. Stemming programs are commonly referred to as stemming algorithms or stemmers.
Often when searching text for a certain keyword, it helps if the search returns variations of the word.
For instance, searching for “boat” might also return “boats” and “boating”. Here, “boat” would be the stem for [boat, boater, boating, boats].
Lemmatization
looks beyond word reduction and considers a language’s full vocabulary to apply a morphological analysis to words. The lemma of ‘was’ is ‘be’ and the lemma of ‘mice’ is ‘mouse’.
I did refer this link,
https://towardsdatascience.com/stemming-vs-lemmatization-2daddabcb221
In short, the difference between these algorithms is that only lemmatization includes the meaning of the word in the evaluation. In stemming, only a certain number of letters are cut off from the end of the word to obtain a word stem. The meaning of the word does not play a role in it.
In short:
Lemmatization: uses context to transform words to their
dictionary(base) form also known as Lemma
Stemming: uses the stem of the word, most of the time removing derivational affixes.
source

Resources