'EditContactPopup' vs 'ContactEditPopup' which one is right? - naming

I'm non-English speaker and want to know proper ordering of words.
I always confuse the two:
verb + noun + Popup(or whatever)
noun + verb + Popup(or whatever)
Which one should I use?

First, take Popup entirely out of the equation; it is just a descriptor. Then, as for whether you should go for verb + noun or noun + verb, it depends on whether the noun is the subject or the object of the sentence:
We can then follow the subject-verb-object sentence structure. In this particular case, you want EditContact, as the noun, contact, is the object of the sentence.

Related

Stem Spanish words in isolation to validate that they are "words" in SpaCy's (or any) dictionary

I have a list of 20,000 words. I want to know which of the 20k words are "weird" in some way. This is part of a text cleaning task.
Albóndiga is fine, huticotai is no Spanish word I know... neither is 56%$3estapa
This means I must compare declined/conjugated words in isolation to some source of truth. Everyone recommends SpaCy. Fine.
Somehow, though, using the code below and a test file with a few dozen words, spaCy thinks they are all "ROOT" words. Si hablas castellano, sabrás que así no es.
technically, I don't want to lemmatize anything! I want to stem the words. I just want to pair down the 20k-long wordlist to something I as a Spanish-speaking linguist can look at to determine what sorts of of crazy desmadre (B.S.) is going on.
Here is an example of the output I get:
trocito NOUN ROOT trocito
ayuntamiento NOUN ROOT ayuntamiento
eyre NOUN ROOT eyre
suscribíos NOUN ROOT suscribío
mezcal ADJ ROOT mezcal
marivent VERB ROOT mariventir
inversores NOUN ROOT inversor
stenger VERB ROOT stenger
Clearly, "stenger" is not a Spanish word, though naïvely, spaCy thinks it is. Mezcal is a NOUN (and a very good time). You get the picture.
Here is my code:
import spacy
nlp = spacy.load("es_core_news_sm")
new_lst = []
with open("vocabu_suse.txt", 'r') as lst:
for i in lst:
# print(i)
new_lst.append(i.strip())
for i in new_lst:
j = nlp(i)
for token in j:
print(token.text, token.pos_, token.dep_, token.lemma_)
I'm pretty confused about what you're trying to do here. To be clear, my understanding is that your main objective is to find what words in your list aren't junk, and you tried using lemmatization to check them, and the lemmatization results seem wrong.
This means I must compare declined/conjugated words in isolation to some source of truth. Everyone recommends SpaCy. Fine.
spaCy is great for many NLP tasks but dealing with lists of words with no context is really not something it's intended for, and I don't think it's going to help you much here.
Finding Real Words
To address your main problem...
First, to find out what words aren't junk, you can use a wordlist you trust or a relatively normal large corpus.
If you don't have a wordlist you trust, you can see if the words are in the vocab of the spaCy models. spaCy's vocabs can include junk words, but since they're built by frequency it should only include common errors. This wordfreq repo may also be helpful.
To check if a word has a word vector in spaCy, in the medium or large models you can use tok.is_oov.
If you want to go from a corpus, use a corpus you have or Wikipedia or something, and discard words below a certain word count threshold. (I understand inflection makes this more difficult, but with a sufficiently large corpus you should still be finding real words with some frequency.)
About the ROOT
The ROOT tag is a dependency tag, not a tag of word form. In a dependency parse of a sentence the ROOT is typically the main verb. In a sentence of one word, that word is always the root.
I think you want the tag_ attribute, which is a detailed language-specific POS tag. The pos_ attribute is a coarse-grained tag designed for multi-lingual applications.
About Lemmatization
I don't understand this statement.
technically, I don't want to lemmatize anything! I want to stem the words.
Lemmatization is almost always better than stemming. "Stemming" usually refers to a rule-based process that works based on token patterns, so for your particular problem it won't help you at all - it'll just tell you if words have common word endings or something. You could have a word like "asdfores" and a stemmer would happily tell you it's the plural of "asdfor". Lemmatizers are often based on databases of words that have been checked so it's closer to what you need.
In either case, spaCy doesn't do stemming.
If you are trying to clean up a list of words, you could use a PoS tagger that works with a dictionary, such as Freeling. I think that the output analysis level morfo could be suitable for your needs. As an example:
cat input.txt | analyzer -f es.cfg --flush --outlv morfo --noprob
Output:
trocito trocito NCMS000 -1
ayuntamiento ayuntamiento NCMS000 -1
eyre
suscribíos suscribir+os VMM02P0+PP2CP00 -1
mezcal mezcal NCMS000 -1
marivent
inversores inversor AQ0MP00 -1 inversor NCMP000 -1
stenger
You should check the entries without a tag assigned ("eyre", "marivent", "stenger"), not present in the dictionary.
Please note that the tool includes some kind of "guessing" for malformed words, so maybe you should pay attention to entries including '_' in their lemma as well.
$ echo "hanticipación" | analyzer -f es.cfg --flush --outlv morfo --noprob
Output:
hanticipación h_anticipación NCFS000 -1

Is there any Information extraction to find subject and verb/relation doubles in sentence just like ClausIE, Reverb, etc?

I have used ClausIE and it returns the Subject, verb and Object triples from a sentence. But these won't work when the text is short text and not even a complete sentence. I just want a library or otherwise which can return just the subject verb pairs from short text/phrase.
An example short text is "Proposal 32 accepted". It should have some dependency or maybe rules used to identify that the term "Proposal" is the subject and the term "accepted" is verb/relation.
I have tried Stanford online parser for the above text but it doesn't return anything maybe because there is no object in the text.
Any advice would be appreciated.
The problem is, you got a Subject ("Proposal 32") and a Verb ("accepted"). Because you don't have an Object, there's no triple.
But, what you could do, is to try to identify the Subject and the Verb, by using Tokenization, using Stanford online
For example:
- The sentence is probably "declarative" if Stanford uses the "S" tag.
- if the sentence is declarative, then:
- the Subject is usually the Noun group that is in front of the main Verb group. In Stanford online that's the first NP in front of the first VP.
Now: if you:
- Add "is" in front of the main verb you get: "Proposal 32 is accepted".
- Which is: "Proposal 32 = accepted", which is a logical comparison that any programming language understands
The problem of course is, you don't always get these simple short sentences. There are probably some packages out there that can deal with this out-of-the-box. But not that I know of.
What you can do: make some rules of your own, based on English grammar. It would only understand sentences for the rules that you make. But maybe that's all you need. If you only have to deal with these very short combinations, a few well designed rules can do the job

Identify prepositons and individual POS

I am trying to find correct parts of speech for each word in paragraph. I am using Stanford POS Tagger. However, I am stuck at a point.
I want to identify prepositions from the paragraph.
Penn Treebank Tagset says that:
IN Preposition or subordinating conjunction
how, can I be sure if current word is be preposition or subordinating conjunction. How can I extract only prepositions from paragraph in this case?
You can't be sure. The reason for this somewhat strange PoS is that it's really hard to automatically determine if, for example, for is a preposition or a subordinate conjunction. So in order for automatic taggers to have a better precision, this distinction is simply ignored. Note that there is also a tag TO, which is given to any occurrence of to, regardless of its function as a preposition, infinitive particle or whatever (I think there are others).
If you need to identify prepositions properly, you need to retrain a tagger with a modified tag set, or maybe train a classifier which takes PoS-tagged text and only does this final disambiguation.
I have had some breakthrough to understand if the word is actually preposition or subordinating conjunction.
I have parsed following sentence :
She left early because Mike arrived with his new girlfriend.
(here because is subordinating conjunction )
After POS tagging
She_PRP left_VBD early_RB because_IN Mike_NNP arrived_VBD with_IN
his_PRP$ new_JJ girlfriend_NN ._.
here , to make sure because is a preposition or not I have parsed the sentence.
here because has direct parent after IN as SBAR(Subordinate Clause) as root.
with also comes under IN but its direct parent will be PP so it is a preposition.
Example 2 :
Keep your hand on the wound until the nurse asks you to take it off.
(here until is coordinating conjunction )
POS tagging is :
Keep_VB your_PRP$ hand_NN on_IN the_DT wound_NN until_IN the_DT
nurse_NN asks_VBZ you_PRP to_TO take_VB it_PRP off_RP ._.
So , until and on are marked as IN.
However, picture gets clearer when we actually parse the sentence.
So finally I conclude because is subordinating conjunction and with is preposition.
Tried for many variations of sentences .. worked for almost all except some cases for before and after.

Infinitive form disambiguation

How to decide whether in a sentence a word is infinitive or not?
For example here "fixing" is infinitive:
Fixing the door was also easy but fixing the window was very hard.
But in
I am fixing the door
it is not. How do people disambiguate these cases?
To elaborate on my comment:
In PoS tagging, choosing between a gerund (VBG) and a noun (NN) is quite subtle and has many special cases. My understanding is fixing should be tagged as a gerund in your first sentence, because it can be modified by an adverb in that context. Citing from the Penn PoS tagging guidelines (page 19):
"While both nouns and gerunds can be preceded by an article or a possessive pronoun, only a noun (NN) can be modified by an adjective, and only a gerund (VBG) can be modified by an adverb."
EXAMPLES:
Good/JJ cooking/NN is something to enjoy.
Cooking/VBG well/RB is a useful skill.
Assuming you meant 'automatically disambiguate', this task requires a bit of processing (pos-tagging and syntactic parsing). The idea is to find instances of a verb that are not preceded by an agreeing Subject Noun Phrase. If you also want to catch infinitive forms like "to fix", just add that to the list of forms you are looking for.

What Is the Difference Between POS Tagging and Shallow Parsing?

I'm currently taking a Natural Language Processing course at my University and still confused with some basic concept. I get the definition of POS Tagging from the Foundations of Statistical Natural Language Processing book:
Tagging is the task of labeling (or tagging) each word in a sentence
with its appropriate part of speech. We decide whether each word is a
noun, verb, adjective, or whatever.
But I can't find a definition of Shallow Parsing in the book since it also describe shallow parsing as one of the utilities of POS Tagging. So I began to search the web and found no direct explanation of shallow parsing, but in Wikipedia:
Shallow parsing (also chunking, "light parsing") is an analysis of a sentence which identifies the constituents (noun groups, verbs, verb groups, etc.), but does not specify their internal structure, nor their role in the main sentence.
I frankly don't see the difference, but it may be because of my English or just me not understanding simple basic concept. Can anyone please explain the difference between shallow parsing and POS Tagging? Is shallow parsing often also called Shallow Semantic Parsing?
Thanks before.
POS tagging would give a POS tag to each and every word in the input sentence.
Parsing the sentence (using the stanford pcfg for example) would convert the sentence into a tree whose leaves will hold POS tags (which correspond to words in the sentence), but the rest of the tree would tell you how exactly these these words are joining together to make the overall sentence. For example an adjective and a noun might combine to be a 'Noun Phrase', which might combine with another adjective to form another Noun Phrase (e.g. quick brown fox) (the exact way the pieces combine depends on the parser in question).
You can see how parser output looks like at http://nlp.stanford.edu:8080/parser/index.jsp
A shallow parser or 'chunker' comes somewhere in between these two. A plain POS tagger is really fast but does not give you enough information and a full blown parser is slow and gives you too much. A POS tagger can be thought of as a parser which only returns the bottom-most tier of the parse tree to you. A chunker might be thought of as a parser that returns some other tier of the parse tree to you instead. Sometimes you just need to know that a bunch of words together form a Noun Phrase but don't care about the sub-structure of the tree within those words (i.e. which words are adjectives, determiners, nouns, etc and how do they combine). In such cases you can use a chunker to get exactly the information you need instead of wasting time generating the full parse tree for the sentence.
POS tagging is a process deciding what is the type of every token from a text, e.g. NOUN, VERB, DETERMINER, etc. Token can be word or punctuation.
Meanwhile shallow parsing or chunking is a process dividing a text into syntactically related group.
Pos Tagging output
My/PRP$ dog/NN likes/VBZ his/PRP$ food/NN ./.
Chunking output
[NP My Dog] [VP likes] [NP his food]
The Constraint Grammar framework is illustrative. In its simplest, crudest form, it takes as input POS-tagged text, and adds what you could call Part of Clause tags. For an adjective, for example, it could add #NN> to indicate that it is part of an NP whose head word is to the right.
In POS_tagger, we tag words using a "tagset" like {noun, verb, adj, adv, prob...}
while shallow parser try to define sub-components such as Name Entity and phrases in the sentence like
"I'm currently (taking a Natural (Language Processing course) at (my University)) and (still confused with some basic concept.)"
D. Jurafsky and J. H. Martin say in their book, that shallow parse (partial parse) is a parse that doesn't extract all the possible information from the sentence, but just extract valuable in the specific case information.
Chunking is just a one of the approaches to shallow parsing. As it was mentioned, it extracts only information about basic non-recursive phrases (e.g. verb phrases or noun phrases).
Other approaches, for example, produce flatted parse trees. These trees may contain information about part-of-speech tags, but defer decisions that may require semantic or contextual factors, such as PP attachments, coordination ambiguities, and nominal compound analyses.
So, shallow parse is the parse that produce a partial parse tree. Chunking is an example of such parsing.

Resources