Is it possible to handle synonyms or other specific two or more words to be considered as a single feature in a bag of words model in stanford classifier?
For instance :
I would want would and could to be considered as a single feature .
I don't exactly understand your question. Please be a bit more specific about what you are trying to classify.
But generally, you can always transform your input before giving it to any classifier. I.e. replace "hey, could i help you" with "X, Y I help you" where X is a placeholder for the group {hi, hey, hello, ...}.
These groups are sometimes called "synsets", e.g. in WordNet (http://wordnet.princeton.edu/). Here's the synsets of "hello" in WordNet: [1] If this is helpful, there's APIs to access WordNet.
You can of course just manually create these word groups as well. Keep in mind though that there's a lot of ambigious words where assigning one of these groups is quite hard.
[1] http://wordnetweb.princeton.edu/perl/webwn?s=hello&sub=Search+WordNet&o2=&o0=1&o8=1&o1=1&o7=&o5=&o9=&o6=&o3=&o4=&h=
Related
I have a list of 20,000 words. I want to know which of the 20k words are "weird" in some way. This is part of a text cleaning task.
Albóndiga is fine, huticotai is no Spanish word I know... neither is 56%$3estapa
This means I must compare declined/conjugated words in isolation to some source of truth. Everyone recommends SpaCy. Fine.
Somehow, though, using the code below and a test file with a few dozen words, spaCy thinks they are all "ROOT" words. Si hablas castellano, sabrás que así no es.
technically, I don't want to lemmatize anything! I want to stem the words. I just want to pair down the 20k-long wordlist to something I as a Spanish-speaking linguist can look at to determine what sorts of of crazy desmadre (B.S.) is going on.
Here is an example of the output I get:
trocito NOUN ROOT trocito
ayuntamiento NOUN ROOT ayuntamiento
eyre NOUN ROOT eyre
suscribíos NOUN ROOT suscribío
mezcal ADJ ROOT mezcal
marivent VERB ROOT mariventir
inversores NOUN ROOT inversor
stenger VERB ROOT stenger
Clearly, "stenger" is not a Spanish word, though naïvely, spaCy thinks it is. Mezcal is a NOUN (and a very good time). You get the picture.
Here is my code:
import spacy
nlp = spacy.load("es_core_news_sm")
new_lst = []
with open("vocabu_suse.txt", 'r') as lst:
for i in lst:
# print(i)
new_lst.append(i.strip())
for i in new_lst:
j = nlp(i)
for token in j:
print(token.text, token.pos_, token.dep_, token.lemma_)
I'm pretty confused about what you're trying to do here. To be clear, my understanding is that your main objective is to find what words in your list aren't junk, and you tried using lemmatization to check them, and the lemmatization results seem wrong.
This means I must compare declined/conjugated words in isolation to some source of truth. Everyone recommends SpaCy. Fine.
spaCy is great for many NLP tasks but dealing with lists of words with no context is really not something it's intended for, and I don't think it's going to help you much here.
Finding Real Words
To address your main problem...
First, to find out what words aren't junk, you can use a wordlist you trust or a relatively normal large corpus.
If you don't have a wordlist you trust, you can see if the words are in the vocab of the spaCy models. spaCy's vocabs can include junk words, but since they're built by frequency it should only include common errors. This wordfreq repo may also be helpful.
To check if a word has a word vector in spaCy, in the medium or large models you can use tok.is_oov.
If you want to go from a corpus, use a corpus you have or Wikipedia or something, and discard words below a certain word count threshold. (I understand inflection makes this more difficult, but with a sufficiently large corpus you should still be finding real words with some frequency.)
About the ROOT
The ROOT tag is a dependency tag, not a tag of word form. In a dependency parse of a sentence the ROOT is typically the main verb. In a sentence of one word, that word is always the root.
I think you want the tag_ attribute, which is a detailed language-specific POS tag. The pos_ attribute is a coarse-grained tag designed for multi-lingual applications.
About Lemmatization
I don't understand this statement.
technically, I don't want to lemmatize anything! I want to stem the words.
Lemmatization is almost always better than stemming. "Stemming" usually refers to a rule-based process that works based on token patterns, so for your particular problem it won't help you at all - it'll just tell you if words have common word endings or something. You could have a word like "asdfores" and a stemmer would happily tell you it's the plural of "asdfor". Lemmatizers are often based on databases of words that have been checked so it's closer to what you need.
In either case, spaCy doesn't do stemming.
If you are trying to clean up a list of words, you could use a PoS tagger that works with a dictionary, such as Freeling. I think that the output analysis level morfo could be suitable for your needs. As an example:
cat input.txt | analyzer -f es.cfg --flush --outlv morfo --noprob
Output:
trocito trocito NCMS000 -1
ayuntamiento ayuntamiento NCMS000 -1
eyre
suscribíos suscribir+os VMM02P0+PP2CP00 -1
mezcal mezcal NCMS000 -1
marivent
inversores inversor AQ0MP00 -1 inversor NCMP000 -1
stenger
You should check the entries without a tag assigned ("eyre", "marivent", "stenger"), not present in the dictionary.
Please note that the tool includes some kind of "guessing" for malformed words, so maybe you should pay attention to entries including '_' in their lemma as well.
$ echo "hanticipación" | analyzer -f es.cfg --flush --outlv morfo --noprob
Output:
hanticipación h_anticipación NCFS000 -1
I understand the implicit value of part-of-speech tagging and have seen mentions about its use in parsing, text-to-speech conversion, etc.
Could you tell me how is the output of a PoS tagger formated ?
Also, could you explain how is such an output used by other tasks/parts of an NLP system?
One purpose of PoS tagging is to disambiguate homonyms.
For instance, take this sentence :
I fish a fish
The same sentence in french would be Je pêche un poisson.
Without tagging, fish would be translated the same way in both case, which would lead to
a wrong traduction. However, after PoS tagging, the sentence would be
I_PRON fish_VERB a_DET fish_NOUN
From a computer point of view, both words are now distinct. This wat, they can be processed much more efficiently (in our example, fish_VERB will be translated to pêche and fish_NOUN to poisson).
Basically, the goal of a POS tagger is to assign linguistic (mostly grammatical) information to sub-sentential units. Such units are called tokens and, most of the time, correspond to words and symbols (e.g. punctuation).
Considering the format of the output, it doesn't really matter as long as you get a sequence of token/tag pairs. Some POS taggers allow you to specify some specific output format, others use XML or CSV/TSV, and so on.
I've been working on a Question Answering engine in C#. I have implemented the features of most modern systems and are achieving good results. Despite the aid of Wordnet , one problem I haven't been able to solve yet is changing the user input to the correct term.
For example
changing Weight -> Mass
changing Tall -> Height
My question is about the existence of some sort of resource that can aid me in this task of changing the terms to the correct terms.
Thank You
Looking at all the synsets in WordNet for both Mass and Weight I can see that there is no shared synset and thus there is no meaning in common. Words that actually do have the same meaning can be matched by means of their synset labels, as I'm sure you've realized.
In my own natural language engine (http://nlp.abodit.com) I allow users to use any synset label in the grammar they define but I would still create two separate grammar rules in this case, one recognizing questions about mass and one recognizing questions about weight.
However, there are also files for Wordnet that give you class relationships between synsets too. For example, if you type 'define mass' into my demo page you'll see:-
4. wn30:synset-mass-noun-1
the property of a body that causes it to have weight in a gravitational field
--type--> wn30:synset-fundamental_quantity-noun-1
--type--> wn30:synset-physical_property-noun-1
ITokenText, IToken, INoun, Singular
And if you do the same for 'weight' you'll also see that it too has a class relationship to 'physical property'.
In my system you can write a rule that recognizes a question about a 'physical property' and perhaps a named object and then try to figure out which physical property they are likely to be asking about. And, perhaps, if you can't match maybe just tell them all about the physical properties of the object.
The method signature in my system would be something like ...
... QuestionAboutPhysicalProperties (... IPhysicalProperty prop,
INamedObject obj, ...)
... and in code I would look at the properties of obj and try to find one called 'prop'.
The only way that I know how to do this effectively requires having a large corpus of user query sessions and a happiness measure on sessions, and then finding correlations between substituting word x for word y (possibly given some context z) that improves user happiness.
Here is a reasonable paper on generating query substitutions.
And here is a new paper on generating synonyms from anchor text, which doesn't require a query log.
I am doing POS tagging. Given the following tokens in the training set, is it better to consider each token as Word1/POStag and Word2/POStag or consider them as one word that is Word1/Word2/POStag ?
Examples: (the POSTag is not required to be included)
Bard/EMS
Interstate/Johnson
Polo/Ralph
IBC/Donoghue
ISC/Bunker
Bendix/King
mystery/comedy
Jeep/Eagle
B/T
Hawaiian/Japanese
IBM/PC
Princeton/Newport
editing/electronic
Heller/Breene
Davis/Zweig
Fleet/Norstar
a/k/a
1/2
Any suggestion is appreciated.
The examples don't seem to fall into one category with respect to the use of the slash -- a/k/a is a phrase acronym, 1/2 is a number, mystery/comedy indicates something in between the two words, etc.
I feel there is no treatment of the component words that would work for all the cases in question, and therefore the better option is to handle them as unique words. At decoding stage, when the tagger will probably be presented with more previously unseen examples of such words, the decision can often be made based on the context, rather than the word itself.
I'm currently taking a Natural Language Processing course at my University and still confused with some basic concept. I get the definition of POS Tagging from the Foundations of Statistical Natural Language Processing book:
Tagging is the task of labeling (or tagging) each word in a sentence
with its appropriate part of speech. We decide whether each word is a
noun, verb, adjective, or whatever.
But I can't find a definition of Shallow Parsing in the book since it also describe shallow parsing as one of the utilities of POS Tagging. So I began to search the web and found no direct explanation of shallow parsing, but in Wikipedia:
Shallow parsing (also chunking, "light parsing") is an analysis of a sentence which identifies the constituents (noun groups, verbs, verb groups, etc.), but does not specify their internal structure, nor their role in the main sentence.
I frankly don't see the difference, but it may be because of my English or just me not understanding simple basic concept. Can anyone please explain the difference between shallow parsing and POS Tagging? Is shallow parsing often also called Shallow Semantic Parsing?
Thanks before.
POS tagging would give a POS tag to each and every word in the input sentence.
Parsing the sentence (using the stanford pcfg for example) would convert the sentence into a tree whose leaves will hold POS tags (which correspond to words in the sentence), but the rest of the tree would tell you how exactly these these words are joining together to make the overall sentence. For example an adjective and a noun might combine to be a 'Noun Phrase', which might combine with another adjective to form another Noun Phrase (e.g. quick brown fox) (the exact way the pieces combine depends on the parser in question).
You can see how parser output looks like at http://nlp.stanford.edu:8080/parser/index.jsp
A shallow parser or 'chunker' comes somewhere in between these two. A plain POS tagger is really fast but does not give you enough information and a full blown parser is slow and gives you too much. A POS tagger can be thought of as a parser which only returns the bottom-most tier of the parse tree to you. A chunker might be thought of as a parser that returns some other tier of the parse tree to you instead. Sometimes you just need to know that a bunch of words together form a Noun Phrase but don't care about the sub-structure of the tree within those words (i.e. which words are adjectives, determiners, nouns, etc and how do they combine). In such cases you can use a chunker to get exactly the information you need instead of wasting time generating the full parse tree for the sentence.
POS tagging is a process deciding what is the type of every token from a text, e.g. NOUN, VERB, DETERMINER, etc. Token can be word or punctuation.
Meanwhile shallow parsing or chunking is a process dividing a text into syntactically related group.
Pos Tagging output
My/PRP$ dog/NN likes/VBZ his/PRP$ food/NN ./.
Chunking output
[NP My Dog] [VP likes] [NP his food]
The Constraint Grammar framework is illustrative. In its simplest, crudest form, it takes as input POS-tagged text, and adds what you could call Part of Clause tags. For an adjective, for example, it could add #NN> to indicate that it is part of an NP whose head word is to the right.
In POS_tagger, we tag words using a "tagset" like {noun, verb, adj, adv, prob...}
while shallow parser try to define sub-components such as Name Entity and phrases in the sentence like
"I'm currently (taking a Natural (Language Processing course) at (my University)) and (still confused with some basic concept.)"
D. Jurafsky and J. H. Martin say in their book, that shallow parse (partial parse) is a parse that doesn't extract all the possible information from the sentence, but just extract valuable in the specific case information.
Chunking is just a one of the approaches to shallow parsing. As it was mentioned, it extracts only information about basic non-recursive phrases (e.g. verb phrases or noun phrases).
Other approaches, for example, produce flatted parse trees. These trees may contain information about part-of-speech tags, but defer decisions that may require semantic or contextual factors, such as PP attachments, coordination ambiguities, and nominal compound analyses.
So, shallow parse is the parse that produce a partial parse tree. Chunking is an example of such parsing.