How to get inflections for a word using Wordnet - nlp

I want to get inflectional forms for a word using Wordnet.
E.g. If the word is make, then its inflections are
made, makes, making
I tried all the options of the wn command but I did not get the inflections for a word.
Any idea how to get these?

I am not sure wordnet was intended to inflect words. Just found this little writeup about how WordNet(R) makes use of the Morphy algorithm to make a morphological determination of the head term associated with an inflected form https://github.com/jdee/dubsar/wiki/Inflections. I needed some inflection for a project of mine (Python) a little ago and I used https://github.com/pwdyson/inflect.py and https://bitbucket.org/cnu/montylingua3/overview/ (required some hacking, also take a look at the original http://web.media.mit.edu/~hugo/montylingua/)

This python package LemmInflect provides functions to get all inflections of a word.
Just copy their examples here:
> from lemminflect import getInflection, getAllInflections, getAllInflectionsOOV
> getInflection('watch', tag='VBD')
('watched',)
> getAllInflections('watch')
{'NN': ('watch',), 'NNS': ('watches', 'watch'), 'VB': ('watch',), 'VBD': ('watched',), 'VBG': ('watching',), 'VBZ': ('watches',), 'VBP': ('watch',)}
> getAllInflections('watch', upos='VERB')
{'VB': ('watch',), 'VBP': ('watch',), 'VBD': ('watched',), 'VBG': ('watching',), 'VBZ': ('watches',)}
> getAllInflectionsOOV('xxwatch', upos='NOUN')
{'NN': ('xxwatch',), 'NNS': ('xxwatches',)}
Check out https://lemminflect.readthedocs.io/en/latest/inflections/ for more details.

Related

Searching a lot of keywords on twitter via tweepy

I am trying to make a python code with tweepy that will track all the tweets from a specific country from a date which will have some of the chosen specific keywords. And I have chosen a lot of keywords like 24-25.
My keywords are vigilance anticipation interesting ecstacy joy serenity admiration trust acceptance terror fear apprehensive amazement surprize distraction grief sadness pensiveness loathing disgust boredom rage anger annoyance.
for more understanding, my code till now is:
places = api.geo_search(query="Canada",granularity="country")
place_id = places[0].id
public_tweets = tweepy.Cursor(api.search,
q="place:"+place_id+" since:2020-03-01",
lang="en",
).items(num_tweets)
Please help me with this question as soon as possible.
Thank You

Extracting labels from owl ontologies when the label isn't in the ontology but can be found at the URI

Please bear with me as I am new to semantic technologies.
I am trying to use the package rdflib to extract labels from classes in ontologies. However some ontologies don't contain the labels themselves but have the URIs of classes from other ontologies. How does one extract the labels from URIs of the external ontologies?
The intuition behind my attempts center on identifying classes that don't contain labels locally (if that is the right way of putting it) and then "following" their URIs to the external ontologies to extract the labels. However the way I have implemented it does not work.
import rdflib
g = rdflib.Graph()
# I have no trouble extracting labels from this ontology:
# g.load("http://purl.obolibrary.org/obo/po.owl#")
# However, this ontology contains no labels locally:
g.load("http://www.bioassayontology.org/bao/bao_complete.owl#")
owlClass = rdflib.namespace.OWL.Class
rdfType = rdflib.namespace.RDF.type
for s in g.subjects(predicate=rdfType, object=owlClass):
# Where label is present...
if g.label(s) != '':
# Do something with label...
print(g.label(s))
# This is what I have added to try to follow the URI to the external ontology.
elif g.label(s) == '':
g2 = rdflib.Graph()
g2.parse(location=s)
# Do something with label...
print(g.label(s))
Am I taking completely the wrong approach? All help is appreciated! Thank you.
I think you can be much more efficient than this. You are trying to do a web request, remote ontology download and search every time you encounter a URI that doesn't have a label given in http://www.bioassayontology.org/bao/bao_complete.owl which is most of them and it's a very large number. So your script will take forever and thrash the web servers delivering those remote ontologies.
Looking at http://www.bioassayontology.org/bao/bao_complete.owl, I see that most of the URIs without labels there are from OBO, and perhaps a couple of other ontologies, but mostly OBO.
What you should do is download OBO once and load that with RDFlib. Then if you run your script above on the joined (union) graph of http://www.bioassayontology.org/bao/bao_complete.owl & OBO, you'll have all OBO's content at your fingertips so that g.label(s) will find a much higher proportion of labels.
Perhaps there are a couple of other source ontologies providing labels for http://www.bioassayontology.org/bao/bao_complete.owl you may need as well but my quick browsing sees only OBO.

Get default stop word list in elastic search

I am trying to find out what the predefined stop word list for elastic search are, but i have found no documented read API for this.
So, i want to find the word lists for this predefined variables (_arabic_, _armenian_, _basque_, _brazilian_, _bulgarian_, _catalan_, _czech_, _danish_, _dutch_, _english_, _finnish_, _french_, _galician_, _german_, _greek_, _hindi_, _hungarian_, _indonesian_, _irish_, _italian_, _latvian_, _norwegian_, _persian_, _portuguese_, _romanian_, _russian_, _sorani_, _spanish_, _swedish_, _thai_, _turkish_)
I found the english stop word list in the documentation, but I want to check if it is the one my server really uses and also check the stop word lists for other languages.
The stop words used by the English Analyzer are the same as the ones defined in the Standard Analyzer, namely the ones you found in the documentation.
The stop word files for all other languages can be found in the Lucene repository in the analysis/common/src/resources/org/apache/lucene/analysis folder.

Labelling text using Notepad++ or any other tool

I have several .dat, containing information about hotel reviews as below
/*
<Author> simmotours
<Content> review......goes here
<Date>Nov 18, 2008
<No. Reader>-1
<No. Helpful>-1
<Overall>4`enter code here`
<Value>4
<Rooms>3
<Location>4
<Cleanliness>4
<Check in / front desk>4
<Service>4
<Business service>-1
*/
I want to classify the review into two pos and neg , i.e. have two folder pos and neg containing several files with reviews above 3 classified as positive and below 3 classified as negative.
How can I quickly and efficiently automate this process?
You could write up a python script to read the overall score. Do this by looping over the the lines using readline() See here. Find the "Overall" Score using some string parsing. Then move the file into the right directory. All very simple things to do in Python, just break it down into steps and search for answers to those steps.
Notepad++ can do replacements with regular expressions. And allows the definition of macros. Use them to convert the file to an XML file. Check out the help file.
Then you can read it with any scripting language and do what you want.
Alternatively you could change the file to a form where you can load it into Excel and do the analysis there.

Different Output for Stanford Parser Online Tool and Stanford Parser Code

I am working with stanford parser to extract grammetical dependency structures from review sentences. My problem is that for some reason the output generated by my code is not similar to the one generated my stanford online tool. Below is an example.
Review Sentence: The picture quality of the camera is not good.
My Code output (It used EnglishPCFG model and typedDependenciesCollapsed structure)
root(ROOT-0, -LSB--1),
det(quality-4, The-2),
nn(quality-4, picture-3),
nsubj(-RSB--11, quality-4),
det(camera-7, the-6),
prep_of(quality-4, camera-7),
cop(-RSB--11, is-8),
neg(-RSB--11, not-9),
amod(-RSB--11, good-10),
ccomp(-LSB--1, -RSB--11)
Stanford Online tool Output:
det(quality-3, The-1)
nn(quality-3, picture-2)
nsubj(good-9, quality-3)
det(camera-6, the-5)
prep_of(quality-3, camera-6)
cop(good-9, is-7)
neg(good-9, not-8)
root(ROOT-0, good-9)
I am looking for the reason for this difference. What kind of model and dependency structure does online parser use ? I apologies if I am missing something obvious. Any help would be highly appreciated.
I can add code snippet if required
Update:
I changed my code to ignore the LSB and RSB generated by the SP tokenizer but still the grammatical structure generated is different from that of online tool. Here is an example:
Review Sentence: The size and picture quality of the camera is perfect.
My Code Output:
det(quality-5, The-1),
nn(quality-5, size-2),
conj_and(size-2, picture-4),
nsubj(perfect-10, quality-5),
det(camera-8, the-7),
prep_of(quality-5, camera-8),
cop(perfect-10, is-9),
root(ROOT-0, perfect-10)
Stanford Online Tool Output:
det(quality-5, The-1)
nn(quality-5, size-2)
conj_and(size-2, picture-4)
**nn(quality-5, picture-4)**
nsubj(perfect-10, quality-5)
det(camera-8, the-7)
prep_of(quality-5, camera-8)
cop(perfect-10, is-9)
root(ROOT-0, perfect-10)
Note the missing nn dependency in my code output. I am trying to get my head around why this is happening. Any help would be appreciated.
Update (Relevant code snippet below):
rawWords2 = [-LSB-, The, size, and, picture, quality, of, the, camera, is, perfect, -RSB-]
lp = LexicalizedParser using EnglishPCFG model
Tree parse = lp.apply(rawWords2.subList(1,rawWords2.size() - 1));
TreebankLanguagePack tlp = new PennTreebankLanguagePack();
GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
tdl = (List<TypedDependency>) gs.typedDependenciesCollapsed();
System.out.println(tdl.toString());
Output to screen is as mentioned earlier in the post.
Another observation.
I worked around with Stanford library to show me the dependency relation between quality and picture which as shown in the Stanford online tool is nn but the dependency shown by the library is dep (i.e. can't find more suitable dependency). Now the question is why is Stanford online tool showing nn dependency between quality and picturewhere as Stanford library showing dep as dependency.
The major issue for whether you get that extra nn dependency or not is whether there is propagation of dependencies across coordination (size is a nn of quality and it is coordinated with picture, therefore we make it an nn of quality too). The online output is showing the collapsed output with propagation, whereas you are calling the API method that doesn't include propagation. You can see either from the command-line using options as shown at the bottom of this post. In the API, to get coordination propagation, you should instead call
gs.typedDependenciesCCprocessed()
(instead of gs.typedDependenciesCollapsed()).
Other comments:
Where are the square brackets (-LSB-) coming from? They shouldn't be introduced by the tokenizer. If they are, it's a bug. Can you say what you do for them to be generated? I suspect they may be coming from your preprocessing? Unexpected things like that in a sentence will tend to cause the parse quality to degrade very badly.
The online parser isn't always up-to-date with the latest released version. I'm not sure if it is up-to-date right now. But I don't think that is the main issue here.
We are doing some work evolving the dependencies representation. This is deliberate, but will create problems if you have code that depends substantively on how the dependencies were defined in an older version. We would be interested to know (perhaps by email to the parser-user list) if your accuracy was coming down for reasons other than your code was written to expect the dependency names as they were in an earlier version.
Example of difference using the command line:
[manning]$ cat > camera.txt
The size and picture quality of the camera is perfect.
[manning]$ java edu.stanford.nlp.parser.lexparser.LexicalizedParser -outputFormat typedDependencies -outputFormatOptions collapsedDependencies edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz camera.txt
Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ... done [2.4 sec].
Parsing file: camera.txt
Parsing [sent. 1 len. 11]: The size and picture quality of the camera is perfect .
det(quality-5, The-1)
nn(quality-5, size-2)
conj_and(size-2, picture-4)
nsubj(perfect-10, quality-5)
det(camera-8, the-7)
prep_of(quality-5, camera-8)
cop(perfect-10, is-9)
root(ROOT-0, perfect-10)
Parsed file: camera.txt [1 sentences].
Parsed 11 words in 1 sentences (6.94 wds/sec; 0.63 sents/sec).
[manning]$ java edu.stanford.nlp.parser.lexparser.LexicalizedParser -outputFormat typedDependencies -outputFormatOptions CCPropagatedDependencies edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz camera.txt
Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ... done [2.2 sec].
Parsing file: camera.txt
Parsing [sent. 1 len. 11]: The size and picture quality of the camera is perfect .
det(quality-5, The-1)
nn(quality-5, size-2)
conj_and(size-2, picture-4)
nn(quality-5, picture-4)
nsubj(perfect-10, quality-5)
det(camera-8, the-7)
prep_of(quality-5, camera-8)
cop(perfect-10, is-9)
root(ROOT-0, perfect-10)
Parsed file: camera.txt [1 sentences].
Parsed 11 words in 1 sentences (12.85 wds/sec; 1.17 sents/sec).
According to my observations it seems like, stanford online parser still uses older versions at its backend.
I have been using stanford parser for an year now. We have been using version 3.2.0 for a long time now. When version 3.3.0 was released with additional feature of sentimental analysis I have tried using the newer version. But, its dependencies where observed to be slightly varying from version 3.2.0 and the efficiency of our product has come down.
If your requirement is just extract dependencies and not use sentiment analysis. I would suggest you to use version 3.2.0.
Check the end of this page to download earlier versions of parser.

Resources