How to programmatically access wordnet hierarchy? - nlp

Suppose for any word I want to access its IS-A parent value and HAS-A value then is it possible using any api?

You can use the python API of the Natural Language Toolkit. In Wordnet, the IS-A-relation is called hypernym (opposite: hyponym) and the HAS-A relation is called meronym (opposite: holonym).
from nltk.corpus import wordnet
book = wordnet.synsets('book')[0]
book.hypernyms()
>>> [Synset('publication.n.01')]
book.part_meronyms()
>>> [Synset('running_head.n.01'), Synset('signature.n.05')]
I also found the NodeBox Linguistics API easier to use:
import en
en.noun.hypernym('book')
>>> [['publication']]

Shameless plug:
I'm writing a Scala library to access WordNet. While not all the similarity measures have been implemented, all the word senses and relations are available. I use it for my research so its in active development.
import com.github.mrmechko.swordnet._
SKey("book",SPos.Noun)
//> List(SKey("publication%1:10:00::"))
SKey("publication%1:10:00::").getRelation(SRelationType.hypernym) //Hypernyms
SKey("publication%1:10:00::").getRelation(SRelationType.hyponym) //Hyponyms etc
SWordNet is available on GitHub and Sonatype

You can use commandline. The command is "wn book -hypen" to get the hypernyms of the noun book. For meronyms, use the command "wn book -meron".
Also the -o option gives the synset offset.
Here is the link for further information.

Related

How to get a sort of inverse lemmatizations for every language?

I found the spacy lib that allows me to apply lemmatization to words (blacks -> black, EN) (bianchi -> bianco, IT). My work is to analyze entities, not verbs or adjectives.
I'm looking for something that allows me to have all the possible words starting from the caninical form.
Like from "black" to "blacks", for english, or from "bianco" (in italian) and get "bianca", "bianchi", "bianche", etc. Is there any library that do this?
I'm not clear on exactly what you're looking for but if a list of English lemma is all you need you can extract that easily enough from a GitHub library I have. Take a look at lemminflect. Initially, this uses a dictionary approach to lemmatization and there is a .csv file in here with all the different lemmas and their inflections. The file is LemmInflect/lemminflect/resources/infl_lu.csv.gz. You'll have to extract the lemmas from it. Something like...
with gzip.open('LemmInflect/lemminflect/resources/infl_lu.csv.gz)` as f:
for line in f.readlines():
parts = lines.split(',')
lemma = parts[0]
pos = parts[1]
print(lemma, pos)
Alternatively, if you need a system to inflect words, this is what Lemminflect is designed to do. You can use it as a stand-alone library or as an extension to SpaCy. There's examples on how to use it in the README.md or in the ReadTheDocs documentation.
I should note that this is for English only. I haven't seen a lot of code for inflecting words and you may have some difficulty finding this for other languages.

What are some of the data preparation steps or techniques one needs to follow when dealing with multi-lingual data?

I'm working on multilingual word embedding code where I need to train my data on English and test it on Spanish. I'll be using the MUSE library by Facebook for the word-embeddings.
I'm looking for a way to pre-process both my data the same way. I've looked into diacritics restoration to deal with the accents.
I'm having trouble coming up with a way in which I can carefully remove stopwords, punctuations and weather or not I should lemmatize.
How can I uniformly pre-process both the languages to create a vocabulary list which I can later use with the MUSE library.
Hi Chandana I hope you're doing well. I would look into using the library spaCy https://spacy.io/api/doc the man that created it has a youtube video in which he discusses the implementation of of NLP in other languages. Below you will find code that will lemmatize and remove stopwords. as far as punctuation you can always set specific characters such as accent marks to ignore. Personally I use KNIME which is free and open source to do preprocessing. You will have to install nlp extentions but what is nice is that they have different extensions for different languages you can install here: https://www.knime.com/knime-text-processing the Stop word filter (since 2.9) and the Snowball stemmer node can be applied for Spanish language. Make sure to select the right language in the dialog of the node. Unfortunately there is no part of speech tagger node for Spanish so far.
# Create functions to lemmatize stem, and preprocess
# turn beautiful, beautifuly, beautified into stem beauti
def lemmatize_stemming(text):
stemmer = PorterStemmer()
return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
# parse docs into individual words ignoring words that are less than 3 letters long
# and stopwords: him, her, them, for, there, ect since "their" is not a topic.
# then append the tolkens into a list
def preprocess(text):
result = []
for token in gensim.utils.simple_preprocess(text):
newStopWords = ['your_stopword1', 'your_stop_word2']
if token not in gensim.parsing.preprocessing.STOPWORDS and token not in newStopWords and len(token) > 3:
nltk.bigrams(token)
result.append(lemmatize_stemming(token))
return result
I hope this helps let me know if you have any questions :)

How to get dependency information about a word?

I have already successfully parsed sentences to get dependency information using stanford parser (version 3.9.1(run it in IDE Eclipse)) with command "TypedDependencies", but how could I get depnedency information about a single word( it's parent, siblings and children)? I have searched javadoc, it seems Class semanticGraph is used to do this job, but it need a IndexedWord type as input, how do I get IndexedWord? Do you have any simple samples?
You can create a SemanticGraph from a List of TypedDependencies and then you can use the methods getChildren(IndexedWord iw), getParent(IndexedWord iw), and getSiblings(IndexedWord iw). (See the javadoc of SemanticGraph).
To get the IndexedWord of a specific word, you can, for example, use the SemanticGraph method getNodeByIndex(int i), which will return the IndexNode of the i-th token in a sentence.

spaCy token.tag_ full list

The official documentation of token.tag_ in spaCy is as follows:
A fine-grained, more detailed tag that represents the word-class and some basic morphological information for the token. These tags are primarily designed to be good features for subsequent models, particularly the syntactic parser. They are language and treebank dependent. The tagger is trained to predict these fine-grained tags, and then a mapping table is used to reduce them to the coarse-grained .pos tags.
But it doesn't list the full available tags and each tag's explanation. Where can I find it?
Finally I found it inside spaCy's source code: glossary.py. And this link explains the meaning of different tags.
Available values for token.tag_ are language specific. With language here, I don't mean English or Portuguese, I mean 'en_core_web_sm' or 'pt_core_news_sm'. In other words, they are language model specific and they are defined in the TAG_MAP, which is customizable and trainable. If you don't customize it, it will be default TAG_MAP for that language.
As of the writing of this answer, spacy.io/models lists all of the pre trained models and their labeling scheme.
Now, for the explanations. If you are working with English or German text, you're in luck! You can use spacy.explain() or access its glossary on github for the full list. If you are working with other languages, token.pos_ values are always those of Universal dependencies and will work regardless.
To finish up, if you are working with other languages, for a full explanation of the tags, you are going to have to look for them in the sources listed in the models page for your model of interest. For instance, for Portuguese I had to track the explanations for the tags in the Portuguese UD Bosque Corpus used to train the model.
Here is the list of tags:
TAG_MAP = [
".",
",",
"-LRB-",
"-RRB-",
"``",
"\"\"",
"''",
",",
"$",
"#",
"AFX",
"CC",
"CD",
"DT",
"EX",
"FW",
"HYPH",
"IN",
"JJ",
"JJR",
"JJS",
"LS",
"MD",
"NIL",
"NN",
"NNP",
"NNPS",
"NNS",
"PDT",
"POS",
"PRP",
"PRP$",
"RB",
"RBR",
"RBS",
"RP",
"SP",
"SYM",
"TO",
"UH",
"VB",
"VBD",
"VBG",
"VBN",
"VBP",
"VBZ",
"WDT",
"WP",
"WP$",
"WRB",
"ADD",
"NFP",
"GW",
"XX",
"BES",
"HVS",
"_SP",
]
Here is the list of tags and POS Spacy uses in the below link.
https://spacy.io/api/annotation
Universal parts of speech tags
English
German
You can get an explaination using
from spacy import glossary
tag_name = 'ADP'
glossary.explain(tag_name)
Version: 3.3.0
Source: https://github.com/explosion/spaCy/blob/master/spacy/glossary.py
You can use below:
dir(spacy.parts_of_speech)

English word segmentation in NLP?

I am new in the NLP domain, but my current research needs some text parsing (or called keyword extraction) from URL addresses, e.g. a fake URL,
http://ads.goole.com/appid/heads
Two constraints are put on my parsing,
The first "ads" and last "heads" should be distinct because "ads" in the "heads" means more suffix rather than an advertisement.
The "appid" can be parsed into two parts; that is 'app' and 'id', both taking semantic meanings on the Internet.
I have tried the Stanford NLP toolkit and Google search engine. The former tries to classify each word in a grammar meaning which is under my expectation. The Google engine shows more smartness about "appid" which gives me suggestions about "app id".
I can not look over the reference of search history in Google search so that it gives me "app id" because there are many people have searched these words. Can I get some offline line methods to perform similar parsing??
UPDATE:
Please skip the regex suggestions because there is a potentially unknown number of compositions of words like "appid" in even simple URLs.
Thanks,
Jamin
Rather than tokenization, what it sounds like you really want to do is called word segmentation. This is for example a way to make sense of asentencethathasnospaces.
I haven't gone through this entire tutorial, but this should get you started. They even give urls as a potential use case.
http://jeremykun.com/2012/01/15/word-segmentation/
The Python wordsegment module can do this. It's an Apache2 licensed module for English word segmentation, written in pure-Python, and based on a trillion-word corpus.
Based on code from the chapter “Natural Language Corpus Data” by Peter Norvig from the book “Beautiful Data” (Segaran and Hammerbacher, 2009).
Data files are derived from the Google Web Trillion Word Corpus, as described by Thorsten Brants and Alex Franz, and distributed by the Linguistic Data Consortium. This module contains only a subset of that data. The unigram data includes only the most common 333,000 words. Similarly, bigram data includes only the most common 250,000 phrases. Every word and phrase is lowercased with punctuation removed.
Installation is easy with pip:
$ pip install wordsegment
Simply call segment to get a list of words:
>>> import wordsegment as ws
>>> ws.segment('http://ads.goole.com/appid/heads')
['http', 'ads', 'goole', 'com', 'appid', 'heads']
As you noticed, the old corpus doesn't rank "app id" very high. That's ok. We can easily teach it. Simply add it to the bigram_counts dictionary.
>>> ws.bigram_counts['app id'] = 10.2e6
>>> ws.segment('http://ads.goole.com/appid/heads')
['http', 'ads', 'goole', 'com', 'app', 'id', 'heads']
I chose the value 10.2e6 by doing a Google search for "app id" and noting the number of results.

Resources