Extract relationship concepts from sentences - nlp

Is there a current model or how could I train a model that takes a sentence involving two subjects like:
[Meiosis] is a type of [cell division]...
and decides if one is the child or parent concept of the other? In this case, cell division is the parent of meiosis.

Are the subjects already identified, i.e., do you know beforehand for each sentence which words or sequence of words represent the subjects? If you do I think what you are looking for is relationship extraction.
Unsupervised approach
A simple unsupervised approach is to look for patterns using part-of-speech tags, e.g.:
First you tokenize and get the PoS-tags for each sentence:
sentence = "Meiosis is a type of cell division."
tokens = nltk.word_tokenize("Meiosis is a type of cell division.")
tokens
['Meiosis', 'is', 'a', 'type', 'of', 'cell', 'division', '.']
token_pos = nltk.pos_tag(tokens)
token_pos
[('Meiosis', 'NN'), ('is', 'VBZ'), ('a', 'DT'), ('type', 'NN'), ('of', 'IN'),
('cell', 'NN'), ('division', 'NN'), ('.', '.')]
Then you build a parser, to parse a specific pattern based on PoS-tags, which is a pattern that mediates relationships between two subjects/entities/nouns:
verb = "<VB|VBD|VBG|VBN|VBP|VBZ>*<RB|RBR|RBS>*"
word = "<NN|NNS|NNP|NNPS|JJ|JJR|JJS|RB|WP>"
preposition = "<IN>"
rel_pattern = "({}|{}{}|{}{}*{})+ ".format(verb, verb, preposition, verb, word, preposition)
grammar_long = '''REL_PHRASE: {%s}''' % rel_pattern
reverb_pattern = nltk.RegexpParser(grammar_long)
NOTE: This pattern is based on this paper: http://www.aclweb.org/anthology/D11-1142
You can then apply the parser to all the tokens/PoS-tags except the ones which are part of the subjects/entities:
reverb_pattern.parse(token_pos[1:5])
Tree('S', [Tree('REL_PHRASE', [('is', 'VBZ')]), ('a', 'DT'), ('type', 'NN'), ('of', 'IN')])
If the the parser outputs a REL_PHRASE than there is a relationships between the two subjects. You then need to analyse all these patterns and decide which represent a parent-of relationships. One way to achieve that is by clustering them, for instance.
Supervised approach
If your sentences already are tagged with subjects/entities and with the type of relationships, i.e., a supervised scenario than you can build a model where the features can be the words between the two subjects/entities and the type of relationship the label.
sent: "[Meiosis] is a type of [cell division.]"
label: parent of
You can build a vector representation of is a type of, and train a classifier to predict the label parent of. You will need many examples for this, it also depends on how many different classes/labels you have.

Related

How to know which contextual embedding to use at test time

Models like BERT generate contextual embeddings for words with different contextual meanings, like 'bank', 'left'.
I don't understand which contextual embedding the model chooses to use at test time? Given a test sentence for classification, when we load the pre-trained bert, how do we initialize the word (token) embedding to use the right contextual embedding over the other embeddings of the same word?
More specifically, there is a convert_to_id() function which converts a word to an id? how does one id represent the correct contextual embedding for the input sentence at test time? Thank you.
I searched all over online but only found explanation about the difference between static vs. contextual embedding, the high level concept is easy to get, but how is that really achieved is unclear. I also search some code example, but the convert_to_id() makes me further confused as I asked in my question.
TL;DR There's only one embedding for the word "left." There's also no way to know which meaning the word has if the sequence is only one word. BERT uses the representation of it's start-of-sequence token (i.e., [CLS]) to represent each sequence, and that representation will differ depending on what context the word "left" is used in.
Given your example of text classification, the input sentence is first tokenized using the WordPiece tokenizer, and the [CLS] token's representation is fed to a feedforward layer for classification.
You can't really debate context when given single words, so I'll use two different sentences:
I left my house this morning.
You'll see my house on your left.
The steps to performing text classification typically are:
Tokenize your input text and receive the necessary input_ids and attention_masks.
Feed this tokenized input into your model and receive the outputs.
Feed the [CLS] token's representation to a classifier layer (typically a feedforward network).
The two sentences are tokenized to (using bert-base-uncased):
['[CLS]', 'i', 'left', 'my', 'house', 'this', 'morning', '.', '[SEP]']
['[CLS]', 'you', "'", 'll', 'see', 'my', 'house', 'on', 'your', 'left', '.', '[SEP]']
The [CLS] token's representation for each sentence will be different because the sentences have different words (i.e., contexts). The result is therefore different.
>>> from transformers import AutoModel, AutoTokenizer
>>> bert = AutoModel.from_pretrained('bert-base-uncased')
>>> tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
>>> sentence1 = "I left my house this morning."
>>> sentence2 = "You'll see my house on your left."
>>> sentence1_inputs = tokenizer(sentence1, return_tensors='pt')
>>> sentence2_inputs = tokenizer(sentence2, return_tensors='pt')
>>> sentence1_outputs = bert(**sentence1_inputs)
>>> sentence2_outputs = bert(**sentence2_inputs)
>>> cls1 = sentence1_outputs[1]
>>> cls2 = sentence2_outputs[1]
>>> print(cls1[0][:5])
... tensor([-0.8832, -0.3484, -0.8044, 0.6536, 0.6409], grad_fn=<SliceBackward0>)
>>> print(cls2[0][:5])
... tensor([-0.8791, -0.4069, -0.8994, 0.7371, 0.7010], grad_fn=<SliceBackward0>)

Identifying phrases which contrast two corpora

I would like to identify compound phrases in one corpus (e.g. (w_1, w_2) in Corpus 1) which not only appear significantly more often than their constituents (e.g. (w_1),(w_2) in Corpus 1) within the corpus but also more than they do in a second corpus (e.g. (w_1, w_2) in Corpus 2). Consider the following informal example. I have the two corpora each consisting of a set of documents:
[['i', 'live', 'in', 'new', 'york'], ['new', 'york', 'is', 'busy'], ...]
[['los', 'angeles', 'is', 'sunny'], ['los', 'angeles', 'has', 'bad', 'traffic'], ...].
In this case, I would like new_york to be detected as a compound phrase. However, when corpus 2 is replaced by
[['i', 'go', 'to', 'new', york'], ['i', 'like', 'new', 'york'], ...],
I would like new_york to be relatively disregarded.
I could just use a ratio between n-gram scores between corresponding phrases in corpora, but I don't see how to scale to general n. Normally, phrase detection for n-grams with n>2 is done by recursing on n and gradually inserting compound phrases into the documents by thresholding a score function. This insures that at step n, if you want to score the n-gram (w_1, ..., w_n), then you can always normalize by the constituent m-grams for m<n. But with a different corpus, these are not guaranteed to appear.
A reference to the literature or a relevant hack will be appreciated.

What is the input format of fastText and why does my model doesn't give me a meaningful similar output?

My goal is to find similarities between a word and a document. For example, I want to find the similarity between "new" and a document, for simplicity, say "Hello World!".
I used word2vec from gensim, but the problem is it does not find the similarity for an unseen word. Thus, I tried to use fastText from gensim as it can find similarity for words that are out of vocabulary.
Here is a sample of my document data:
[['This', 'is', 'the', 'only', 'rule', 'of', 'our', 'household'],
['If',
'you',
'feel',
'a',
'presence',
'standing',
'over',
'you',
'while',
'you',
'sleep',
'do'],
['NOT', 'open', 'your', 'eyes'],
['Ignore', 'it', 'and', 'try', 'to', 'fall', 'asleep'],
['This',
'may',
'sound',
'a',
'bit',
'like',
'the',
'show',
'Bird',
'Box',
'from',
'Netflix']]
I simply train data like this:
from gensim.models.fasttext import FastText
model = FastText(sentences_cleaned)
Consequently, I want to find the similarity between say, "rule" and this document.
model.wv.most_similar("rule")
However, fastText gives me this:
[('the', 0.1334390938282013),
('they', 0.12790171802043915),
('in', 0.12731242179870605),
('not', 0.12656228244304657),
('and', 0.11071767657995224),
('of', 0.08563747256994247),
('I', 0.06609072536230087),
('that', 0.05195673555135727),
('The', 0.002402491867542267),
('my', -0.009009800851345062)]
Obviously, it must have "rule" as the top similarity since the word "rule" appears in the first sentence of the document. I also tried stemming/lemmatization, but it doesn't work either.
Was my input format correct? I've seen lots of documents are using .cor or .bin format and I don't know what are those.
Thanks for any reply!
model.wv.most_similar('rule') asks for that's model's set-of-word-vectors (.wv) to return the words most-similar to 'rule'. That is, you've provided neither any document (multiple words) as a query, nor is there any way for the FastText model to return either a document itself, or a name of any documents. Only words, as it has done.
While FastText trains on texts – lists of word-tokens – it only models words/subwords. So it's unclear what you expected instead: the answer is of the proper form.
Those don't look like words very-much like 'rule', but you'll only get good results from FastText (and similar word2vec-algorithms) if you train them with lots of varied data showing many subtly-contrasting realistic uses of the relevant words.
How many texts, with how many words, are in your sentences_cleaned data? (How many uses of 'rule' and related words?)
In any real FastText/Word2Vec/etc model, trained with asequate data/parameters, no single sentence (like your 1st sentence) can tell you much about what the results "should" be. That only emerged from the full rich dataset.

Ignoring filler words in part of speech pattern NLTK

I have rule based text matching program that I've written that operates based on rules created using specific POS patterns. So for example one rule is:
pattern = [('PRP', "i'll"), ('VB', ('jump', 'play', 'bite', 'destroy'))]
In this case when analyzing my input text this will only return results in a string that fit grammatically to this specific pattern so:
I'll jump
I'll play
I'll bite
I'll destroy
My question involves extracting the same meaning of the text when people use the same text but add a superlative or any type of word that doesn't change context, right now it only does exact matches, but wont catch phrases like the first string in this example:
I'll 'freaking' jump
'Dammit' I'll play
I'll play 'dammit'
The word doesn't have have to be specific its just making sure the program can still identify the same pattern with the addition of a non-contextual superlative or any other type of word with the same purpose. This is the flagger I've written and I've given an example string:
string_list = [('Its', 'PRP$'), ('annoying', 'NN'), ('when', 'WRB'), ('a', 'DT'), ('kid', 'NN'), ('keeps', 'VBZ'), ('asking', 'VBG'), ('you', 'PRP'), ('to', 'TO'), ('play', 'VB'), ('but', 'CC'), ("I'll", 'NNP'), ('bloody', 'VBP'), ('play', 'VBP'), ('so', 'RB'), ('it', 'PRP'), ('doesnt', 'VBZ'), ('cry', 'NN')]
def find_match_pattern(string_list, pattern_dict):
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer() # does a sentiment analysis on the output string
filt_ = ['Filter phrases'] # not the patterns just phrases I know I dont want
filt_tup = [x.lower() for x in filt_]
for rule, pattern in pattern_dict.items(): # pattern dict is an Ordered Diction courtesy of collections
num_matched = 0
for idx, tuple in enumerate(string_list): # string_list is the input string that has been POS tagged
matched = False
if tuple[1] == list(pattern.keys())[num_matched]:
if tuple[0] in pattern[tuple[1]]:
num_matched += 1
else:
num_matched = 0
else:
num_matched = 0
if num_matched == len(pattern): # if the number of matching words equals the length of the pattern do this
matched_string = ' '.join([i[0] for i in string_list]) # Joined for the sentiment analysis score
vs = analyzer.polarity_scores(matched_string)
sentiment = vs['compound']
if matched_string in filt_tup:
break
elif (matched_string not in filt_tup) or (sentiment < -0.8):
matched = True
print(matched, '\n', matched_string, '\n', sentiment)
return (matched, sentiment, matched_string, rule)
I know its a really abstract (or down the rabbit hole) question, so it may be a discussion but if anyone has experience with this it would be awesome to see what you recommend.
Your question can be answered using Spacy's dependecy tagger. Spacy provides a matcher with many optional and switchable options.
In the case below, instead of basing on specific words or Parts of Speech, the focus was looking at certain sintatic functions, such as the nominal subject and the auxiliary verbs.
Here's a quick example:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en')
matcher = Matcher(nlp.vocab, validate=True)
pattern = [{'DEP': 'nsubj', 'OP': '+'}, # OP + means it has to be at least one nominal subject - usually a pronoun
{'DEP': 'aux', 'OP': '?'}, # OP ? means it can have one or zero auxiliary verbs
{'POS': 'ADV', 'OP': '?'}, # Now it looks for an adverb. Also, it is not needed (OP?)
{'POS': 'VERB'}] # Finally, I've generallized it with a verb, but you can make one pattern for each verb or write a loop to do it.
matcher.add("NVAV", None, pattern)
phrases = ["I\'ll really jump.",
"Okay, I\'ll play.",
"Dammit I\'ll play",
"I\'ll play dammit",
"He constantly plays it",
"She usually works there"]
for phrase in phrases:
doc = nlp(phrase)
matches = matcher(doc)
for match_id, start, end in matches:
span = doc[start:end]
print('Matched:',span.text)
Matched: I'll really jump
Matched: I'll play
Matched: I'll play
Matched: I'll play
Matched: He constantly plays
Matched: She usually works
You can always test your patterns in the live example: Spacy Live Example
You can extend it as you will. Read more here:https://spacy.io/usage/rule-based-matching

CoreNLP API for N-grams?

Does CoreNLP have an API for getting unigrams, bigrams, trigrams, etc.?
For example, I have a string "I have the best car ". I would love to get:
I
I have
the
the best
car
based on the string I am passing.
If you are coding in Java, check out getNgrams* functions in the StringUtils class in CoreNLP.
You can also use CollectionUtils.getNgrams (which is what StringUtils class uses too)
You can use CoreNLP to tokenize, but for grabbing n-grams, do it natively in whatever language you're working in. If, say, you're piping this into Python, you can use list slicing and some list comprehensions to split them up:
>>> tokens
['I', 'have', 'the', 'best', 'car']
>>> unigrams = [tokens[i:i+1] for i,w in enumerate(tokens) if i+1 <= len(tokens)]
>>> bigrams = [tokens[i:i+2] for i,w in enumerate(tokens) if i+2 <= len(tokens)]
>>> trigrams = [tokens[i:i+3] for i,w in enumerate(tokens) if i+3 <= len(tokens)]
>>> unigrams
[['I'], ['have'], ['the'], ['best'], ['car']]
>>> bigrams
[['I', 'have'], ['have', 'the'], ['the', 'best'], ['best', 'car']]
>>> trigrams
[['I', 'have', 'the'], ['have', 'the', 'best'], ['the', 'best', 'car']]
CoreNLP is great for doing NLP heavy lifting, like dependencies, coref, POS tagging, etc. It seems like overkill if you just want to tokenize though, like bringing a fire truck to a water gun fight. Using something like TreeTagger might equally fulfill your needs for tokenization.

Resources