How to determine relationship between two entities when there is more than one relation while creating distant supervision training data? - nlp

I got the concepts of distant supervision. As for my understanding, the creating training data process is like;
Extract named entities from sentences
Find two entities named "e1" and "e2" from each sentence.
Search these two entities in knowledge base (freebase etc.) to find relationship between them
I got confused at this step. What if there is more than 1 relation between these two entities (e1 and e2) ? If so which relation should I select?

It depends on the model you're training.
Are you learning a model for one kind of relationship and doing bootstrapping? Then only pay attention to that one relationship and drop the others from your DB.
Are you trying to learn a bunch of relationships? Then use the presence or absence of each as a feature in your model. This is how Universals Schemas work.
Here's an image of a feature matrix from the Universal Schema paper:

Related

Markup for relation extraction task with lstm model

I'm trying to figure out how to properly markup for Relation Extraction task (I'm going to use the lstm model).
At the moment, I figured out that entities are highlighted using the
<e1>, </e1>, <e2> and </e2>
tags. And in a separate column, the class of the relationship is indicated.
But what to do in the case when in the sentence one entity has relations of the same type or different ones at once to two other entities.
An example in the image.
Or when there are four entities in one sentence and two relations are defined.
I have two options. The first is to introduce new tags
<e3>, </e3>, <e4> and </e4>
and do multi-class classification. But I haven't seen it done anywhere. The second option is to make a copy of the proposal and share the relationship in this way.
Can you please tell me how to do this markup?

Auto NLP Entity Extraction

I have trained model using Auto Natural Language Processing - Entity extraction. For now I have trained this model to extract single keyword under each entity from text however I want to tag single keyword under two entity to create a hierarchy. Example - For now keyword "Lazada" tagged under "Lazada_Ecommerce" however I want to tag this single keyword under two entities - sub-entity "Lazada" and main entity "Ecommerce". It would be great help if someone suggest if it is possible with Google Auto NLP-Entity Extraction model and how.
Thanks,
Satish Kumar
Data Scientist
Google NLP Entity Extraction does not support entity hierarchies. The result of a prediction includes an array of entities, corresponding to each detected entity in the text.
https://cloud.google.com/automl/docs/reference/rpc/google.cloud.automl.v1#google.cloud.automl.v1.PredictResponse
includes property 'payload' which is an array of:
https://cloud.google.com/automl/docs/reference/rpc/google.cloud.automl.v1#google.cloud.automl.v1.AnnotationPayload
Note: If a "sub-entity" can only have one "main entity", then you could manage entity hierarchies external to the model, i.e., train the model to predict "Lazada" and other sub-entities, and externally identify that "Lazada" and others belong to a main "Ecommerce" category. However, if your entity model could have a "Lazada" entity underneath multiple main entities then your current solution would be appropriate (e.g., "Lazada_Ecommerce", "Lazada_SomeOtherMainEntity", etc.).

Event extraction vs N-ary relation extraction

What is the difference between an event and an n-ary relation?
I am trying to differentiate between the two and eventually focus on the task of extraction. For example, from the given sentence:
Peter completed B.Sc. in physics from Boston University.
Extracted n-ary relation:
r(Peter, B.Sc., physics, Boston University)
(Assuming that the entities are already labeled)
For the problem of event extraction, we have datasets like ACE 2005 event extraction corpus. However, I haven't come across any corpus for n-ary relation extraction. Is anyone aware of any such corpus which might facilitate n-ary relation extraction?
Once you get your entitites from your text using libraries like opennlp and stanfordnlp, you need to add those to your vocab like something that has been done here. Their plan was to produce a suggested vocabulary for describing that a class represents an n-ary relation and for defining mappings between n-ary relations in RDF and OWL and other languages.
They've considered many attributes to describe a relation.
they've used blank nodes in RDF to represent instances of a relation.
:Christine
a :Person ;
:has_diagnosis _:Diagnosis_Relation_1 .
:_Diagnosis_relation_1
a :Diagnosis_Relation ;
:diagnosis_probability :HIGH;
:diagnosis_value :Breast_Tumor_Christine .
similarly you can do
:Peter
a :Person ;
:has_degree _:Degree_1 .
:_Degree_1
a :Degree ;
:degree_name :B.Sc;
:degree_value :Physics;
:degree_place :Boston_University .
Follow this for some insight.
Hope this helps!

topic modeling on mallet

I'm currently doing the topic modeling things (beginner)
I was thinking using mallet for some tool to get me understand this area, but, my problem is, I'd like to train a model based on, let's say, 1000 documents, to construct a model and using the model on a new single document to generate its potential topics.
But, as far as I read about mallet tutorial, it always says like this tool or API is useful on a corpus of texts, which means, it's used to find topics within several documents.
Is there a way that it can find topic on single document based on the model (or inference parameter it learned / constructed from the 1000 documents?)
Is there any other tool that can do this?
Thanks a lot!
You can refer the example code src/cc/mallet/examples/TopicModel.java which describes how to clustering and infer the new instance.
Actually when you run the simple LDA on a directory the model assigns topic proportions to each of the documents of that directory based on "an already" trained model from a part of your corpus. So, topic proportions are assigned with a certain probability to each of the documents (already ranked by the probability of appearance of that topic to that specific document).

Incrementally Trainable Entity Recognition Classifier

I'm doing some semantic-web/nlp research, and I have a set of sparse records, containing a mix of numeric and non-numeric data, representing entities labeled with various features extracted from simple English sentences.
e.g.
uid|features
87w39423|speaker=432, session=43242, sentence=34, obj_called=bob,favorite_color_is=blue
4535k3l535|speaker=512, session=2384, sentence=7, obj_called=tree,isa=plant,located_on=wilson_street
23432424|speaker=997, session=8945305, sentence=32, obj_called=salty,isa=cat,eats=mice
09834502|speaker=876, session=43242, sentence=56, obj_called=the monkey,ate=the banana
928374923|speaker=876, session=43242, sentence=57, obj_called=it,was=delicious
294234234|speaker=876, session=43243, sentence=58, obj_called=the monkey,ate=the banana
sd09f8098|speaker=876, session=43243, sentence=59, obj_called=it,was=hungry
...
A single entity may appear more than once (but with a different UID each time), and may have overlapping features with its other occurrences. A second data set represents which of the above UIDs are definitely the same.
e.g.
uid|sameas
87w39423|234k2j,234l24jlsd,dsdf9887s
4535k3l535|09d8fgdg0d9,l2jk34kl,sd9f08sf
23432424|io43po5,2l3jk42,sdf90s8df
09834502|294234234,sd09f8098
...
What algorithm(s) would I use to incrementally train a classifier that could take a set of features, and instantly recommend the N most similar UIDs and probability of whether or not those UIDs actually represent the same entity? Optionally, I'd also like to get a recommendation of missing features to populate and then re-classify to get a more certain matches.
I researched traditional approximate nearest neighbor algorithms. such as FLANN and ANN, and I don't think these would be appropriate since they're not trainable (in a supervised learning sense) nor are they typically designed for sparse non-numeric input.
As a very naive first-attempt, I was thinking about using a naive bayesian classifier, by converting each SameAs relation into a set of training samples. So, for each entity A with B sameas relations, I would iterate over each and train the classifier like:
classifier = Classifier()
for entity,sameas_entities in sameas_dataset:
entity_features = get_features(entity)
for other_entity in sameas_entities:
other_entity_features = get_features(other_entity)
classifier.train(cls=entity, ['left_'+f for f in entity_features] + ['right_'+f for f in other_entity_features])
classifier.train(cls=other_entity, ['left_'+f for f in other_entity_features] + ['right_'+f for f in entity_features])
And then use it like:
>>> print classifier.findSameAs(dict(speaker=997, session=8945305, sentence=32, obj_called='salty',isa='cat',eats='mice'), n=7)
[(1.0, '23432424'),(0.999, 'io43po5', (1.0, '2l3jk42'), (1.0, 'sdf90s8df'), (0.76, 'jerwljk'), (0.34, 'rlekwj32424'), (0.08, '09843jlk')]
>>> print classifier.findSameAs(dict(isa='cat',eats='mice'), n=7)
[(0.09, '23432424'), (0.06, 'jerwljk'), (0.03, 'rlekwj32424'), (0.001, '09843jlk')]
>>> print classifier.findMissingFeatures(dict(isa='cat',eats='mice'), n=4)
['obj_called','has_fur','has_claws','lives_at_zoo']
How viable is this approach? The initial batch training would be horribly slow, at least O(N^2), but incremental training support would allow updates to happen more quickly.
What are better approaches?
I think this is more of a clustering than a classification problem. Your entities are data points and the sameas data is a mapping of entities to clusters. In this case, clusters are the distinct 'things' your entities refer to.
You might want to take a look at semi-supervised clustering. A brief google search turned up the paper Active Semi-Supervision for Pairwise Constrained Clustering which gives pseudocode for an algorithm that is incremental/active and uses supervision in the sense that it takes training data indicating which entities are or are not in the same cluster. You could derive this easily from your sameas data, assuming that - for example - uids 87w39423 and 4535k3l535 are definitely distinct things.
However, to get this to work you need to come up with a distance metric based on the features in the data. You have a lot of options here, for example you could use a simple Hamming distance on the features, but the choice of metric function here is a little bit arbitrary. I'm not aware of any good ways of choosing the metric, but perhaps you have already looked into this when you were considering nearest neighbour algorithms.
You can come up with confidence scores using the distance metric from the centres of the clusters. If you want an actual probability of membership then you would want to use a probabilistic clustering model, like a Gaussian mixture model. There's quite a lot of software to do Gaussian mixture modelling, I don't know of any that is semi-supervised or incremental.
There may be other suitable approaches if the question you wanted to answer was something like "given an entity, which other entities are likely to refer to the same thing?", but I don't think that is what you are after.
You may want to take a look at this method:
"Large Scale Online Learning of Image Similarity Through Ranking" Gal Chechik, Varun Sharma, Uri Shalit and Samy Bengio, Journal of Machine Learning Research (2010).
[PDF] [Project homepage]
More thoughts:
What do you mean by 'entity'? Is entity the thing that is referred by 'obj_called'? Do you use the content of 'obj_called' to match different entities, e.g. 'John' is similar to 'John Doe'? Do you use proximity between sentences to indicate similar entities? What is the greater goal (task) of the mapping?

Resources