Three part related entities not specifically identified by a sentence - nlp

How do I train a Watson Knowledge Studio machine learning annotator to identify education info that is not a part of a proper sentence. For example, two bullet points. How do I form a type system that will identify entities without breaking them all apart? I've considered using relation annotations, but according to the official documentation relation types should only be annotated if the sentence specifically mentions the relation. Such as "Mary works for IBM" is an example of the employedBy relation type. (Mary employedBy IBM) However, their own videos show them annotating "Ford F-150" with a manufacturedBy relation even though the sentence doesn't specifically state the relation. For example, "The Ford F-150 struck a light pole." (F-150 manufacturedBy Ford)
This is the kind of text I'm working with:
B.A., City University of New York, 1995
M.A., New York University, 1997
Ph.D, Columbia University, 1999
I could annotate these with degree, school, and graduationYear entities, but I'll end up getting back "1995", "1997", "1999" "B.A.", "City University of New York", "Columbia University", "M.A.", "New York University", "Ph.D"; a jumble that I can't work with because I can't tell anymore what degree belongs with what school belongs with what graduation year.

As for the expressions which include two bullet points, there is a possibility to improve accuracy to detect sentences as they can work with WKS, using Dictionary-based Tokenizer.
https://console.bluemix.net/docs/services/knowledge-studio/create-project.html#wks_tokenizer
I imported your example text to WKS and checked the result of tokenization, and then the expression was separated into 3 sentences.
In this case you can annotate relations among degree, school and graduation year.

Related

Word2vec word embeddings: how to have different embeddings to different words coming in same context?

Suppose I have two documents:
document 1 : Where can I buy this product1 in paris.
document 2 : Where can I buy this product2 in paris.
Assume product1 and product2 are not in word2vec and I need to train my own word2vec model.
Since the context is same, will word2vec consider product1 and product2 as synonyms?
Will they have similar word embeddings?
If yes, how to make them non related to each other? Should I go for doc2vec model in this case?
The concept behind word embeddings is that the context of a word determines its meaning. If two words were always to occur in exactly the same context, they would be identical (this never happens). This works well for pretty much any word, except for names.
Names don't have a 'linguistic' meaning; their meaning is a pointer to something in the real world outside of language. Their context then depends on the use of that something in language: the name of a car brand is usually used in different contexts from coffee brands. "I'll drive my new X" works well with VW, but not so well with Lavazza. Hence they occur in different contexts and thus have a different meaning.
If the products are the same kind (eg VW vs Mercedes), then their contexts will be the same. But they might also be subtly different: you wouldn't use language to boast about your new Skoda in the same way you would about your new Bentley. So the embeddings for "Skoda" and "Bentley" will be similar, but not identical. But if there are essentially no differences, the context, and thus the embedding, will be the same. Incidentally that is why people often confuse the names of their kids when they are young -- you are using the names in pretty much exactly the same contexts, so they're sometimes tricky to keep apart.
The solution to this dilemma is to find more data where product1 and product2 are used in different contexts. In your examples they are simply presented as something you want to buy in Paris. You need to find examples where they are used, repaired, or break; anything that differentiates them. And no other context-based representation will be able to solve this for you without such data.

Determine a category from keywords

After clustering process I have a bunch of words that have some similarity. I would like to categorize these words.
For example, If I have this words:
Linked Data
Domain Ontology Semantic Web
Use Case
Semantic Annotation
Maybe the right category is Semantic Web.
I know this kind of problems could be solved wit NLP, but I new in NLP and I don't know where to start. Anyone could say me what the correct way is? or If it's reachable?
Note: I found similar problems They have solved with collocation and POS tagging. Could I apply it for this specific problem?
You could search for papers on Topic Labelling - It is generally considered a pretty hard problem. A paper such as the following is probably a good place to start though. The authors have a few others that are relevant as well.
Lau, J. H., Grieser, K., Newman, D., & Baldwin, T. (2011, June). Automatic labelling of topic models. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1 (pp. 1536-1545). Association for Computational Linguistics.

Could I define entity type automatically?

I am trying to develop software to get suitable attributes for entities names depending on entity type.
For example if I have entities such doctor, nurse, employee , customer, patient , lecturer , donor, user, developer, designer, driver, passenger and technician, they all will have attributes such as name, sex, date of birth , email address, home address and telephone number because all of them are people.
Second example word such as university, college, hospital, hotel and supermarket can share attributes such as name, address and telephone number because all of them could be organization.
Are there any Natural Language Processing tools and software could help me to achieve my goal.
I need to identify entity type as person or origination then I attached suitable attributes according to the entity type?
I have looked at Name Entity Recognition (NER) tool such as Stanford Name Entity recognizer which can extract Entity such as Person, Location, Organization, Money, time, Date and Percent But it was not really useful.
I can do it by building my own gazetteer however I do not prefer to go to this option unless I failed to do it automatically.
Any helps, suggestions and ideas will be appreciated.
If I understand correctly, you are mainly interested in knowing if a given word can be mapped to a general category of Human, Organization, etc.
You should use WordNet, which provides a complete hierarchy of the general English lexicon. Try it a bit in the user interface to get of feel of how it works.
WordNet encodes relations between words. One of these relation is hypernymy, a fancy word that means a relation of general-to-particular.
Some examples:
Vehicle is a hypernym of boat.
Vehicle is a hypernem of car.
Human is a hypernym of worker which is a hypernym of plumber.
Hyponymy is the inverse relation of hypernymy:
Boat is a hyponym of vehicle.
Car is a hyponym of vehicle.
Plumber is a hyponym of worker, itself a hyponym of human.
These relations are transitive, so in my last example plumber is also a hyponym of human. This gives you the solution to your problem: any word that has human as hypernym should be mapped to Human and have people attributes.
There are libraries to access WordNet from Java and Python, as well as from many other languages. Here is the documentation for using WordNet with the NLTK Python module.
A short example to determine if a word is hyponym of "human"
from nltk.corpus import wordnet as wn
human = wn.synset('person.n.01')
hyponyms_of_human = set(x for x in human.closure(lambda s:s.hyponyms())
fireman = wn.synsets('fireman')
salad = wn.synsets('salad')
print(any(x in hyponyms_of_human for x in fireman)) # outputs True
print(any(x in hyponyms_of_human for x in salad)) # outputs False

Performing semantic analysis in text

I want to perform semantic analysis on some text similar to YAGO[1]. But I have no structure in the text to identify entities and relationships. One way is I use POS tagging and then identify subject and predicates in the sentences[2]. But still I cannot establish what relationships exist between them.
How should I go about this?
For example:
Albert Einstein was born in 1879.
Should result in:
AlbertEinstein BORNIN 1879
subject relation predicate
My aim to look for better approaches to find subjects, predicates and relationships in raw text.
What you are trying to do is essentially Natural Language Understanding, a subfield of Natural Language Processing, which again is a subfield of Computational Linguistics ~ often thought as the engineering arm.
You could do semantic parsing or relation extraction. Either are fine for this task. I decided to read through Suchanek et al (2007) and you will realise that it is ontology based, where the relations are extracted into a predefined ontological template where aixoms are predifed (e.g. BORNIN). I personally think this is far to restrictive for general intelligence but works great with weak ai problems [narrow domains]. Much more interesting work has been happening over the years such as ontology driven information extraction, where the algorithms are trained on the ontology rather than having a corpus annotated by an ontology. One PhD study that comes to mind is McDowell Thesis and the Yildiz & Miksch (2007) paper.
Regardless and without going off topic, there is a really interesting open source Python GUI driven project called iepy at the moment being developed by a firm called Machinalis which is based on django. It allows for rule based and machine learning based information extraction. I highly recommend you check it out -> Tried and tested by myself. Also, I'm not affiliated with this company.
https://github.com/machinalis/iepy
According to the documentation:
IEPY is an open source tool for Information Extraction focused on
Relation Extraction.
To give an example of Relation Extraction, if we are trying to find a
birth date in:
"John von Neumann (December 28, 1903 – February 8, 1957) was a
Hungarian and American pure and applied mathematician, physicist,
inventor and polymath." then IEPY's task is to identify "John von
Neumann" and "December 28, 1903" as the subject and object entities of
the "was born in" relation.
It's aimed at: users needing to perform Information Extraction on a
large dataset. scientists wanting to experiment with new IE
algorithms.
The task you attempt to solve is called relation extraction, while semantic analysis has much broader meaning (honestly, I can't say for sure what does it mean now).
Relation extraction is an open research problem, so I suggest to review the field - for example, start from the chapter 2.3 of Mining text data book or A Review of Relation Extraction paper (which is a little older - 2007). Then continue research by following citing or cited-by links; finally, try to implement approach that looks most promising: for example, if you know that your data is rather formal (all sentences are short and share similar strict structure), then try something like pattern-based approaches; and so on.
Stanford parser can do it :) You need to look at the dependency parser though. Have a look at the bottom of this page: http://nlp.stanford.edu/software/lex-parser.shtml:
subject: nsubj(snapped, rain),
or direct object: dobj(shut, hub))
...
Or have a look at this page (Stanford Dependencies): http://nlp.stanford.edu/software/stanford-dependencies.shtml
And to understand the annotations have a look at this: http://nlp.stanford.edu/software/dependencies_manual.pdf
And for your particular example, use Stanford "collapsed" dependency parser which for a given sentence will produce predicates like born_in(Einstein,1879), which is very similar to what you want.

Difference between named entity recognition and resolution?

What is the difference between named entity recognition and named entity resolution? Would appreciate a practical example.
Named entity recognition is picking up the names and classifying them in running text. E.g., given (1)
John Terry to face criminal charges over alleged racist abuse
an NE recognizer will output
[PER John Terry] to face criminal charges over alleged racist abuse
NE resolution or normalization means finding out which entity in the outside world a name refers to. E.g., in the above example, the output would be annotated with a unique identifier for the footballer John Terry, like his Wikipedia URL:
[https://en.wikipedia.org/wiki/John_Terry John Terry] to face criminal charges
over alleged racist abuse
as opposed to, e.g.
https://en.wikipedia.org/wiki/John_Terry_%28actor%29
https://en.wikipedia.org/wiki/John_Terry_%28baseball%29
or any of the other John Terry's the Wikipedia knows.

Resources