I want to perform semantic analysis on some text similar to YAGO[1]. But I have no structure in the text to identify entities and relationships. One way is I use POS tagging and then identify subject and predicates in the sentences[2]. But still I cannot establish what relationships exist between them.
How should I go about this?
For example:
Albert Einstein was born in 1879.
Should result in:
AlbertEinstein BORNIN 1879
subject relation predicate
My aim to look for better approaches to find subjects, predicates and relationships in raw text.
What you are trying to do is essentially Natural Language Understanding, a subfield of Natural Language Processing, which again is a subfield of Computational Linguistics ~ often thought as the engineering arm.
You could do semantic parsing or relation extraction. Either are fine for this task. I decided to read through Suchanek et al (2007) and you will realise that it is ontology based, where the relations are extracted into a predefined ontological template where aixoms are predifed (e.g. BORNIN). I personally think this is far to restrictive for general intelligence but works great with weak ai problems [narrow domains]. Much more interesting work has been happening over the years such as ontology driven information extraction, where the algorithms are trained on the ontology rather than having a corpus annotated by an ontology. One PhD study that comes to mind is McDowell Thesis and the Yildiz & Miksch (2007) paper.
Regardless and without going off topic, there is a really interesting open source Python GUI driven project called iepy at the moment being developed by a firm called Machinalis which is based on django. It allows for rule based and machine learning based information extraction. I highly recommend you check it out -> Tried and tested by myself. Also, I'm not affiliated with this company.
https://github.com/machinalis/iepy
According to the documentation:
IEPY is an open source tool for Information Extraction focused on
Relation Extraction.
To give an example of Relation Extraction, if we are trying to find a
birth date in:
"John von Neumann (December 28, 1903 – February 8, 1957) was a
Hungarian and American pure and applied mathematician, physicist,
inventor and polymath." then IEPY's task is to identify "John von
Neumann" and "December 28, 1903" as the subject and object entities of
the "was born in" relation.
It's aimed at: users needing to perform Information Extraction on a
large dataset. scientists wanting to experiment with new IE
algorithms.
The task you attempt to solve is called relation extraction, while semantic analysis has much broader meaning (honestly, I can't say for sure what does it mean now).
Relation extraction is an open research problem, so I suggest to review the field - for example, start from the chapter 2.3 of Mining text data book or A Review of Relation Extraction paper (which is a little older - 2007). Then continue research by following citing or cited-by links; finally, try to implement approach that looks most promising: for example, if you know that your data is rather formal (all sentences are short and share similar strict structure), then try something like pattern-based approaches; and so on.
Stanford parser can do it :) You need to look at the dependency parser though. Have a look at the bottom of this page: http://nlp.stanford.edu/software/lex-parser.shtml:
subject: nsubj(snapped, rain),
or direct object: dobj(shut, hub))
...
Or have a look at this page (Stanford Dependencies): http://nlp.stanford.edu/software/stanford-dependencies.shtml
And to understand the annotations have a look at this: http://nlp.stanford.edu/software/dependencies_manual.pdf
And for your particular example, use Stanford "collapsed" dependency parser which for a given sentence will produce predicates like born_in(Einstein,1879), which is very similar to what you want.
Related
I have a corpus of a few 100-thousand legal documents (mostly from the European Union) – laws, commentary, court documents etc. I am trying to algorithmically make some sense of them.
I have modeled the known relationships (temporal, this-changes-that, etc). But on the single-document level, I wish I had better tools to allow fast comprehension. I am open for ideas, but here's a more specific question:
For example: are there NLP methods to determine the relevant/controversial parts of documents as opposed to boilerplate? The recently leaked TTIP papers are thousands of pages with data tables, but one sentence somewhere in there may destroy an industry.
I played around with google's new Parsey McParface, and other NLP solutions in the past, but while they work impressively well, I am not sure how good they are at isolating meaning.
In order to make sense out of documents you need to perform some sort of semantic analysis. You have two main possibilities with their exemples:
Use Frame Semantics:
http://www.cs.cmu.edu/~ark/SEMAFOR/
Use Semantic Role Labeling (SRL):
http://cogcomp.org/page/demo_view/srl
Once you are able to extract information from the documents then you may apply some post-processing to determine which information is relevant. Finding which information is relevant is task related and I don't think you can find a generic tool that extracts "the relevant" information.
I see you have an interesting usecase. You've also mentioned the presence of a corpus(which a really good plus). Let me relate a solution that I had sketched for extracting the crux from research papers.
To make sense out of documents, you need triggers to tell(or train) the computer to look for these "triggers". You can approach this using a supervised learning algorithm with a simple implementation of a text classification problem at the most basic level. But this would need prior work, help from domain experts initially for discerning "triggers" from the textual data. There are tools to extract gists of sentences - for example, take noun phrases in a sentence, assign weights based on co-occurences and represent them as vectors. This is your training data.
This can be a really good start to incorporating NLP into your domain.
Don't use triggers. What you need is a word sense disambiguation and domain adaptation. You want to make sense of is in the documents i.e understand the semantics to figure out the meaning. You can build a legal ontology of terms in skos or json-ld format represent it ontologically in a knowledge graph and use it with dependency parsing like tensorflow/parseymcparseface. Or, you can stream your documents in using a kappa based architecture - something like kafka-flink-elasticsearch with added intermediate NLP layers using CoreNLP/Tensorflow/UIMA, cache your indexing setup between flink and elasticsearch using redis to speed up the process. To understand relevancy you can apply specific cases from boosting in your search. Furthermore, apply sentiment analysis to work out intents and truthness. Your use case is one of an information extraction, summarization, and semantic web/linked data. As EU has a different legal system you would need to generalize first on what is really a legal document then narrow it down to specific legal concepts as they relate to a topic or region. You could also use here topic modelling techniques from LDA or Word2Vec/Sense2Vec. Also, Lemon might also help from converting lexical to semantics and semantics to lexical i.e NLP->ontology ->ontology->NLP. Essentially, feed the clustering into your classification of a named entity recognition. You can also use the clustering to assist you in building out the ontology or seeing what word vectors are in a document or set of documents using cosine similarity. But, in order to do all that it be best to visualize the word sparsity of your documents. Something like commonsense reasoning + deep learning might help in your case as well.
I am using Wordnet for finding synonyms of ontology concepts. How can i find choose the appropriate sense for my ontology concept. e.g there is an ontlogy concept "conference" it has following synsets in wordnet
The noun conference has 3 senses (first 3 from tagged texts)
(12) conference -- (a prearranged meeting for consultation or exchange of information or discussion (especially one with a formal agenda))
(2) league, conference -- (an association of sports teams that organizes matches for its members)
(2) conference, group discussion -- (a discussion among participants who have an agreed (serious) topic)
now 1st and 3rd synsets have apprpriate sense for my ontology concept. How can i choose only these two from wordnet?
The technology you're looking for is in the direction of semantic disambiguation / representation.
The most "traditional approach" is Word Sense Disambiguation (WSD), take a look at
https://en.wikipedia.org/wiki/Word-sense_disambiguation
https://stackoverflow.com/questions/tagged/word-sense-disambiguation
Anyone know of some good Word Sense Disambiguation software?
Then comes the next generation of Word Sense induction / Topic modelling / Knowledge representation:
https://en.wikipedia.org/wiki/Word-sense_induction
https://en.wikipedia.org/wiki/Topic_model
https://en.wikipedia.org/wiki/Knowledge_representation_and_reasoning
Then comes the most recent hype:
Word embeddings, vector space models, neural nets
Sometimes people skip the semantic representation and goes directly to do text similarity and by comparing pairs of sentences, the differences/similarities before getting to the ultimate aim of the text processing.
Take a look at Normalize ranking score with weights for a list of STS related work.
On the other direction, there's
ontology creation (Cyc, Yago, Freebase, etc.)
semantic web (https://en.wikipedia.org/wiki/Semantic_Web)
semantic lexical resources (WordNet, Open Multilingual WordNet, etc.)
Knowledge base population (http://www.nist.gov/tac/2014/KBP/)
There's also a recent task on ontology induction / expansion:
http://alt.qcri.org/semeval2015/task17/
http://alt.qcri.org/semeval2016/task13/
http://alt.qcri.org/semeval2016/task14/
Depending on the ultimate task, maybe either of the above technology would help.
You can also try Babelfy, which provides Word Sense Disambiguation and Named Entity Disambiguation.
Demo:
http://babelfy.org/
API:
http://babelfy.org/guide
Take a look at this list: 100 Best GitHub: Word-sense Disambiguation
and search by WordNet - there are several appropriate libraries.
I didn't use any of them, but this one seems to be promising, because it is based on classic yet effective idea (namely, Lesk algorithm) upgraded by modern word-embedding methods. Actually, before finding it, I was going to suggest to try almost the same ideas.
Note also that all methods try to find the meaning (WordNet sysnet, in your case) that is most similar to the context of the current word/collocation, so it is crucial to have context of the words you're trying to disambiguate. For example, words can come from some text and most libraries rely on that.
After clustering process I have a bunch of words that have some similarity. I would like to categorize these words.
For example, If I have this words:
Linked Data
Domain Ontology Semantic Web
Use Case
Semantic Annotation
Maybe the right category is Semantic Web.
I know this kind of problems could be solved wit NLP, but I new in NLP and I don't know where to start. Anyone could say me what the correct way is? or If it's reachable?
Note: I found similar problems They have solved with collocation and POS tagging. Could I apply it for this specific problem?
You could search for papers on Topic Labelling - It is generally considered a pretty hard problem. A paper such as the following is probably a good place to start though. The authors have a few others that are relevant as well.
Lau, J. H., Grieser, K., Newman, D., & Baldwin, T. (2011, June). Automatic labelling of topic models. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1 (pp. 1536-1545). Association for Computational Linguistics.
I am using Wordnet for finding synonyms of ontology concepts. How can i find choose the appropriate sense for my ontology concept. e.g there is an ontlogy concept "conference" it has following synsets in wordnet
The noun conference has 3 senses (first 3 from tagged texts)
(12) conference -- (a prearranged meeting for consultation or exchange of information or discussion (especially one with a formal agenda))
(2) league, conference -- (an association of sports teams that organizes matches for its members)
(2) conference, group discussion -- (a discussion among participants who have an agreed (serious) topic)
now 1st and 3rd synsets have apprpriate sense for my ontology concept. How can i choose only these two from wordnet?
The technology you're looking for is in the direction of semantic disambiguation / representation.
The most "traditional approach" is Word Sense Disambiguation (WSD), take a look at
https://en.wikipedia.org/wiki/Word-sense_disambiguation
https://stackoverflow.com/questions/tagged/word-sense-disambiguation
Anyone know of some good Word Sense Disambiguation software?
Then comes the next generation of Word Sense induction / Topic modelling / Knowledge representation:
https://en.wikipedia.org/wiki/Word-sense_induction
https://en.wikipedia.org/wiki/Topic_model
https://en.wikipedia.org/wiki/Knowledge_representation_and_reasoning
Then comes the most recent hype:
Word embeddings, vector space models, neural nets
Sometimes people skip the semantic representation and goes directly to do text similarity and by comparing pairs of sentences, the differences/similarities before getting to the ultimate aim of the text processing.
Take a look at Normalize ranking score with weights for a list of STS related work.
On the other direction, there's
ontology creation (Cyc, Yago, Freebase, etc.)
semantic web (https://en.wikipedia.org/wiki/Semantic_Web)
semantic lexical resources (WordNet, Open Multilingual WordNet, etc.)
Knowledge base population (http://www.nist.gov/tac/2014/KBP/)
There's also a recent task on ontology induction / expansion:
http://alt.qcri.org/semeval2015/task17/
http://alt.qcri.org/semeval2016/task13/
http://alt.qcri.org/semeval2016/task14/
Depending on the ultimate task, maybe either of the above technology would help.
You can also try Babelfy, which provides Word Sense Disambiguation and Named Entity Disambiguation.
Demo:
http://babelfy.org/
API:
http://babelfy.org/guide
Take a look at this list: 100 Best GitHub: Word-sense Disambiguation
and search by WordNet - there are several appropriate libraries.
I didn't use any of them, but this one seems to be promising, because it is based on classic yet effective idea (namely, Lesk algorithm) upgraded by modern word-embedding methods. Actually, before finding it, I was going to suggest to try almost the same ideas.
Note also that all methods try to find the meaning (WordNet sysnet, in your case) that is most similar to the context of the current word/collocation, so it is crucial to have context of the words you're trying to disambiguate. For example, words can come from some text and most libraries rely on that.
I'm building a small prototype of a Movies semantic search engine based on the data of LinkedIMDB
I've defined some Query Types as an example of use cases
search by entity name search by
entity type
search common features between two entities ...etc
So far I've developed a SPARQL engine that takes any type of those Queries and send the Query to the endpoint and preview the result.
The problem here is that I want to make a natural language or semi natural language interface for it in order for users to invoke those sentences using Natural language search Queries. But I don't know from where to start.
I've found some papers that are trying to extract triplets from the text but I don't feel that's the key to the solution.
Also I've found some LSA techniques to interpret Natural language search Queries but I feel it's not applicable to semantic search domain.
Any idea or resources to start reading from?
Is there a best practice than the natural language interface?
A lot of work has been done in the field of natural languge -> SQL conversion. Maybe you should take that as a starting point and see how you can modify the available examples for SPARQL. (Also, designing a controlled natural language could make your task easier.)
Another path to explore can be this article: Supporting Domain Experts to Construct Conceptual Ontologies: A Holistic Approach published at the Journal of Web Semantics, http://www.websemanticsjournal.org/index.php/ps/article/view/189 Even though it is about using natural language for ontology construction, the approach explained there (along with open source code) can turn into a fruitful exploration.
Have you seen FREya # https://github.com/nmvijay/freya it is an NLP to SPARQL convertor.
FREyA is an interactive Natural Language Interface for querying ontologies which combines usability enhancement methods such as feedback and clarification dialogs in order to:
1) improve recall by generating the dialog and enriching the domain lexicon from the user's vocabulary, whenever an "unknown" term appears in a question
2) improve precision by resolving ambiguities more effectively through the dialog. The suggestions shown to the user are found through ontology reasoning and are initially ranked using the combination of string similarity and synonym detection. The system then learns from the user's selections, and improves its performance over time.