I want to know if it is possible to use OpenIE or if there is an available option with which I can specify the entities instead of OpenIE extracting them from Text. And given the entities it finds relation between them?
Eg. Obama was president of US.
Input - Obama, US
Output - president of
The kbp annotator can extract relations (a fixed set of relations from the KBP competition, things such as "born in"). There is documentation about using the full pipeline here: https://stanfordnlp.github.io/CoreNLP/api.html. One limitation of this is it won't extract general relations, but just the specific KBP ones.
No promises, but down the road we want to integrate a relation extractor into our Python code base, and make it trainable on any relation you want. Though you would have to have training data for your specific relation type.
Related
For stanford NER 3 class model, Location, Person, Organization recognizers are available. Is it possible to add additional classes to this model. For example : Sports as one class to tag sports names.
or if not, is there any model where i can add additional classes.
Note: I didnt exactly mean to add "sports" as a class. I was wondering is there a possibility to add a custom class in that model. If not possible in stanford, is it possible with spacy..
Sports doesn't really fall into the Entity category, as there are a limited number of them, and they are pretty fixed, unlike the names of people or locations, so you can list them all.
I would simply set up a list of sports names, and use string matching to find them.
I am using Rasa 2.0 to build an FAQ chatbot, wherein I have a large dataset, and specifying entities while defining intents does not seem efficient to me.
I have the intents and examples defined in nlu.yml and would like to extract entities.
Here is an example of what I want to achieve,
User message -> I want a hospital in Delhi.
Entity -> Delhi, hospital
Is it possible to do so?
Entity detection is not a solved problem. There exist pre-trained models that integrate with Rasa like Duckling and spaCy and while these tools certainly contribute a lot of knowledge, they will make errors. If you're interested in learning more of the background on why these models can certainly fail, you can enjoy this youtube video that explains human name detection.
That's why a popular alternative is to use name-lists. There are lists of cities around the world as well as lists of baby names that you can download that might be used as a rule based alternative. You can configure this in Rasa via the RegexEntityExtractor but if you have namelists with 1000+ items then a FlashTextExtractor might be preferable.
If you've got labelled examples you can also train Rasa itself to recognise the entities. But in order to do this you will to have labels around.
specifying entities while defining intents does not seem efficient to me
Labelling might not be super fun, but it is super effective. Without labelling your received utterances you won't know what intents your users are interested in.
You could use entity annotations in your nlu training data; for example, assuming you have defined building_type and city as entity names:
I want a [hospital]("building_type") in [Delhi]("city").
Alternatively, you could try out these options:
annotate a smaller sample (for example, those entities that are essential for your FAQ assistant)
use the RegexEntityExtractor to write some rules
if you have a list of entities, you can use lookup tables to generate the regular expressions
I'm in a nlp project and there are millions of sentences which contains two entity. I want to find whether two entities have relationships or not in each sentence.
So I want to find a word list like:
['related to','induced by','the treatment of','The effects of','the treatment of','treated with','best for','in response to','approved for','response with','associated with','efficacy of ','in treating','applied to','efficacy in','efficacy and safety','efficacy at','impact on','approved','causing','but none of ','linked to','cause of','associated with','leading to','caused by','the relationship between','responsible for']
I have search github but I can't find it.
What should I do?
As you can see there are a vast number of ways in which a possible semantic relationship between two entities can be lexicalised (i.e. expressed by a word/expression) in language. Furthermore, this will be very dependent on the domain (e.g. politics, healthcare, engineering, astronomy, social sciences, etc, etc, etc). I'm not aware of any "ontology of relations".
By contrast, there will be less variety in the syntactic structures at play (i.e. dependency relations or constituent structure, depending on the syntactic formalism you use). You should be able to identify (many of) these more easily than the actual list of words used (although having a list of word would be very useful). For example, for a given verb, if one entity (noun or noun phrase) is the subject and another entity (noun or noun phrase) the direct object, then that verb is likely to express a relation between the two. The same goes for indirect object, oblique object etc.
You can use a library like spaCy to retrieve the grammatical (dependency) relations between verbs and nominal entities which you can then use to identify semantic relations. For example:
The Moon orbits the Earth.
spaCy dependencies: nsubj(orbits, Moon) obj(orbits, Earth)
semantic relation: orbit(Moon, Earth)
Trump was impeached by Congress.
spaCy dependencies: nsubjpass(impeach, Trump) agent(impeached, by) pobj(by, Congress)
semantic relation: impeach(Congress, Trump)
spaCy also takes care of Named Entity Recognition for you, although it is trained on a specific corpus that may not match your domain. Note that I have used the lemma of the word to represent the relation (not the inflected verb form).
These are just simple examples and the number of configurations will be large and more complex verbal predicates exist (e.g. phrasal verbs), but you can pick up many semantic relations with a few patterns of grammatical dependencies just looking at simple verbs.
This requires a bit of work and I have not provided an implementation, but maybe this will help you make a start...?
I have a corpus of a few 100-thousand legal documents (mostly from the European Union) – laws, commentary, court documents etc. I am trying to algorithmically make some sense of them.
I have modeled the known relationships (temporal, this-changes-that, etc). But on the single-document level, I wish I had better tools to allow fast comprehension. I am open for ideas, but here's a more specific question:
For example: are there NLP methods to determine the relevant/controversial parts of documents as opposed to boilerplate? The recently leaked TTIP papers are thousands of pages with data tables, but one sentence somewhere in there may destroy an industry.
I played around with google's new Parsey McParface, and other NLP solutions in the past, but while they work impressively well, I am not sure how good they are at isolating meaning.
In order to make sense out of documents you need to perform some sort of semantic analysis. You have two main possibilities with their exemples:
Use Frame Semantics:
http://www.cs.cmu.edu/~ark/SEMAFOR/
Use Semantic Role Labeling (SRL):
http://cogcomp.org/page/demo_view/srl
Once you are able to extract information from the documents then you may apply some post-processing to determine which information is relevant. Finding which information is relevant is task related and I don't think you can find a generic tool that extracts "the relevant" information.
I see you have an interesting usecase. You've also mentioned the presence of a corpus(which a really good plus). Let me relate a solution that I had sketched for extracting the crux from research papers.
To make sense out of documents, you need triggers to tell(or train) the computer to look for these "triggers". You can approach this using a supervised learning algorithm with a simple implementation of a text classification problem at the most basic level. But this would need prior work, help from domain experts initially for discerning "triggers" from the textual data. There are tools to extract gists of sentences - for example, take noun phrases in a sentence, assign weights based on co-occurences and represent them as vectors. This is your training data.
This can be a really good start to incorporating NLP into your domain.
Don't use triggers. What you need is a word sense disambiguation and domain adaptation. You want to make sense of is in the documents i.e understand the semantics to figure out the meaning. You can build a legal ontology of terms in skos or json-ld format represent it ontologically in a knowledge graph and use it with dependency parsing like tensorflow/parseymcparseface. Or, you can stream your documents in using a kappa based architecture - something like kafka-flink-elasticsearch with added intermediate NLP layers using CoreNLP/Tensorflow/UIMA, cache your indexing setup between flink and elasticsearch using redis to speed up the process. To understand relevancy you can apply specific cases from boosting in your search. Furthermore, apply sentiment analysis to work out intents and truthness. Your use case is one of an information extraction, summarization, and semantic web/linked data. As EU has a different legal system you would need to generalize first on what is really a legal document then narrow it down to specific legal concepts as they relate to a topic or region. You could also use here topic modelling techniques from LDA or Word2Vec/Sense2Vec. Also, Lemon might also help from converting lexical to semantics and semantics to lexical i.e NLP->ontology ->ontology->NLP. Essentially, feed the clustering into your classification of a named entity recognition. You can also use the clustering to assist you in building out the ontology or seeing what word vectors are in a document or set of documents using cosine similarity. But, in order to do all that it be best to visualize the word sparsity of your documents. Something like commonsense reasoning + deep learning might help in your case as well.
I need to build a classifier which identifies NEs in a specific domain. So for instance if my domain is Hockey or Football, the classifier should go accept NEs in that domain but NOT all pronouns it sees on web pages. My ultimate goal is to improve text classification through NER.
For people working in this area please suggest me how should I build such a classifier?
thanks!
If all you want is to ignore pronouns, you can run any POS tagger followed by any NER algorithm ( the Stanford package is a popular implementation) and then ignore any named entities which are pronouns. However, the pronouns might refer to named entities, which may or may not turn out to be important for the performance of your classifier. The only way to tell for sure it to try.
A slightly unrelated comment- a NER system trained on domain-specific data (e.g. hockey) is more likely to pick up entities from that domain because it will have seen some of the contexts entities appear in. Depending on the system, it might also pick up entities from other domains (which you do not want, if I understand your question correctly) because of syntax, word shape patterns, etc.
I think something like AutoNER might be useful for this. Essentially, the input to the system is text documents from a particular domain and a list of domain-specific entities that you'd like the system to recognize (like Hockey players in your case).
According to their results in this paper, they perform well on recognizing chemical names and disease names among others.