I am using Rasa 2.0 to build an FAQ chatbot, wherein I have a large dataset, and specifying entities while defining intents does not seem efficient to me.
I have the intents and examples defined in nlu.yml and would like to extract entities.
Here is an example of what I want to achieve,
User message -> I want a hospital in Delhi.
Entity -> Delhi, hospital
Is it possible to do so?
Entity detection is not a solved problem. There exist pre-trained models that integrate with Rasa like Duckling and spaCy and while these tools certainly contribute a lot of knowledge, they will make errors. If you're interested in learning more of the background on why these models can certainly fail, you can enjoy this youtube video that explains human name detection.
That's why a popular alternative is to use name-lists. There are lists of cities around the world as well as lists of baby names that you can download that might be used as a rule based alternative. You can configure this in Rasa via the RegexEntityExtractor but if you have namelists with 1000+ items then a FlashTextExtractor might be preferable.
If you've got labelled examples you can also train Rasa itself to recognise the entities. But in order to do this you will to have labels around.
specifying entities while defining intents does not seem efficient to me
Labelling might not be super fun, but it is super effective. Without labelling your received utterances you won't know what intents your users are interested in.
You could use entity annotations in your nlu training data; for example, assuming you have defined building_type and city as entity names:
I want a [hospital]("building_type") in [Delhi]("city").
Alternatively, you could try out these options:
annotate a smaller sample (for example, those entities that are essential for your FAQ assistant)
use the RegexEntityExtractor to write some rules
if you have a list of entities, you can use lookup tables to generate the regular expressions
Related
I would like to create a semantic context for my data before vectorizing the actual data in Weaviate (https://github.com/semi-technologies/weaviate).
Lets say we have a taxonomy where we have a set of domain specific concepts together with links to their related concepts. Could you advise me what the best way is to encode not only those concepts but also relations between them using contextionary?
Depending on your use case, there are a few answers possible.
You can create the "semantic context" in a Weaviate schema and use a vectorization module to vectorized the data according to this schema.
You have domain-specific concepts in your data that the out-of-the-box vectorization modules don't know about (e.g., specific abbreviations).
You want to capture the semantic context of (i.e., vectorize) the graph itself before adding it to Weaviate.
The first is the easiest and straightforward one, the last one is the most esoteric.
Create a schema and use a vectorizer for your data
In your case, you would create a schema based on your taxonomy and load the data using an out-of-the-box vectorizer (this configurator helps you to build a Docker-compose file).
I would recommend starting with this anyway, because it will determine your data model and how you can search through and/or classify data. It might even be the case that for your use case this step already solves the problem because the out-of-the-box vectorizers are (bias alert) pretty decent.
Domain-specific concepts
At the moment of writing, Weaviate has two vectorizers, the contextionary and the transformers modules.
If you want to extend Weaviate with custom context, you can extend the contextionary or fine tune and distribute custom transformers.
If you do this, I would highly recommend still taking the first step. Because it will simply improve the results.
Capture semantic context of your graph
I don't think this is what you want, but it possible and quite esoteric. In principle, you can store your vectorized graph in Weaviate, but you need to generate the vectors on your own. For example, at the moment of writing, we are looking at RDF2Vec.
PS:
Because people often ask about the role of ontologies and taxonomies in Weaviate, I've written this blog post.
I'm trying to create a model in LUIS that allow me to detect if a brand (any brand) is mentioned in an utterance. I've tried different approaches but I'm struggling to get it working.
First I have an intent searchBrand with some examples utterances:
'Help me find info about Channel'
'I want to know more about Adidas'
...
What I want is that LUIS recognizes that a brand has been mentioned in the utterance (as an entity).
I believe I have these options:
Use a List Entity: impossible since I would have to fill the list
with every possible brand that exists and, moreover, the user would
have to write the brand exactly as it is, not allowing typos (e.g. ralf
lauren)
Use a ML Entity: I believe this could be the right approach. I've tried the following without success:
Create a ML Entity "brands"
Add a Structure with 1 component "brand"
Add to the component a Descriptor with a list of different brands as an example
Once I label the entities in the utterances, the model recognizes correctly the brands that I added to the Descriptor but it fails to recognize others brands or typos
Another option is a pattern entity. It fits somewhere between the two options you listed. You do need to train it with the patterns, and if the pattern is off at all it will not recognize the entity (and won't recognize the intent either unless you've separately trained it with utterances, which you should). However, it seems like the phrasings in your case would be consistent enough that you could define a few patterns for this, and as you train your bot from endpoint utterances you can add additional patterns as needed. Here is an example:
As I put this together I realized I'm ignoring [help me] and [find], essentially the pattern is "info about {brand}", which may or may not be appropriate depending on your other intents. If you say something different like "Tell me more about Adidas", the intent will be recognized (I trained it with your sample utterances), but the pattern, and therefore entity, will not.
Tutorial on using Patterns in LUIS
I got it working following this:
Create a ML Entity "brands"
Add to the entity a Descriptor with a list of different brands as an example. Remember to normalize the elements in the Descriptor
Add brands to the Descriptor
Label entities as "brands" inside utterances in intent "searchBrands"
Train & test the model
It is very important to normalize everything in LUIS. I had the brands inside the Descriptor capitalized and LUIS couldn't recognize new ones, once I normalized the brands LUIS started suggesting new ones and recognizing more when testing the model
I was looking through the documentation and testing Google's Natural Language API and noticed it gets a number of people, events, organizations, and locations incorrect - it appears to be using Wikipedia as a major data source so if it is not in Wikipedia it seems to have trouble identifying the type of various words. Also, if certain words appear in a name (proper noun) it seems to always identify an entity as a certain type which is not always correct.
For instance: "Congress" seems to always identify as an organization [government] even when it is part of an event name. The name "WordCamp" shows as a location, but it is an event.
Is there a way to train the Natural Language engine or provide a custom set of organizations, locations, events, etc. so that it provides more accurate type information for entities that are not extremely popular?
I am the Product manager for this product. Custom entity types are not currently supported. As per your comment about not getting some entity types right, this is true for any NLP system but our goal is to keep improving. We are working on ways for you to provide us feedback on instances that we get wrong to improve our accuracy and will share the details shortly. Note we have trained our models on multiple data sources and not just Wikipedia data. The API returns the most relevant Wikipedia article for an entity detected so if an entity has multiple interpretations, we will only return the most commonly used interpretation.
I have a corpus of a few 100-thousand legal documents (mostly from the European Union) – laws, commentary, court documents etc. I am trying to algorithmically make some sense of them.
I have modeled the known relationships (temporal, this-changes-that, etc). But on the single-document level, I wish I had better tools to allow fast comprehension. I am open for ideas, but here's a more specific question:
For example: are there NLP methods to determine the relevant/controversial parts of documents as opposed to boilerplate? The recently leaked TTIP papers are thousands of pages with data tables, but one sentence somewhere in there may destroy an industry.
I played around with google's new Parsey McParface, and other NLP solutions in the past, but while they work impressively well, I am not sure how good they are at isolating meaning.
In order to make sense out of documents you need to perform some sort of semantic analysis. You have two main possibilities with their exemples:
Use Frame Semantics:
http://www.cs.cmu.edu/~ark/SEMAFOR/
Use Semantic Role Labeling (SRL):
http://cogcomp.org/page/demo_view/srl
Once you are able to extract information from the documents then you may apply some post-processing to determine which information is relevant. Finding which information is relevant is task related and I don't think you can find a generic tool that extracts "the relevant" information.
I see you have an interesting usecase. You've also mentioned the presence of a corpus(which a really good plus). Let me relate a solution that I had sketched for extracting the crux from research papers.
To make sense out of documents, you need triggers to tell(or train) the computer to look for these "triggers". You can approach this using a supervised learning algorithm with a simple implementation of a text classification problem at the most basic level. But this would need prior work, help from domain experts initially for discerning "triggers" from the textual data. There are tools to extract gists of sentences - for example, take noun phrases in a sentence, assign weights based on co-occurences and represent them as vectors. This is your training data.
This can be a really good start to incorporating NLP into your domain.
Don't use triggers. What you need is a word sense disambiguation and domain adaptation. You want to make sense of is in the documents i.e understand the semantics to figure out the meaning. You can build a legal ontology of terms in skos or json-ld format represent it ontologically in a knowledge graph and use it with dependency parsing like tensorflow/parseymcparseface. Or, you can stream your documents in using a kappa based architecture - something like kafka-flink-elasticsearch with added intermediate NLP layers using CoreNLP/Tensorflow/UIMA, cache your indexing setup between flink and elasticsearch using redis to speed up the process. To understand relevancy you can apply specific cases from boosting in your search. Furthermore, apply sentiment analysis to work out intents and truthness. Your use case is one of an information extraction, summarization, and semantic web/linked data. As EU has a different legal system you would need to generalize first on what is really a legal document then narrow it down to specific legal concepts as they relate to a topic or region. You could also use here topic modelling techniques from LDA or Word2Vec/Sense2Vec. Also, Lemon might also help from converting lexical to semantics and semantics to lexical i.e NLP->ontology ->ontology->NLP. Essentially, feed the clustering into your classification of a named entity recognition. You can also use the clustering to assist you in building out the ontology or seeing what word vectors are in a document or set of documents using cosine similarity. But, in order to do all that it be best to visualize the word sparsity of your documents. Something like commonsense reasoning + deep learning might help in your case as well.
I need to build a classifier which identifies NEs in a specific domain. So for instance if my domain is Hockey or Football, the classifier should go accept NEs in that domain but NOT all pronouns it sees on web pages. My ultimate goal is to improve text classification through NER.
For people working in this area please suggest me how should I build such a classifier?
thanks!
If all you want is to ignore pronouns, you can run any POS tagger followed by any NER algorithm ( the Stanford package is a popular implementation) and then ignore any named entities which are pronouns. However, the pronouns might refer to named entities, which may or may not turn out to be important for the performance of your classifier. The only way to tell for sure it to try.
A slightly unrelated comment- a NER system trained on domain-specific data (e.g. hockey) is more likely to pick up entities from that domain because it will have seen some of the contexts entities appear in. Depending on the system, it might also pick up entities from other domains (which you do not want, if I understand your question correctly) because of syntax, word shape patterns, etc.
I think something like AutoNER might be useful for this. Essentially, the input to the system is text documents from a particular domain and a list of domain-specific entities that you'd like the system to recognize (like Hockey players in your case).
According to their results in this paper, they perform well on recognizing chemical names and disease names among others.