Can we test or evaluate entity extraction in Rasa NLU? - nlp

Is it possible to evaluate how well my model extracts entities (and maps synonym values) in Rasa NLU?
I have tried the rasa_nlu -evaluate mode however, it seems to only work for intent classification, although my JSON data file contains entities information and I'd really like to know if my entity extraction is up to the mark given various scenarios. I've used Tracy to generate test dataset.

Actually yes - you should get the score to you entities as well.
Are you sure you added some to your training data?
do you have it NER algo that fetches them? something like this?
pipeline:
- name: "intent_featurizer_count_vectors"
- name: "intent_classifier_tensorflow_embedding"
batch_size: 64
epochs: 1500
- name: "nlp_spacy"
- name: "tokenizer_spacy"
- name: "ner_crf"
ner_crf is conditional random field for extracting the "name entity recognition"
To make sure you follow the model building correctly have a look at this tutorial:
https://hackernoon.com/build-simple-chatbot-with-rasa-part-1-f4c6d5bb1aea

As the documentation says https://rasa.com/docs/nlu/0.12.0/evaluation/, if your are using either ner_crf or ner_duckling, the evaluation method automatically takes entity extraction performance unto account. If you only use ner_synonyms the evaluate method won't compute an output table.
Other possible pitfalls could be:
If you parse a single sentence including a desired entity, does your trained model extract an entity? This could be a clue to the situation that your model was not able to evolve a pattern recognizing entities.
Also a problem could be that by randomly splitting the data into train and test set, there's no entity in your test set to extract. Your algorithm could have learned the pattern but is not forced to apply this pattern. Did you check wether your test set contains entities?

If I understand right, perhaps you are interested in something like https://github.com/RasaHQ/rasa_nlu/issues/1472? So, this issue was written because for intents you could get overall score and you could see how each intent was classified, but you could only get the overall score for entities and not how each entity was classified.
So in short, this is still an open issue and not possible in Rasa. However, it was an issue I was asked to look at just yesterday, so I will let you know if I make any progress on it.

Related

How to handle two entity extraction methods in NLP

I am using two different entity extraction methods (https://rasa.com/docs/nlu/entities/) while building my NLP model in the RASA framework to build a chatbot.
The bot should handle different questions which have custom entities as well as some general ones like location or organisation.
So I use both components ner_spacy and ner_crf to create the model. After that I build a small helper script in python to evaluate the model performance. There I noticed that the model struggles to choose the correct enity.
For example for a word 'X' it choosed the pre-defined enity 'ORG' from SpaCy, but it should be recogniced as a custom enity which I defined in the training data.
If I just use the ner_crf extractor I face huge problems in identifing location enities like capitals. Also one of my biggest problems are single answer enities.
Q : "What´s your favourite animal?"
A : Dog
My model is not able to extract this single entity 'animal' for this single answer. If I answer this question with two words like 'The Dog', the model has no problems to extract the animal entity with the value 'Dog'.
So my question is, is it clever to use two different components to extract entities? One for custom enities and the other one for pre-defined enities.
If I use two methods, what´s the mechanism in the model which extractor is used?
By the way, currently I´m just testing things out, so my training samples are not that huge it should be (less then 100 examples). Could the problem been solved if I have much more training examples?
You are facing 2 problems here. I am suggesting few ways that i found helpful.
1. Custom entity recognition:
To solve this you need to add more training sentences with all possible lengths of entities. ner_crf is going to predict better when there are identifiable markers around entities (e.g. prepositions)
2. Extracting entities from single word answer :
As a workaround, i suggest you to do below manipulations on client end.
When you are sending question like What´s your favorite animal?, append a marker to question to indicate to client that a single answer is expected. e.g.
You can send ##SINGLE## What´s your favorite animal? to client.
Client can remove the ##SINGLE## from question and show it to user. But when client sends user's response to server, it doesn't send Dog, it send something like User responded with single answer as Dog
You can train your model to extract entities from such an answer.

Is it possible to use DialogFlow simply to parse text?

Is it possible to use DialogFlow to simply parse some text and return the entities within that text?
I'm not interested in a conversation or bot-like behaviour, simply text in and list of entities out.
The entity recognition seems to be better with DialogFlow than Google Natural Language Processing and the ability to train might be useful also.
Cheers.
I've never considered this... but yeah, it should be possible. You would upload the entities with synonyms. Then, remove the "Default Fallback Intent", and make a new intent, called "catchall". Procedurally generate sentences with examples of every entity being mentioned, alone or in combination (in whatever way you expect to need to extract them). In "Settings", change the "ML Settings" so the "ML Classification Threshold" is 0.
In theory, it should now classify every input as "catchall", and return all the entities it finds...
If you play around with tagging things as sys.any, this could be pretty effective...
However, you may want to look into something that is built for this. I have made cool stuff with Aylien's NLP API. They have entity extraction, and the free tier gives you 1,000 hits per day.
EDIT: If you can run some code, instead of relying on SaaS, you could check out Rasa NLU for Entity Extraction. With a SpaCy backend it would do well on recognizing pre-trained entities, and with a different backend you can use custom entities.

Pattern Recognition OR Named Entity Recognition for Information Extraction in NLP

There are some event description texts.
I want to extract the entrance fee of the events.
Sometimes the entrance fee is conditional.
What I want to achieve is to extract the entrance fee and it's conditions(if available). It's fine to retrieve the whole phrase or sentence which tells the entrance fee + it's conditions.
Note I: The texts are in German language.
Note II: Often the sentences are not complete, as they are mainly event flyers or advertisements.
What would be the category of this problem in NLP? Is it Named Entity Recognition and could be solved by training an own model with Apache openNLP?
Or I thought maybe easier would be to detect the pattern via the usual keywords in the use-case(entrance, $, but, only till, [number]am/pm, ...).
Please shed some light on me.
Input Examples:
- "If you enter the club before 10pm, the entrance is for free. Afterwards it is 6$."
- "Join our party tonight at 11pm till 5am. The entrance fee is 8$. But for girls and students it's half price."
This is broadly a structure learning problem. You might have to combine Named-Entity-Recognition/Tagging with Coreference Resolution. Read some papers on these as well as related github code and take it from there. Here is good discussion of state of the art tools for these at the moment https://www.reddit.com/r/MachineLearning/comments/3dz3fl/dl_architectures_for_entity_recognition_and_other/
Hope that helps.
You might try Stanford's CoreNLP for the named entity extraction part. It should be able to help you pick out the money values, and there is also a link to models trained for German language as well (https://nlp.stanford.edu/software/CRF-NER.shtml).
Given that it's fine to extract the entire sentence that contains the information, I'd suggest taking a binary sentence classification approach. You could probably get quite far just by using ngrams and some named entity information as features. That would mean that you'd need you'd want to build a pipeline that would automatically segment your documents into sentence-like chunks. You could try a sentence segmentation tool (also provided by Stanford CoreNLP) as a first go https://stanfordnlp.github.io/CoreNLP/. Since this would form the basis for all further work, you'd want to ensure that the results are at least decent. Perhaps the structure of the document itself gives you enough information to segment it without even using a sentence segmentation tool.
After you have this pipeline in place, you'd want to annotate the sentences extracted from a large set of documents as relevant or non-relevant to make it a binary classification task. Then train a model based on that dataset. Finally, when you apply it to unseen data, first use the sentence segmentation approach, and then classify each sentence.

Techniques other than RegEx to discover 'intent' in sentences

I'm embarking on a project for a non-profit organization to help process and classify 1000's of reports annually from their field workers / contractors the world over. I'm relatively new to NLP and as such wanted to seek the group's guidance on the approach to solve our problem.
I'll highlight the current process, and our challenges and would love your help on the best way to solve our problem.
Current process: Field officers submit reports from locally run projects in the form of best practices. These reports are then processed by a full-time team of curators who (i) ensure they adhere to a best-practice template and (ii) edit the documents to improve language/style/grammar.
Challenge: As the number of field workers increased the volume of reports being generated has grown and our editors are now becoming the bottle-neck.
Solution: We would like to automate the 1st step of our process i.e., checking the document for compliance to the organizational best practice template
Basically, we need to ensure every report has 3 components namely:
1. States its purpose: What topic / problem does this best practice address?
2. Identifies Audience: Who is this for?
3. Highlights Relevance: What can the reader do after reading it?
Here's an example of a good report submission.
"This document introduces techniques for successfully applying best practices across developing countries. This study is intended to help low-income farmers identify a set of best practices for pricing agricultural products in places where there is no price transparency. By implementing these processes, farmers will be able to get better prices for their produce and raise their household incomes."
As of now, our approach has been to use RegEx and check for keywords. i.e., to check for compliance we use the following logic:
1 To check "states purpose" = we do a regex to match 'purpose', 'intent'
2 To check "identifies audience" = we do a regex to match with 'identifies', 'is for'
3 To check "highlights relevance" = we do a regex to match with 'able to', 'allows', 'enables'
The current approach of RegEx seems very primitive and limited so I wanted to ask the community if there is a better way to solving this problem using something like NLTK, CoreNLP.
Thanks in advance.
Interesting problem, i believe its a thorough research problem! In natural language processing, there are few techniques that learn and extract template from text and then can use them as gold annotation to identify whether a document follows the template structure. Researchers used this kind of system for automatic question answering (extract templates from question and then answer them). But in your case its more difficult as you need to learn the structure from a report. In the light of Natural Language Processing, this is more hard to address your problem (no simple NLP task matches with your problem definition) and you may not need any fancy model (complex) to resolve your problem.
You can start by simple document matching and computing a similarity score. If you have large collection of positive examples (well formatted and specified reports), you can construct a dictionary based on tf-idf weights. Then you can check the presence of the dictionary tokens. You can also think of this problem as a binary classification problem. There are good machine learning classifiers such as svm, logistic regression which works good for text data. You can use python and scikit-learn to build programs quickly and they are pretty easy to use. For text pre-processing, you can use NLTK.
Since the reports will be generated by field workers and there are few questions that will be answered by the reports (you mentioned about 3 specific components), i guess simple keyword matching techniques will be a good start for your research. You can gradually move to different directions based on your observations.
This seems like a perfect scenario to apply some machine learning to your process.
First of all, the data annotation problem is covered. This is usually the most annoying problem. Thankfully, you can rely on the curators. The curators can mark the specific sentences that specify: audience, relevance, purpose.
Train some models to identify these types of clauses. If all the classifiers fire for a certain document, it means that the document is properly formatted.
If errors are encountered, make sure to retrain the models with the specific examples.
If you don't provide yourself hints about the format of the document this is an open problem.
What you can do thought, is ask people writing report to conform to some format for the document like having 3 parts each of which have a pre-defined title like so
1. Purpose
Explains the purpose of the document in several paragraph.
2. Topic / Problem
This address the foobar problem also known as lorem ipsum feeling text.
3. Take away
What can the reader do after reading it?
You parse this document from .doc format for instance and extract the three parts. Then you can go through spell checking, grammar and text complexity algorithm. And finally you can extract for instance Named Entities (cf. Named Entity Recognition) and low TF-IDF words.
I've been trying to do something very similar with clinical trials, where most of the data is again written in natural language.
If you do not care about past data, and have control over what the field officers write, maybe you can have them provide these 3 extra fields in their reports, and you would be done.
Otherwise; CoreNLP and OpenNLP, the libraries that I'm most familiar with, have some tools that can help you with part of the task. For example; if your Regex pattern matches a word that starts with the prefix "inten", the actual word could be "intention", "intended", "intent", "intentionally" etc., and you wouldn't necessarily know if the word is a verb, a noun, an adjective or an adverb. POS taggers and the parsers in these libraries would be able to tell you the type (POS) of the word and maybe you only care about the verbs that start with "inten", or more strictly, the verbs spoken by the 3rd person singular.
CoreNLP has another tool called OpenIE, which attempts to extract relations in a sentence. For example, given the following sentence
Born in a small town, she took the midnight train going anywhere
CoreNLP can extract the triple
she, took, midnight train
Combined with the POS tagger for example; you would also know that "she" is a personal pronoun and "took" is a past tense verb.
These libraries can accomplish many other tasks such as tokenization, sentence splitting, and named entity recognition and it would be up to you to combine all of these tools with your domain knowledge and creativity to come up with a solution that works for your case.

Improving entity naming with custom file/code in NLTK

We've been working with the NLTK library in a recent project where we're
mainly interested in the named entities part.
In general we're getting good results using the NEChunkParser class.
However, we're trying to find a way to provide our own terms to the
parser, without success.
For example, we have a test document where my name (Shay) appears in
several places. The library finds me as GPE while I'd like it to find
me as PERSON...
Is there a way to provide some kind of a custom file/
code so the parser will be able to interpret the named entity as I
want it to?
Thanks!
The easy solution is to compile a list of entities that you know are misclassified, then filter the NEChunkParser output in a postprocessing module and replace these entities' tags with the tags you want them to have.
The proper solution is to retrain the NE tagger. If you look at the source code for NLTK, you'll see that the NEChunkParser is based on a MaxEnt classifier, i.e. a machine learning algorithm. You'll have to compile and annotate a corpus (dataset) that is representative for the kind of data you want to work with, then retrain the NE tagger on this corpus. (This is hard, time-consuming and potentially expensive.)

Resources