How to handle two entity extraction methods in NLP - nlp

I am using two different entity extraction methods (https://rasa.com/docs/nlu/entities/) while building my NLP model in the RASA framework to build a chatbot.
The bot should handle different questions which have custom entities as well as some general ones like location or organisation.
So I use both components ner_spacy and ner_crf to create the model. After that I build a small helper script in python to evaluate the model performance. There I noticed that the model struggles to choose the correct enity.
For example for a word 'X' it choosed the pre-defined enity 'ORG' from SpaCy, but it should be recogniced as a custom enity which I defined in the training data.
If I just use the ner_crf extractor I face huge problems in identifing location enities like capitals. Also one of my biggest problems are single answer enities.
Q : "What´s your favourite animal?"
A : Dog
My model is not able to extract this single entity 'animal' for this single answer. If I answer this question with two words like 'The Dog', the model has no problems to extract the animal entity with the value 'Dog'.
So my question is, is it clever to use two different components to extract entities? One for custom enities and the other one for pre-defined enities.
If I use two methods, what´s the mechanism in the model which extractor is used?
By the way, currently I´m just testing things out, so my training samples are not that huge it should be (less then 100 examples). Could the problem been solved if I have much more training examples?

You are facing 2 problems here. I am suggesting few ways that i found helpful.
1. Custom entity recognition:
To solve this you need to add more training sentences with all possible lengths of entities. ner_crf is going to predict better when there are identifiable markers around entities (e.g. prepositions)
2. Extracting entities from single word answer :
As a workaround, i suggest you to do below manipulations on client end.
When you are sending question like What´s your favorite animal?, append a marker to question to indicate to client that a single answer is expected. e.g.
You can send ##SINGLE## What´s your favorite animal? to client.
Client can remove the ##SINGLE## from question and show it to user. But when client sends user's response to server, it doesn't send Dog, it send something like User responded with single answer as Dog
You can train your model to extract entities from such an answer.

Related

Can we test or evaluate entity extraction in Rasa NLU?

Is it possible to evaluate how well my model extracts entities (and maps synonym values) in Rasa NLU?
I have tried the rasa_nlu -evaluate mode however, it seems to only work for intent classification, although my JSON data file contains entities information and I'd really like to know if my entity extraction is up to the mark given various scenarios. I've used Tracy to generate test dataset.
Actually yes - you should get the score to you entities as well.
Are you sure you added some to your training data?
do you have it NER algo that fetches them? something like this?
pipeline:
- name: "intent_featurizer_count_vectors"
- name: "intent_classifier_tensorflow_embedding"
batch_size: 64
epochs: 1500
- name: "nlp_spacy"
- name: "tokenizer_spacy"
- name: "ner_crf"
ner_crf is conditional random field for extracting the "name entity recognition"
To make sure you follow the model building correctly have a look at this tutorial:
https://hackernoon.com/build-simple-chatbot-with-rasa-part-1-f4c6d5bb1aea
As the documentation says https://rasa.com/docs/nlu/0.12.0/evaluation/, if your are using either ner_crf or ner_duckling, the evaluation method automatically takes entity extraction performance unto account. If you only use ner_synonyms the evaluate method won't compute an output table.
Other possible pitfalls could be:
If you parse a single sentence including a desired entity, does your trained model extract an entity? This could be a clue to the situation that your model was not able to evolve a pattern recognizing entities.
Also a problem could be that by randomly splitting the data into train and test set, there's no entity in your test set to extract. Your algorithm could have learned the pattern but is not forced to apply this pattern. Did you check wether your test set contains entities?
If I understand right, perhaps you are interested in something like https://github.com/RasaHQ/rasa_nlu/issues/1472? So, this issue was written because for intents you could get overall score and you could see how each intent was classified, but you could only get the overall score for entities and not how each entity was classified.
So in short, this is still an open issue and not possible in Rasa. However, it was an issue I was asked to look at just yesterday, so I will let you know if I make any progress on it.

Is it possible to use DialogFlow simply to parse text?

Is it possible to use DialogFlow to simply parse some text and return the entities within that text?
I'm not interested in a conversation or bot-like behaviour, simply text in and list of entities out.
The entity recognition seems to be better with DialogFlow than Google Natural Language Processing and the ability to train might be useful also.
Cheers.
I've never considered this... but yeah, it should be possible. You would upload the entities with synonyms. Then, remove the "Default Fallback Intent", and make a new intent, called "catchall". Procedurally generate sentences with examples of every entity being mentioned, alone or in combination (in whatever way you expect to need to extract them). In "Settings", change the "ML Settings" so the "ML Classification Threshold" is 0.
In theory, it should now classify every input as "catchall", and return all the entities it finds...
If you play around with tagging things as sys.any, this could be pretty effective...
However, you may want to look into something that is built for this. I have made cool stuff with Aylien's NLP API. They have entity extraction, and the free tier gives you 1,000 hits per day.
EDIT: If you can run some code, instead of relying on SaaS, you could check out Rasa NLU for Entity Extraction. With a SpaCy backend it would do well on recognizing pre-trained entities, and with a different backend you can use custom entities.

Sentence correction using NLP

I'm trying to build a chat assistant in my website and it should answer queries like "Can you track my order?", "How's performance of XXX". The majority of the work lies in understanding the user's query.
I'm using 'Named Entity Recognizers' and "Text Parsers" for processing the queries. Before this, I'm passing the query through 'Spell checker' to reduce the errors like,
Can you track my ordr?
to
Can you track my order?
It's working in most of the cases but failing in cases like,
Can you track my water?
In this case, the spelling corrector doesn't correct the word 'water' and NER is not able identify the entity as 'order'.
The problem is 'Can you track my water?' may be a correct sentence in some other context but it's definitely a mistake in my context (domain). So I should be able to correct this sentence.
I'm stuck here.
Is there anyway I can correct these sentences using predefined queries and/or statistical data of user entered queries?
I don't know of a way you can change "water" to "order".
But if you have a predefined set of questions then you may give the user suggestions to select from, just before he submits the question.
NER may only recognize/classify entities it may not be used to replace parts of sentences, because the user may have intended what he said.
What you do is suggest most probabilistic word based on your set.
References:
What is the best way to find the most similar sentence?
Find semantically similar word
You could use n-gram models to find the most probable word and then use substitution. In your case, you substitute the word ordr by the word order. And if you want to go deeper you could use a machine learning model to handle the issue.

How can I use Natural Language Processing to check if a paragraph contains predefined topics?

We have a system that allows users to answer a question as free text and we want to check whether their answer contains any of our predefined topics. These topics will be defined prior to answers being submitted.
We tried to use a method similar to spam detection, but this is only good for determining whether something is true/false, incorrect/correct. We need the response to say which of the predefined topics a piece of text contains. Is there an algorithm that would solve this problem?
Maybe you will try to use "bag of words" for feature extraction and "naive Bayes classifier with multinomial model" for classification.
In this page this described more detail link.
You could also try explicit semantic analysis (ESA)[1][2]. Given a set of documents that represent concepts (in your case your topics) you can train a model and given any new sentence as input you can get a ranked list of the closest concepts "evoked" by that sentence. Of course this assume you have a document with some text describing every concept you want to identify (that's why the most common thing to do is to use Wikipedia pages as concepts), but if this is the case you could give it a try.
[1] https://en.wikipedia.org/wiki/Explicit_semantic_analysis
[2] http://www.cs.technion.ac.il/~gabr/papers/ijcai-2007-sim.pdf

Text classification using Java

I need to categorize a text or word to a particular category. For example, the text 'Pink Floyd' should be categorized as 'music' or 'Wikimedia' as 'technology' or 'Einstein' as 'science'.
How can this be done? Is there a way I can use the DBpedia for the same? If not, the database has to be trained from time to time, right?
This is a text classification problem. Manning, Raghavan and Schütze's Information Retrieval book chapter is a nice introduction. I think you do not need DBPedia nor NER for this, just a small labeled training data set with enough labeled examples for all of your classes.
Yes, DBpedia may be a good choice for this kind of problem. You'll have to
squash the DBpedia category structure so you get the right granularity (e.g., Pink Floyd is listed under Capitol Records artists and a host of other categories, but not directly under Music). Maybe pick a few large categories and try to find whether your concepts are listed indirectly in them;
normalize text; Einstein is listed as Albert Einstein, not einstein
deal with ambiguity due to terms describing multiple concepts and concepts belonging to multiple top-level categories.
These problems may be solvable using machine learning, but I only see how it can be done if you extract these terms, along with relevant features, from running text. But in that case, you might just as well classify the entire text into one of the categories you choose in step 1.
This is the well-studied named entity recognition problem. Unless you have a particular need to roll your own technology (hint: it's a hard problem in general), using Gate, or perhaps one of the online services that builds on it (e.g. TSO's Data Enrichment Service), would be a good option. An alternative online service is OpenCalais.
Mapping your categries to DBPedia.
Index with lucene selected DBPedia categories and label data with your category names.
Do search for your data - tokenization, normalization will be done by Lucene.
This approach is somehow related to KNN classification.
Yes DBpedia is a good choice for text classification, as you can use its predicates/ relations to query and to extract the meaningful information for the particular category.
You can look into the endpoint for querying Dbpedia:
http://dbpedia.org/sparql
Further, learn the basic syntax of SPARQL to query on the endpoint from the following link:
http://www.w3.org/TR/rdf-sparql-query/

Resources