Substitute multiple word with single entity in chat text dataset

Substitute multiple word with single entity in chat text dataset - python-3.x

I have a chat data of shape 500k rows. I want to replace or substitute multiple words entity [eg. NEW YORK, New York, new york, Newyork] with single entity as "New York" using python.
I tried to do this using regex, but it consumes too much time for processing. Also I have many such words. Is there any alternative method which consumes less time using Python?
Is there any good resource to study more about Spacy and Rasa API?

You can provide, some simple example of you need to do? I mean example using some training object. You need to change the entity name or entity value?
About more docs to study rasa and spacy, both has a good documentations on his own domains(site/github).
About Rasa, you can find good things here:
https://rasa.com/docs/nlu/
https://medium.com/rasa-blog
https://forum.rasa.com/
About SpaCy:
https://spacy.io/usage/
https://explosion.ai/blog/
Also, you can find more real examples on medium's posts

Related

Is there a way to have a reference term in addition to a label with Doccano?

Hi I would like to know if we can have something like the following example on Doccano:
So let's say that we have a sentence like this : "MS is an IT company". I want to label some words in this sentence, for example MS (Microsoft). MS should be labelled as a Company (so imagine that I have an entity named Company) but I also want to say that MS stands for Microsoft.
Is there a way to do that with Doccano?
Thanks

Doccano supports
Sequence Labelling good for Named Entity Recognition (NER)
Text Classification good e.g. for Sentiment Analysis
Sequence To Sequence good for Machine Translation
What you're describing sounds a little like Entity Linking.
You can see from Doccano's roadmap in its docs that Entity Linking is part of the plans, but not yet available.
For now, I suggest to frame this as a NER problem, and to have different entities for MS (Microsoft) and MS (other). If you have too many entities to choose from, the labelling could become complicated, but then you could break up the dataset in smaller entity-focussed datasets. For example, you could get only documents with MS in them and label the mentions as one of the few synonyms.

How to handle two entity extraction methods in NLP

I am using two different entity extraction methods (https://rasa.com/docs/nlu/entities/) while building my NLP model in the RASA framework to build a chatbot.
The bot should handle different questions which have custom entities as well as some general ones like location or organisation.
So I use both components ner_spacy and ner_crf to create the model. After that I build a small helper script in python to evaluate the model performance. There I noticed that the model struggles to choose the correct enity.
For example for a word 'X' it choosed the pre-defined enity 'ORG' from SpaCy, but it should be recogniced as a custom enity which I defined in the training data.
If I just use the ner_crf extractor I face huge problems in identifing location enities like capitals. Also one of my biggest problems are single answer enities.
Q : "What´s your favourite animal?"
A : Dog
My model is not able to extract this single entity 'animal' for this single answer. If I answer this question with two words like 'The Dog', the model has no problems to extract the animal entity with the value 'Dog'.
So my question is, is it clever to use two different components to extract entities? One for custom enities and the other one for pre-defined enities.
If I use two methods, what´s the mechanism in the model which extractor is used?
By the way, currently I´m just testing things out, so my training samples are not that huge it should be (less then 100 examples). Could the problem been solved if I have much more training examples?

You are facing 2 problems here. I am suggesting few ways that i found helpful.
1. Custom entity recognition:
To solve this you need to add more training sentences with all possible lengths of entities. ner_crf is going to predict better when there are identifiable markers around entities (e.g. prepositions)
2. Extracting entities from single word answer :
As a workaround, i suggest you to do below manipulations on client end.
When you are sending question like What´s your favorite animal?, append a marker to question to indicate to client that a single answer is expected. e.g.
You can send ##SINGLE## What´s your favorite animal? to client.
Client can remove the ##SINGLE## from question and show it to user. But when client sends user's response to server, it doesn't send Dog, it send something like User responded with single answer as Dog
You can train your model to extract entities from such an answer.

Is it possible to use DialogFlow simply to parse text?

Is it possible to use DialogFlow to simply parse some text and return the entities within that text?
I'm not interested in a conversation or bot-like behaviour, simply text in and list of entities out.
The entity recognition seems to be better with DialogFlow than Google Natural Language Processing and the ability to train might be useful also.
Cheers.

I've never considered this... but yeah, it should be possible. You would upload the entities with synonyms. Then, remove the "Default Fallback Intent", and make a new intent, called "catchall". Procedurally generate sentences with examples of every entity being mentioned, alone or in combination (in whatever way you expect to need to extract them). In "Settings", change the "ML Settings" so the "ML Classification Threshold" is 0.
In theory, it should now classify every input as "catchall", and return all the entities it finds...
If you play around with tagging things as sys.any, this could be pretty effective...
However, you may want to look into something that is built for this. I have made cool stuff with Aylien's NLP API. They have entity extraction, and the free tier gives you 1,000 hits per day.
EDIT: If you can run some code, instead of relying on SaaS, you could check out Rasa NLU for Entity Extraction. With a SpaCy backend it would do well on recognizing pre-trained entities, and with a different backend you can use custom entities.

Label text documents - Supervised Machine Learning

I'm currently working on a project where I'm taking emails, stripping out the message bodies using the email package, then I want to categorize them using labels like sports, politics, technology, etc...I've successfully stripped the message bodies out of my emails. I'm looking to start classifying.
To make multiple labels like sports, technology, politics, entertainment I need some set of words of each one to make the labelling. Example for
Sports label will have the label data: Football, Soccer, Hockey……
Where can I find online label data to help me ?

You can use DMOZ.
Be award, there are different kinds of text. For e.g one of the most common words in email-text will be Hi or Hello but in wiki-text Hi and Hello will not be common words

What you're trying to do is called topic modeling:
https://en.wikipedia.org/wiki/Topic_model
The list of topics is very dependent on your training dataset and the ultimate purpose for which you're building this.
A good place to start can be here:
https://nlp.stanford.edu/software/tmt/tmt-0.4/
You can look on their topics, but you can probably also use it to give some initial topics to your data and just work on top of their topics.

You can use the BBC dataset.
It has labeled news articles which can help.
for feature extraction, remove stopwords, do stemming, use n-gram with tf-idf, and than choose the best features

How to extract full entities from a bunch of text (not partial entities)

This is probably a classical NLP problem, but how do I extract the FULL entity in a bunch of tweets?
For instance, suppose there's a bunch of tweets that mention "Boston" and "marathon", both in the same tweet. How do I know I should I extract "Boston marathon" and not just Boston or marathon?
Similarly, suppose there's a lot of tweets that mention "Game of Thrones". How would I know the entity to be extracted is Game of Thrones, not just Game?

Another thing to try may be extracting collocations. See the following article for an introduction to this approach.

Most named entity recogniser use the so-called IOB (inside-outside-beginning) tagging scheme exactly because of the scenario you are asking about. For instance, the sentence
John saw Game of Thrones.
should be tagged as
John/B-PERSON saw/O Game/B-MISC of/I-MISC Thrones/I-MISC.
Notice how the second and third tokens of "Game of thrones" are tagged as being inside a named entity, which begins at "Game". Of course, there is no guarantee that the tagger you are using will produce this exact sequence of tags.
You can read more about IOB in the NLTK book.

You can also try with DBpedia-Spotlight endpoint..
http://spotlight.dbpedia.org/rest/spot/?text=

I'm currently extracting named entities from a event databse. I've tried several libraries NLTK, PHP scripts etc. But the best one I find is Stanford NER: http://nlp.stanford.edu:8080/ner/.
english.all.3class.distsim.crf.ser.gz
<PERSON>John</PERSON> saw Game of Thrones.
english.conll.4class.distsim.crf.ser.gz
John saw <ORGANIZATION>Game of Thrones</ORGANIZATION>.
Just ignore the classified type.
I use different classifiers to extract the entities from the text. After that I use the Stanford Parser: http://nlp.stanford.edu:8080/parser/
Typed dependencies, collapsed
nsubj(saw-2, John-1)
root(ROOT-0, saw-2)
dobj(saw-2, Game-3)
prep_of(Game-3, Thrones-5)
http://nlp.stanford.edu/software/dependencies_manual.pdf
to specify which named entities I want or not.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string