How to convert text JSON: a way to standardize outputs with Hugging Face and Pytorch - pytorch

Total newbie to Hugging Face and AI here.
My goal is to convert an input text to a standardized structure that would allow me, later on, to process tabulated data in JSON format.
For example,
Input: “Give me a list of all clients having purchased milk”
Output: {"intention": "retrieve", "object": "client", "conditions":['purchase', 'milk']}
Input: “Please, machine, do me a favor and delete users not having logged in after 2022”
Output {"intention": "delete", "object": "user", "conditions":['logged-in', '2022-12-31']}
The output JSON structure has fixed keys (intention, object, conditions) and values can be either discrete (for example intention can only be ['retrieve', 'delete', 'modify']) or variable (for example conditions can contain any piece of data.
My approach would be to use named entity recognition (NER) to identify the relevant entities and their properties, and syntactic parsing to determine the structure of the user’s prompt. For example, the “Give me a list” would result in the entity intention to be retrieve.
After reading, watching, and practicing, I think I’m now totally lost and not even sure the NER approach is advisable in this context.
Any help would be much appreciated!

Related

How to handle two entity extraction methods in NLP

I am using two different entity extraction methods (https://rasa.com/docs/nlu/entities/) while building my NLP model in the RASA framework to build a chatbot.
The bot should handle different questions which have custom entities as well as some general ones like location or organisation.
So I use both components ner_spacy and ner_crf to create the model. After that I build a small helper script in python to evaluate the model performance. There I noticed that the model struggles to choose the correct enity.
For example for a word 'X' it choosed the pre-defined enity 'ORG' from SpaCy, but it should be recogniced as a custom enity which I defined in the training data.
If I just use the ner_crf extractor I face huge problems in identifing location enities like capitals. Also one of my biggest problems are single answer enities.
Q : "What´s your favourite animal?"
A : Dog
My model is not able to extract this single entity 'animal' for this single answer. If I answer this question with two words like 'The Dog', the model has no problems to extract the animal entity with the value 'Dog'.
So my question is, is it clever to use two different components to extract entities? One for custom enities and the other one for pre-defined enities.
If I use two methods, what´s the mechanism in the model which extractor is used?
By the way, currently I´m just testing things out, so my training samples are not that huge it should be (less then 100 examples). Could the problem been solved if I have much more training examples?
You are facing 2 problems here. I am suggesting few ways that i found helpful.
1. Custom entity recognition:
To solve this you need to add more training sentences with all possible lengths of entities. ner_crf is going to predict better when there are identifiable markers around entities (e.g. prepositions)
2. Extracting entities from single word answer :
As a workaround, i suggest you to do below manipulations on client end.
When you are sending question like What´s your favorite animal?, append a marker to question to indicate to client that a single answer is expected. e.g.
You can send ##SINGLE## What´s your favorite animal? to client.
Client can remove the ##SINGLE## from question and show it to user. But when client sends user's response to server, it doesn't send Dog, it send something like User responded with single answer as Dog
You can train your model to extract entities from such an answer.

Can we test or evaluate entity extraction in Rasa NLU?

Is it possible to evaluate how well my model extracts entities (and maps synonym values) in Rasa NLU?
I have tried the rasa_nlu -evaluate mode however, it seems to only work for intent classification, although my JSON data file contains entities information and I'd really like to know if my entity extraction is up to the mark given various scenarios. I've used Tracy to generate test dataset.
Actually yes - you should get the score to you entities as well.
Are you sure you added some to your training data?
do you have it NER algo that fetches them? something like this?
pipeline:
- name: "intent_featurizer_count_vectors"
- name: "intent_classifier_tensorflow_embedding"
batch_size: 64
epochs: 1500
- name: "nlp_spacy"
- name: "tokenizer_spacy"
- name: "ner_crf"
ner_crf is conditional random field for extracting the "name entity recognition"
To make sure you follow the model building correctly have a look at this tutorial:
https://hackernoon.com/build-simple-chatbot-with-rasa-part-1-f4c6d5bb1aea
As the documentation says https://rasa.com/docs/nlu/0.12.0/evaluation/, if your are using either ner_crf or ner_duckling, the evaluation method automatically takes entity extraction performance unto account. If you only use ner_synonyms the evaluate method won't compute an output table.
Other possible pitfalls could be:
If you parse a single sentence including a desired entity, does your trained model extract an entity? This could be a clue to the situation that your model was not able to evolve a pattern recognizing entities.
Also a problem could be that by randomly splitting the data into train and test set, there's no entity in your test set to extract. Your algorithm could have learned the pattern but is not forced to apply this pattern. Did you check wether your test set contains entities?
If I understand right, perhaps you are interested in something like https://github.com/RasaHQ/rasa_nlu/issues/1472? So, this issue was written because for intents you could get overall score and you could see how each intent was classified, but you could only get the overall score for entities and not how each entity was classified.
So in short, this is still an open issue and not possible in Rasa. However, it was an issue I was asked to look at just yesterday, so I will let you know if I make any progress on it.

Parsing addresses with ambiguous data

I have data of phone numbers and village names collected from the villagers via forms. Because of various reasons the data is inaccurate or incomplete.
The idea is to validate these two data points before adding them to the data base/store.
The phone numbers are being formatted programmatically and validated via an external API. (That gives me the service provider and province information).
The problem is with the addresses.
No standardized address line. Tons of ambiguity.
Numeric street names and door numbers exist.
Input string will sometimes contain an addressee.
Possible solutions I can think of
Reverse geocoding helps. But not very accurate when it comes to Indian context. The Google TOS also prohibits automated queries. (correct me if I'm wrong here)
Soundexing. Again not very accurate with Indian data.
I understand it's difficult to such highly unstructured data, but I'm looking for a ways to achieve atleast enough accuracy to map addresses to the nearest point of interest.
Queries
Given a village name from the villager who might spell it wrong or incorrectly or abbreviate it how do I get the correct official name of the village and location?
Any possible ways to sanitize bad location/addresses or decode complex/poorly formed addresses?
Are there any machine learning solutions that can help so I can learn from every computation?(I have 0 knowledge on ML, do correct me if I'm wrong here.)
What you want is a geolocation system that works with informal text input. I have a previously used a Text-based geolocation model trained on Twitter data.
To solve your problem, you need training data in the form of:
informal_text village_name
If you have access to such data (e.g. using the addresses which can be geolocated) then you can train a text-based classifier that given a new informal address can predict where on the map it points to. In your case every village becomes a class label. You can use scikit-learn to train the classifier.

Where to find a state of art relation extraction dataset

I am looking for a dataset which contains large quantities of relation tuples. For example, the search of "people" and "location" yields "lives in", "worked in", etc. University of Washington's OpenIE http://OpenIE.cs.washington.edu is a good tool but their dataset is only accessible through web. Where can I download a database or library like this?
I've been collecting all the public datasets containing relationships between named-entities or nominals.
You can find them here:
https://github.com/davidsbatista/Annotated-Semantic-Relationships-Datasets
OpenIE itself provides large dataset of 11 gb for this purpose. Check this
http://knowitall.cs.washington.edu/paralex/
Although it is an auto answering system, you can consider intermediate relation extraction result for your purpose.
Another method you could implement is using syntex parse. Use syntex parser and write rules to extract subject, object and other entities as per your requirement.

What is the best method to extract relevant info from Email?

My friend has a small business where customers order services using email. He receives several emails a day and sorting thru it is becoming cumbersome.
There are about 10 different kind of tasks the customer can request, and for each there are one or two words that specify it. The other info present in the emails is the place where the service is to be delivered, the time, and the involved people's names. The email also contains an ID, a long number with a fairly standard format.
The emails are very unstructured, but all contain the key info above. My question is: what is the best method to sweep thru these emails and extract the key info (such as type of service, place, people's names, the ID etc)?
I thought about some kind of pre-processing, then pass it thru AlchemyAPI and then test the Alchemy output using Neural Networks for each feature (key info). This can be supervised learning as I can do a feedback loop all the time, as once the info is inputted, I can have someone to validate.
Any ideas? Thanks
I guess some parts (ID, task, time) can be captured by a regular expression and dictionary matching. Have a look at GATE's JAPE tool.
It should be fairly easy to assemble a dictionary and then use the lookups for the "task", also you can reuse the available jape rules for date/time and write a new one for the ID (also, a simple regex could be fine).
For matching the location and people's names you should be careful, openCalais and alchemyAPI can give you good results if names and places are used in well defined sentences and will probably make more mistakes with some tabular or weird format. Also you can never be sure you captured the place and person correctly so don't rely on that for processing orders directly.
If you have more information about mails' structure or expected names and places (i.e. you have a "clients" table with all possible names), you would probably want to do your own tagging, otherwise I'd stick to openCalais or alchemyAPI + some regular expressions.
P.S. I assume all mails are in English.

Resources