training phrases order is guaranteed? - dialogflow-es

I would like to tie Dialogflow training phrase with our APP's record with the name (dialogflow terminology which usually called as 'key', 'internal-id', or 'code'). In order to do that on creating an intent (which has several training phrases) via Dialogflow API (e.g. batch_update_intents), it is required some binding between input parameter for each training phrase and Dialogflow created training phrase (which returns generated name).
Where, 'tie' or 'bind' means that my APP record refers to Dialogflow training phrase by the uniq-id name. For example, tp1 below refers to a Dialogflow training phrase "Is today find?" with the name '9ed938...':
| Training Phrase | My APP | Dialogflow |
| | | name | parts... |
|-----------------------+--------+-----------+-------------------|
| "Is today fine?" | tp1 | 9ed938... | "Is", "today", ...|
| "What weather today?" | tp2 | b3415c... | "What, "wheather".|
If order of created training phrase is guaranteed as exactly the same as input parameter for training phrase, it is OK to bind in the order. Otherwise, there is no way to tie them (or, matching by training phrase text?).
So my question is that the order of created training phrase is guaranteed as input parameters order?

Assuming that the Google uses the public Protobuf definitions for Dialogflow internally the training phrases of an intent are stored as a repeated field, which does preserve the order of its entries. That and the fact that the external API uses a JSON array, which is also supposed to preserve its order, should make it possible to rely on the order in which you have created them.

Related

Fasttext inconsistent on one label model classification

I'm using official FastText python library (v0.9.2) for intents classification.
import fasttext
model = fasttext.train_supervised(input='./test.txt',
loss='softmax',
dim=200,
bucket=2000000,
epoch=25,
lr=1.0)
Where test.txt contains just one sample file like:
__label__greetings hi
and predict two utterances the results are:
print(model.words)
print('hi', model.predict('hi'))
print('bye', model.predict('bye'))
app_1 | ['hi']
app_1 | hi (('__label__greetings',), array([1.00001001]))
app_1 | bye ((), array([], dtype=float64))
This is my expected output, meanwhile if a set two samples for the same label:
__label__greetings hi
__label__greetings hello
The result for OOV is not correct.
app_1 | ['hi', '</s>', 'hello']
app_1 | hi (('__label__greetings',), array([1.00001001]))
app_1 | bye (('__label__greetings',), array([1.00001001]))
I understand that the problem is with </s> token, maybe \n in text file?, and when there isn't any word on vocabulary the text is replaced by </s>. There are any train option or way to skip this behavior?
Thanks!
In addition to gojomo's answer, we can say that your training dataset is absolutely too small.
If you don't have a significant annotated dataset, you can try zero shot classification: starting from a pretrained language model, you only set some labels and let the model try to classify sentences.
Here you can see and test an interesting demo.
Read also this good article about zero shot classification, with theory and implementation.
FastText is a big, data-hungry algorithm that starts with random-initialization. You shouldn't expect results to be sensible or indeed match any set of expectations on toy-sized datasets - where (for example) 100%-minus-epsilon of your n-gram buckets won't have received any training.
I also wouldn't expect supervised mode to ever reliably predict no labels on realistic data-sets – it expects all of its training data to have labels, and I've not seen mention of its use to predict an implied 'ghost' category of "not in training data" versus a single known label (as in 'one-class classification').
(Speculatively, I think you might have to feed FastText supervised mode explicitly __label__not-greetings labeled contrast data – perhaps just synthesized random strings if you've got nothing else – in order for it to have any hope of meaningfully predicting "not-greetings".)
Given that, I'd not consider your first result for the input bye correct, nor the second result not correct. Both are just noise results from an undertrained model being asked to make a kind of distinction it's not known for being able to make.

To check if a string of words is a sentence

I have a text file from which I have to eliminate all the statements which do not make any meaning or in other words, I have to check for a statement that if it is a sentence or not.
For example:
1. John is a heart patient.
2. Dr. Green, Rob is the referring doctor for the patient.
3. Jacob Thomas, M.D. is the ordering provider
4. Xray Shoulder PA, Oblique, TRUE Lateral, 18° FOSSA LAT LT; Status: Complete;
The sentence 1,2, ad 3 makes some meaning
but sentence 4 does not make any meaning, so I want to eliminate it.
May I know how it could be done?
This task seems very difficult; however, assuming you have the training data, you could likely use XGBoost, which uses boosted decision trees (and random forests). You would train it to answer positive or negative (yes is makes sense, or no).
You would then need to come up with features. You could use the features from the NLTK part of speech (POS) tags. The number of occurrences of each of the types of tags in the sentence would be a good first model. That can set your benchmark for how good an "easy" solution is.
You also may be able to look into the utility of a (word/sentence)-to-vector model such as gensim for creating features for your model.
First I would see what happens with just the number of occurrences of each POS tag and XGBOOST. Train and test a model and see how well it does. Then look to adding other features such as position or using a doc-2-vec as your input to XGBoost.
Last resort would be a neural network (which would only be recommended if the prior ideas fail, and you have lots and lots of data). If you did use a neural net I would think an LSTM would likely be useful.
You would have to experiment and the amount of data matters, but you can start simple and then test and add to your model iteratively.
It's very hard to be 100% confident but let's try.
I can use Amazon Comprehend - Natural Language Processing and Text Analytics and create your own metrics over the sentences. ex:
John is a heart patient.
Amazon will give you: "." Punctuation, "a" Determiner, "heart" Noun, "is" verb, "John" Proper Noun, "patient" Noun.
1 Punctuation, 1 Determiner, 2 Noun, 1 Verb, 1 Proper Noun. Probably you will have Noun and verd to have a valid sentence.
In Your last sentence we have:
3 Punctuation, 1 Numeral, 11 Proper noun. You dont have a action (verb) probably these sentense isn't valid.

Best practice wit ai training

I am developing an apps that use wit ai as a service. Right now, I am having problems training it. In my apps I have 3 intents:
to call
to text
to send picture
Here are my example training:
Call this number 072839485 and text this number 0623744758 and send picture to this number 0834952849.
Call this number 072839485, 0834952849 and 0623744758
In my first training I labeled that sentence with all 3 intents, and 072839485 as phone_number with role to_call_phone_number, 0623744758 as phone_number with role to_text_phone_number and 0834952849 as phone_number with role to_send_pic_phone_number.
In my second training I labeled all the 3 numbers as phone_number with to_call_phone_number role.
After many training, the wit still output the wrong labelled. When the sentence like this:
Call this number 072637464, 07263485 and 0273847584
The wit says 072637464 is to_call_phone_number but 07263485 and 0273847584 are to_send_pic_phone_number.
Am I not correctly training it? Can some one give me some suggestions about the best practice to train wit?
There aren't many best practices out there for wit.ai training at the moment, but with this particular example in mind I would recommend the following:
Pay attention to the type of entity in addition to just the value. If you choose free-text or keyword, you'll get different responses from the wit engine. For example: in your training if the number is a keyword, it'll associate the particular number with the intent/role rather than the position. This is probably the reason your training isn't working correctly.
One good practice would be to train your bot with specific examples first which will provide the bot with more information (such as user providing keyword 'photograph' and number) and then general examples which will apply to more cases (such as your second example).
Think about the user's perspective and what would seem natural to them. Work with those training examples first. Generate a list of possible training examples labelling them from general to specific and then train intents/roles/entities based on those examples rather than thinking about intents and roles first.

Google Prediction API for FAQ/Recommendation system

I want to build automated FAQ system where user can ask some questions and based on the questions and their answers from the training data, the application would suggest set of answers.
Can this be achieved via Prediction API?
If yes, how should I create my training data?
I have tested Prediction API for sentiment analysis. But having doubts and confusion on using it as FAQ/Recommendation system.
My training data has following structure:
"Question":"How to create email account?"
"Answer":"Step1: xxxxxxxx Step2: xxxxxxxxxxxxx Step3: xxxxx xxx xxxxx"
"Question":"Who can view my contact list?"
"Answer":"xxxxxx xxxx xxxxxxxxxxxx x xxxxx xxx"
train your data like input is question and output is answer
when you are sending a question as a input to predict it can give output of your answer.
simple faq you will rock.
but if you completed in PHP Help me too man.
In order to use the Prediction API, you must first train it against a set of training data. At the end of the training process, the Prediction API creates a model for your data set. Each model is either categorical (if the answer column is string) or regression (if the answer column is numeric). The model remains until you explicitly delete it. The model learns only from the original training session and any Update calls; it does not continue to learn from the Predict queries that you send to it.
Training data can be submitted in one of the following ways:
A comma-separated value (CSV) file. Each row is an example consisting of a collection of data plus an answer (a category or a value) for that example, as you saw in the two data examples above. All answers in a training file must be either categorical or numeric; you cannot mix the two. After uploading the training file, you will tell the Prediction API to train against it.
Training instances embedded directly into the request. The training instances can be embedded into the trainingInstances parameter. Note: due to limits on the size of an HTTP request, this would only work with small datasets (< 2 MB).
Via Update calls. First an empty model is trained by passing in empty storageDataLocation and trainingInstances parameters into an Insert call. Then, the training instances are passed in using the Update call to update the empty model. Note: since not all classifiers can be updated, this may result in lower model accuracy than batch training the model on the entire dataset.
You can have more information in this Help Center article.
NB: Google Prediction API client library for PHP is still in Beta.

NLP: Words and Polarity

Is anyone aware of any repository that has words and their polarities as score.
Example
Word | Polarity
bad | -1
worst | -3
better | 1
best | 3
Thanks
A
What you are looking for is a sentiment lexicon. A sentiment lexicon is a dictionary of words, in which each word has a corresponding sentiment score (ranging from very negative to very positive). There are several sentiment lexicons that you could use, such as sentiwordnet, sentistrength, and AFINN just to name a few. The easiest to use among these is AFINN which I recommend you to start with. Later you can upgrade to a more suitable one based on your application. You can find information about AFINN here and download it from here.
While Alex Nevidomsky is generally correct in his comment, in sentiment analysis problems there are many ways to circumvent such limitations, by learning the context of a word. Let me know if you had any further questions.

Resources