Google Prediction API for FAQ/Recommendation system - nlp

I want to build automated FAQ system where user can ask some questions and based on the questions and their answers from the training data, the application would suggest set of answers.
Can this be achieved via Prediction API?
If yes, how should I create my training data?
I have tested Prediction API for sentiment analysis. But having doubts and confusion on using it as FAQ/Recommendation system.
My training data has following structure:
"Question":"How to create email account?"
"Answer":"Step1: xxxxxxxx Step2: xxxxxxxxxxxxx Step3: xxxxx xxx xxxxx"
"Question":"Who can view my contact list?"
"Answer":"xxxxxx xxxx xxxxxxxxxxxx x xxxxx xxx"

train your data like input is question and output is answer
when you are sending a question as a input to predict it can give output of your answer.
simple faq you will rock.
but if you completed in PHP Help me too man.

In order to use the Prediction API, you must first train it against a set of training data. At the end of the training process, the Prediction API creates a model for your data set. Each model is either categorical (if the answer column is string) or regression (if the answer column is numeric). The model remains until you explicitly delete it. The model learns only from the original training session and any Update calls; it does not continue to learn from the Predict queries that you send to it.
Training data can be submitted in one of the following ways:
A comma-separated value (CSV) file. Each row is an example consisting of a collection of data plus an answer (a category or a value) for that example, as you saw in the two data examples above. All answers in a training file must be either categorical or numeric; you cannot mix the two. After uploading the training file, you will tell the Prediction API to train against it.
Training instances embedded directly into the request. The training instances can be embedded into the trainingInstances parameter. Note: due to limits on the size of an HTTP request, this would only work with small datasets (< 2 MB).
Via Update calls. First an empty model is trained by passing in empty storageDataLocation and trainingInstances parameters into an Insert call. Then, the training instances are passed in using the Update call to update the empty model. Note: since not all classifiers can be updated, this may result in lower model accuracy than batch training the model on the entire dataset.
You can have more information in this Help Center article.
NB: Google Prediction API client library for PHP is still in Beta.

Related

Is it possible to keep training the same Azure Translate Custom Model with additional data sets?

I just finished training a Custom Azure Translate Model with a set of 10.000 sentences. I now have the options to review the result and test the data. While I already get a good result score I would like to continue training the same model with additional data sets before publishing. I cant find any information regarding this in the documentation.
The only remotely close option I can see is to duplicate the first model and add the new data sets but this would create a new model and not advance the original one.
Once the project is created, we can train with different models on different datasets. Once the dataset is uploaded and the model was trained, we cannot modify the content of the dataset or upgrade it.
https://learn.microsoft.com/en-us/azure/cognitive-services/translator/custom-translator/quickstart-build-deploy-custom-model
The above document can help you.

Tuned model with GroupKFold Cross-Validaion requires Group parameter when Predicting

I tuned a RandomForest with GroupKFold (to prevent data leakage because some rows came from the same group).
I get a best fit model, but when I go to make a prediction on the test data it says that it needs the group feature.
Does that make sense? Its odd that the group feature is coming up as one of the most important features as well.
I'm just wondering if there is something I could be doing wrong.
Thanks
A search on the scikit-learn Github repo does not reveal a single instance of the string "group feature" or "group_feature" or anything similar, so I will go ahead and assume you have in your data set a feature called "group" that the prediction model requires as input in order to produce an output.
Remember that a prediction model is basically a function that takes an input (the "predictor" variable) and returns an output (the "predicted" variable). If a variable called "group" was defined as input for your prediction model, then it makes sense that scikit-learn would request it.
Does the group appear as a column on the training set? If so, remove it and re-train. It looks like you are just using it to generate splits. If it isn't a part of the input data you need to predict, it shouldn't be in the training set.

How to design a LSTM network , which accept multiple input

Here is the scenario, I want to create a contextual chatbot, which means the bot will answer or reply based on context. As an example
Input :["text": "it was really nice", "topic":movie]
Output:["text": "indeed,it was an awesome movie","topic":movie]
Whenever I have to consider the only one thing about the input, which is the sentence itself I can do it, all I need to do is to tokenize the sentences and feed into the input of LSTM. But how can I consider "topic"?
I have already prepared a dataset, in such a format.
I am using Keras to build such a bot.
I am not really sure what you want to build.
The first thing that comes to mind is a normal generativ lstm like this one
https://keras.rstudio.com/articles/examples/lstm_text_generation.html
wich generates text based on nietches works.
To use such a network you would need your training data in a question?, answer format.
And you would need to set your question as the seed.
You do not need to load the topic seperatly, as the concept of a neural net is that it learns on its own to understand the data.

Need to know how to properly regression test a Dialogflow agent - multiple, conflicting options

I've been working with Dialogflow for several months now - really enjoy the tool. However, it's very important to me to know if my changes (in training) are making intent mapping accuracy better or worse over time. I've talked to numerous people about the best way to do this using Dialogflow and I've gotten at least 4 different answers. Can you help me out here? Here are the 4 answers I've received on how to properly train/test your agent. Please note that ALL of these answers involve generating a confusion matrix...
"Traditional" - randomly split all your input data into 80/20% - use the 80% for training and the 20% for testing. Every time you train your agent (because you've collected new input data), start the process all over again - meaning randomly split all your input data (old and new) into 80/20%. In this model, a piece of training data for one agent iteration might be used as a piece of test data on another iteration - and vice-versa. I've seen variations on this option (KFold and Monte Carlo).
"Golden Set" - similar to the above except that the initial 80/20% training and testing sets that you use to create your first agent continue to grow over time as you add more input data. In this model, once a piece of input data has been tagged as training data, it will NEVER be used as testing data - and once a piece of input data has been tagged as testing data, it will NEVER be used as training data. The 2 initial sets of training and testing data just continue to grow over time as new inputs are randomly split into the existing sets.
"All Data is Training Data - Except for Fake Testing Data" - In this model, we use all the input data as training data. We then copy a portion of the input data (20%) and "munge it" - meaning that we inject characters or words into the 20% portion - and use that as test data.
"All Data is Training Data - and Some of it is Testing Data Also" - In this model, we use all the input data as training data. We then copy a portion of the input data (20%) and use it as testing data. In other words, we are testing our agent using a portion of our (unmodified) training data. A variation on this option is to still use all your inputs as training data but sort your inputs by "popularity/usage" and take the top 20% for testing data.
If I were creating a bot from scratch, I'd simply go with option #1 above. However, I'm using an off-the-shelf product (Dialogflow) and it isn't clear to me that traditional testing is required. Golden Set seems like it will (mostly) get me to the same place as "traditional" so I don't have a big problem with it. Option #3 seems bad - creating fake testing data sounds problematic on many levels. And option #4 is using the same data to test as it uses to train - which scares me.
Anyway, would love to hear some thoughts on the above from the experts!

Train multiple models with various measures and accumulate predictions

So I have been playing around with Azure ML lately, and I got one dataset where I have multiple values I want to predict. All of them uses different algorithms and when I try to train multiple models within one experiment; it says the “train model can only predict one value”, and there are not enough input ports on the train-model to take in multiple values even if I was to use the same algorithm for each measure. I tried launching the column selector and making rules, but I get the same error as mentioned. How do I predict multiple values and later put the predicted columns together for the web service output so I don’t have to have multiple API’s?
What you would want to do is to train each model and save them as already trained models.
So create a new experiment, train your models and save them by right clicking on each model and they will show up in the left nav bar in the Studio. Now you are able to drag your models into the canvas and have them score predictions where you eventually make them end up in the same output as I have done in my example through the “Add columns” module. I made this example for Ronaldo (Real Madrid CF player) on how he will perform in match after training day. You can see my demo on http://ronaldoinform.azurewebsites.net
For more detailed explanation on how to save the models and train multiple values; you can check out Raymond Langaeian (MSFT) answer in the comment section on this link:
https://azure.microsoft.com/en-us/documentation/articles/machine-learning-convert-training-experiment-to-scoring-experiment/
You have to train models for each variable that you going to predict. Then add all those predicted columns together and get as a single output for the web service.
The algorithms available in ML are only capable of predicting a single variable at a time based on the inputs it's getting.

Resources