Here is the scenario, I want to create a contextual chatbot, which means the bot will answer or reply based on context. As an example
Input :["text": "it was really nice", "topic":movie]
Output:["text": "indeed,it was an awesome movie","topic":movie]
Whenever I have to consider the only one thing about the input, which is the sentence itself I can do it, all I need to do is to tokenize the sentences and feed into the input of LSTM. But how can I consider "topic"?
I have already prepared a dataset, in such a format.
I am using Keras to build such a bot.
I am not really sure what you want to build.
The first thing that comes to mind is a normal generativ lstm like this one
https://keras.rstudio.com/articles/examples/lstm_text_generation.html
wich generates text based on nietches works.
To use such a network you would need your training data in a question?, answer format.
And you would need to set your question as the seed.
You do not need to load the topic seperatly, as the concept of a neural net is that it learns on its own to understand the data.
Related
I want to fine tune BERT on a specific domain. I have texts of that domain in text files. How can I use these to fine tune BERT?
I am looking here currently.
My main objective is to get sentence embeddings using BERT.
The important distinction to make here is whether you want to fine-tune your model, or whether you want to expose it to additional pretraining.
The former is simply a way to train BERT to adapt to a specific supervised task, for which you generally need in the order of 1000 or more samples including labels.
Pretraining, on the other hand, is basically trying to help BERT better "understand" data from a certain domain, by basically continuing its unsupervised training objective ([MASK]ing specific words and trying to predict what word should be there), for which you do not need labeled data.
If your ultimate objective is sentence embeddings, however, I would strongly suggest you to have a look at Sentence Transformers, which is based on a slightly outdated version of Huggingface's transformers library, but primarily tries to generate high-quality embeddings. Note that there are ways to train with surrogate losses, where you try to emulate some form ofloss that is relevant for embeddings.
Edit: The author of Sentence-Transformers recently joined Huggingface, so I expect support to greatly improve over the upcoming months!
#dennlinger gave an exhaustive answer. Additional pretraining is also referred as "post-training", "domain adaptation" and "language modeling fine-tuning". here you will find an example how to do it.
But, since you want to have good sentence embeddings, you better use Sentence Transformers. Moreover, they provide fine-tuned models, which already capable of understanding semantic similarity between sentences. "Continue Training on Other Data" section is what you want to further fine-tune the model on your domain. You do have to prepare training dataset, according to one of available loss functions. E.g. ContrastLoss requires a pair of texts and a label, whether this pair is similar.
I believe transfer learning is useful to train the model on a specific domain. First you load the pretrained base model and freeze its weights, then you add another layer on top of the base model and train that layer based on your own training data. However, the data would need to be labelled.
Tensorflow has some useful guide on transfer learning.
You are talking about pre-training. Fine-tuning on unlabeled data is called pre-training and for getting started, you can take a look over here.
I am beginner to Machine Learning and NLP, I have to create a bot based on FAQ dataset, Each FAQ dataset excel file contains 2 columns "Questions" and its "Answers".
Eg. A record from an excel file (A question & it's answer).
Question - What is RASA-NLU?
Answer - Rasa NLU is trained to identify intent and entities. Better the training, better the identification...
We have 3K+ excel files which has around 10K to 20K such records each excel.
To implement the bot, I would have followed exactly this FAQ bot approach which uses RASA-NLU, but the RASA,Chatterbot also Microsoft's QnA maker are not allowed in my organization.
And Spacy does the NER extraction perfectly for me, so I am looking for a bot creation using Spacy. but I don't know how to proceed further after extracting the entities. (IMHO, I will have to predict the exact question from dataset (and its answer from knowlwdge base) from user query to the bot)
I don't know what NLP algorithm/ ML process to be used or is there any easiest way to create that FAQ bot using extracted NERs.
One way to achieve your FAQ bot is to transform the problem into a classification problem. You have questions and the answers can be the "labels". I suppose that you always have multiple training questions which map to the same answer. You can encode each answer in order to get smaller labels (for instance, you can map the text of the answer to an id).
Then, you can use your training data (the questions) and your labels (the encoded answers) and feed a classifier. After the training your classifier can predict the label of unseen questions.
Of course, this is a supervised approach, so you will need to extract features from your training sentences (the questions). In this case, you can use as a feature the bag-of-word representations and even include the named entities.
An example of how to do text classification in spacy is available here: https://www.dataquest.io/blog/tutorial-text-classification-in-python-using-spacy/
I have string data in my dataset of the type :
AGF.SL.CA.LOSANG.15764
ABC.EMP.GOO.__._ME$.ZR_ME$ATR$GENERAL
SEM.JP.YOO.����_������_�����.ZC_NA:US::SANDO$GENERAL
Every record has a category associated with it, and given one such string, I have to use a Machine Learning or Deep Learning approach to identify the corresponding category.
I am confused as to what approach to follow in order to do this. My primary question is, should I keep the strings as is and use string similarity functions, or should I break up the strings into different words, and then do count vectorization on it, and then proceed from there?
Given this kind of data, with just one string to predict the class, what would be the best approach? I have to put this into production so I need look at something which will scale well. I am new to ML so any suggestions would be appreciated. Thanks.
It seems to me that you can tackle this problem using lstm. Long short-term memory (LSTM) units (or blocks) are a building unit for layers of a recurrent neural network (RNN)
These LSTM will help us to capture sequential information and generally used in case where we want to learn the sequential patterns in the data
You can decode this problem using character level LSTM.
In this you have to pass every character of the text in a LSTM cell.and at the last time step you will have a class which is the true label
You can use cross-entropy loss function.
https://machinelearningmastery.com/develop-character-based-neural-language-model-keras/
This will give you complete idea
Build_vocab extend my old vocabulary?
For example, my idea is when I use doc2vec(s) to train a model, it just builds the vocabulary from the datasets. If I want to extend it, I need to use build_vocab()
Where should I use it? Should I put it after "gensim.doc2vec()"?
For example:
sentences = gensim.models.doc2vec.TaggedLineDocument(f_path)
dm_model = gensim.models.doc2vec.Doc2Vec(sentences, dm=1, size=300, window=8, min_count=5, workers=4)
dm_model.build_vocab()
You should follow working examples in gensim documentation/tutorials/notebooks or online tutorials to understand which steps are necessary and in what order.
In particular, if you provide your sentences corpus iterable on the Doc2Vec() initialization, it will automatically do both the vocabulary-discovery pass and all training – so you don’t then need to call either build_vocab() or train() yourself. And further, you would never call build_vocab() with no arguments. (No working example in docs or online will do what your code does – so don’t improvise new things until you’ve followed the examples and know why they do what they do.)
There is an optional update argument to build_vocab(), which purports to allow the expansion of a vocabulary from an earlier training session (in preparation for further training with the newer words). HOWEVER, it’s only been developed/tested with regard to Word2Vec models – there are reports it causes crashes when used with Doc2Vec. And even in Word2Vec, its overall effects and best-ways-to-use aren’t clear, across all training modes. So I don’t recommend its use except for experts who can read & interpret the source code, and many involved tradeoffs, on their own. If you receive a chunk of new texts, with new words, the best-grounded course of action, and easiest to evaluate/reason-about, is to re-train from scratch, using a combined corpus of all text examples.
I want to build automated FAQ system where user can ask some questions and based on the questions and their answers from the training data, the application would suggest set of answers.
Can this be achieved via Prediction API?
If yes, how should I create my training data?
I have tested Prediction API for sentiment analysis. But having doubts and confusion on using it as FAQ/Recommendation system.
My training data has following structure:
"Question":"How to create email account?"
"Answer":"Step1: xxxxxxxx Step2: xxxxxxxxxxxxx Step3: xxxxx xxx xxxxx"
"Question":"Who can view my contact list?"
"Answer":"xxxxxx xxxx xxxxxxxxxxxx x xxxxx xxx"
train your data like input is question and output is answer
when you are sending a question as a input to predict it can give output of your answer.
simple faq you will rock.
but if you completed in PHP Help me too man.
In order to use the Prediction API, you must first train it against a set of training data. At the end of the training process, the Prediction API creates a model for your data set. Each model is either categorical (if the answer column is string) or regression (if the answer column is numeric). The model remains until you explicitly delete it. The model learns only from the original training session and any Update calls; it does not continue to learn from the Predict queries that you send to it.
Training data can be submitted in one of the following ways:
A comma-separated value (CSV) file. Each row is an example consisting of a collection of data plus an answer (a category or a value) for that example, as you saw in the two data examples above. All answers in a training file must be either categorical or numeric; you cannot mix the two. After uploading the training file, you will tell the Prediction API to train against it.
Training instances embedded directly into the request. The training instances can be embedded into the trainingInstances parameter. Note: due to limits on the size of an HTTP request, this would only work with small datasets (< 2 MB).
Via Update calls. First an empty model is trained by passing in empty storageDataLocation and trainingInstances parameters into an Insert call. Then, the training instances are passed in using the Update call to update the empty model. Note: since not all classifiers can be updated, this may result in lower model accuracy than batch training the model on the entire dataset.
You can have more information in this Help Center article.
NB: Google Prediction API client library for PHP is still in Beta.