How to merge new model after training with the old one? - nlp

I have a model "en-ner-organization.bin" which I downloaded from apache web-site. It's works fine, but I prefer to train it with my organizations database to increase recognition quality. But after I trained "en-ner-organization.bin" with my organization database - the size of model became less that it was. So it seems, it was overwritten with my data.
I see that there is no possibility to re-train existing model, but maybe there is a way to merge models?
If no - I guess I can add my train data into the .train file of original model, so generated model will consists of default data, plus my data from db. But I can't find such file in web.
So, the main question is: how to keep existing model data and add new data into model?
Thanks

As far as I know it's not possibile to merge different models, but it's possible to specify different files to the finder.
From the sinopsys:
$ bin/opennlp TokenNameFinder
Usage: opennlp TokenNameFinder model1 model2 ... modelN < sentences

Related

combine multiple spacy textcat_multilabel models into a single textcat_multilabel model

Problem: I have millions of records that need to be transformed using a bunch of spacy textcat_multilabel models.
// sudo code
for model in models:
nlp = spacy.load(model)
for groups_of_records in records: // millions of records
new_data = nlp.pipe(groups_of_records) // data is getting processed bulk
// process data
bulk_create_records(new_data)
My current loop is as follows:
load a model
loop through records / transform data using model / save
As you can imagine, the more records i process, and the more models i include, the longer this entire process will take. The idea is to make a single model, and just process my data once, instead of (n * num_of_models)
Question: is there a way to combine multiple textcat_multilabel models created from the same spacy config, into a single textcat_multilabel model?
There is no basic feature to just combine models, but there are a couple of ways you can do this.
One is to source all your components into the same pipeline. This is very easy to do, see the double NER project for an example. The disadvantage is that this might not save you much processing time, since separately trained models will still have their own tok2vec layers.
You could combine your training data and train one big model. But if your models are actually separate that would almost certainly cause a reduction in accuracy.
If speed is the primary concern, you could train each of your textcats separately while freezing your tok2vec. That would result in decreased accuracy, though maybe not too bad, and it would allow you to then combine the textcat models in the same pipeline while removing a bunch of tok2vec processing. (This is probably the method I've listed with the best balance of implementation complexity, speed advantage, and accuracy sacrificed.)
One thing that I don't think has been tested is that you could try training separate textcat models at the same time with separate sets of labels by manually specifying the labels to each component in their configs. I am not completely sure that would work but you could try it.

Is it possible to keep training the same Azure Translate Custom Model with additional data sets?

I just finished training a Custom Azure Translate Model with a set of 10.000 sentences. I now have the options to review the result and test the data. While I already get a good result score I would like to continue training the same model with additional data sets before publishing. I cant find any information regarding this in the documentation.
The only remotely close option I can see is to duplicate the first model and add the new data sets but this would create a new model and not advance the original one.
Once the project is created, we can train with different models on different datasets. Once the dataset is uploaded and the model was trained, we cannot modify the content of the dataset or upgrade it.
https://learn.microsoft.com/en-us/azure/cognitive-services/translator/custom-translator/quickstart-build-deploy-custom-model
The above document can help you.

How do i retrain the model without losing the earlier model data with new set of data

for my current requirement, I'm having a dataset of 10k+ faces from 100 different people from which I have trained a model for recognizing the face(s). The model was trained by getting the 128 vectors from the facenet_keras.h5 model and feeding those vector value to the Dense layer for classifying the faces.
But the issue I'm facing currently is
if want to train one person face, I have to retrain the whole model once again.
How should I get on with this challenge? I have read about a concept called transfer learning but I have no clues about how to implement it. Please give your suggestion on this issue. What can be the possible solutions to it?
With transfer learning you would copy an existing pre-trained model and use it for a different, but similar, dataset from the original one. In your case this would be what you need to do if you want to train the model to recognize your specific 100 people.
If you already did this and you want to add another person to the database without having to retrain the complete model, then I would freeze all layers (set layer.trainable = False for all layers) except for the final fully-connected layer (or the final few layers). Then I would replace the last layer (which had 100 nodes) to a layer with 101 nodes. You could even copy the weights to the first 100 nodes and maybe freeze those too (I'm not sure if this is possible in Keras). In this case you would re-use all the trained convolutional layers etc. and teach the model to recognise this new face.
You can save your training results by saving your weights with:
model.save_weights('my_model_weights.h5')
And load them again later to resume your training after you added a new image to the dataset with:
model.load_weights('my_model_weights.h5')

Using my saved ML model to work on a raw and unprocessed dataset

I have created few models in ML and saved them for future use in predicting the outcomes. This time there is a common scenario but unseen for me.
I need to provide this model to someone else to test it out on their dataset.
I had removed few redundant columns from my training data, trained a regression model on it and saved it after validating it. However, when I give this model to someone to use it on their dataset, how do I tell them to drop few columns. I could have manually added the column list in a python file where saved model will be called from but that does not sound too neat.
What is the best way to do this in general. Kindly share some inputs.
One can simply use pickle library to save column list and other things along with the model. In the new session, one can simply use pickle to upload those things in the session again.

How to Add Training Data to Out-of-the-Box Parsey McParseFace Model

I am wondering how, if possible at all, one might train a new SyntaxNet model that uses the training data from the original, out-of-the-box "ready to parse" model included on the github page. What I want to do is add new training data to make a new model, but I don't want to make an entirely new and therefore entirely distinct model from the original Parsey McParseFace. So my new model would be trained on the data that the included model was trained on (Penn Treebank, OntoNotes, English Web Treebank), plus my new data. I don't have the money to buy from the LDC the treebanks the original model is trained on. Has anyone attempted this? Thanks very much.

Resources