Is it possible to train spark word2vec model in batch mode - apache-spark

I am wondering if it is possible to train spark word2vec in batch mode. Or in other words, if it is possible to update the vocabulary list of a spark word2vec model which is already trained.
My application is:
my paragraphs are located in multiple files, and when I use gensim i can do
class MySentences(object):
def __init__(self, file_list, folder):
self.file_list = file_list
self.folder = folder
def __iter__(self):
for file in self.file_list:
if 'walk_' in file:
print file
with open(self.folder + file, 'r') as f:
for line in f:
yield line.split()
model = Word2Vec(MySentences(files, fileFolder), size=32, window=5, min_count=5, workers=15)
i can even do
for epoch in range(10):
model.train(MySentences(files, fileFolder))
I am wondering how I can do similar things in spark word2vec.
In spark, I found I can only do RDD union with multiple files as:
from pyspark.mllib.feature import Word2Vec
from pyspark.sql import SQLContext
inp1 = sc.textFile("file1").map(lambda row: row.split('\t'))
inp2 = sc.textFile("file2").map(lambda row: row.split('\t'))
inp = sc.union([inp1,inp2])
word2vec = Word2Vec().setVectorSize(4).setMinCount(1)
model = word2vec.fit(inp)
otherwise, if I train model with inp1, then inp2, the words from inp1 will be gone.
If i cannot do the training on batch mode, how can i update a trained model with new paragraphs in future?

I think you can:
for idx in range(1, 100, 1):
model = word2vec.fit(data.sample(False, 0.01))
model.save(sc, path)
Not sure if the sample function always takes unseen data in this example.

Related

Restore best checkpoint to an estimator tensorflow 2.x

Briefly, I put in place a data input pipline using tensorflow Dataset API. Then, I implemented a CNN model for classification using keras, which i converted to an estimator. I feeded my estimator Train and Eval Specs with my input_fn providing input data for training and evaluation. And as final step I launched the model training with tf.estimator.train_and_evaluate
def my_input_fn(tfrecords_path):
dataset = (...)
return batch_fbanks, batch_labels
def build_model():
model = tf.keras.models.Sequential()
model.add(...)
model.compile(...)
return model
model = build_model()
run_config=tf.estimator.RunConfig(model_dir,save_summary_steps=100,save_checkpoints_steps=1000)
estimator = tf.keras.estimator.model_to_estimator(model,config=run_config)
def serving_input_receiver_fn():
inputs = {'Conv1_input': tf.compat.v1.placeholder(shape=[None, 11,120,1], dtype=tf.float32)}
return tf.estimator.export.ServingInputReceiver(inputs, inputs)
exporter = tf.estimator.BestExporter(serving_input_receiver_fn, name="best_exporter", exports_to_keep=5)
train_spec_dnn = tf.estimator.TrainSpec(input_fn = lambda: my_input_fn(train_data_path),hooks=[hook])
eval_spec_dnn = tf.estimator.EvalSpec(input_fn = lambda: my_eval_input_fn(eval_data_path),exporters=exporter,start_delay_secs=0,throttle_secs=15)
tf.estimator.train_and_evaluate(estimator, train_spec_dnn, eval_spec_dnn)
I save the 5 best checkpoints using the tf.estimator.BestExporter as shown above. Once i finished training, i want to reload the best model and convert it to an estimator to re-evaluate the model and predict on new dataset. However my issue is in restoring the checkpoint to an estimator. I tried several solutions but each time i don't get the estimator object I need to run its evaluate and predict methods.
Just to specify more, each of the best checkpoints directory is organised as follow:
./
variables/
variables.data-00000-of-00002
variables.data-00001-of-00002
variables.index
saved_model.pb
So the question is how can I get an estimator object from the best checkpoint so that i can use it to evaluate my model and predict on new data?
Note : I found some proposed solutions relying on TensorFlow v1 features which can not solve my problem because i work with TF v2.
Thanks a lot, any help is appreciated.
You can use the class below created from tf.estimator.BestExporter
What it does is, except for saving the best model (.pb files and etc) it will also save
the best-exported model checkpoint on a different folder.
Below is the class:
import shutil, glob, os
# import tensorflow.logging as logging
## the path where all the checkpoint reside
BEST_CHECKPOINTS_PATH_FROM = 'PATH TO ALL CHECKPOINT FILES'
## the path it will save the best exporter checkpoint files
BEST_CHECKPOINTS_PATH_TO = 'PATH TO BEST EXPORTER CHECKPOINT FILES TO BE SAVE'
class BestCheckpointsExporter(tf.estimator.BestExporter):
def export(self, estimator, export_path, checkpoint_path, eval_result,is_the_final_export):
if self._best_eval_result is None or \
self._compare_fn(self._best_eval_result, eval_result):
#print('Exporting a better model ({} instead of {})...'.format(eval_result, self._best_eval_result))
for name in glob.glob(checkpoint_path + '.*'):
print(name)
print(os.path.join(BEST_CHECKPOINTS_PATH_TO, os.path.basename(name)))
shutil.copy(name, os.path.join(BEST_CHECKPOINTS_PATH_TO, os.path.basename(name)))
# also save the text file used by the estimator api to find the best checkpoint
with open(os.path.join(BEST_CHECKPOINTS_PATH_TO, "checkpoint"), 'w') as f:
f.write("model_checkpoint_path: \"{}\"".format(os.path.basename(checkpoint_path)))
self._best_eval_result = eval_result
else:
print('Keeping the current best model ({} instead of {}).'.format(self._best_eval_result, eval_result))
Example Usage of the Class
You will just replace the exporter by calling the class and pass the serving_input_receiver_fn.
def serving_input_receiver_fn():
inputs = {'my_dense_input': tf.compat.v1.placeholder(shape=[None, 4], dtype=tf.float32)}
return tf.estimator.export.ServingInputReceiver(inputs, inputs)
exporter = BestCheckpointsExporter(serving_input_receiver_fn=serving_input_receiver_fn)
train_spec_dnn = tf.estimator.TrainSpec(input_fn = input_fn, max_steps=5)
eval_spec_dnn = tf.estimator.EvalSpec(input_fn=input_fn,exporters=exporter,start_delay_secs=0,throttle_secs=15)
(x, y) = tf.estimator.train_and_evaluate(keras_estimator, train_spec_dnn, eval_spec_dnn)
At this point, It will save the best-exported model checkpoint files in the folder you have specified.
For loading the checkpoint files you need to do the following steps:
Step 1: Rebuild your model instance
def build_model():
model = tf.keras.models.Sequential()
model.add(...)
model.compile(...)
return model
model = build_model()
Step 2: use the model load_weights API
Reference URL: https://www.tensorflow.org/tutorials/keras/save_and_load
ck_path = tf.train.latest_checkpoint('PATH TO BEST EXPORTER CHECKPOINT FILES')
model.load_weights(ck_path)
## From there you will be able to call the predict & evaluate the functionality of the trained model
##PREDICT
prediction = model.predict(x)
##EVALUATE
for features_batch, labels_batch in input_fn().take(1):
model.evaluate(features_batch, labels_batch)
Note: All of these have been simulated on google colab.

X has 7 features per sample; expecting 18282

I'm trying to make a textclassification model with sklearn. I'm quite new to python and also sklearn. I already made the model with some training data and saved the model. But there's an error when I try to reuse the model in an another python program/file.
I already looked in some similar problems here on stackoverflow, but I couldn't find a solution for me.
I made some comments, so you can read the code more easily.
...
# load the dataset
data = codecs.open('C:/Users/baran/PycharmProjects/test/resource/CorpusMitLabelsPlusSonstige.txt', encoding='utf8',
errors='ignore').read ()
# seperate lables from text
labels, texts = [], []
for i, line in enumerate(data.split("\n")):
content = line.split()
labels.append(content[0])
texts.append(" ".join(content[1:]))
# create a dataframe using texts and lables
trainDF = pandas.DataFrame()
trainDF['text'] = texts
trainDF['label'] = labels
# split the dataset into training and validation datasets
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(trainDF['text'], trainDF['label'])
# label encode the target variable
encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
valid_y = encoder.fit_transform(valid_y)
# create a count vectorizer object
count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')
count_vect.fit(trainDF['text'])
# transform the training and validation data using count vectorizer object
xtrain_count = count_vect.transform(train_x)
xvalid_count = count_vect.transform(valid_x)
...
And since I was training with different methods to evaluate which was better I made a train_model method.
...
def train_model(classifier, feature_vector_train, label, feature_vector_valid, is_neural_net=False, is_not_tfid=False,
correct_model=False):
# fit the training dataset on the classifier
...
elif correct_model:
classifier.fit(feature_vector_train, label)
pkl_filename = "C:/Users/baran/PycharmProjects/test/resources/pickle_model.pkl"
with open(pkl_filename, 'wb') as file:
pickle.dump(classifier, file)
# with open(pkl_filename, 'rb') as file:
# pickle_model = pickle.load(file)
# joblib.dump(classifier, "C:/Users/baran/PycharmProjects/test/resources/model.pkl")
# loaded_model = joblib.load("C:/Users/baran/PycharmProjects/test/resources/model.pkl")
# result = loaded_model.score(feat)
# print(pickle_model.predict(feature_vector_valid))
...
# predict the labels on validation dataset
predictions = classifier.predict(feature_vector_valid)
...
return metrics.accuracy_score(valid_y, predictions)
...
This is the "correct_model":
...
# Linear Classifier on Count Vectors
accuracy = train_model(linear_model.LogisticRegression(), xtrain_count, train_y, xvalid_count, correct_model=True)
print("LR, Count Vectors: ", accuracy)
...
This model gives me something around 80% accuracy on the validation data.
So this is my test file where I wanted to test, if I can load and reuse the model:
...
texts = []
texts.append("Der Bus hat nicht an der Haltestelle gehalten")
# create a dataframe using texts and lables
trainDF = pandas.DataFrame()
trainDF['text'] = texts
# create a count vectorizer object
count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')
count_vect.fit(trainDF['text'])
# transform the training and validation data using count vectorizer object
test_data = count_vect.transform(trainDF['text'])
# load the model
pkl_filename = "C:/Users/baran/PycharmProjects/test/resources/pickle_model.pkl"
with open(pkl_filename, 'rb') as file:
pickle_model = pickle.load(file)
#reuse the model
test_load = joblib.load("C:/Users/baran/PycharmProjects/test/model.pkl")
print(test_load.predict(test_data))
...
Then I get this error:
...
ValueError: X has 7 features per sample; expecting 18282
What I expected is, that it will give me "3" as a result which is the encoding for a specific label. These predictions works in the same file where I also train the model, but somehow I can not use new validation data.
I think I made some mistake when fitting and or transforming the data.

How to increase speed of this ner model implemented from scratch using 1 million labeled sentences

I would like to use spacy's NER model to train a model from scratch using 1 Million sentences. The model has only two types of entities. This is the code I am using. Since, I can't share the data, I created a dummy dataset.
My main issue is that the model is taking too long to train. I would appreciate it if you can highlight any error in my code or suggest other methods to try to fasten training.
TRAIN_DATA = [ ('Ich bin in Bremen', {'entities': [(11, 17, 'loc')]})] * 1000000
import spacy
import random
from spacy.util import minibatch, compounding
def train_spacy(data,iterations):
TRAIN_DATA = data
nlp = spacy.blank('de')
if 'ner' not in nlp.pipe_names:
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner, last=True)
# add labels
for _, annotations in TRAIN_DATA:
for ent in annotations.get('entities'):
ner.add_label(ent[2])
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes):
optimizer = nlp.begin_training()
for itn in range(iterations):
print("Statring iteration " + str(itn))
random.shuffle(TRAIN_DATA)
losses = {}
batches = minibatch(TRAIN_DATA, size=compounding(100, 64.0, 1.001))
for batch in batches:
texts, annotations = zip(*batch)
nlp.update(texts, annotations, sgd=optimizer, drop=0.35, losses=losses)
print("Losses", losses)
return nlp
model = train_spacy(TRAIN_DATA, 20)
Maybe you can try this:
batches = minibatch(TRAIN_DATA, size=compounding(1, 512, 1.001))

How to save best model in Keras based on AUC metric?

I would like to save the best model in Keras based on auc and I have this code:
def MyMetric(yTrue, yPred):
auc = tf.metrics.auc(yTrue, yPred)
return auc
best_model = [ModelCheckpoint(filepath='best_model.h5', monitor='MyMetric', save_best_only=True)]
train_history = model.fit([train_x],
[train_y], batch_size=batch_size, epochs=epochs, validation_split=0.05,
callbacks=best_model, verbose = 2)
SO my model runs nut I get this warning:
RuntimeWarning: Can save best model only with MyMetric available, skipping.
'skipping.' % (self.monitor), RuntimeWarning)
It would be great if any can tell me this is the right way to do it and if not what should I do?
You have to pass the Metric you want to monitor to model.compile.
https://keras.io/metrics/#custom-metrics
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=[MyMetric])
Also, tf.metrics.auc returns a tuple containing the tensor and update_op. Keras expects the custom metric function to return only a tensor.
def MyMetric(yTrue, yPred):
import tensorflow as tf
auc = tf.metrics.auc(yTrue, yPred)
return auc[0]
After this step, you will get errors about uninitialized values. Please see these threads:
https://github.com/keras-team/keras/issues/3230
How to compute Receiving Operating Characteristic (ROC) and AUC in keras?
You can define a custom metric that calls tensorflow to compute AUROC in the following way:
def as_keras_metric(method):
import functools
from keras import backend as K
import tensorflow as tf
#functools.wraps(method)
def wrapper(self, args, **kwargs):
""" Wrapper for turning tensorflow metrics into keras metrics """
value, update_op = method(self, args, **kwargs)
K.get_session().run(tf.local_variables_initializer())
with tf.control_dependencies([update_op]):
value = tf.identity(value)
return value
return wrapper
#as_keras_metric
def AUROC(y_true, y_pred, curve='ROC'):
return tf.metrics.auc(y_true, y_pred, curve=curve)
You then need to compile your model with this metric:
model.compile(loss=train_loss, optimizer='adam', metrics=['accuracy',AUROC])
Finally: Checkpoint the model in the following way:
model_checkpoint = keras.callbacks.ModelCheckpoint(path_to_save_model, monitor='val_AUROC',
verbose=0, save_best_only=True,
save_weights_only=False, mode='auto', period=1)
Be careful though: I believe the Validation AUROC is calculated batch wise and averaged; so might give some errors with checkpointing. A good idea might be to verify after model training finishes that the AUROC of the predictions of the trained model (computed with sklearn.metrics) matches what Tensorflow reports while training and checkpointing
Assuming you use TensorBoard, then you have a historical record—in the form of tfevents files—of all your metric calculations, for all your epochs; then a tf.keras.callbacks.Callback is what you want.
I use tf.keras.callbacks.ModelCheckpoint with save_freq: 'epoch' to save—as an h5 file or tf file—the weights for each epoch.
To avoid filling the hard-drive with model files, write a new Callback—or extend the ModelCheckpoint class's—on_epoch_end implementation:
def on_epoch_end(self, epoch, logs=None):
super(DropWorseModels, self).on_epoch_end(epoch, logs)
if epoch < self._keep_best:
return
model_files = frozenset(
filter(lambda filename: path.splitext(filename)[1] == SAVE_FORMAT_WITH_SEP,
listdir(self._model_dir)))
if len(model_files) < self._keep_best:
return
tf_events_logs = tuple(islice(log_parser(tfevents=path.join(self._log_dir,
self._split),
tag=self.monitor),
0,
self._keep_best))
keep_models = frozenset(map(self._filename.format,
map(itemgetter(0), tf_events_logs)))
if len(keep_models) < self._keep_best:
return
it_consumes(map(lambda filename: remove(path.join(self._model_dir, filename)),
model_files - keep_models))
Appendix (imports and utility function implementations):
from itertools import islice
from operator import itemgetter
from os import path, listdir, remove
from collections import deque
import tensorflow as tf
from tensorflow.core.util import event_pb2
def log_parser(tfevents, tag):
values = []
for record in tf.data.TFRecordDataset(tfevents):
event = event_pb2.Event.FromString(tf.get_static_value(record))
if event.HasField('summary'):
value = event.summary.value.pop(0)
if value.tag == tag:
values.append(value.simple_value)
return tuple(sorted(enumerate(values), key=itemgetter(1), reverse=True))
it_consumes = lambda it, n=None: deque(it, maxlen=0) if n is None \
else next(islice(it, n, n), None)
SAVE_FORMAT = 'h5'
SAVE_FORMAT_WITH_SEP = '{}{}'.format(path.extsep, SAVE_FORMAT)
For completeness, the rest of the class:
class DropWorseModels(tf.keras.callbacks.Callback):
"""
Designed around making `save_best_only` work for arbitrary metrics
and thresholds between metrics
"""
def __init__(self, model_dir, monitor, log_dir, keep_best=2, split='validation'):
"""
Args:
model_dir: directory to save weights. Files will have format
'{model_dir}/{epoch:04d}.h5'.
split: dataset split to analyse, e.g., one of 'train', 'test', 'validation'
monitor: quantity to monitor.
log_dir: the path of the directory where to save the log files to be
parsed by TensorBoard.
keep_best: number of models to keep, sorted by monitor value
"""
super(DropWorseModels, self).__init__()
self._model_dir = model_dir
self._split = split
self._filename = 'model-{:04d}' + SAVE_FORMAT_WITH_SEP
self._log_dir = log_dir
self._keep_best = keep_best
self.monitor = monitor
This has the added advantage of being able to save and delete multiple model files in a single Callback. You can easily extend with different thresholding support, e.g., to keep all model files with an AUC in threshold OR TP, FP, TN, FN within threshold.

SVM classifier- save the trained model

I am new in machine learning. I am using SGDClassifier to classify my documents. I trained the model. To persist the trained data I used pickle
code in classify.py for training model
corpus=df2.title_desc #df2 is my dataframe with 2 columns title_desc and category
vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix=vectorizer.fit_transform(corpus).todense()
variables = tfidf_matrix
labels = df2.category
variables_train, variables_test, labels_train, labels_test = train_test_split(variables, labels, test_size=0.1)
svm_classifier=linear_model.SGDClassifier(loss='hinge',alpha=0.0001)
svm_classifier=svm_classifier.fit(variables_train, labels_train)
with open('my_dumped_classifier.pkl', 'wb') as fid:
pickle.dump(svm_classifier, fid)
After the data is dumped to a file.I created another py file to test the model
test.py
corpus_test=df_test.title_desc #df_testis my dataframe with 2 columns title_desc and category
vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix_test=vectorizer.fit_transform(corpus_test).todense()
svm_classifier=linear_model.SGDClassifier(loss='hinge',alpha=0.0001)
with open('my_dumped_classifier.pkl', 'rb') as fid:
svm_classifier = pickle.load(fid)
tfidf_matrix_test=vectorizer.transform(corpus_test).todense()
svm_predictions=svm_classifier.predict(tfidf_matrix_test)
I am not sure about the logic I have give in test.py. In line
svm_predictions=svm_classifier.predict(tfidf_matrix_test)
its an error 'ValueError: X has 249 features per sample; expecting 1050'
Please give a solution.

Resources