Since hyperparameter tuning seems to consist in training different models for the same task, I suppose it is a good idea to train them in parallel in order to gain some time. However, my attempt was quite unsuccessful, as multiple errors occured during the execution of my code. I was wondering if using keras requires me to write multithreading differently, or if the problem lies elsewhere.
Here's what I wrote (I'm trying to calculate the effect of dropout on the minimal value of a custom metric) :
from threading import Thread
class FitModel(Thread):
def __init__(self, params):
Thread.__init__(self)
self.params = params
def run(self):
DC=DeltaCallback(verbose=0) #custom metric
model=keras.models.Sequential([
keras.layers.Conv1D(64,11,activation="relu",padding="SAME",input_shape=(700,1)),
keras.layers.BatchNormalization(),
keras.layers.AvgPool1D(pool_size=2),
keras.layers.Conv1D(128,11,activation="relu",padding="SAME"),
keras.layers.BatchNormalization(),
keras.layers.AvgPool1D(pool_size=2),
keras.layers.Conv1D(256,11,activation="relu",padding="SAME"),
keras.layers.BatchNormalization(),
keras.layers.AvgPool1D(pool_size=2),
keras.layers.Conv1D(512,11,activation="relu",padding="SAME"),
keras.layers.BatchNormalization(),
keras.layers.AvgPool1D(pool_size=2),
keras.layers.Conv1D(512,11,activation="relu",padding="SAME"),
keras.layers.BatchNormalization(),
keras.layers.AvgPool1D(pool_size=2),
keras.layers.Flatten(),
keras.layers.Dropout(self.params[0]),
keras.layers.Dense(4096,activation="relu"),
keras.layers.Dropout(self.params[1]),
keras.layers.Dense(4096,activation="relu"),
keras.layers.Dense(256,activation="softmax")
])
model.compile(loss="sparse_categorical_crossentropy",
optimizer=keras.optimizers.RMSprop(learning_rate=0.00001),
metrics=["accuracy"])
model.fit(X_train,y_train,batch_size=100,epochs=50,validation_data=(X_valid,y_valid),callbacks=[DC],verbose=2)
print(self.params,"epochs : ",DC.deltas.index(min(DC.deltas)))
print(self.params,"deltamin : ",DC.deltas[DC.deltas.index(min(DC.deltas))])
print(self.params,"Nval : ", DC.Nvals[DC.deltas.index(min(DC.deltas))])
parameters_list=[[0,0.1],[0.1,0],[0.2,0],[0.1,0.2],[0.2,0.1],[0.3,0],[0.3,0.1],[0.3,0.2],[0,0.3],[0.1,0.3],[0.2,0.3]]
# create threads
THREADS = [FitModel(parameters) for parameters in parameters_list]
# start threads
for thread in THREADS:
thread.start()
# wait for threads to finish
for thread in THREADS:
thread.join()
The problem is that multiple Exceptions occur when I try to execute this code, as well as OOM errors. Any idea how to make this work?
Related
Is there a possibility to implement the following scenario with Tensorflow:
In the first N batches, the learning rate should be increased from 0 to 0.001. After this number of batches has been reached, the learning rate should slowly decrease from 0.001 to 0.00001 after each epoch.
How can I combine this combination in a callback? Tensorflow offers the tf.keras.callbacks.LearningRateScheduler and the callback functions on_train_batch_begin() or on_train_batch_end(). But I will not come to a common combination of these callbacks.
Can someone give me an approach how to create such a combined callback that depends on the number of batches and epochs?
Something like this would work. I didn't test this and I didn't try to perfect it...but the pieces are there so that you can get it working how you like.
import tensorflow as tf
from tensorflow.keras.callbacks import Callback
import numpy as np
class LRSetter(Callback):
def __init__(self, start_lr=0, middle_lr=0.001, end_lr=0.00001,
start_mid_batches=200, end_epochs=2000):
self.start_mid_lr = np.linspace(start_lr, middle_lr, start_mid_batches)
#Not exactly right since you'll have gone through a couple epochs
#but you get the picture
self.mid_end_lr = np.linspace(middle_lr, end_lr, end_epochs)
self.start_mid_batches = start_mid_batches
self.epoch_takeover = False
def on_train_batch_begin(self, batch, logs=None):
if batch < self.start_mid_batches:
tf.keras.backend.set_value(self.model.optimizer.lr, self.start_mid_lr[batch])
else:
self.epoch_takeover = True
def on_epoch_begin(self, epoch):
if self.epoch_takeover:
tf.keras.backend.set_value(self.model.optimizer.lr, self.mid_end_lr[epoch])
I have saved more than 1000 models for each item. Now I need to load all these models into memory (a dataframe) to do predictions. If I just use "for" loop to load these models, each loading will be 3 seconds slower than the previous model loading. So I try to use multiprocessing.pool (ThreadPool).
But, strangely, using ThreadPool will cause the prediction "ValueError: Tensor Tensor". If using normal loading, the prediction is fine.
I tried thread also got error msg
#following code will lead to ValueError
from multiprocessing.pool import ThreadPool as Pool
def load_model(stock):
model_pred.at[0, stock] = keras.models.load_model (
'C:/Users/chenp/Documents/rqpro/models/{}_model.h5'.format (stock))
pool = Pool(processes=16)
for stock in trade_stocks['stock']:
pool.map (load_model, (stock,))
#Prediction
for stock in trade_stocks['stock']:
model = model_pred.loc[0, stock]
prediction = model.predict(pred_data)
#Get following msg:
ValueError: Tensor Tensor("dense_9/Softmax:0", shape=(?, 2), dtype=float32) is not an element of this graph.
#Normal code but too low efficient
for stock in trade_stocks['stock']:
model_pred.at[0, stock] = keras.models.load_model(
'C:/Users/chenp/Documents/rqpro/models/{}_model.h5'.format(stock))
#Get following msg:
ValueError: Tensor Tensor("dense_9/Softmax:0", shape=(?, 2), dtype=float32) is not an element of this graph.
This happens as Keras is not thread safe. For solving this problem, please use _make_predict_function() before predicting. For detailed answer, please check
I'm trying to use PyTorch with complex loss function. In order to accelerate the code, I hope that I can use the PyTorch multiprocessing package.
The first trial, I put 10x1 features into the NN and get 10x4 output.
After that, I want to pass 10x4 parameters into a function to do some calculation. (The calculation will be complex in the future.)
After calculating, the function will return a 10x1 array in total. This array will be set as NN_energy and calculate loss function.
Besides, I also want to know if there is another method to create a backward-able array to store the NN_energy array, instead of using
NN_energy = net(Data_in)[0:10,0]
Thanks a lot.
Full Code:
import torch
import numpy as np
from torch.autograd import Variable
from torch import multiprocessing
def func(msg,BOP):
ans = (BOP[msg][0]+BOP[msg][1]/BOP[msg][2])*BOP[msg][3]
return ans
class Net(torch.nn.Module):
def __init__(self, n_feature, n_hidden_1, n_hidden_2, n_output):
super(Net, self).__init__()
self.hidden_1 = torch.nn.Linear(n_feature , n_hidden_1) # hidden layer
self.hidden_2 = torch.nn.Linear(n_hidden_1, n_hidden_2) # hidden layer
self.predict = torch.nn.Linear(n_hidden_2, n_output ) # output layer
def forward(self, x):
x = torch.tanh(self.hidden_1(x)) # activation function for hidden layer
x = torch.tanh(self.hidden_2(x)) # activation function for hidden layer
x = self.predict(x) # linear output
return x
if __name__ == '__main__': # apply_async
Data_in = Variable( torch.from_numpy( np.asarray(list(range( 0,10))).reshape(10,1) ).float() )
Ground_truth = Variable( torch.from_numpy( np.asarray(list(range(20,30))).reshape(10,1) ).float() )
net = Net( n_feature=1 , n_hidden_1=15 , n_hidden_2=15 , n_output=4 ) # define the network
optimizer = torch.optim.Rprop( net.parameters() )
loss_func = torch.nn.MSELoss() # this is for regression mean squared loss
NN_output = net(Data_in)
args = range(0,10)
pool = multiprocessing.Pool()
return_data = pool.map( func, zip(args, NN_output) )
pool.close()
pool.join()
NN_energy = net(Data_in)[0:10,0]
for i in range(0,10):
NN_energy[i] = return_data[i]
loss = torch.sqrt( loss_func( NN_energy , Ground_truth ) ) # must be (1. nn output, 2. target)
print(loss)
Error messages:
File
"C:\ProgramData\Anaconda3\lib\site-packages\torch\multiprocessing\reductions.py",
line 126, in reduce_tensor
raise RuntimeError("Cowardly refusing to serialize non-leaf tensor which requires_grad, "
RuntimeError: Cowardly refusing to serialize non-leaf tensor which
requires_grad, since autograd does not support crossing process
boundaries. If you just want to transfer the data, call detach() on
the tensor before serializing (e.g., putting it on the queue).
First of all, Torch Variable API is deprecated since a very long time, just don't use it.
Next, torch.from_numpy( np.asarray(list(range( 0,10))).reshape(10,1) ).float() is wrong at many levels: np.asarray of list is useless since a copy will be performed anyway, and np.array takes list as input by design. Then, np.arange is available to return a range as numpy array, and it is also available on Torch. Next, specifying both dimension for reshape is useless and error prone, you could simply do reshape((-1, 1)), or even better unsqueeze(-1).
Here is the simplified expression torch.arange(10, dtype=torch.float32, requires_grad=True).unsqueeze(-1).
Using multiprocessing pool is a bad practice if using batch processing is possible. It will be both way more efficient and readable. Indeed, performing N small algebraic operations in parallel is always slower and a larger single algebraic operation, and even more on GPU. More importantly, computing the gradient is not supported by multiprocessing, hence the error that you get. Yet, this is partially true, because it is supports for tensors on cpu since 1.6.0. Have a lok, to the official release changelog.
Could you post a more representative example of what func method could be to make sure you really need it ?
NB: Distributed autograd as you are looking is now available in Pytorch as an experimental feature available in beta since 1.6.0. Have a look to the official documentation.
I am wondering if there is an easy way of creating a way of triggering early stopping in Keras based on user input rather than monitorization of any particular metric.
Ie I would like to send a keyboard signal to the process executing the training so that it gets out of the fit_generator function and execute the remaining code.
Any ideas?
EDIT: Based on #AnkurGoel 's answer, I wrote this code:
# Monitors the SIGINT (ctrl + C) to safely stop training when it is sent
flag = False
class TerminateOnFlag(Callback):
"""Callback that terminates training when the flag is raised.
"""
def on_batch_end(self, batch, logs=None):
if flag:
self.model.stop_training = True
def handler(signum, frame):
logging.info('SIGINT signal received. Training will finish after this epoch')
global flag
flag = True
signal.signal(signal.SIGINT, handler) # We assign a specific handler for the SIGINT signal
terminateOnFlag = TerminateOnFlag()
callbacks.append(terminateOnFlag)
Where callbacks is a list of callbacks I fed into fit_generator.
During training, when I send the SIGINT signal indeed I get the message SIGINT signal received. Training will finish after this epoch, but when the epoch ends nothing happens. What is going on?
You can give a thought to approach below:
Use One global variable, initialize 0
Use Signal Handler,
When signal(interrupt) received by the python process, its value is changed from 0 to 1.
Use Custom Callback in Keras, to stop the training when this variable value is changed
class TerminateOnFlag(Callback):
"""Callback that terminates training when flag=1 is encountered.
"""
def on_batch_end(self, batch, logs=None):
if flag==1:
self.model.stop_training = True
Original Callbacks are available at:
https://github.com/keras-team/keras/blob/master/keras/callbacks.py#L251
You still have to check if it is possible to provide custom callback to fit_generator, instead of standard callbacks.
Here is the code for signal Handler :
For windows:
import signal, os
def handler(signum, frame):
print('Signal handler called with signal', signum)
raise OSError("Couldn't open device!")
signal.signal(signal.CTRL_C_EVENT, handler) # only in python version 3.2
For Linux:
import signal, os
def handler(signum, frame):
print('Signal handler called with signal', signum)
raise OSError("Couldn't open device!")
signal.signal(signal.SIGINT, handler)
Better and safer way is to use mouse as input, for stopping, and other internal interactions.
For example, to stop keras in the end of batch when mouse is moved to the left side (mouse_x<10):
def queryMousePosition():
from ctypes import windll, Structure, c_long, byref
class POINT(Structure): _fields_ = [("x", c_long), ("y", c_long)]
pt = POINT()
windll.user32.GetCursorPos(byref(pt))
return pt.x, pt.y # %timeit queryMousePosition()
class TerminateOnFlag(keras.callbacks.Callback):
def on_batch_end(self, batch, logs=None):
mouse_x, mouse_y = queryMousePosition()
if mouse_x < 10:
self.model.stop_training = True
callbacks=[keras.callbacks.ReduceLROnPlateau(), TerminateOnFlag()]
model.fit_generator(..., callbacks=callbacks, ...)
Not using a keyboard signal, but when running Keras in a Jupyter notebook I found it easiest to use a callback that stops training on the presence of a particular file.
TRAINING_POISON_PILL_FILE_NAME = 'stop-training'
class PoisonPillCallback(tf.keras.callbacks.Callback):
def on_epoch_end(self, epoch, logs={}):
if os.path.exists(TRAINING_POISON_PILL_FILE_NAME):
self.model.stop_training = True
os.remove(TRAINING_POISON_PILL_FILE_NAME)
print(f'poison pill file "{TRAINING_POISON_PILL_FILE_NAME}" detected, stopping training')
model.fit(..., callbacks=[PoisonPillCallback(), ...])
Then you can just creat an (empty) file with this name in the current directoy in the Jupyter UI and it will stop training after the current epoch.
I want to be able to load an existing tensorflow network simultaneously from several processes using multiprocessing library to do inference on different cores simultaneously.
def spawn_process(x):
g = tf.Graph()
sess = tf.Session(graph=g)
with g.as_default():
meta_graph = tf.train.import_meta_graph('model.meta')
meta_graph.restore(sess, tf.train.latest_checkpoint(chkp_dir)
x_ph = tf.get_collection('x')[0] # placeholder tensor that we use to pass in new x's to do inference on.
pred = tf.get_collection('pred')[0] # tensor responsible for computing prediction given x
prediction = sess.run(pred, feed_dict={x_ph: x})
This is basically the function I want to pass to Pool.map to infer parallely.
Above function works with just map and it looks like this
predictions = list(map(spawn_process, range(10)))
predictions = [spawn_process(x) for x in range(10)]
Both of the above work as expected.
But when I try do this, it fails, and each process just hangs right before the meta_graph.restore line and I'm stumped.
p = multiprocess.Pool(4)
predictions = p.map(spawn_process, range(10))
p.close()
p.join()
I don't know why this isn't working with tensorflow, this normally works for me when I try to do any sort of computation parallely. It stops right before the meta_graph.restore line and all the processes hang there.