Dear TensorFlow Community,
I'm training a classifier with tf.contrib.factorization.KMeansClustering,
but the training goes really slow, and only uses 1% of my GPU.
However, my 4 CPU cores are hitting about 35% use constantly.
Is it the case that K-Means is written more for the CPU than the GPU?
Is there a way I can shift more of the computation to the GPU, or some
other approach to speed up training?
Below is my script for training (Python3).
Thank you for your time.
import tensorflow as tf
def parser(record):
features={
'feats': tf.FixedLenFeature([], tf.string),
}
parsed = tf.parse_single_example(record, features)
feats = tf.convert_to_tensor(tf.decode_raw(parsed['feats'], tf.float64))
return {'feats': feats}
def my_input_fn(tfrecords_path):
dataset = (
tf.data.TFRecordDataset(tfrecords_path)
.map(parser)
.batch(1024)
)
iterator = dataset.make_one_shot_iterator()
batch_feats = iterator.get_next()
return batch_feats
### SPEC FUNCTIONS ###
train_spec_kmeans = tf.estimator.TrainSpec(input_fn = lambda: my_input_fn('/home/ubuntu/train.tfrecords') , max_steps=10000)
eval_spec_kmeans = tf.estimator.EvalSpec(input_fn = lambda: my_input_fn('/home/ubuntu/eval.tfrecords') )
### INIT ESTIMATOR ###
KMeansEstimator = tf.contrib.factorization.KMeansClustering(
num_clusters=500,
feature_columns = [tf.feature_column.numeric_column(
key='feats',
dtype=tf.float64,
shape=(377,),
)],
use_mini_batch=True)
### TRAIN & EVAL ###
tf.estimator.train_and_evaluate(KMeansEstimator, train_spec_kmeans, eval_spec_kmeans)
Best,
Josh
Here's my best answer so far with time information, building off of Eliethesaiyan's answer and link to docs.
My original Dataset codeblock and performance:
dataset = (
tf.data.TFRecordDataset(tfrecords_path)
.map(parse_fn)
.batch(1024)
)
real 1m36.171s
user 2m57.756s
sys 0m42.304s
Eliethesaiyan's answer (prefetch + num_parallel_calls)
dataset = (
tf.data.TFRecordDataset(tfrecords_path)
.map(parse_fn,num_parallel_calls=multiprocessing.cpu_count())
.batch(1024)
.prefetch(1024)
)
real 0m41.450s
user 1m33.120s
sys 0m18.772s
From the docs using map_and_batch + num_parallel_batches + prefetch:
dataset = (
tf.data.TFRecordDataset(tfrecords_path)
.apply(
tf.contrib.data.map_and_batch(
map_func=parse_fn,
batch_size=1024,
num_parallel_batches=multiprocessing.cpu_count()
)
)
.prefetch(1024)
)
real 0m32.855s
user 1m11.412s
sys 0m10.408s
one of the thing that i saw that increases gpu and cpu usage, is using prefetch on the dataset.It keeps the dataset producer fetch the data while the model is also consuming the previous batch therefore maximizing resource usage. Also specifying the max of your cpu would speed up the process.
I would restructure it this way
dataset = (
tf.data.TFRecordDataset(tfrecords_path)
.map(parser,num_parallel_calls=multiprocessing.cpu_count())
.batch(1024)
)
dataset = dataset.prefetch(1024)
here is a nice guide of best practice when it comes to use TfRecords here
Related
I am currently trying to optimize the hyperparameters of a gradient boosting method with the library hyperopt. When I was working on my own computer, I used the class Trials and I was able to save and reload my results with the library pickles. This allowed me to have a save of all the set of parameters I tested. My code looked like that :
from hyperopt import SparkTrials, STATUS_OK, tpe, fmin
from LearningUtils.LearningUtils import build_train_test, get_train_test, mean_error, rmse, mae
from LearningUtils.constants import MAX_EVALS, CV, XGBOOST_OPTIM_SPACE, PARALELISM
from sklearn.model_selection import cross_val_score
import pickle as pkl
if os.path.isdir(PATH_TO_TRIALS): #we reload the past results
with open(PATH_TO_TRIALS, 'rb') as trials_file:
trials = pkl.load(trials_file)
else : # We create the trials file
trials = Trials()
# classic hyperparameters optimization
def objective(space):
regressor = xgb.XGBRegressor(n_estimators = space['n_estimators'],
max_depth = int(space['max_depth']),
learning_rate = space['learning_rate'],
gamma = space['gamma'],
min_child_weight = space['min_child_weight'],
subsample = space['subsample'],
colsample_bytree = space['colsample_bytree'],
verbosity=0
)
regressor.fit(X_train, Y_train)
# Applying k-Fold Cross Validation
accuracies = cross_val_score(estimator=regressor, x=X_train, y=Y_train, cv=5)
CrossValMean = accuracies.mean()
return {'loss':1-CrossValMean, 'status': STATUS_OK}
best = fmin(fn=objective,
space=XGBOOST_OPTIM_SPACE,
algo=tpe.suggest,
max_evals=MAX_EVALS,
trials=trials,
return_argmin=False)
# Save the trials
pkl.dump(trials, open(PATH_TO_TRIALS, "wb"))
Now, I would like to make this code work on a distant serveur with more CPUs in order to allow parallelisation and gain time.
I saw that I can simply do that using the SparkTrials class of hyperopt instead ot Trials. But, SparkTrials objects cannot be saved with pickles. Do you have any idea on how I could save and reload my trials results stored in a Sparktrials object ?
so this might be a bit late, but after messing around a bit, I found a kind of hacky solution:
spark_trials= SparkTrials()
pickling_trials = dict()
for k, v in spark_trials.__dict__.items():
if not k in ['_spark_context', '_spark']:
pickling_trials[k] = v
pickle.dump(pickling_trials, open('pickling_trials.hyperopt', 'wb'))
The _spark_context and the _spark attributes of the SparkTrials instance are the culprits of not being able to serialize the object. It turns out that you dont need them if you want to re-use the object, because if you want to re-run the optimization again, a new spark context is created anyway, so you can re use the trials as:
new_sparktrials = SparkTrials()
for att, v in pickling_trials.items():
setattr(new_sparktrials, att, v)
best = fmin(loss_func,
space=search_space,
algo=tpe.suggest,
max_evals=1000,
trials=new_sparktrials)
voilĂ :)
I'm trying to run a simple boostedTreeClassifier on my dataset from the example, but it seems to get stuck on first step:
2019-06-28 11:20:31.658689: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:111] Filling up shuffle buffer (this may take a while): 84090 of 85873
2019-06-28 11:20:32.908425: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:162] Shuffle buffer filled.
I0628 11:20:34.904214 140220602029888 basic_session_run_hooks.py:262] loss = 0.6931464, step = 0
W0628 11:21:03.421219 140220602029888 basic_session_run_hooks.py:724] It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 0 vs previous value: 0. You could increase the global step by passing tf.train.get_global_step() to Optimizer.apply_gradients or Optimizer.minimize.
W0628 11:21:05.555618 140220602029888 basic_session_run_hooks.py:724] It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 0 vs previous value: 0. You could increase the global step by passing tf.train.get_global_step() to Optimizer.apply_gradients or Optimizer.minimize.
The same dataset seems to work fine when I pass it to other keras based model or xgboost model.
Here's the relevant code:
def make_input_fn(self, X, y, shuffle=True, num_epochs=None):
num_samples = len(self.y_train)
def input_fn():
dataset = tf.data.Dataset.from_tensor_slices((dict(X), y))
if shuffle:
dataset = dataset.shuffle(num_samples).repeat(num_epochs).batch(self.batch_size)
else:
dataset = dataset.repeat(num_epochs).batch(self.batch_size)
return dataset
return input_fn
def ens_train(self):
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.DEBUG)
train_input_fn = self.make_input_fn(self.X_train, self.y_train, num_epochs=self.epochs)
self.model = tf.estimator.BoostedTreesClassifier(self.feature_columns,
n_batches_per_layer = int(0.5* len(self.y_train)/self.batch_size),
model_dir = self.ofolder,
max_depth = 10,
n_trees = 1000)
self.model.train(train_input_fn, max_steps = 1000)
Was able to get a result by playing around with learning rates and number of epochs. The "best" parameters obtained by hyperparameter tuning on xgboost doesn't give similar results in BoostedTreeClassifier. It took a large number of epochs to get around 84% accuracy (balanced dataset). xgboost had given 95% without even hyperparameter tuning..
I'm trying to use PyTorch with complex loss function. In order to accelerate the code, I hope that I can use the PyTorch multiprocessing package.
The first trial, I put 10x1 features into the NN and get 10x4 output.
After that, I want to pass 10x4 parameters into a function to do some calculation. (The calculation will be complex in the future.)
After calculating, the function will return a 10x1 array in total. This array will be set as NN_energy and calculate loss function.
Besides, I also want to know if there is another method to create a backward-able array to store the NN_energy array, instead of using
NN_energy = net(Data_in)[0:10,0]
Thanks a lot.
Full Code:
import torch
import numpy as np
from torch.autograd import Variable
from torch import multiprocessing
def func(msg,BOP):
ans = (BOP[msg][0]+BOP[msg][1]/BOP[msg][2])*BOP[msg][3]
return ans
class Net(torch.nn.Module):
def __init__(self, n_feature, n_hidden_1, n_hidden_2, n_output):
super(Net, self).__init__()
self.hidden_1 = torch.nn.Linear(n_feature , n_hidden_1) # hidden layer
self.hidden_2 = torch.nn.Linear(n_hidden_1, n_hidden_2) # hidden layer
self.predict = torch.nn.Linear(n_hidden_2, n_output ) # output layer
def forward(self, x):
x = torch.tanh(self.hidden_1(x)) # activation function for hidden layer
x = torch.tanh(self.hidden_2(x)) # activation function for hidden layer
x = self.predict(x) # linear output
return x
if __name__ == '__main__': # apply_async
Data_in = Variable( torch.from_numpy( np.asarray(list(range( 0,10))).reshape(10,1) ).float() )
Ground_truth = Variable( torch.from_numpy( np.asarray(list(range(20,30))).reshape(10,1) ).float() )
net = Net( n_feature=1 , n_hidden_1=15 , n_hidden_2=15 , n_output=4 ) # define the network
optimizer = torch.optim.Rprop( net.parameters() )
loss_func = torch.nn.MSELoss() # this is for regression mean squared loss
NN_output = net(Data_in)
args = range(0,10)
pool = multiprocessing.Pool()
return_data = pool.map( func, zip(args, NN_output) )
pool.close()
pool.join()
NN_energy = net(Data_in)[0:10,0]
for i in range(0,10):
NN_energy[i] = return_data[i]
loss = torch.sqrt( loss_func( NN_energy , Ground_truth ) ) # must be (1. nn output, 2. target)
print(loss)
Error messages:
File
"C:\ProgramData\Anaconda3\lib\site-packages\torch\multiprocessing\reductions.py",
line 126, in reduce_tensor
raise RuntimeError("Cowardly refusing to serialize non-leaf tensor which requires_grad, "
RuntimeError: Cowardly refusing to serialize non-leaf tensor which
requires_grad, since autograd does not support crossing process
boundaries. If you just want to transfer the data, call detach() on
the tensor before serializing (e.g., putting it on the queue).
First of all, Torch Variable API is deprecated since a very long time, just don't use it.
Next, torch.from_numpy( np.asarray(list(range( 0,10))).reshape(10,1) ).float() is wrong at many levels: np.asarray of list is useless since a copy will be performed anyway, and np.array takes list as input by design. Then, np.arange is available to return a range as numpy array, and it is also available on Torch. Next, specifying both dimension for reshape is useless and error prone, you could simply do reshape((-1, 1)), or even better unsqueeze(-1).
Here is the simplified expression torch.arange(10, dtype=torch.float32, requires_grad=True).unsqueeze(-1).
Using multiprocessing pool is a bad practice if using batch processing is possible. It will be both way more efficient and readable. Indeed, performing N small algebraic operations in parallel is always slower and a larger single algebraic operation, and even more on GPU. More importantly, computing the gradient is not supported by multiprocessing, hence the error that you get. Yet, this is partially true, because it is supports for tensors on cpu since 1.6.0. Have a lok, to the official release changelog.
Could you post a more representative example of what func method could be to make sure you really need it ?
NB: Distributed autograd as you are looking is now available in Pytorch as an experimental feature available in beta since 1.6.0. Have a look to the official documentation.
I'm computing learning curves out of random forests using sklearn. I need to do it for lot of different RFs, therefore I want to use a cluster and Dask to reduce the time of the RFs fits.
Currently I implemented the following algorithm:
from sklearn.externals import joblib
from dask.distributed import Client, LocalCluster
worker_kwargs = dict(memory_limit="2GB", ncores=4)
cluster = LocalCluster(n_workers=4, threads_per_worker=2, **worker_kwargs) # processes=False?
client = Client(cluster)
X, Y = ..., ...
estimator = RandomForestRegressor(n_jobs=-1, **rf_params)
cv = ShuffleSplit(n_splits=5, test_size=0.2)
train_sizes = [...] # 20 different values
with joblib.parallel_backend('dask', scatter=[X,Y]):
train_sizes, train_scores, test_scores = learning_curve(estimator, X, Y, cv=cv, n_jobs=-1, train_sizes=train_sizes)
Here are 2 levels of parallelism:
One for the fitting of a RF (n_jobs=-1)
One for the looping over all the training set sizes (n_jobs=-1)
My problem is: if the backend is loky, then it takes around 23s.
[Parallel(n_jobs=-1)]: Done 50 out of 50 | elapsed: 22.8s finished
Now, if the backend is dask, then it takes more time:
[Parallel(n_jobs=-1)]: Done 50 out of 50 | elapsed: 30.3s finished
I now that Dask introduces overhead, but I don't except that this explain all the difference of running time.
dask is being developed quickly and I find a lot of different versions to do the same thing, without knowing which one is up-to-date.
I've been doing some adaptation to code in this blog about CNN for text clasification:
http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/
Everything works fine! But when I try to use the model trained to predict new instances it consumes all memory available. It seems that it's not liberating any memory when evaluates and load all the model again and again. As far as I know memory should be liberated after every sess.run command.
Here is the part of the code I'm working with:
with graph.as_default():
session_conf = tf.ConfigProto(
allow_soft_placement=FLAGS.allow_soft_placement,
log_device_placement=FLAGS.log_device_placement)
sess = tf.Session(config=session_conf)
with sess.as_default():
# Load the saved meta graph and restore variables
saver = tf.train.import_meta_graph("{}.meta".format(checkpoint_file))
saver.restore(sess, checkpoint_file)
# Get the placeholders from the graph by name
input_x = graph.get_operation_by_name("input_x").outputs[0]
# input_y = graph.get_operation_by_name("input_y").outputs[0]
dropout_keep_prob = graph.get_operation_by_name("dropout_keep_prob").outputs[0]
# Tensors we want to evaluate
predictions = graph.get_operation_by_name("output/predictions").outputs[0]
# Add a vector for probas
probas =graph.get_operation_by_name("output/scores").outputs[0]
# Generate batches for one epoch
print("\nGenerating Bathces...\n")
gc.collect()
#mem0 = proc.get_memory_info().rss
batches = data_helpers.batch_iter(list(x_test), FLAGS.batch_size, 1, shuffle=False)
#mem1 = proc.get_memory_info().rss
print("\nBatches done...\n")
#pd = lambda x2, x1: 100.0 * (x2 - x1) / mem0
#print "Allocation: %0.2f%%" % pd(mem1, mem0)
# Collect the predictions here
all_predictions = []
all_probas = []
for x_test_batch in batches:
#Calculate probability of prediction been good
gc.collect()
batch_probas = sess.run(tf.reduce_max(tf.nn.softmax(probas),1), {input_x: x_test_batch, dropout_keep_prob: 1.0})
batch_predictions = sess.run(predictions, {input_x: x_test_batch, dropout_keep_prob: 1.0})
all_predictions = np.concatenate([all_predictions, batch_predictions])
all_probas = np.concatenate([all_probas, batch_probas])
# Add summary ops to collect data
with tf.name_scope("eval") as scope:
p_h = tf.histogram_summary("eval/probas", batch_probas)
summary= sess.run(p_h)
eval_summary_writer.add_summary(summary)
Any help will be much appreciated
Cheers
Your training loop creates new TensorFlow operations (tf.reduce_max(), tf.nn.softmax() and tf.histogram_summary()) in each iteration, which will lead to more memory being consumed over time. TensorFlow is most efficient when you run the same graph many times, because it can amortize the cost of optimizing the graph over multiple executions. Therefore,
to get the best performance, you should revise your program so that you create each of these operations once, before the for x_test_batch in batches: loop, and then re-use the same operations in each iteration.