I'm trying to make a horovod torch estimator for a spark pipeline, but I'm getting an error while trying to fit the data and I don't know/understand the cause.
I've left the full stack error here, but the final trace is as follows:
[1,0]<stderr>:Traceback (most recent call last):
[1,0]<stderr>: File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
[1,0]<stderr>: return _run_code(code, main_globals, None,[1,0]<stderr>:
[1,0]<stderr>: File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
[1,0]<stderr>: exec(code, run_globals)
[1,0]<stderr>: File "/usr/local/lib/python3.8/dist-packages/horovod/spark/task/mpirun_exec_fn.py", line 52, in <module>
[1,0]<stderr>: main(codec.loads_base64(sys.argv[1]), codec.loads_base64(sys.argv[2]))
[1,0]<stderr>: File "/usr/local/lib/python3.8/dist-packages/horovod/spark/task/mpirun_exec_fn.py", line 45, in main
[1,0]<stderr>: task_exec(driver_addresses, settings, 'OMPI_COMM_WORLD_RANK', 'OMPI_COMM_WORLD_LOCAL_RANK')
[1,0]<stderr>: File "/usr/local/lib/python3.8/dist-packages/horovod/spark/task/__init__.py", line 61, in task_exec
[1,0]<stderr>: result = fn(*args, **kwargs)
[1,0]<stderr>: File "/usr/local/lib/python3.8/dist-packages/horovod/spark/torch/remote.py", line 432, in train
[1,0]<stderr>: 'train': _train(epoch)
[1,0]<stderr>: File "/usr/local/lib/python3.8/dist-packages/horovod/spark/torch/remote.py", line 373, in _train
[1,0]<stderr>: inputs, labels, sample_weights = prepare_batch(row)
[1,0]<stderr>: File "/usr/local/lib/python3.8/dist-packages/horovod/spark/torch/remote.py", line 306, in prepare_batch
[1,0]<stderr>: for col, shape in zip(feature_columns, input_shapes)]
[1,0]<stderr>:TypeError: 'NoneType' object is not iterable
Unfortunately, I don't really know how to give a minimal reproducible example, but I'll try to give the most relevant and minimal information possible to understand the context.
I'm working on a google colab notebook.
I created the estimator following the documentation, like so:
import horovod.spark.torch as hvd
from horovod.spark.common.store import DBFSLocalStore
import shutil
import uuid
uuid_str = str(uuid.uuid4())
work_dir = "/dbfs/horovod_spark_estimator/"+uuid_str
from torch import optim
# Setup store for intermediate data
store = DBFSLocalStore(work_dir)
optimizer = optim.Adam(model.parameters(), lr=1.0e-3)
torch_estimator = hvd.TorchEstimator(
store=store,
num_proc=1,
model=model,
optimizer=optimizer,
feature_cols=['Windows'],
label_cols=['Labels'],
verbose=1)
And the model is from this repo, which is a basic LSTM in pytorch with some minor customizations. I also changed the requirements to use the most updated libraries.
The Dataframe fed to the fit function is like so:
+--------------------+------+---+
| Windows|Labels| id|
+--------------------+------+---+
|[0, 0, 0, 0, 1, 2...| 1| 0|
|[0, 0, 0, 1, 2, 2...| 1| 1|
|[0, 0, 1, 2, 2, 2...| 1| 2|
|[0, 1, 2, 2, 2, 2...| 0| 3|
|[1, 2, 2, 2, 2, 2...| 0| 4|
|[2, 2, 2, 2, 2, 2...| 0| 5|
|[2, 2, 2, 2, 2, 2...| 0| 6|
|[2, 2, 2, 2, 2, 2...| 0| 7|
+--------------------+------+---+
Which is an array/vector of 10 ids and 1 label per window.
My attempts to solve this issue
My first hypothesis was that, by looking at the error log, it looks like there is a mismatch between the input data and the model definition. If so, i don't understand how the mismatch could be possible, since the model expects as features the list of ids and a label of 1 or 0 for the whole window (list).
Thinking that it might be caused by the column names, I also tried changing them to exactly the name expected when using the normal torch dataloader, but with no success.
The second thing that stood out was that NoneType object, as if horovod isn't taking or finding the correct data or structure to pass the model.
I tried looking on the web for similar situations, but it seems like horovod isn't widely used, so I found little to nothing useful in helping me.
Any help in finding a solution to this would be greatly appreciated, but I'm also open to any alternative to horovod that can integrate a torch model into a spark pipeline or just be fed a spark dataframe.
EDIT 1:
Thanks to #pSoLT I managed to get past the NoneType Error by setting the input shapes in the the constructor of hvd.TorchEstimator (in my case input_shapes=[[-1,10]]).
Unfortunately, I immediately stumbled upon another one:
[1,0]<stderr>: File "/usr/local/lib/python3.8/dist-packages/horovod/spark/task/__init__.py", line 61, in task_exec
[1,0]<stderr>: [1,0]<stderr>:result = fn(*args, **kwargs)[1,0]<stderr>:
[1,0]<stderr>: File "/usr/local/lib/python3.8/dist-packages/horovod/spark/torch/remote.py", line 432, in train
[1,0]<stderr>: [1,0]<stderr>:'train': _train(epoch)[1,0]<stderr>:
[1,0]<stderr>: File "/usr/local/lib/python3.8/dist-packages/horovod/spark/torch/remote.py", line 374, in _train
[1,0]<stderr>: [1,0]<stderr>:outputs, loss = train_minibatch(model, optimizer, transform_outputs,[1,0]<stderr>:
[1,0]<stderr>: File "/usr/local/lib/python3.8/dist-packages/horovod/spark/torch/remote.py", line 468, in train_minibatch
[1,0]<stderr>: [1,0]<stderr>:loss = loss_fn(outputs, labels, sample_weights)[1,0]<stderr>:
[1,0]<stderr>: File "/usr/local/lib/python3.8/dist-packages/horovod/spark/torch/remote.py", line 351, in loss_fn
[1,0]<stderr>: [1,0]<stderr>:loss = calculate_loss(outputs, labels, loss_weights, loss_fns, sample_weights)[1,0]<stderr>:
[1,0]<stderr>:NameError[1,0]<stderr>:: [1,0]<stderr>:free variable 'loss_fns' referenced before assignment in enclosing scope[1,0]<stderr>:
As I mentioned in the comments, I'll play around more if I have the chance, but I'll try switching to SparkTorch for the immediate future.
Related
When I call model.predict_proba(X) on my StackingClassifier model the run execution crashes because the library is calling a method assert_all_finite() to check whether my dataframe contains missing values.
Since the estimators I stacked are able to handle missing values, I don't see the reason why this should happen and I didn't find anything in the documentation that says that the StackingClassifier requires data without missing values.
It's a bit hard for me to come up with a short reproducibile snippet of code given that it comes from several layers of model abstraction, but I can print out the model effectively raising the error call.
p = model.predict_proba(X_loyal)
where model is:
StackingClassifier(estimators=[('ExtraTreesClassifier_117',
ExtraTreesClassifier(bootstrap=True,
class_weight={0: 1, 1: 5},
criterion='entropy',
max_depth=11,
max_features='log2',
max_samples=0.5946040593595099,
min_samples_leaf=2,
n_estimators=163,
random_state=117)),
('RandomForestClassifier_117',
RandomForestClassifier(class_weight={0: 1,
1: 5},
criterion='entropy',
max_depth=11,
max_features='log2',
max_samples=0.5946040593595099,
min_samples_leaf=2,
n_estimators=163,
random_state=117)),
('LGBMClassifier_117',
LGBMClassifier(class_weight={0: 1, 1: 1},
deterministic=True, max_depth=9,
n_estimators=183, num_leaves=3,
subsample=0.2986274713775564,
verbose=-1))])
Error
Traceback (most recent call last):
File "/home/mlpoc/miniconda3/envs/churn/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3417, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-12-eafa75c49322>", line 1, in <module>
model.predict_proba(X_loyal)
File "/home/mlpoc/miniconda3/envs/churn/lib/python3.8/site-packages/sklearn/utils/metaestimators.py", line 120, in <lambda>
out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
File "/home/mlpoc/miniconda3/envs/churn/lib/python3.8/site-packages/sklearn/ensemble/_stacking.py", line 485, in predict_proba
return self.final_estimator_.predict_proba(self.transform(X))
File "/home/mlpoc/miniconda3/envs/churn/lib/python3.8/site-packages/sklearn/ensemble/_stacking.py", line 522, in transform
return self._transform(X)
File "/home/mlpoc/miniconda3/envs/churn/lib/python3.8/site-packages/sklearn/ensemble/_stacking.py", line 215, in _transform
predictions = [
File "/home/mlpoc/miniconda3/envs/churn/lib/python3.8/site-packages/sklearn/ensemble/_stacking.py", line 216, in <listcomp>
getattr(est, meth)(X)
File "/home/mlpoc/miniconda3/envs/churn/lib/python3.8/site-packages/sklearn/ensemble/_forest.py", line 674, in predict_proba
X = self._validate_X_predict(X)
File "/home/mlpoc/miniconda3/envs/churn/lib/python3.8/site-packages/sklearn/ensemble/_forest.py", line 422, in _validate_X_predict
return self.estimators_[0]._validate_X_predict(X, check_input=True)
File "/home/mlpoc/miniconda3/envs/churn/lib/python3.8/site-packages/sklearn/tree/_classes.py", line 407, in _validate_X_predict
X = self._validate_data(X, dtype=DTYPE, accept_sparse="csr",
File "/home/mlpoc/miniconda3/envs/churn/lib/python3.8/site-packages/sklearn/base.py", line 421, in _validate_data
X = check_array(X, **check_params)
File "/home/mlpoc/miniconda3/envs/churn/lib/python3.8/site-packages/sklearn/utils/validation.py", line 63, in inner_f
return f(*args, **kwargs)
File "/home/mlpoc/miniconda3/envs/churn/lib/python3.8/site-packages/sklearn/utils/validation.py", line 720, in check_array
_assert_all_finite(array,
File "/home/mlpoc/miniconda3/envs/churn/lib/python3.8/site-packages/sklearn/utils/validation.py", line 103, in _assert_all_finite
raise ValueError(
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
Versions
sklearn.__version__
Out[6]: '0.24.2'
lightgbm.__version__
Out[8]: '3.2.1'
im running Nicholas Rennote's TFODCourse.
when i execute the Evaluate the model code:
python Tensorflow\models\research\object_detection\model_main_tf2.py --model_dir=Tensorflow\workspace\models\my_ssd_mobnet --pipeline_config_path=Tensorflow\workspace\models\my_ssd_mobnet\pipeline.config --checkpoint_dir=Tensorflow\workspace\models\my_ssd_mobnet
error occurs like this
Traceback (most recent call last):
File "Tensorflow\models\research\object_detection\model_main_tf2.py", line 115, in <module>
tf.compat.v1.app.run()
File "C:\Users\All_Nighter\AppData\Roaming\Python\Python38\site-packages\tensorflow\python\platform\app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "C:\Users\All_Nighter\miniconda3\envs\TF\lib\site-packages\absl\app.py", line 303, in run
_run_main(main, args)
File "C:\Users\All_Nighter\miniconda3\envs\TF\lib\site-packages\absl\app.py", line 251, in _run_main
sys.exit(main(argv))
File "Tensorflow\models\research\object_detection\model_main_tf2.py", line 82, in main
model_lib_v2.eval_continuously(
File "C:\Users\All_Nighter\miniconda3\envs\TF\lib\site-packages\object_detection-0.1-py3.8.egg\object_detection\model_lib_v2.py", line 1151, in eval_continuously
eager_eval_loop(
File "C:\Users\All_Nighter\miniconda3\envs\TF\lib\site-packages\object_detection-0.1-py3.8.egg\object_detection\model_lib_v2.py", line 928, in eager_eval_loop
for i, (features, labels) in enumerate(eval_dataset):
File "C:\Users\All_Nighter\AppData\Roaming\Python\Python38\site-packages\tensorflow\python\data\ops\iterator_ops.py", line 761, in __next__
return self._next_internal()
File "C:\Users\All_Nighter\AppData\Roaming\Python\Python38\site-packages\tensorflow\python\data\ops\iterator_ops.py", line 744, in _next_internal
ret = gen_dataset_ops.iterator_get_next(
File "C:\Users\All_Nighter\AppData\Roaming\Python\Python38\site-packages\tensorflow\python\ops\gen_dataset_ops.py", line 2727, in iterator_get_next
_ops.raise_from_not_ok_status(e, name)
File "C:\Users\All_Nighter\AppData\Roaming\Python\Python38\site-packages\tensorflow\python\framework\ops.py", line 6897, in raise_from_not_ok_status
six.raise_from(core._status_to_exception(e.code, message), None)
File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.InvalidArgumentError: input must be 4-dimensional[1,1,371,300,3]
[[{{node ResizeImage/resize/ResizeBilinear}}]] [Op:IteratorGetNext]
I can't understand what is input must be 4-dimensional[1,1,371,300,3] means.
i tried Labeling again, and downgrade TF to 2.4.0. but still happend.
ssd_mobilenet model expects input
A three-channel image of variable size - the model does NOT support
batching. The input tensor is a tf.uint8 tensor with shape [1, height,
width, 3] with values in [0, 255]
In this case you are giving 4-dimensional input[1,1,371,300,3],
Reshape your input data as [1,371,300,3].
Trying to use multiple TensorFlow models in parallel using pathos.multiprocessing.Pool
Error is:
multiprocess.pool.RemoteTraceback:
Traceback (most recent call last):
File "c:\users\burge\appdata\local\programs\python\python37\lib\site-packages\multiprocess\pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "c:\users\burge\appdata\local\programs\python\python37\lib\site-packages\multiprocess\pool.py", line 44, in mapstar
return list(map(*args))
File "c:\users\burge\appdata\local\programs\python\python37\lib\site-packages\pathos\helpers\mp_helper.py", line 15, in <lambda>
func = lambda args: f(*args)
File "c:\Users\Burge\Desktop\SwarmMemory\sim.py", line 38, in run
i.step()
File "c:\Users\Burge\Desktop\SwarmMemory\agent.py", line 240, in step
output = self.ai(np.array(self.internal_log).reshape(-1, 1, 9))
File "c:\users\burge\appdata\local\programs\python\python37\lib\site-packages\tensorflow\python\keras\engine\base_layer.py", line 1012, in __call__
outputs = call_fn(inputs, *args, **kwargs)
File "c:\users\burge\appdata\local\programs\python\python37\lib\site-packages\tensorflow\python\keras\engine\sequential.py", line 375, in call
return super(Sequential, self).call(inputs, training=training, mask=mask)
File "c:\users\burge\appdata\local\programs\python\python37\lib\site-packages\tensorflow\python\keras\engine\functional.py", line 425, in call
inputs, training=training, mask=mask)
File "c:\users\burge\appdata\local\programs\python\python37\lib\site-packages\tensorflow\python\keras\engine\functional.py", line 569, in _run_internal_graph
assert x_id in tensor_dict, 'Could not compute output ' + str(x)
AssertionError: Could not compute output KerasTensor(type_spec=TensorSpec(shape=(None, 1, 4), dtype=tf.float32, name=None), name='dense_1/BiasAdd:0', description="created by layer 'dense_1'")
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "c:\Users\Burge\Desktop\SwarmMemory\sim.py", line 78, in <module>
p.map(Sim.run, sims)
File "c:\users\burge\appdata\local\programs\python\python37\lib\site-packages\pathos\multiprocessing.py", line 137, in map
return _pool.map(star(f), zip(*args)) # chunksize
File "c:\users\burge\appdata\local\programs\python\python37\lib\site-packages\multiprocess\pool.py", line 268, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "c:\users\burge\appdata\local\programs\python\python37\lib\site-packages\multiprocess\pool.py", line 657, in get
raise self._value
AssertionError: Could not compute output KerasTensor(type_spec=TensorSpec(shape=(None, 1, 4), dtype=tf.float32, name=None), name='dense_1/BiasAdd:0', description="created by layer 'dense_1'")
The creation of the pool is as follows:
if __name__ == '__main__':
freeze_support()
model = Sequential()
model.add(Input(shape=(1,9)))
model.add(LSTM(10, return_sequences=True))
model.add(Dropout(0.1))
model.add(LSTM(5))
model.add(Dropout(0.1))
model.add(Dense(4))
model.add(Dense(4))
models = []
sims = []
for i in range(6):
models.append(tensorflow.keras.models.clone_model(model))
sims.append(Sim(models[-1]))
p = Pool()
p.map(Sim.run, sims)
Basically, I am running a simulation using the model provided to the class sim. This means after the sim has run I can get use a fitness function on results, and apply a genetic algorithm to the results.
GitHub link for more information, under branch python-ver:
https://github.com/HarryBurge/SwarmMemory
EDIT:
In case anyone needs to know how to do this in the future.
I used keras-pickle-wrapper to be able to pickle the keras model and just pass it to the run method.
models = []
sims = []
for i in range(6):
models.append(KerasPickleWrapper(tensorflow.keras.models.clone_model(model)))
sims.append(Sim())
p = Pool()
p.map(Sim.run, sims, models)
I'm the author of pathos. Whenever you see self._value in the error, what's generally happening is that something you tried to send to another processor failed to serialize. The error and traceback is a bit obtuse, admittedly. However, what you can do is check the serialization with dill, and determine if you need to use one of the serialization variants (like dill.settings['trace'] = True), or whether you need to restructure your code slightly to better accommodate serialization. If the class you are working with is something you can edit, then an easy thing to do is to add a __reduce__ method, or similar, to aid serialization.
I am trying to create an HDF5 file with two datasets, 'data' and 'label'. When I tried to access the said file, however, I got an error as follows:
Traceback (most recent call last):
File "C:\Program Files\JetBrains\PyCharm Community Edition 2018.1.4\helpers\pydev\pydevd.py", line 1664, in <module>
main()
File "C:\Program Files\JetBrains\PyCharm Community Edition 2018.1.4\helpers\pydev\pydevd.py", line 1658, in main
globals = debugger.run(setup['file'], None, None, is_module)
File "C:\Program Files\JetBrains\PyCharm Community Edition 2018.1.4\helpers\pydev\pydevd.py", line 1068, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "C:\Program Files\JetBrains\PyCharm Community Edition 2018.1.4\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "C:/pycharm/Input_Pipeline.py", line 140, in <module>
data_h5 = f['data'][:]
File "h5py\_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py\_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "C:\Users\u20x47\PycharmProjects\PCL\venv\lib\site-packages\h5py\_hl\group.py", line 177, in __getitem__
oid = h5o.open(self.id, self._e(name), lapl=self._lapl)
File "h5py\_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py\_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py\h5o.pyx", line 190, in h5py.h5o.open
ValueError: Not a location (invalid object ID)
Code used to create the dataset:
h5_file.create_dataset('data', data=data_x, compression='gzip', compression_opts=4, dtype='float32')
h5_file.create_dataset('label', data=label, compression='gzip', compression_opts=1, dtype='uint8')
data_x an array of arrays. Each element in data_x is a 3D array of 1024 elements.
label is an array of arrays as well. Each element is a 1D array of 1 element.
Code to access the said file:
f = h5_file
data_h5 = f['data'][:]
label_h5 = f['label'][:]
print (data_h5, label_h5)
How can I fix this? Is this a syntax error or a logical one?
I was unable to reproduce the error.
Maybe you forgot to close the file or you change the content of your h5 during execution.
Also you can use print h5_file.items() to check the content of your h5 file
Tested code:
import h5py
import numpy as np
h5_file = h5py.File('test.h5', 'w')
# bogus data with the correct size
data_x = np.random.rand(16,8,8)
label = np.random.randint(100, size=(1,1),dtype='uint8')
#
h5_file.create_dataset('data', data=data_x, compression='gzip', compression_opts=4, dtype='float32')
h5_file.create_dataset('label', data=label, compression='gzip', compression_opts=1, dtype='uint8')
h5_file.close()
h5_file = h5py.File('test.h5', 'r')
f = h5_file
print f.items()
data_h5 = f['data'][...]
label_h5 = f['label'][...]
print (data_h5, label_h5)
h5_file.close()
Produces
[(u'data', <HDF5 dataset "data": shape (16, 8, 8), type "<f4">), (u'label', <HDF5 dataset "label": shape (1, 1), type "|u1">)]
(array([[[4.36837107e-01, 8.05664659e-01, 3.34415197e-01, ...,
8.89135897e-01, 1.84097692e-01, 3.60782951e-01],
[8.86442482e-01, 6.07181549e-01, 2.42844030e-01, ...,
[4.24369454e-01, 6.04596496e-01, 5.56676507e-01, ...,
7.22884715e-01, 2.45932683e-01, 9.18777227e-01]]], dtype=float32), array([[25]], dtype=uint8))
I was trying to build a 3D convolutional layer using keras. It works fine, but when I added a subsample parameter it crashed. The code:
l_1 = Convolution3D(2, 10,10,10,
border_mode='same',
name = 'l_1',
activation='relu',
subsample = (5,5,5)
)(inputs)
the error is:
Traceback (most recent call last):
File "image_proc_09.py", line 244, in <module>
)(inputs)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 572, in __call__
self.add_inbound_node(inbound_layers, node_indices, tensor_indices)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 635, in add_inbound_node
Node.create_node(self, inbound_layers, node_indices, tensor_indices)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 166, in create_node
output_tensors = to_list(outbound_layer.call(input_tensors[0], mask=input_masks[0]))
File "/usr/local/lib/python2.7/dist-packages/keras/layers/convolutional.py", line 1234, in call
filter_shape=self.W_shape)
File "/usr/local/lib/python2.7/dist-packages/keras/backend/theano_backend.py", line 1627, in conv3d
dim_ordering, volume_shape, filter_shape)
File "/usr/local/lib/python2.7/dist-packages/keras/backend/theano_backend.py", line 1686, in _old_theano_conv3d
assert(strides == (1, 1, 1))
AssertionError
I am using theano 0.8.2.
Thanks
You cannot use the subsample parameter with border_mode='same'. Use 'valid' or 'full'
Check out the line of code where the assertion error happens