Scikit-learn StackingClassifier with missing values - scikit-learn

When I call model.predict_proba(X) on my StackingClassifier model the run execution crashes because the library is calling a method assert_all_finite() to check whether my dataframe contains missing values.
Since the estimators I stacked are able to handle missing values, I don't see the reason why this should happen and I didn't find anything in the documentation that says that the StackingClassifier requires data without missing values.
It's a bit hard for me to come up with a short reproducibile snippet of code given that it comes from several layers of model abstraction, but I can print out the model effectively raising the error call.
p = model.predict_proba(X_loyal)
where model is:
StackingClassifier(estimators=[('ExtraTreesClassifier_117',
ExtraTreesClassifier(bootstrap=True,
class_weight={0: 1, 1: 5},
criterion='entropy',
max_depth=11,
max_features='log2',
max_samples=0.5946040593595099,
min_samples_leaf=2,
n_estimators=163,
random_state=117)),
('RandomForestClassifier_117',
RandomForestClassifier(class_weight={0: 1,
1: 5},
criterion='entropy',
max_depth=11,
max_features='log2',
max_samples=0.5946040593595099,
min_samples_leaf=2,
n_estimators=163,
random_state=117)),
('LGBMClassifier_117',
LGBMClassifier(class_weight={0: 1, 1: 1},
deterministic=True, max_depth=9,
n_estimators=183, num_leaves=3,
subsample=0.2986274713775564,
verbose=-1))])
Error
Traceback (most recent call last):
File "/home/mlpoc/miniconda3/envs/churn/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3417, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-12-eafa75c49322>", line 1, in <module>
model.predict_proba(X_loyal)
File "/home/mlpoc/miniconda3/envs/churn/lib/python3.8/site-packages/sklearn/utils/metaestimators.py", line 120, in <lambda>
out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
File "/home/mlpoc/miniconda3/envs/churn/lib/python3.8/site-packages/sklearn/ensemble/_stacking.py", line 485, in predict_proba
return self.final_estimator_.predict_proba(self.transform(X))
File "/home/mlpoc/miniconda3/envs/churn/lib/python3.8/site-packages/sklearn/ensemble/_stacking.py", line 522, in transform
return self._transform(X)
File "/home/mlpoc/miniconda3/envs/churn/lib/python3.8/site-packages/sklearn/ensemble/_stacking.py", line 215, in _transform
predictions = [
File "/home/mlpoc/miniconda3/envs/churn/lib/python3.8/site-packages/sklearn/ensemble/_stacking.py", line 216, in <listcomp>
getattr(est, meth)(X)
File "/home/mlpoc/miniconda3/envs/churn/lib/python3.8/site-packages/sklearn/ensemble/_forest.py", line 674, in predict_proba
X = self._validate_X_predict(X)
File "/home/mlpoc/miniconda3/envs/churn/lib/python3.8/site-packages/sklearn/ensemble/_forest.py", line 422, in _validate_X_predict
return self.estimators_[0]._validate_X_predict(X, check_input=True)
File "/home/mlpoc/miniconda3/envs/churn/lib/python3.8/site-packages/sklearn/tree/_classes.py", line 407, in _validate_X_predict
X = self._validate_data(X, dtype=DTYPE, accept_sparse="csr",
File "/home/mlpoc/miniconda3/envs/churn/lib/python3.8/site-packages/sklearn/base.py", line 421, in _validate_data
X = check_array(X, **check_params)
File "/home/mlpoc/miniconda3/envs/churn/lib/python3.8/site-packages/sklearn/utils/validation.py", line 63, in inner_f
return f(*args, **kwargs)
File "/home/mlpoc/miniconda3/envs/churn/lib/python3.8/site-packages/sklearn/utils/validation.py", line 720, in check_array
_assert_all_finite(array,
File "/home/mlpoc/miniconda3/envs/churn/lib/python3.8/site-packages/sklearn/utils/validation.py", line 103, in _assert_all_finite
raise ValueError(
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
Versions
sklearn.__version__
Out[6]: '0.24.2'
lightgbm.__version__
Out[8]: '3.2.1'

Related

python Tensorflow 2.4.0 'input must be 4-dimensional[1,1,371,300,3]' ERROR

im running Nicholas Rennote's TFODCourse.
when i execute the Evaluate the model code:
python Tensorflow\models\research\object_detection\model_main_tf2.py --model_dir=Tensorflow\workspace\models\my_ssd_mobnet --pipeline_config_path=Tensorflow\workspace\models\my_ssd_mobnet\pipeline.config --checkpoint_dir=Tensorflow\workspace\models\my_ssd_mobnet
error occurs like this
Traceback (most recent call last):
File "Tensorflow\models\research\object_detection\model_main_tf2.py", line 115, in <module>
tf.compat.v1.app.run()
File "C:\Users\All_Nighter\AppData\Roaming\Python\Python38\site-packages\tensorflow\python\platform\app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "C:\Users\All_Nighter\miniconda3\envs\TF\lib\site-packages\absl\app.py", line 303, in run
_run_main(main, args)
File "C:\Users\All_Nighter\miniconda3\envs\TF\lib\site-packages\absl\app.py", line 251, in _run_main
sys.exit(main(argv))
File "Tensorflow\models\research\object_detection\model_main_tf2.py", line 82, in main
model_lib_v2.eval_continuously(
File "C:\Users\All_Nighter\miniconda3\envs\TF\lib\site-packages\object_detection-0.1-py3.8.egg\object_detection\model_lib_v2.py", line 1151, in eval_continuously
eager_eval_loop(
File "C:\Users\All_Nighter\miniconda3\envs\TF\lib\site-packages\object_detection-0.1-py3.8.egg\object_detection\model_lib_v2.py", line 928, in eager_eval_loop
for i, (features, labels) in enumerate(eval_dataset):
File "C:\Users\All_Nighter\AppData\Roaming\Python\Python38\site-packages\tensorflow\python\data\ops\iterator_ops.py", line 761, in __next__
return self._next_internal()
File "C:\Users\All_Nighter\AppData\Roaming\Python\Python38\site-packages\tensorflow\python\data\ops\iterator_ops.py", line 744, in _next_internal
ret = gen_dataset_ops.iterator_get_next(
File "C:\Users\All_Nighter\AppData\Roaming\Python\Python38\site-packages\tensorflow\python\ops\gen_dataset_ops.py", line 2727, in iterator_get_next
_ops.raise_from_not_ok_status(e, name)
File "C:\Users\All_Nighter\AppData\Roaming\Python\Python38\site-packages\tensorflow\python\framework\ops.py", line 6897, in raise_from_not_ok_status
six.raise_from(core._status_to_exception(e.code, message), None)
File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.InvalidArgumentError: input must be 4-dimensional[1,1,371,300,3]
[[{{node ResizeImage/resize/ResizeBilinear}}]] [Op:IteratorGetNext]
I can't understand what is input must be 4-dimensional[1,1,371,300,3] means.
i tried Labeling again, and downgrade TF to 2.4.0. but still happend.
ssd_mobilenet model expects input
A three-channel image of variable size - the model does NOT support
batching. The input tensor is a tf.uint8 tensor with shape [1, height,
width, 3] with values in [0, 255]
In this case you are giving 4-dimensional input[1,1,371,300,3],
Reshape your input data as [1,371,300,3].

How to create a custom keras generator to fit multiple outputs and use workers

I have one input, and multiple outputs, like a multilabel classification, but I chose to try another approach to see if I have any improvements.
I have these generators, I'm using flow_from_dataframe because I have a huge dataset (200k):
self.train_generator = datagen.flow_from_dataframe(
dataframe=train,
directory='dataset',
x_col='Filename',
y_col=columns,
batch_size=BATCH_SIZE,
color_mode='rgb',
class_mode='raw',
shuffle=True,
target_size=(HEIGHT,WIDTH))
self.test_generator = datatest.flow_from_dataframe(
dataframe=test,
directory='dataset',
x_col='Filename',
y_col=columns,
batch_size=BATCH_SIZE,
color_mode='rgb',
class_mode='raw',
target_size=(HEIGHT,WIDTH))
I'm passing to fit using this function:
def generator(self, generator):
while True:
X, y = generator.next()
y = [y[:,x] for x in range(len(columns))]
yield X,[y]
If I fit like this:
self.h = self.model.fit_generator(self.generator(self.train_generator),
steps_per_epoch=self.STEP_SIZE_TRAIN,
validation_data=self.generator(self.test_generator),
validation_steps=self.STEP_SIZE_TEST,
epochs=50,
verbose = 1,
workers = 2,
)
I get :
RuntimeError: Your generator is NOT thread-safe. Keras requires a thread-safe generator when `use_multiprocessing=False, workers > 1`.
Using multiprocessing=True:
self.h = self.model.fit_generator(self.generator(self.train_generator),
steps_per_epoch=self.STEP_SIZE_TRAIN,
validation_data=self.generator(self.test_generator),
validation_steps=self.STEP_SIZE_TEST,
epochs=50,
verbose = 1,
workers = 2,
use_multiprocessing=True,
)
Results in:
File "C:\ProgramData\Anaconda3\lib\threading.py", line 932, in _bootstrap_inner
self.run()
File "C:\ProgramData\Anaconda3\lib\threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\tensorflow\python\keras\utils\data_utils.py", line 877, in _run
with closing(self.executor_fn(_SHARED_SEQUENCES)) as executor:
File "C:\ProgramData\Anaconda3\lib\site-packages\tensorflow\python\keras\utils\data_utils.py", line 867, in pool_fn
pool = get_pool_class(True)(
File "C:\ProgramData\Anaconda3\lib\multiprocessing\context.py", line 119, in Pool
return Pool(processes, initializer, initargs, maxtasksperchild,
File "C:\ProgramData\Anaconda3\lib\multiprocessing\pool.py", line 212, in __init__
self._repopulate_pool()
File "C:\ProgramData\Anaconda3\lib\multiprocessing\pool.py", line 303, in _repopulate_pool
return self._repopulate_pool_static(self._ctx, self.Process,
File "C:\ProgramData\Anaconda3\lib\multiprocessing\pool.py", line 326, in _repopulate_pool_static
w.start()
File "C:\ProgramData\Anaconda3\lib\multiprocessing\process.py", line 121, in start
self._popen = self._Popen(self)
File "C:\ProgramData\Anaconda3\lib\multiprocessing\context.py", line 327, in _Popen
return Popen(process_obj)
File "C:\ProgramData\Anaconda3\lib\multiprocessing\popen_spawn_win32.py", line 93, in __init__
reduction.dump(process_obj, to_child)
File "C:\ProgramData\Anaconda3\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
TypeError: cannot pickle 'generator' object
File "C:\ProgramData\Anaconda3\lib\threading.py", line 932, in _bootstrap_inner
self.run()
File "C:\ProgramData\Anaconda3\lib\threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\tensorflow\python\keras\utils\data_utils.py", line 877, in _run
with closing(self.executor_fn(_SHARED_SEQUENCES)) as executor:
File "C:\ProgramData\Anaconda3\lib\site-packages\tensorflow\python\keras\utils\data_utils.py", line 867, in pool_fn
pool = get_pool_class(True)(
File "C:\ProgramData\Anaconda3\lib\multiprocessing\context.py", line 119, in Pool
return Pool(processes, initializer, initargs, maxtasksperchild,
File "C:\ProgramData\Anaconda3\lib\multiprocessing\pool.py", line 212, in __init__
self._repopulate_pool()
File "C:\ProgramData\Anaconda3\lib\multiprocessing\pool.py", line 303, in _repopulate_pool
return self._repopulate_pool_static(self._ctx, self.Process,
File "C:\ProgramData\Anaconda3\lib\multiprocessing\pool.py", line 326, in _repopulate_pool_static
w.start()
File "C:\ProgramData\Anaconda3\lib\multiprocessing\process.py", line 121, in start
self._popen = self._Popen(self)
File "C:\ProgramData\Anaconda3\lib\multiprocessing\context.py", line 327, in _Popen
return Popen(process_obj)
File "C:\ProgramData\Anaconda3\lib\multiprocessing\popen_spawn_win32.py", line 93, in __init__
reduction.dump(process_obj, to_child)
File "C:\ProgramData\Anaconda3\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
TypeError: cannot pickle 'generator' object
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "C:\ProgramData\Anaconda3\lib\multiprocessing\spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "C:\ProgramData\Anaconda3\lib\multiprocessing\spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
EOFError: Ran out of input
Now I'm stuck, how to solve this?
According to documentation https://keras.io/api/preprocessing/image/
The argument class_mode can be set as "multi_output" so you don't need to create a custom generator:
class_mode: one of "binary", "categorical", "input", "multi_output", "raw", sparse" or None. Default: "categorical". Mode for yielding the targets:
- "binary": 1D numpy array of binary labels,
- "categorical": 2D numpy array of one-hot encoded labels. Supports multi-label output.
- "input": images identical to input images (mainly used to work with autoencoders),
- "multi_output": list with the values of the different columns,
- "raw": numpy array of values in y_col column(s),
- "sparse": 1D numpy array of integer labels,
- None, no targets are returned (the generator will only yield batches of image data, which is useful to use in model.predict()).
I am now being able to use workers > 1, but I am not having performance improvements.

PyTorch - Torchvision - BrokenPipeError: [Errno 32] Broken pipe

I'm trying to carry out the tutorial named "Training a classifier" with PyTorch.
WHen trying to debug this part of the code :
import matplotlib.pyplot as plt
import numpy as np
# functions to show an image
def imshow(img):
img = img / 2 + 0.5 # unnormalize
npimg = img.numpy()
plt.imshow(np.transpose(npimg, (1, 2, 0)))
# get some random training images
dataiter = iter(trainloader)
images, labels = dataiter.next()
# show images
imshow(torchvision.utils.make_grid(images))
# print labels
print(' '.join('%5s' % classes[labels[j]] for j in range(4)))
I get this error message :
Files already downloaded and verified Files already downloaded and verified
Files already downloaded and verified Files already downloaded and verified Traceback (most recent call last):
File "<string>", line 1, in <module>
File "D:\Anaconda\lib\multiprocessing\spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "D:\Anaconda\lib\multiprocessing\spawn.py", line 114, in _main
prepare(preparation_data)
File "D:\Anaconda\lib\multiprocessing\spawn.py", line 225, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "D:\Anaconda\lib\multiprocessing\spawn.py", line 277, in
_fixup_main_from_path
run_name="__mp_main__")
File "D:\Anaconda\lib\runpy.py", line 263, in run_path
pkg_name=pkg_name, script_name=fname)
File "D:\Anaconda\lib\runpy.py", line 96, in _run_module_code
mod_name, mod_spec, pkg_name, script_name)
File "D:\Anaconda\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "d:\Yggdrasil\Programmation\PyTorch\TutorialCIFAR10.py", line 36, in <module>
dataiter = iter(trainloader)
File "D:\Anaconda\lib\site-packages\torch\utils\data\dataloader.py", line 451, in __iter__
return _DataLoaderIter(self)
File "D:\Anaconda\lib\site-packages\torch\utils\data\dataloader.py", line 239, in __init__
w.start()
File "D:\Anaconda\lib\multiprocessing\process.py", line 105, in start
self._popen = self._Popen(self)
File "D:\Anaconda\lib\multiprocessing\context.py", line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "D:\Anaconda\lib\multiprocessing\context.py", line 322, in _Popen
return Popen(process_obj)
File "D:\Anaconda\lib\multiprocessing\popen_spawn_win32.py", line 33, in __init__
prep_data = spawn.get_preparation_data(process_obj._name)
File "D:\Anaconda\lib\multiprocessing\spawn.py", line 143, in get_preparation_data
_check_not_importing_main()
File "D:\Anaconda\lib\multiprocessing\spawn.py", line 136, in
_check_not_importing_main
is not going to be frozen to produce an executable.)
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
Traceback (most recent call last):
File "d:\Yggdrasil\Programmation\PyTorch\TutorialCIFAR10.py", line 36, in <module>
dataiter = iter(trainloader)
File "D:\Anaconda\lib\site-packages\torch\utils\data\dataloader.py", line 451, in __iter__
return _DataLoaderIter(self)
File "D:\Anaconda\lib\site-packages\torch\utils\data\dataloader.py", line 239, in __init__
w.start()
File "D:\Anaconda\lib\multiprocessing\process.py", line 105, in start
self._popen = self._Popen(self)
File "D:\Anaconda\lib\multiprocessing\context.py", line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj) File "D:\Anaconda\lib\multiprocessing\context.py", line 322, in _Popen
return Popen(process_obj)
File "D:\Anaconda\lib\multiprocessing\popen_spawn_win32.py", line 65, in
__init__
reduction.dump(process_obj, to_child)
File "D:\Anaconda\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
BrokenPipeError: [Errno 32] Broken pipe
All the previous lines in the tutorial are working perfectly.
Does someone know how to solve this, please ?
Thanks a lot in advance
The question happened because Windows cannot run this DataLoader in 'num_workers' more than 0.
You can see where the trainloader come from.we can see
trainloader = torch.utils.data.DataLoader(trainset, batch_size=4,
shuffle=True, num_workers=2)
We need to change the 'num_workers' to 0.like this:
trainloader = torch.utils.data.DataLoader(trainset, batch_size=4,
shuffle=True, num_workers=0)
Every trainloaders need to change like this.
Got the same error. The following workaround works for me:
def run():
# code goes here
if __name__ == '__main__':
run()
This doesn't look to be a PyTorch problem. Try executing the code in Jupyter notebooks and other environment troubleshooting.
you need to add a if-clause protection as stated in the pytorch docs:
https://pytorch.org/docs/stable/notes/windows.html#usage-multiprocessing

Pytorch model weight type conversion

I'm trying to do inference on FlowNet2-C model loading from file.
However, I met some data type problem. How can I resolve it?
Source code
FlowNet2-C pre-trained model
$ python main.py
Initializing Datasets
[0.000s] Loading checkpoint '/notebooks/data/model/FlowNet2-C_checkpoint.pth.tar'
[1.293s] Loaded checkpoint '/notebooks/data/model/FlowNet2-C_checkpoint.pth.tar' (at epoch 0)
(1L, 6L, 384L, 512L)
<class 'torch.autograd.variable.Variable'>
[1.642s] Operation failed
Traceback (most recent call last):
File "main.py", line 102, in <module>
main()
File "main.py", line 98, in main
summary(input_size, model)
File "main.py", line 61, in summary
model(x)
File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 357, in __call__
result = self.forward(*input, **kwargs)
File "/notebooks/data/vinet/FlowNetC.py", line 75, in forward
out_conv1a = self.conv1(x1)
File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 357, in __call__
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/container.py", line 67, in forward
input = module(input)
File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 357, in __call__
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/conv.py", line 282, in forward
self.padding, self.dilation, self.groups)
File "/usr/local/lib/python2.7/dist-packages/torch/nn/functional.py", line 90, in conv2d
return f(input, weight, bias)
RuntimeError: Input type (CUDAFloatTensor) and weight type (CPUFloatTensor) should be the same
Maybe that is because your model and input x to the model has different data types. It seems that the model parameters have been moved to GPU, but your input x is on GPU.
You can try to use model.cuda() after line 94, which will put the model on the GPU. Then the error should disappear.

AttributeError : lower not found

I am doing Document Classification and obtained accuracy upto 76%. And while predicting the document category i did following one
doc_clf.predict(tf_idf.transform((count_vect.transform([r'document']))))
and i get the following error:
File "/usr/local/lib/python3.5/dist- packages/sklearn/utils/metaestimators.py", line 115, in <lambda>
out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/sklearn/pipeline.py", line 306, in predict
Xt = transform.transform(Xt)
File "/usr/local/lib/python3.5/dist-packages/sklearn/feature_extraction/text.py", line 923, in transform
_, X = self._count_vocab(raw_documents, fixed_vocab=True)
File "/usr/local/lib/python3.5/dist-packages/sklearn/feature_extraction/text.py", line 792, in _count_vocab
for feature in analyze(doc):
File "/usr/local/lib/python3.5/dist-packages/sklearn/feature_extraction/text.py", line 266, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)
File "/usr/local/lib/python3.5/dist-packages/sklearn/feature_extraction/text.py", line 232, in <lambda>
return lambda x: strip_accents(x.lower())
File "/usr/local/lib/python3.5/dist-packages/scipy/sparse/base.py", line 647, in __getattr__
raise AttributeError(attr + " not found")
How do i correct this error ? And any other way to improve the accuracy further?
I share link to review full code
Full Code
In your code, doc_clf is a pipeline. So the tf_idf.transform() and count_vect.transform() will be handled automatically by the pipeline.
You should only call
category = doc_clf.predict([r'document'])
As this document passes through the pipeline, it will be automatically transformed by the CountVectorizer and TfidfTransformer.

Resources