Randomly hitting Joblib exception in Sklearn on parallel Grid Search

Randomly hitting Joblib exception in Sklearn on parallel Grid Search - scikit-learn

I am running gridSearchCV in parallel with n_jobs > 1, but randomly hit the following crash in joblib:
TypeError: Cannot create a consistent method resolution
order (MRO) for bases JoblibException, Exception
Here is the complete stack trace:
Traceback (most recent call last):
File "example_sklearn.py", line 92, in <module>
main()
File "example_sklearn.py", line 76, in main
).fit(X_train, y_train)
File "/usr/local/lib/python2.7/dist-packages/sklearn/grid_search.py",
line 372, in fit for clf_params in grid for train, test in cv)
File "/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/parallel.py",
line 516, in __call__self.retrieve()
File "/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/parallel.py",
line 448, in retrieve exception_type = _mk_exception(exception.etype)[0]
File "/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/my_exceptions.py",
line 61, in _mk_exception__str__=JoblibException.__str__),
TypeError: Cannot create a consistent method resolution
order (MRO) for bases JoblibException, Exception
Any pointers on what this really is, and how I can debug this. Is this a known issue with sklearn

I had the exact same exception, exactly while using the GridSearchCV.
If you look at the exception, it is complaining about not being able to understand how exactly it should choose between two parent classes JoblibException and Exception. This is a bug in the joblib package, that the inheritance is improper.
But other than than, there exist another problem, which is the source of the exception itself. It's getting an exception while retrieve()ing, and while passing the exception, you get the error.
The second problem (which is the source of the exception), seems to be fixed in later versions of joblib. But scikit-learn is still using an old version (I will submit a pull request with the changed file soon).
A temporary workaround would be to install your own version of joblib using
easy_install joblib
and then go to the sklearn/exterlan folder, remove/rename the joblib folder, and create a symbolic link to your own joblib using:
ln -s /path/to/joblib joblib
EDIT: Seems somebody has had already fixed the problem. My version was also old.

Related

RuntimeError: unexpected EOF, expected 3302200 more bytes. The file might be corrupted

I am trying to implement pretrained model of following repository. I need your assistance to rectify the error.
RuntimeError: unexpected EOF, expected 3302200 more bytes. The file might be corrupted.
I tried to implement pretrained model of CANNet present on following repo using google Collab and followed all steps of (Prerequisites, cloning, Data Preparation, and Testing)
https://github.com/gjy3035/NWPU-Crowd-Sample-Code.git
The detailed error is given below
Traceback (most recent call last):
File "test.py", line 118, in
main()
File "test.py", line 46, in main
test(lines, model_path)
File "test.py", line 55, in test
net.load_state_dict(torch.load(model_path))
File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 593, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 779, in _legacy_load
deserialized_objects[key]._set_from_file(f, offset, f_should_read_directly)
RuntimeError: unexpected EOF, expected 3302200 more bytes. The file might be corrupted.

Check out this github link: https://github.com/huggingface/transformers/issues/1491
It proposes one should use the force_download arg. This is equivalent to force_reload assuming you're using torch.load.hub to load the pretrained model. The other option proposed is applicable to windows users is to delete the downloaded model and download it again.
I have the same issue but --setting force_reload=True hasn't cleared it for me, I'm thinking I have space problems, but I think it's worth a shot on your end.

I also faced the same same problem while I was evaluating my trained model on google collab. I found that the model was taking a lot of time to get fully uploaded to the machine. I was testing with the incompletely uploaded model. when I ensured that the model has been fully uploaded and then I ran, it worked.

AzureML: Unable to unpickle LightGBM model

I am trying to run an Azure ML pipeline. This pipeline trains a model, saves it a pickle file and then tries to unpickle it in the next step. When trying to unpickle it, I am facing the below issue in any random run:
Traceback (most recent call last):
File "batch_scoring.py", line 199, in
clf = joblib.load(open(model_path, 'rb'))
File "/azureml-envs/azureml_347514cea2002d6bd71b42aceb1e4eeb/lib/python3.6/site-packages/joblib/numpy_pickle.py", line 595, in load
obj = _unpickle(fobj)
File "/azureml-envs/azureml_347514cea2002d6bd71b42aceb1e4eeb/lib/python3.6/site-packages/joblib/numpy_pickle.py", line 529, in _unpickle
obj = unpickler.load()
File "/azureml-envs/azureml_347514cea2002d6bd71b42aceb1e4eeb/lib/python3.6/pickle.py", line 1048, in load
raise EOFError
EOFError
Has anyone faced this issue before?

I get same error when I m trying to unpickle the model from output folder/ model registry. In my case the pkl was not properly formed during the experiment. Try to re-run the experiment (I did it without changing any line and it works for me). In my case even the first pickle was a smaller size than the good one. Hope this helps :)

TensorFlow 2.0 - Running using TPU: AttributeError: 'NameError' object has no attribute 'op'

I'm running code script designed for to work with TF 2.0 to generate predictions on a pre-trained BERT base model for an NLP task. I'm running with with Python 3.7 and TF 2.1 in a Google Colab notebook using a cloud-TPU hosted instance. I'm able to run the script successfully without errors and generate predictions using a cloud-GPU but am getting the following error outputs when I try to run it with a TPU (after enabling the TPU and pointing to the corresponding ip address for the TPU).
2020-02-09 01:17:36.155906: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2020-02-09 01:17:36.156040: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2020-02-09 01:17:36.156061: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
WARNING:tensorflow:From tf_kaggle_test.py:188: The name tf.estimator.tpu.InputPipelineConfig is deprecated. Please use tf.compat.v1.estimator.tpu.InputPipelineConfig instead.
WARNING:tensorflow:From tf_kaggle_test.py:189: The name tf.estimator.tpu.RunConfig is deprecated. Please use tf.compat.v1.estimator.tpu.RunConfig instead.
WARNING:tensorflow:From tf_kaggle_test.py:194: The name tf.estimator.tpu.TPUConfig is deprecated. Please use tf.compat.v1.estimator.tpu.TPUConfig instead.
WARNING:tensorflow:From tf_kaggle_test.py:212: The name tf.estimator.tpu.TPUEstimator is deprecated. Please use tf.compat.v1.estimator.tpu.TPUEstimator instead.
WARNING:tensorflow:Estimator's model_fn (<function model_fn_builder.<locals>.model_fn at 0x7f55bb7600d0>) includes params argument, but params are not passed to Estimator.
FLAGS.predict_file data/simplified-nq-test.jsonl
***** Running predictions *****
Num orig examples = 346
Num split examples = 9409
Batch size = 8
Num split into 3 = 8
.
.
Num split into 187 = 1
output/eval.tf_record
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py:1635: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
2020-02-09 01:18:52.589210: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:373] GrpcSession::ListDevices will initialize the session with an empty graph and other defaults because the session has not yet been created.
WARNING:tensorflow:From /content/gdrive/My Drive/Capstone - Google QA/natural-questions-nlp-drive/tf2_0_baseline_w_bert.py:1112: map_and_batch (from tensorflow.python.data.experimental.ops.batching) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.map(map_func, num_parallel_calls)` followed by `tf.data.Dataset.batch(batch_size, drop_remainder)`. Static tf.data optimizations will take care of using the fused implementation.
2020-02-09 01:18:53.890592: E tensorflow/stream_executor/cuda/cuda_driver.cc:351] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/tracking/util.py:1262: NameBasedSaverStatus.__init__ (from tensorflow.python.training.tracking.util) is deprecated and will be removed in a future version.
Instructions for updating:
Restoring a name-based tf.train.Saver checkpoint using the object-based restore API. This mode uses global names to match variables, and so is somewhat fragile. It also adds new restore ops to the graph each time it is called when graph building. Prefer re-encoding training checkpoints in the object-based format: run save() on the object-based saver (the same one this message is coming from) and use that checkpoint in the future.
WARNING:tensorflow:From /content/gdrive/My Drive/Capstone - Google QA/natural-questions-nlp-drive/tf2_0_baseline_w_bert.py:1057: The name tf.estimator.tpu.TPUEstimatorSpec is deprecated. Please use tf.compat.v1.estimator.tpu.TPUEstimatorSpec instead.
The above warnings all are fine and still run, many of them pointing to the issues due to deprecation as the original script was built for TF 1.0 and then translated to work with TF 2.0. It appears the issue and failure is occurring with the tpu_estimator and error_handling scripts below; something to do with the exception catching process. I'm not sure what it's referring to when it refers to the AttributeError: 'NameError' object has no attribute 'op' and name 'assignment_map' is not defined.
WARNING:tensorflow:Reraising captured error
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3075, in predict
rendezvous.record_error('prediction_loop', sys.exc_info())
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 81, in record_error
if value and value.op and value.op.type == _CHECK_NUMERIC_OP_NAME:
AttributeError: 'NameError' object has no attribute 'op'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "tf_kaggle_test.py", line 267, in <module>
predict_input_fn, yield_single_examples=True):
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3078, in predict
rendezvous.raise_errors()
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 143, in raise_errors
six.reraise(typ, value, traceback)
File "/usr/local/lib/python3.6/dist-packages/six.py", line 693, in reraise
raise value
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3072, in predict
yield_single_examples=yield_single_examples):
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 626, in predict
features, None, ModeKeys.PREDICT, self.config)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2857, in _call_model_fn
config)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1152, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3394, in _model_fn
scaffold = _get_scaffold(scaffold_fn)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3749, in _get_scaffold
scaffold = scaffold_fn()
File "/content/gdrive/My Drive/Capstone - Google QA/natural-questions-nlp-drive/tf2_0_baseline_w_bert.py", line 994, in tpu_scaffold
tf.compat.v1.train.init_from_checkpoint(init_checkpoint, assignment_map)
NameError: name 'assignment_map' is not defined
The notebook where I'm using the script (and it worked perfectly with a GPU/CPU)is located here:
https://www.kaggle.com/abhinand05/bert-for-humans-tutorial-baseline/data#Code-Implementation-in-Tensorflow-2.0
Is it something to do with using the Google Colab that I need to change or additional changes to be made to use with a TPU?

Python : Geting issue on OCR while using python tesseract API interface

I used Pytesseract module for OCR. It seems slow process. So I followed
Pytesseract is too slow. How can I make it process images faster? .
I used code mentioned in https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/xvTFjYCDRQU/rCEwjZL3BQAJ . But getting error
!strcmp(locale, "C"):Error:Assert failed:in file baseapi.cpp, line 201
Segmentation fault (core dumped),
Then i check some post and get reference to add in my code locale.setlocale(locale.LC_ALL, "C").
So after added this in my code I got another error
Traceback (most recent call last):
File "master_doc_test3.py", line 107, in <module>
tess = Tesseract()
File "master_doc_test3.py", line 67, in __init__
if self._lib.TessBaseAPIInit3(self._api, datapath, language):
ctypes.ArgumentError: argument 3: <class 'TypeError'>: wrong type`
Can anyone give idea about this error? OR If anyone have idea about best way to make OCR in fastest way using python.

You should try to convert to bytes every parameter you pass to ctypes lib calls:
self._lib.TessBaseAPIInit3(self._api, datapath, language)
Something like this is working for me:
self._lib.TessBaseAPIInit3(self._api, bytes(datapath, encoding='utf-8'), bytes(language, encoding='utf-8'))
I have got the clue here.
Please, take into consideration that the code you are using needs more fine tuning in other lib calls as the next ones:
tess.set_variable(bytes("tessedit_pageseg_mode", encoding='utf-8'), bytes(str(frame_piece.psm), encoding='utf-8'))
tess.set_variable(bytes("preserve_interword_spaces", encoding='utf-8'), bytes(str(1), encoding='utf-8'))

Gensim Word2Vec object has no attribute vector_size when loading file

I'm trying to load an already trained word2vec model downloaded from here by using the following code, as suggested by the aforementioned website:
from gensim.models import Word2Vec
model=Word2Vec.load('wiki_iter=5_algorithm=skipgram_window=10_size=300_neg-samples=10.m')
When I try to execute that code, I get the following error:
UserWarning: detected Windows; aliasing chunkize to chunkize_serial
warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
Traceback (most recent call last):
File "d:\DavideV\documents\visual studio 2017\Projects\tesi\tesi\tesi.py", line 112, in <module>
model=Word2Vec.load('wiki_iter=5_algorithm=skipgram_window=10_size=300_neg-samples=10.m')
File "C:\Users\admin\Anaconda3\lib\site-packages\gensim\models\word2vec.py", line 979, in load
return load_old_word2vec(*args, **kwargs)
File "C:\Users\admin\Anaconda3\lib\site-packages\gensim\models\deprecated\word2vec.py", line 155, in load_old_word2vec
'size': old_model.vector_size,
AttributeError: 'Word2Vec' object has no attribute 'vector_size'
I suppose that this is due to the fact that the model has probably been trained with a previous version of gensim, but I would prefer to avoid to retrain it.
How can I solve this problem? Thanks.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Randomly hitting Joblib exception in Sklearn on parallel Grid Search - scikit-learn

Related

RuntimeError: unexpected EOF, expected 3302200 more bytes. The file might be corrupted

AzureML: Unable to unpickle LightGBM model

TensorFlow 2.0 - Running using TPU: AttributeError: 'NameError' object has no attribute 'op'

Python : Geting issue on OCR while using python tesseract API interface

Gensim Word2Vec object has no attribute vector_size when loading file

Categories

Resources