InternalError: Invalid variable reference. [Op:ResourceApplyAdam] on TensorFlow - python-3.x

I am currently working on EagerExecution with tensorflow 1.7.0.
I get this error when I am working on GPU :
tensorflow.python.framework.errors_impl.InternalError: Invalid variable reference. [Op:ResourceApplyAdam]
Unfortunately, I wasn't able to isolate the error so I can't give a snippet which could explain that.
The error doesn't occur when I am working on CPU. My code was working fine on GPU until a recent update. I don't think it is machine related because it occurs on different machines.
I wasn't able to find something relevant so if you have any hints on what can cause this error, please let me know.
Complete tracking :
2018-07-19 17:52:32.393711: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at training_ops.cc:2507 : Internal: Invalid variable reference.
Traceback (most recent call last):
File "debugging_jules_usage.py", line 391, in <module>
mainLoop()
File "debugging_jules_usage.py", line 370, in mainLoop
raise e
File "debugging_jules_usage.py", line 330, in mainLoop
Kn.fit(train)
File "/home/jbayet/xai-link-prediction/xai_lp/temporal/models_temporal.py", line 707, in fit
self._train_one_batch(X_bis, i)
File "/home/jbayet/xai-link-prediction/xai_lp/temporal/models_temporal.py", line 639, in _train_one_batch
self.optimizer.minimize(batch_model_loss, global_step=tf.train.get_global_step())
File "/home/jbayet/miniconda3/envs/xai/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py", line 409, in minimize
name=name)
File "/home/jbayet/miniconda3/envs/xai/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py", line 564, in apply_gradients
update_ops.append(processor.update_op(self, grad))
File "/home/jbayet/miniconda3/envs/xai/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py", line 161, in update_op
update_op = optimizer._resource_apply_dense(g, self._v)
File "/home/jbayet/miniconda3/envs/xai/lib/python3.6/site-packages/tensorflow/python/training/adam.py", line 166, in _resource_apply_dense
grad, use_locking=self._use_locking)
File "/home/jbayet/miniconda3/envs/xai/lib/python3.6/site-packages/tensorflow/python/training/gen_training_ops.py", line 1105, in resource_apply_adam
_six.raise_from(_core._status_to_exception(e.code, message), None)
File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.InternalError: Invalid variable reference. [Op:ResourceApplyAdam]

Related

Issues loading pytorch checkpoint -- storage error wrong size -- how to fix?

I got the following error:
Traceback (most recent call last):
File "/home/miranda9/diversity-for-predictive-success-of-meta-learning/div_src/diversity_src/experiment_mains/main2_distance_sl_vs_maml.py", line 790, in <module>
main_data_analyis()
File "/home/miranda9/diversity-for-predictive-success-of-meta-learning/div_src/diversity_src/experiment_mains/main2_distance_sl_vs_maml.py", line 597, in main_data_analyis
args.mdl2 = get_sl_learner(args)
File "/raid/projects/miranda9/diversity-for-predictive-success-of-meta-learning/div_src/diversity_src/data_analysis/common.py", line 195, in get_sl_learner
model = load_original_rfs_ckpt(args, path_to_checkpoint=args.path_2_init_sl)
File "/raid/projects/miranda9/diversity-for-predictive-success-of-meta-learning/div_src/diversity_src/data_analysis/common.py", line 168, in load_original_rfs_ckpt
ckpt = torch.load(path_to_checkpoint, map_location=args.device)
File "/home/miranda9/miniconda3/envs/meta_learning_a100/lib/python3.9/site-packages/torch/serialization.py", line 608, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "/home/miranda9/miniconda3/envs/meta_learning_a100/lib/python3.9/site-packages/torch/serialization.py", line 794, in _legacy_load
deserialized_objects[key]._set_from_file(f, offset, f_should_read_directly)
RuntimeError: storage has wrong size: expected 0 got 64
Why and how does one fix it?
related: https://discuss.pytorch.org/t/runtimeerror-storage-has-wrong-size/88109/4
This issue was removed from me by simply re-uploading the model ckpt file to the cluster.
Other point of view see: https://discuss.pytorch.org/t/runtimeerror-storage-has-wrong-size/88109/4

Tensorflow KeyError for creating train.record for model

I'm current facing this error when creating a train.record file for my. I'm following a guide from Medium
Previously I had no issue but today I have this error
/content/gdrive/MyDrive/TensorFlow/scripts/preprocessing
Traceback (most recent call last):
File "generate_tfrecord.py", line 168, in <module>
tf.app.run()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 300, in run
_run_main(main, args)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "generate_tfrecord.py", line 158, in main
tf_example = create_tf_example(group, path)
File "generate_tfrecord.py", line 132, in create_tf_example
classes.append(class_text_to_int(row['class']))
File "generate_tfrecord.py", line 101, in class_text_to_int
return label_map_dict[row_label]
KeyError: '\\'
Successfully created the TFRecord file: /content/gdrive/MyDrive/TensorFlow/workspace/training_demo/annotations/test.record
I suspect the problem might be with one of the annotation .xml file in my train image folder causing the issue but I can't seem to find any other sources to confirm. Can someone please enlighten me?
Edit: Here is the generate_tfrecord.py for reference if required. Thank you

Python library "Crypto" conflict

I'm trying to integrate two frameworks, and I'm installing requirements for both of frameworks, but it seems like the library 'Crypto' used in both frameworks and have different versions of use, so if I install requirements for one of the frameworks, it returns me the first error:
Traceback (most recent call last):
File "dapp_bdb.py", line 134, in <module>
main()
File "dapp_bdb.py", line 112, in main
blockchain = LevelDBBlockchain(settings.chain_leveldb_path)
File "/home/ubuntu/.local/lib/python3.6/site-packages/neo/Implementations/Blockchains/LevelDB/LevelDBBlockchain.py", line 190, in __init__
self.Persist(Blockchain.GenesisBlock())
File "/home/ubuntu/.local/lib/python3.6/site-packages/neo/Implementations/Blockchains/LevelDB/LevelDBBlockchain.py", line 691, in Persist
account = accounts.GetAndChange(output.AddressBytes, AccountState(output.ScriptHash))
File "/home/ubuntu/.local/lib/python3.6/site-packages/neo/Core/TX/Transaction.py", line 121, in AddressBytes
return bytes(self.Address, encoding='utf-8')
File "/home/ubuntu/.local/lib/python3.6/site-packages/neo/Core/TX/Transaction.py", line 111, in Address
return Crypto.ToAddress(self.ScriptHash)
File "/home/ubuntu/.local/lib/python3.6/site-packages/neocore/Cryptography/Crypto.py", line 103, in ToAddress
return scripthash_to_address(script_hash.Data)
File "/home/ubuntu/.local/lib/python3.6/site-packages/neocore/Cryptography/Helper.py", line 78, in scripthash_to_address
return base58.b58encode(bytes(outb)).decode("utf-8")
AttributeError: 'str' object has no attribute 'decode'
and with the second requrirements of the framework, it returns me other error:
exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/usr/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "dapp_bdb.py", line 95, in custom_background_code
put_bdb("Hello world")
File "dapp_bdb.py", line 68, in put_bdb
fulfilled_creation_tx = bdb.transactions.fulfill(prepared_creation_tx, private_keys=private_key)
File "/home/ubuntu/.local/lib/python3.6/site-packages/bigchaindb_driver/driver.py", line 270, in fulfill
return fulfill_transaction(transaction, private_keys=private_keys)
File "/home/ubuntu/.local/lib/python3.6/site-packages/bigchaindb_driver/offchain.py", line 346, in fulfill_transaction
signed_transaction = transaction_obj.sign(private_keys)
File "/home/ubuntu/.local/lib/python3.6/site-packages/bigchaindb_driver/common/transaction.py", line 823, in sign
PrivateKey(private_key) for private_key in private_keys}
File "/home/ubuntu/.local/lib/python3.6/site-packages/bigchaindb_driver/common/transaction.py", line 823, in <dictcomp>
PrivateKey(private_key) for private_key in private_keys}
File "/home/ubuntu/.local/lib/python3.6/site-packages/bigchaindb_driver/common/transaction.py", line 817, in gen_public_key
public_key = private_key.get_verifying_key().encode()
File "/home/ubuntu/.local/lib/python3.6/site- packages/cryptoconditions/crypto.py", line 62, in get_verifying_key
return Ed25519VerifyingKey(self.verify_key.encode(encoder=Base58Encoder))
File "/home/ubuntu/.local/lib/python3.6/site-packages/nacl/encoding.py", line 90, in encode
return encoder.encode(bytes(self))
File "/home/ubuntu/.local/lib/python3.6/site-packages/cryptoconditions/crypto.py", line 15, in encode
return base58.b58encode(data).encode()
AttributeError: 'bytes' object has no attribute 'encode'
Is there any ideas how I can avoid it?
Looks like cryptoconditions library is doing it wrong.
You should file a bug asking to update required version of base58 and review all the calls to it. Usual behavior for Python3 is to return bytes on some_encoder_library.encode() and str on some_encoder_library.decode(). New versions of base58 module follow this rule although base58-encoded objects never contain any special symbols of course. cryptoconditions are still using previous version where b58encode were returning str.
Meanwhile you can make local modifications to the installed library or for it and install your fork instead.
It is likely that everything will work OK with encode() call removed from this line.

PyYAML Error: TypeError: can't pickle _thread.RLock objects

I'm trying to dump what perhaps a somewhat complex Class with YAML and am seeing the following error. I don't know what pickle does, but I'm not engaged in any multithread programming to my knowledge. This happens while running a pyunit unit test:
Any idea how to find the offending attribute?
ERROR: test_multi_level_needs (test_needs.needs_TestCase)
-----------------------------------------------------------
Traceback (most recent call last):
File "/Users/rsalemi/.../test_needs.py", line 240, in test_multi_level_needs
print(yaml.dump(test2_comp))
File ".../.../yaml/__init__.py", line 200, in dump
<snipped lots of stack trace>
File ".../.../yaml/representer.py", line 91, in represent_sequence
node_item = self.represent_data(item)
File ".../.../yaml/representer.py", line 51, in represent_data
node = self.yaml_multi_representers[data_type](self, data)
File ".../.../yaml/representer.py", line 341, in represent_object
'tag:yaml.org,2002:python/object:'+function_name, state)
File ".../.../yaml/representer.py", line 116, in represent_mapping
node_value = self.represent_data(item_value)
File ".../.../yaml/representer.py", line 51, in represent_data
node = self.yaml_multi_representers[data_type](self, data)
File ".../.../yaml/representer.py", line 315, in represent_object
reduce = data.__reduce_ex__(2)
TypeError: can't pickle _thread.RLock objects

TypeError: can't pickle memoryview objects when running basic add.delay(1,2) test

Trying to run the most basic test of add.delay(1,2) using celery 4.1.0 with Python 3.6.4 and getting the following error:
[2018-02-27 13:58:50,194: INFO/MainProcess] Received task:
exb.tasks.test_tasks.add[52c3fb33-ce00-4165-ad18-15026eca55e9]
[2018-02-27 13:58:50,194: CRITICAL/MainProcess] Unrecoverable error:
SystemError(' returned a result with an error set',) Traceback (most
recent call last): File
"/opt/myapp/lib/python3.6/site-packages/kombu/messaging.py", line 624,
in _receive_callback
return on_m(message) if on_m else self.receive(decoded, message) File
"/opt/myapp/lib/python3.6/site-packages/celery/worker/consumer/consumer.py",
line 570, in on_task_received
callbacks, File "/opt/myapp/lib/python3.6/site-packages/celery/worker/strategy.py",
line 145, in task_message_handler
handle(req) File "/opt/myapp/lib/python3.6/site-packages/celery/worker/worker.py", line
221, in _process_task_sem
return self._quick_acquire(self._process_task, req) File "/opt/myapp/lib/python3.6/site-packages/kombu/async/semaphore.py",
line 62, in acquire
callback(*partial_args, **partial_kwargs) File "/opt/myapp/lib/python3.6/site-packages/celery/worker/worker.py", line
226, in _process_task
req.execute_using_pool(self.pool) File "/opt/myapp/lib/python3.6/site-packages/celery/worker/request.py",
line 531, in execute_using_pool
correlation_id=task_id, File "/opt/myapp/lib/python3.6/site-packages/celery/concurrency/base.py",
line 155, in apply_async
**options) File "/opt/myapp/lib/python3.6/site-packages/billiard/pool.py", line 1486,
in apply_async
self._quick_put((TASK, (result._job, None, func, args, kwds))) File
"/opt/myapp/lib/python3.6/site-packages/celery/concurrency/asynpool.py",
line 813, in send_job
body = dumps(tup, protocol=protocol) TypeError: can't pickle memoryview objects
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File
"/opt/myapp/lib/python3.6/site-packages/celery/worker/worker.py", line
203, in start
self.blueprint.start(self) File "/opt/myapp/lib/python3.6/site-packages/celery/bootsteps.py", line
119, in start
step.start(parent) File "/opt/myapp/lib/python3.6/site-packages/celery/bootsteps.py", line
370, in start
return self.obj.start() File "/opt/myapp/lib/python3.6/site-packages/celery/worker/consumer/consumer.py",
line 320, in start
blueprint.start(self) File "/opt/myapp/lib/python3.6/site-packages/celery/bootsteps.py", line
119, in start
step.start(parent) File "/opt/myapp/lib/python3.6/site-packages/celery/worker/consumer/consumer.py",
line 596, in start
c.loop(*c.loop_args()) File "/opt/myapp/lib/python3.6/site-packages/celery/worker/loops.py", line
88, in asynloop
next(loop) File "/opt/myapp/lib/python3.6/site-packages/kombu/async/hub.py", line 354,
in create_loop
cb(*cbargs) File "/opt/myapp/lib/python3.6/site-packages/kombu/transport/base.py", line
236, in on_readable
reader(loop) File "/opt/myapp/lib/python3.6/site-packages/kombu/transport/base.py", line
218, in _read
drain_events(timeout=0) File "/opt/myapp/lib/python3.6/site-packages/librabbitmq-2.0.0-py3.6-linux-x86_64.egg/librabbitmq/init.py",
line 227, in drain_events
self._basic_recv(timeout) SystemError: returned a result with an error set
I cannot find any previous evidence of anyone hitting this error. I noticed from the celery site that only python 3.5 is mentioned as supported, is that the issue or is this something I am missing?
Any help would be much appreciated!
UPDATE: Tried with Python 3.5.5 and the problem persists. Tried with Django 4.0.2 and the problem persists.
UPDATE: Uninstalled librabbitmq and the problem stopped. This was seen after migration from Python 2.7.5, Django 1.7.7 to Python 3.6.4, Django 2.0.2.
After uninstalling librabbitmq, the problem was resolved.

Resources