I have the following error when loading model weights in Colaboratory:
IOPub data rate exceeded. The notebook server will temporarily stop
sending output to the client in order to avoid crashing it. To change
this limit, set the config variable
--NotebookApp.iopub_data_rate_limit.
Current values: NotebookApp.iopub_data_rate_limit=1000000.0
(bytes/sec) NotebookApp.rate_limit_window=3.0 (secs)
UnicodeDecodeErrorTraceback (most recent call last)
in ()
6 if if_init_from_ckpt_file:
7 print('load saved model from', ckpt_file)
----> 8 model.load_weights(ckpt_file)
9
An IOPub error usually occurs when you try to print a large amount of data to the console. Check your print statements - if you're trying to print a file/data variable that exceeds 10MB, its likely that this caused the error. Try to print smaller portions of the file/data.
Related
I am facing issue understanding the cause of snowflake error encountered during the data processing. To give a context, we create an iterator from pandas read_sql_query() with chunk size of 5000 records. We then parse each of these 5000 record in an iteration - make some manipulations to it and store it in reporting tables. Repeating this process until the iterator have NO records
df_iterator = pd.read_sql_query(
sql=query_statement, con=con_source, chunksize=5000
)
for df_source in df_iterator:
# Perform Record level Manipulation
# The Store it into reporting tables
# Repeat till df_iterator is empty
Now After 8-9 hours of data processing, I am suddenly getting the below error -
*ERROR:snowflake.connector.network:HTTP 403: Forbidden Traceback (most recent call last): File "/home/ec2-user/python-venv/.venv/lib64/python3.6/dist-packages/snowflake/connector/network.py", line 652, in _request_exec_wrapper
*kwargs) File "/home/ec2-user/python-venv/.venv/lib64/python3.6/dist-packages/snowflake/connector/network.py", line 913, in _request_exec
raise err File "/home/ec2-user/python-venv/.venv/lib64/python3.6/dist-packages/snowflake/connector/network.py", line 824, in _request_exec
raise RetryRequest(exi) snowflake.connector.network.RetryRequest: HTTP 403: Forbidden ERROR:snowflake.connector.chunk_downloader:Failed to fetch the large result set chunk 128/398
We have executed this code for much more duration, but not sure what's going wrong now.
Some more information on Python and module installation -
Python version : 3.6.10
snowflake-connector-python : 2.1.3
snowflake-sqlalchemy : 1.1.18
SQLAlchemy : 1.3.13
Can someone please assist. Any pointers how to proceed. I am having hard time understanding what's causing it. Thanks in advance.
I've successfully run an ML Pipeline experiment and published the Azure ML Pipeline without issues. When I run the following directly after the successful run and publish (i.e. I'm running all cells using Jupyter), the test fails!
interactive_auth = InteractiveLoginAuthentication()
auth_header = interactive_auth.get_authentication_header()
rest_endpoint = published_pipeline.endpoint
response = requests.post(rest_endpoint,
headers=auth_header,
json={"ExperimentName": "***redacted***",
"ParameterAssignments": {"process_count_per_node": 6}})
run_id = response.json()["Id"]
Here is the error in azureml-logs/70_driver_log.txt:
[2020-12-10T17:17:50.124303] The experiment failed. Finalizing run...
Cleaning up all outstanding Run operations, waiting 900.0 seconds
3 items cleaning up...
Cleanup took 0.20258069038391113 seconds
Traceback (most recent call last):
File "driver/amlbi_main.py", line 48, in <module>
main()
File "driver/amlbi_main.py", line 44, in main
JobStarter().start_job()
File "/mnt/batch/tasks/shared/LS_root/jobs/***redacted***/azureml/***redacted***/mounts/workspaceblobstore/azureml/***redacted***/driver/job_starter.py", line 52, in start_job
job.start()
File "/mnt/batch/tasks/shared/LS_root/jobs/***redacted***/azureml/***redacted***/mounts/workspaceblobstore/azureml/***redacted***/driver/job.py", line 105, in start
master.wait()
File "/mnt/batch/tasks/shared/LS_root/jobs/***redacted***/azureml/***redacted***/mounts/workspaceblobstore/azureml/***redacted***/driver/master.py", line 301, in wait
file_helper.start()
File "/mnt/batch/tasks/shared/LS_root/jobs/***redacted***/azureml/***redacted***/mounts/workspaceblobstore/azureml/***redacted***/driver/file_helper.py", line 206, in start
self.analyze_source()
File "/mnt/batch/tasks/shared/LS_root/jobs/***redacted***/azureml/***redacted***/mounts/workspaceblobstore/azureml/***redacted***/driver/file_helper.py", line 69, in analyze_source
raise Exception(message)
Exception: No temp file found. The job failed. A job should generate temp files or should fail before this. Please check logs for the cause.
Exception: No temp file found. The job failed. A job should generate temp files or should fail before this. Please check logs for the cause.
Here are the errors in logs/sys/warning.txt:
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://eastus.experiments.azureml.net/execution/v1.0/subscriptions/***redacted***/resourceGroups/***redacted***/providers/Microsoft.MachineLearningServices/workspaces/***redacted***/experiments/***redacted-experiment-name***/runs/***redacted-run-id***/telemetry
[...]
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url:
with the same URL.
Next...
When I wait a few minutes and rerun the following code/cell.
interactive_auth = InteractiveLoginAuthentication()
auth_header = interactive_auth.get_authentication_header()
rest_endpoint = published_pipeline.endpoint
response = requests.post(rest_endpoint,
headers=auth_header,
json={"ExperimentName": "***redacted***",
"ParameterAssignments": {"process_count_per_node": 2}})
run_id = response.json()["Id"]
It completes successfully!? Huh? (I changed the process count here, but I don't think that makes a difference). Also, there is no user error here in the logs.
Any ideas as to what could be going on here?
Thanks in advance for any insights you might have, and happy coding! :)
========== UPDATE #1: ==========
Running on 1 file with ~300k rows. Sometimes the job works and sometimes it doesn't. We've tried many versions with different config settings, all result in a failure from time to time. Changed sklearn models to use n_jobs=1. We're scoring text data for NLP work.
default_ds = ws.get_default_datastore()
# output dataset
output_dir = OutputFileDatasetConfig(destination=(def_file_store, 'model/results')).register_on_complete(name='model_inferences')
# location of scoring script
experiment_folder = 'model_pipeline'
rit = 60*60*24
parallel_run_config = ParallelRunConfig(
source_directory=experiment_folder,
entry_script="score.py",
mini_batch_size="5",
error_threshold=10,
output_action="append_row",
environment=batch_env,
compute_target=compute_target,
node_count=5,
run_invocation_timeout=rit,
process_count_per_node=1
)
Our next test was going to be - chuck each row of data into its own file. I tried this with just 30 rows i.e. 30 files each with 1 record for scoring, and still getting the same error. This time I changed the error threshold to 1.
2020-12-17 02:26:16,721|ParallelRunStep.ProgressSummary|INFO|112|The ParallelRunStep processed all mini batches. There are 6 mini batches with 30 items. Processed 6 mini batches containing 30 items, 30 succeeded, 0 failed. The error threshold is 1.
2020-12-17 02:26:16,722|ParallelRunStep.Telemetry|INFO|112|Start concatenating.
2020-12-17 02:26:17,202|ParallelRunStep.FileHelper|ERROR|112|No temp file found. The job failed. A job should generate temp files or should fail before this. Please check logs for the cause.
2020-12-17 02:26:17,368|ParallelRunStep.Telemetry|INFO|112|Run status: Running
2020-12-17 02:26:17,495|ParallelRunStep.Telemetry|ERROR|112|Exception occurred executing job: No temp file found. The job failed. A job should generate temp files or should fail before this. Please check logs for the cause..
Traceback (most recent call last):
File "/mnt/batch/tasks/shared/LS_root/jobs/**redacted**/mounts/workspaceblobstore/azureml/**redacted**/driver/job.py", line 105, in start
master.wait()
File "/mnt/batch/tasks/shared/LS_root/jobs/**redacted**/mounts/workspaceblobstore/azureml/**redacted**/driver/master.py", line 301, in wait
file_helper.start()
File "/mnt/batch/tasks/shared/LS_root/jobs/**redacted**/mounts/workspaceblobstore/azureml/**redacted**/driver/file_helper.py", line 206, in start
self.analyze_source()
File "/mnt/batch/tasks/shared/LS_root/jobs/**redacted**/mounts/workspaceblobstore/azureml/**redacted**/driver/file_helper.py", line 69, in analyze_source
raise Exception(message)
Exception: No temp file found. The job failed. A job should generate temp files or should fail before this. Please check logs for the cause.
And on the rounds where it does complete, only some of the records are returned. One time the # of records returned I think was 25 or 23, and another time it was 15.
========== UPDATE #2: 12/17/2020 ==========
I removed one of my models (my model is a weight blend of 15 models). I even cleaned up my text fields, removing all tabs, newlines, and commas. Now I'm scoring 30 files, each with 1 record, and the job completes sometimes, but it doesn't return 30 records. Other times it returns an error, and still getting "No temp file found" error.
I think I might have answered my own question. I think the issue was with
OutputFileDatasetConfig
Once I switched back to using
PipelineData
Everything started working again. I guess Azure wasn't kidding when they say that OutputFileDatasetConfig is still experimental.
The thing I still don't understand is how we're supposed to pick up the results of an ML Studio Pipeline from a Data Factory Pipeline without OutputFileDatasetConfig? PipelineData outputs the results in a folder based on the child step run id, so how is Data Factory supposed to know where to get the results? Would love to hear any feedback anyone might have. Thanks :)
== Update ==
For picking up results of an ML Studio Pipeline from a Data Factory Pipeline, check out Pick up Results From ML Studio Pipeline in Data Factory Pipeline
== Update #2 ==
https://github.com/Azure/azure-sdk-for-python/issues/16568#issuecomment-781526789
Hi #yeamusic21 , thank you for your feedback, in current version,
OutputDatasetConfig can't work with ParallelRunStep, we are working on
fixing it.
I am training a cnn model to recognise images. However, I get an error when running this code:
from fastai.vision.all import *
path = untar_data(URLs.PETS)/‘images’
def is_cat(x): return x[0].isupper()
dls = ImageDataLoaders.from_name_func(
path, get_image_files(path), valid_pct=0.2, seed=42,
label_func=is_cat, item_tfms=Resize(224))
learn = cnn_learner(dls, resnet34, metrics=error_rate)
learn.fine_tune(1)
error:
During handling of the above exception, another exception occurred:
RuntimeError Traceback (most recent call last)
in
----> 1 learn.fine_tune(1)
RuntimeError: DataLoader worker (pid(s) 12456, 4440, 3268, 448) exited unexpectedly
The error happens at the last line (was a longer error but SO does not let me submit all of that).
I am not running on GPU (as suggested on internet) because I havent really got how to tell jupiter notebook to do that.
Can you help?
Thanks, Luigi
you can add num_workers=0
Example
ImageDataLoaders.from_name_func(path, files, label_func, item_tfms=Resize(224),**num_workers=0**)
I dumped a Jupyter Notebook session using dill.dump_session(filename), and at one point it told me that the disk storage was full. However, I made some space on the disk and tried again. Now, I am unable to load back the session using, dill.load_session(filename).
I get the following error:
~/.local/lib/python3.6/site-packages/dill/_dill.py in load_session(filename, main)
408 unpickler._main = main
409 unpickler._session = True
--> 410 module = unpickler.load()
411 unpickler._session = False
412 main.__dict__.update(module.__dict__)
EOFError: Ran out of input
And the file (i.e. filename) is about 30 gigs in size of data.
How can I retrieve my data from the file?
BTW, I’m running all this on Google Cloud, and it’s costing me a fortune to keep the instance up and running.
I have tried using undill, and other unpickle methods.
For example I tried this:
open(file, 'a').close()
try:
with open(file, "rb") as Score_file:
unpickler = pickle.Unpickler(Score_file)
scores = unpickler.load()
return scores
But got this error:
`6 with open(file, "rb") as Score_file:
7 unpickler = pickle.Unpickler(Score_file);
----> 8 scores = unpickler.load();
9
10 return scores
ModuleNotFoundError: No module named '__builtin__'`
I know this probably isn't the answer you want to hear, but... it sounds like you may have a corrupt pickle file. If that's the case, you can get the data back only if you edit it by hand, and can understand what the pickled strings are and how they are structured. Note that there are some very rare cases that an object will dump, but not load -- however, it's much more likely you have a corrupt file. Either way, the resolution is the same... a hand edit is the only way to potentially save what you have pickled.
Also, note that if you use dump_session, you really should use load_session (as it does a sequence of steps on top of a standard load, reversing what is done in dump_session) -- that's really irrelevant for the issue however, the issue likely is having an incomplete or corrupt pickle file.
So, I'm trying to get started on text classification and the likes in Tensorflow and Python.
The initial step in attempting such tasks is to build embeddings. Consequently, I'd started working with the GloVe pretrained vector 840B 300d set(around 5 GB).
I ran into problems right away when trying to load the dataset itself. I run a CPU based Tensor-Flow version (No NVidia/ advantage of GPUs :( )
It gets stuck for an insanely long time and there seems to be no workaround this. I tried passing the input in stages like the following.
File has been initialized already as file=('glove.840B.300d.txt','r') and works. Now...
def scanlin(n):
for i in range(n,n+20):
line=file.readline(i)
if line=='':
break
splitline=line.split()
word=splitline[0]
embedding=[float(val) for val in splitline[1:]]
model[word]=embedding
print('Done')
It executes succesfully but when I call the function as scanlin(1), it says
Traceback (most recent call last): File "", line 1, in
File "", line 8, in scanlin File "", line 8, in
ValueError: could not convert string to float: '-'
I then tried making the 'embedding' as 'embedding.append()' but same error.
However, when I do it to a limited extent, it works upto n=50000 words with some effort.
def loadset(file):
i=1
for line in file:
if(i>n):
break
values=line.strip().split(" ")
word=values[0]
coefs= [float(val) for val in splitLine[1:]]
embeddings_index[word]=coefs
i=i+1
return coefs,embeddings_index
The above works and I can load the embeddings for each word, but how do I change it to sequentially process the input data on each call and integrate it into the load dataset fully? This is Python 3.5 on Ubuntu 16.04 with Tensorflow and I need to work with much huger datasets soon.