I've successfully run an ML Pipeline experiment and published the Azure ML Pipeline without issues. When I run the following directly after the successful run and publish (i.e. I'm running all cells using Jupyter), the test fails!
interactive_auth = InteractiveLoginAuthentication()
auth_header = interactive_auth.get_authentication_header()
rest_endpoint = published_pipeline.endpoint
response = requests.post(rest_endpoint,
headers=auth_header,
json={"ExperimentName": "***redacted***",
"ParameterAssignments": {"process_count_per_node": 6}})
run_id = response.json()["Id"]
Here is the error in azureml-logs/70_driver_log.txt:
[2020-12-10T17:17:50.124303] The experiment failed. Finalizing run...
Cleaning up all outstanding Run operations, waiting 900.0 seconds
3 items cleaning up...
Cleanup took 0.20258069038391113 seconds
Traceback (most recent call last):
File "driver/amlbi_main.py", line 48, in <module>
main()
File "driver/amlbi_main.py", line 44, in main
JobStarter().start_job()
File "/mnt/batch/tasks/shared/LS_root/jobs/***redacted***/azureml/***redacted***/mounts/workspaceblobstore/azureml/***redacted***/driver/job_starter.py", line 52, in start_job
job.start()
File "/mnt/batch/tasks/shared/LS_root/jobs/***redacted***/azureml/***redacted***/mounts/workspaceblobstore/azureml/***redacted***/driver/job.py", line 105, in start
master.wait()
File "/mnt/batch/tasks/shared/LS_root/jobs/***redacted***/azureml/***redacted***/mounts/workspaceblobstore/azureml/***redacted***/driver/master.py", line 301, in wait
file_helper.start()
File "/mnt/batch/tasks/shared/LS_root/jobs/***redacted***/azureml/***redacted***/mounts/workspaceblobstore/azureml/***redacted***/driver/file_helper.py", line 206, in start
self.analyze_source()
File "/mnt/batch/tasks/shared/LS_root/jobs/***redacted***/azureml/***redacted***/mounts/workspaceblobstore/azureml/***redacted***/driver/file_helper.py", line 69, in analyze_source
raise Exception(message)
Exception: No temp file found. The job failed. A job should generate temp files or should fail before this. Please check logs for the cause.
Exception: No temp file found. The job failed. A job should generate temp files or should fail before this. Please check logs for the cause.
Here are the errors in logs/sys/warning.txt:
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://eastus.experiments.azureml.net/execution/v1.0/subscriptions/***redacted***/resourceGroups/***redacted***/providers/Microsoft.MachineLearningServices/workspaces/***redacted***/experiments/***redacted-experiment-name***/runs/***redacted-run-id***/telemetry
[...]
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url:
with the same URL.
Next...
When I wait a few minutes and rerun the following code/cell.
interactive_auth = InteractiveLoginAuthentication()
auth_header = interactive_auth.get_authentication_header()
rest_endpoint = published_pipeline.endpoint
response = requests.post(rest_endpoint,
headers=auth_header,
json={"ExperimentName": "***redacted***",
"ParameterAssignments": {"process_count_per_node": 2}})
run_id = response.json()["Id"]
It completes successfully!? Huh? (I changed the process count here, but I don't think that makes a difference). Also, there is no user error here in the logs.
Any ideas as to what could be going on here?
Thanks in advance for any insights you might have, and happy coding! :)
========== UPDATE #1: ==========
Running on 1 file with ~300k rows. Sometimes the job works and sometimes it doesn't. We've tried many versions with different config settings, all result in a failure from time to time. Changed sklearn models to use n_jobs=1. We're scoring text data for NLP work.
default_ds = ws.get_default_datastore()
# output dataset
output_dir = OutputFileDatasetConfig(destination=(def_file_store, 'model/results')).register_on_complete(name='model_inferences')
# location of scoring script
experiment_folder = 'model_pipeline'
rit = 60*60*24
parallel_run_config = ParallelRunConfig(
source_directory=experiment_folder,
entry_script="score.py",
mini_batch_size="5",
error_threshold=10,
output_action="append_row",
environment=batch_env,
compute_target=compute_target,
node_count=5,
run_invocation_timeout=rit,
process_count_per_node=1
)
Our next test was going to be - chuck each row of data into its own file. I tried this with just 30 rows i.e. 30 files each with 1 record for scoring, and still getting the same error. This time I changed the error threshold to 1.
2020-12-17 02:26:16,721|ParallelRunStep.ProgressSummary|INFO|112|The ParallelRunStep processed all mini batches. There are 6 mini batches with 30 items. Processed 6 mini batches containing 30 items, 30 succeeded, 0 failed. The error threshold is 1.
2020-12-17 02:26:16,722|ParallelRunStep.Telemetry|INFO|112|Start concatenating.
2020-12-17 02:26:17,202|ParallelRunStep.FileHelper|ERROR|112|No temp file found. The job failed. A job should generate temp files or should fail before this. Please check logs for the cause.
2020-12-17 02:26:17,368|ParallelRunStep.Telemetry|INFO|112|Run status: Running
2020-12-17 02:26:17,495|ParallelRunStep.Telemetry|ERROR|112|Exception occurred executing job: No temp file found. The job failed. A job should generate temp files or should fail before this. Please check logs for the cause..
Traceback (most recent call last):
File "/mnt/batch/tasks/shared/LS_root/jobs/**redacted**/mounts/workspaceblobstore/azureml/**redacted**/driver/job.py", line 105, in start
master.wait()
File "/mnt/batch/tasks/shared/LS_root/jobs/**redacted**/mounts/workspaceblobstore/azureml/**redacted**/driver/master.py", line 301, in wait
file_helper.start()
File "/mnt/batch/tasks/shared/LS_root/jobs/**redacted**/mounts/workspaceblobstore/azureml/**redacted**/driver/file_helper.py", line 206, in start
self.analyze_source()
File "/mnt/batch/tasks/shared/LS_root/jobs/**redacted**/mounts/workspaceblobstore/azureml/**redacted**/driver/file_helper.py", line 69, in analyze_source
raise Exception(message)
Exception: No temp file found. The job failed. A job should generate temp files or should fail before this. Please check logs for the cause.
And on the rounds where it does complete, only some of the records are returned. One time the # of records returned I think was 25 or 23, and another time it was 15.
========== UPDATE #2: 12/17/2020 ==========
I removed one of my models (my model is a weight blend of 15 models). I even cleaned up my text fields, removing all tabs, newlines, and commas. Now I'm scoring 30 files, each with 1 record, and the job completes sometimes, but it doesn't return 30 records. Other times it returns an error, and still getting "No temp file found" error.
I think I might have answered my own question. I think the issue was with
OutputFileDatasetConfig
Once I switched back to using
PipelineData
Everything started working again. I guess Azure wasn't kidding when they say that OutputFileDatasetConfig is still experimental.
The thing I still don't understand is how we're supposed to pick up the results of an ML Studio Pipeline from a Data Factory Pipeline without OutputFileDatasetConfig? PipelineData outputs the results in a folder based on the child step run id, so how is Data Factory supposed to know where to get the results? Would love to hear any feedback anyone might have. Thanks :)
== Update ==
For picking up results of an ML Studio Pipeline from a Data Factory Pipeline, check out Pick up Results From ML Studio Pipeline in Data Factory Pipeline
== Update #2 ==
https://github.com/Azure/azure-sdk-for-python/issues/16568#issuecomment-781526789
Hi #yeamusic21 , thank you for your feedback, in current version,
OutputDatasetConfig can't work with ParallelRunStep, we are working on
fixing it.
Related
I am facing issue understanding the cause of snowflake error encountered during the data processing. To give a context, we create an iterator from pandas read_sql_query() with chunk size of 5000 records. We then parse each of these 5000 record in an iteration - make some manipulations to it and store it in reporting tables. Repeating this process until the iterator have NO records
df_iterator = pd.read_sql_query(
sql=query_statement, con=con_source, chunksize=5000
)
for df_source in df_iterator:
# Perform Record level Manipulation
# The Store it into reporting tables
# Repeat till df_iterator is empty
Now After 8-9 hours of data processing, I am suddenly getting the below error -
*ERROR:snowflake.connector.network:HTTP 403: Forbidden Traceback (most recent call last): File "/home/ec2-user/python-venv/.venv/lib64/python3.6/dist-packages/snowflake/connector/network.py", line 652, in _request_exec_wrapper
*kwargs) File "/home/ec2-user/python-venv/.venv/lib64/python3.6/dist-packages/snowflake/connector/network.py", line 913, in _request_exec
raise err File "/home/ec2-user/python-venv/.venv/lib64/python3.6/dist-packages/snowflake/connector/network.py", line 824, in _request_exec
raise RetryRequest(exi) snowflake.connector.network.RetryRequest: HTTP 403: Forbidden ERROR:snowflake.connector.chunk_downloader:Failed to fetch the large result set chunk 128/398
We have executed this code for much more duration, but not sure what's going wrong now.
Some more information on Python and module installation -
Python version : 3.6.10
snowflake-connector-python : 2.1.3
snowflake-sqlalchemy : 1.1.18
SQLAlchemy : 1.3.13
Can someone please assist. Any pointers how to proceed. I am having hard time understanding what's causing it. Thanks in advance.
I was able to run the Flask app with yolov5 on a PC with an internet connection. I followed the steps mentioned in yolov5 docs and used this file: yolov5/utils/flask_rest_api/restapi.py,
But I need to achieve the same offline(On a particular PC). Now the issue is, when I am using the following:
model = torch.hub.load("ultralytics/yolov5", "yolov5", force_reload=True)
It tries to download model from internet. And throws an error.
Urllib.error.URLError: <urlopen error [Errno - 2] name or service not known>
How to get the same results offline.
Thanks in advance.
If you want to run detection offline, you need to have the model already downloaded.
So, download the model (for example yolov5s.pt) from https://github.com/ultralytics/yolov5/releases and store it for example to the yolov5/models.
After that, replace
# model = torch.hub.load("ultralytics/yolov5", "yolov5s", force_reload=True) # force_reload to recache
with
model = torch.hub.load(r'C:\Users\Milan\Projects\yolov5', 'custom', path=r'C:\Users\Milan\Projects\yolov5\models\yolov5s.pt', source='local')
With this line, you can run detection also offline.
Note: When you start the app for the first time with the updated torch.hub.load, it will download the model if not present (so you do not need to download it from https://github.com/ultralytics/yolov5/releases).
There is one more issue involved here. When this code is run on a machine that has no internet connection at all. Then you may face the following error.
Downloading https://ultralytics.com/assets/Arial.ttf to /home/<local_user>/.config/Ultralytics/Arial.ttf...
Traceback (most recent call last):
File "/home/<local_user>/Py_Prac_WSL/yolov5-flask-master/yolov5/utils/plots.py", line 56, in check_pil_font
return ImageFont.truetype(str(font) if font.exists() else font.name, size)
File "/home/<local_user>/.local/share/virtualenvs/23_Jun-82xb8nrB/lib/python3.8/site-packages/PIL/ImageFont.py", line 836, in truetype
return freetype(font)
File "/home/<local_user>/.local/share/virtualenvs/23_Jun-82xb8nrB/lib/python3.8/site-packages/PIL/ImageFont.py", line 833, in freetype
return FreeTypeFont(font, size, index, encoding, layout_engine)
File "/home/<local_user>/.local/share/virtualenvs/23_Jun-82xb8nrB/lib/python3.8/site-packages/PIL/ImageFont.py", line 193, in __init__
self.font = core.getfont(
OSError: cannot open resource
To overcome this error, you need to download manually, the Arial.ttf file from https://ultralytics.com/assets/Arial.ttf and paste it to the following location, on Linux:
/home/<your_pc_user>/.config/Ultralytics
On windows, paste Arial.ttf here:
C:\Windows\Fonts
The first line of the error message mentions the same thing. After this, the code runs smoothly in offline mode.
Further as mentioned at https://docs.ultralytics.com/tutorials/pytorch-hub/, any custom-trained-model other than the one uploaded at PyTorch-model-hub can be accessed by this code.
path_hubconfig = 'absolute/path/to/yolov5'
path_trained_model = 'absolute/path/to/best.pt'
model = torch.hub.load(path_hubconfig, 'custom', path=path_trained_model, source='local') # local repo
With this code, object detection is carried out by the locally saved custom-trained model. Once, the custom trained model is saved locally this piece of code access it directly avoiding any necessity of the internet.
I have a program that runs in AWS and reads over 400k documents in a DB. It ran flawlessly until recently. I'm not sure what change but now I'm getting the pymongo.errors.CursorNotFound: cursor id "..." not found
I tried researching and it seems to be a connection issue to the DB, but I have not changed anything.
Below is the stack trace:
Text Analysis Started....
DB Connection init...
Traceback (most recent call last):
File "predict.py", line 8, in <module>
textanalyser.start()
File "/usr/src/app/text_analyser.py", line 100, in start
for row in table_data:
File "/usr/local/lib/python3.7/site-packages/pymongo/cursor.py", line 1156, in next
if len(self.__data) or self._refresh():
File "/usr/local/lib/python3.7/site-packages/pymongo/cursor.py", line 1093, in _refresh
self.__send_message(g)
File "/usr/local/lib/python3.7/site-packages/pymongo/cursor.py", line 955, in __send_message
address=self.__address)
File "/usr/local/lib/python3.7/site-packages/pymongo/mongo_client.py", line 1346, in _run_operation_with_response
exhaust=exhaust)
File "/usr/local/lib/python3.7/site-packages/pymongo/mongo_client.py", line 1464, in _retryable_read
return func(session, server, sock_info, slave_ok)
File "/usr/local/lib/python3.7/site-packages/pymongo/mongo_client.py", line 1340, in _cmd
unpack_res)
File "/usr/local/lib/python3.7/site-packages/pymongo/server.py", line 136, in run_operation_with_response
_check_command_response(first)
File "/usr/local/lib/python3.7/site-packages/pymongo/helpers.py", line 156, in _check_command_response
raise CursorNotFound(errmsg, code, response)
pymongo.errors.CursorNotFound: cursor id 3011673819761463104 not found
Any help you can provide would be greatly appreciated.
This is a very common issue in MongoDB. I will elaborate on the issue first then provide possible workarounds for you.
Whenever you perform a find or Aggregate operation on MongoDB, it returns a cursor to you which will have a unique cursor id assigned to it. This cursor will have a deadline where it will delete after few minutes of inactivity. This is done so to save the memory and CPU usage of the machine running MongoDB. The maximum document returned from a cursor is 16MB or the value set in the MongoDB config file.
Let's assume you perform a find operation with 1000 records in a batch of 100 in a MongoDB server with 10 min cursor idle timeout configured. If the processing of 300 - 400 documents takes more than 10 minutes, that cursor is terminated and you won't be able to get 400 - 500 batch documents since it is not able to match that id.
There are few workarounds though.
Workaround - 1:
You can set the no cursor timeout option no_cursor_timeout=True for find commands.
Note: Don't forget to terminate the cursor in the end
cursor = col.find({}, no_cursor_timeout=True)
for x in cursor:
print(x)
cursor.close() # <- Don't forget to close the cursor
Workaround - 2:
Additionally, limit the batch size to a lesser number batch_size=1
What this does is send documents in a batch of 10 overwriting the default.
cursor = col.find({}, no_cursor_timeout=True, batch_size=1)
for x in cursor:
print(x)
cursor.close() # <- Don't forget to close the cursor
I am using ete3(http://etetoolkit.org/) package in Python within a bioinformatics pipeline I wrote myself.
While running this script, I get the following error. I have used this script a lot for other datasets which don't have any issues and have not given any errors. I am using Python3.5 and miniconda. Any fixes/insights to resolve this error will be appreciated.
[Error]
Traceback (most recent call last):
File "/Users/d/miniconda2/envs/py35/bin/ete3", line 11, in <module>
load_entry_point('ete3==3.1.1', 'console_scripts', 'ete3')()
File "/Users/d/miniconda2/envs/py35/lib/python3.5/site-packages/ete3/tools/ete.py", line 95, in main
_main(sys.argv)
File "/Users/d/miniconda2/envs/py35/lib/python3.5/site-packages/ete3/tools/ete.py", line 268, in _main
args.func(args)
File "/Users/d/miniconda2/envs/py35/lib/python3.5/site-packages/ete3/tools/ete_ncbiquery.py", line 168, in run
collapse_subspecies=args.collapse_subspecies)
File "/Users/d/miniconda2/envs/py35/lib/python3.5/site-packages/ete3/ncbi_taxonomy/ncbiquery.py", line 434, in get_topology
lineage = id2lineage[sp]
KeyError: 3
Continuing from the comment section for better formatting.
Assuming that the sp contains 3 as suggested by the error message (do check this yourself). You can inspect the ete3 code (current version) following its definition, you can trace it to line:
def get_lineage_translator(self, taxids):
"""Given a valid taxid number, return its corresponding lineage track as a
hierarchically sorted list of parent taxids.
So I went to https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi
and checked if 3 is valid taxid and it appears that it is not.
# relevant section from ncbi taxonomy browser
No result found in the Taxonomy database for taxonomy id
3
It appears to me that your only option is to trace how the 3 gets computed. Because the root cause is simply that taxid 3 is not valid taxid number as required by the function.
I'm trying to sort ~13,000 documents on my Mac's local CouchDB database by date, but it gets hung up on document 5407 each time. I've tried increasing the time-out tolerance on Futon but to no avail. This is the error message I'm getting:
for row in db.view('index15/by_date_time', startkey=start, endkey=end):
File "/Library/Python/2.6/site-packages/CouchDB-0.8-py2.6.egg/couchdb/client.py", line 984, in iter
File "/Library/Python/2.6/site-packages/CouchDB-0.8-py2.6.egg/couchdb/client.py", line 1003, in rows
File "/Library/Python/2.6/site-packages/CouchDB-0.8-py2.6.egg/couchdb/client.py", line 990, in _fetch
File "/Library/Python/2.6/site-packages/CouchDB-0.8-py2.6.egg/couchdb/client.py", line 880, in _exec
File "/Library/Python/2.6/site-packages/CouchDB-0.8-py2.6.egg/couchdb/http.py", line 393, in get_json
File "/Library/Python/2.6/site-packages/CouchDB-0.8-py2.6.egg/couchdb/http.py", line 374, in get
File "/Library/Python/2.6/site-packages/CouchDB-0.8-py2.6.egg/couchdb/http.py", line 419, in _request
File "/Library/Python/2.6/site-packages/CouchDB-0.8-py2.6.egg/couchdb/http.py", line 239, in request
File "/Library/Python/2.6/site-packages/CouchDB-0.8-py2.6.egg/couchdb/http.py", line 205, in _try_request_with_retries
socket.error: 54
incidentally, this is the same error message that is produced when I have a typo in my script.
I'm using couchpy to create the view as follows:
def dateTimeToDocMapper(doc):
from dateutil.parser import parse
from datetime import datetime as dt
if doc.get('Date'):
# [year, month, day, hour, min, sec]
_date = list(dt.timetuple(parse(doc['Date']))[:-3])
yield (_date, doc)
while this is running, I can open a python shell and using server.tasks() I can see that the indexing is indeed taking place.
>>> server.tasks()
[{u'status': u'Processed 75 of 13567 changes (0%)', u'pid': u'<0.451.0>', u'task': u'gmail2 _design/index11', u'type': u'View Group Indexer'}]
but each time it gets stuck on process 5407 of 13567 changes (it takes ~8 minutes to get this far). I have examined what I believe to be document 5407 and it doesn't appear to be anything out of the ordinary.
Incidentally, if I try to restart the process after it stops, I get this response from server.tasks()
>>> server.tasks()
[{u'status': u'Processed 0 of 8160 changes (0%)', u'pid': u'<0.1224.0>', u'task': u'gmail2 _design/index11', u'type': u'View Group Indexer'}]
in other words, couchDB seems to have recognized that it's already processed the first 5407 of the 13567 changes and now has only 8160 left.
but then it almost immediately quits and gives me the same socket.error: 54
I have been searching the internet for the last few hours to no avail. I have tried initiating the indexing from other locations, such as Futon. As I mentioned, one of my errors was an OS timeout error, and increasing the time_out thresholds in Futon's configuration seemed to help with that.
Please, if anyone could shed light on this issue, I would be very very grateful. I'm wondering if there's a way to restart the process once its already indexed 5407 documents, or better yet if there's a way to prevent the thing from quitting 1/3 of the way through in the first place.
Thanks so much.
From what I gather, CouchDB builds your view contents by sending all documents to your couchpy view server, which runs your Python code on that document. If that code fails for any reason, CouchDB will be notified that something went wrong, which will stop the update of the view contents.
So, there is something wrong with document 5408 that causes your Python code to misbehave. If you need more help, I suggest you post that document here. Alternatively, look into the logs for your couchpy view server: they might contain information about how your code failed.