CouchDB stops with socket error 54 partway through first indexing

CouchDB stops with socket error 54 partway through first indexing - couchdb

I'm trying to sort ~13,000 documents on my Mac's local CouchDB database by date, but it gets hung up on document 5407 each time. I've tried increasing the time-out tolerance on Futon but to no avail. This is the error message I'm getting:
for row in db.view('index15/by_date_time', startkey=start, endkey=end):
File "/Library/Python/2.6/site-packages/CouchDB-0.8-py2.6.egg/couchdb/client.py", line 984, in iter
File "/Library/Python/2.6/site-packages/CouchDB-0.8-py2.6.egg/couchdb/client.py", line 1003, in rows
File "/Library/Python/2.6/site-packages/CouchDB-0.8-py2.6.egg/couchdb/client.py", line 990, in _fetch
File "/Library/Python/2.6/site-packages/CouchDB-0.8-py2.6.egg/couchdb/client.py", line 880, in _exec
File "/Library/Python/2.6/site-packages/CouchDB-0.8-py2.6.egg/couchdb/http.py", line 393, in get_json
File "/Library/Python/2.6/site-packages/CouchDB-0.8-py2.6.egg/couchdb/http.py", line 374, in get
File "/Library/Python/2.6/site-packages/CouchDB-0.8-py2.6.egg/couchdb/http.py", line 419, in _request
File "/Library/Python/2.6/site-packages/CouchDB-0.8-py2.6.egg/couchdb/http.py", line 239, in request
File "/Library/Python/2.6/site-packages/CouchDB-0.8-py2.6.egg/couchdb/http.py", line 205, in _try_request_with_retries
socket.error: 54
incidentally, this is the same error message that is produced when I have a typo in my script.
I'm using couchpy to create the view as follows:
def dateTimeToDocMapper(doc):
from dateutil.parser import parse
from datetime import datetime as dt
if doc.get('Date'):
# [year, month, day, hour, min, sec]
_date = list(dt.timetuple(parse(doc['Date']))[:-3])
yield (_date, doc)
while this is running, I can open a python shell and using server.tasks() I can see that the indexing is indeed taking place.
>>> server.tasks()
[{u'status': u'Processed 75 of 13567 changes (0%)', u'pid': u'<0.451.0>', u'task': u'gmail2 _design/index11', u'type': u'View Group Indexer'}]
but each time it gets stuck on process 5407 of 13567 changes (it takes ~8 minutes to get this far). I have examined what I believe to be document 5407 and it doesn't appear to be anything out of the ordinary.
Incidentally, if I try to restart the process after it stops, I get this response from server.tasks()
>>> server.tasks()
[{u'status': u'Processed 0 of 8160 changes (0%)', u'pid': u'<0.1224.0>', u'task': u'gmail2 _design/index11', u'type': u'View Group Indexer'}]
in other words, couchDB seems to have recognized that it's already processed the first 5407 of the 13567 changes and now has only 8160 left.
but then it almost immediately quits and gives me the same socket.error: 54
I have been searching the internet for the last few hours to no avail. I have tried initiating the indexing from other locations, such as Futon. As I mentioned, one of my errors was an OS timeout error, and increasing the time_out thresholds in Futon's configuration seemed to help with that.
Please, if anyone could shed light on this issue, I would be very very grateful. I'm wondering if there's a way to restart the process once its already indexed 5407 documents, or better yet if there's a way to prevent the thing from quitting 1/3 of the way through in the first place.
Thanks so much.

From what I gather, CouchDB builds your view contents by sending all documents to your couchpy view server, which runs your Python code on that document. If that code fails for any reason, CouchDB will be notified that something went wrong, which will stop the update of the view contents.
So, there is something wrong with document 5408 that causes your Python code to misbehave. If you need more help, I suggest you post that document here. Alternatively, look into the logs for your couchpy view server: they might contain information about how your code failed.

Related

pymongo.errors.CursorNotFound: cursor id "..." not found

I have a program that runs in AWS and reads over 400k documents in a DB. It ran flawlessly until recently. I'm not sure what change but now I'm getting the pymongo.errors.CursorNotFound: cursor id "..." not found
I tried researching and it seems to be a connection issue to the DB, but I have not changed anything.
Below is the stack trace:
Text Analysis Started....
DB Connection init...
Traceback (most recent call last):
File "predict.py", line 8, in <module>
textanalyser.start()
File "/usr/src/app/text_analyser.py", line 100, in start
for row in table_data:
File "/usr/local/lib/python3.7/site-packages/pymongo/cursor.py", line 1156, in next
if len(self.__data) or self._refresh():
File "/usr/local/lib/python3.7/site-packages/pymongo/cursor.py", line 1093, in _refresh
self.__send_message(g)
File "/usr/local/lib/python3.7/site-packages/pymongo/cursor.py", line 955, in __send_message
address=self.__address)
File "/usr/local/lib/python3.7/site-packages/pymongo/mongo_client.py", line 1346, in _run_operation_with_response
exhaust=exhaust)
File "/usr/local/lib/python3.7/site-packages/pymongo/mongo_client.py", line 1464, in _retryable_read
return func(session, server, sock_info, slave_ok)
File "/usr/local/lib/python3.7/site-packages/pymongo/mongo_client.py", line 1340, in _cmd
unpack_res)
File "/usr/local/lib/python3.7/site-packages/pymongo/server.py", line 136, in run_operation_with_response
_check_command_response(first)
File "/usr/local/lib/python3.7/site-packages/pymongo/helpers.py", line 156, in _check_command_response
raise CursorNotFound(errmsg, code, response)
pymongo.errors.CursorNotFound: cursor id 3011673819761463104 not found
Any help you can provide would be greatly appreciated.

This is a very common issue in MongoDB. I will elaborate on the issue first then provide possible workarounds for you.
Whenever you perform a find or Aggregate operation on MongoDB, it returns a cursor to you which will have a unique cursor id assigned to it. This cursor will have a deadline where it will delete after few minutes of inactivity. This is done so to save the memory and CPU usage of the machine running MongoDB. The maximum document returned from a cursor is 16MB or the value set in the MongoDB config file.
Let's assume you perform a find operation with 1000 records in a batch of 100 in a MongoDB server with 10 min cursor idle timeout configured. If the processing of 300 - 400 documents takes more than 10 minutes, that cursor is terminated and you won't be able to get 400 - 500 batch documents since it is not able to match that id.
There are few workarounds though.
Workaround - 1:
You can set the no cursor timeout option no_cursor_timeout=True for find commands.
Note: Don't forget to terminate the cursor in the end
cursor = col.find({}, no_cursor_timeout=True)
for x in cursor:
print(x)
cursor.close() # <- Don't forget to close the cursor
Workaround - 2:
Additionally, limit the batch size to a lesser number batch_size=1
What this does is send documents in a batch of 10 overwriting the default.
cursor = col.find({}, no_cursor_timeout=True, batch_size=1)
for x in cursor:
print(x)
cursor.close() # <- Don't forget to close the cursor

python-docx: Error opening file - "Bad magic number for file header" / "EOFError"

The company I work for distributes document assembly software that uses the python-docx library. The software runs a function on every generated document that opens the document and does a simple search and replace for characters that weren't escaped properly (namely "& amp;" -> "&").
FYI The actual document assembly uses python-docx-template. However, the error happens after the document has already been assembled and the error is triggered by the search-and-replace function, which only uses python-docx.
Recently, we've had a few cases where documents are failing to generate on client deployments. They're throwing an error on this line where the document object is instantiated:
doc = Document(docx=Path(doc_path))
We've seen two errors:
raise BadZipFile("Bad magic number for file header")
and
raise EOFError
The software is widely used and we've never had this issue before. We can't reproduce it in our test environments. The error has only started appearing in the past week but has shown up for several clients after they were updated. The software will fail to generate a particular document some number of times but will succeed after a few tries.
We've only seen it happen with one document in particular, but all documents use the same search and replace function, and like I said the error is only intermittent with the problem document.
There have been no changes in code to this search and replace function and I can't think of any other meaningful difference to our doc assembly process that would explain this.
I'm having a lot of trouble finding info on what could cause this specifically with the python-docx library. Is this a sign that the generated document is corrupted? If anyone is able to shed some light on possible causes that would be very helpful!
Here's the stack trace for both errors:
Bad magic number...
File "/home/user/app/application/document_assembly/core_da.py", line 524, in translate_ampersands
doc = Document(docx=Path(doc_path))
File "/home/user/app-venv/lib/python3.6/site-packages/docx/api.py", line 25, in Document
document_part = Package.open(docx).main_document_part
File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/package.py", line 116, in open
pkg_reader = PackageReader.from_file(pkg_file)
File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/pkgreader.py", line 36, in from_file
phys_reader, pkg_srels, content_types
File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/pkgreader.py", line 69, in _load_serialized_parts
for partname, blob, reltype, srels in part_walker:
File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/pkgreader.py", line 104, in _walk_phys_parts
part_srels = PackageReader._srels_for(phys_reader, partname)
File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/pkgreader.py", line 83, in _srels_for
rels_xml = phys_reader.rels_xml_for(source_uri)
File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/phys_pkg.py", line 129, in rels_xml_for
rels_xml = self.blob_for(source_uri.rels_uri)
File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/phys_pkg.py", line 108, in blob_for
return self._zipf.read(pack_uri.membername)
File "/usr/lib/python3.6/zipfile.py", line 1337, in read
with self.open(name, "r", pwd) as fp:
File "/usr/lib/python3.6/zipfile.py", line 1396, in open
raise BadZipFile("Bad magic number for file header")
zipfile.BadZipFile: Bad magic number for file header
EOFError
File "/home/user/app/application/document_assembly/core_da.py", line 524, in translate_ampersands
doc = Document(docx=Path(doc_path))
File "/home/user/app-venv/lib/python3.6/site-packages/docx/api.py", line 25, in Document
document_part = Package.open(docx).main_document_part
File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/package.py", line 116, in open
pkg_reader = PackageReader.from_file(pkg_file)
File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/pkgreader.py", line 36, in from_file
phys_reader, pkg_srels, content_types
File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/pkgreader.py", line 69, in _load_serialized_parts
for partname, blob, reltype, srels in part_walker:
File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/pkgreader.py", line 110, in _walk_phys_parts
for partname, blob, reltype, srels in next_walker:
File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/pkgreader.py", line 105, in _walk_phys_parts
blob = phys_reader.blob_for(partname)
File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/phys_pkg.py", line 108, in blob_for
return self._zipf.read(pack_uri.membername)
File "/usr/lib/python3.6/zipfile.py", line 1338, in read
return fp.read()
File "/usr/lib/python3.6/zipfile.py", line 858, in read
buf += self._read1(self.MAX_N)
File "/usr/lib/python3.6/zipfile.py", line 940, in _read1
data += self._read2(n - len(data))
File "/usr/lib/python3.6/zipfile.py", line 975, in _read2
raise EOFError
EOFError

Both of these errors indicate that the specified file is not a valid zip archive. So I expect something is going wrong with the writing of the file (by the step prior to find-and-replace).
I would start by stopping the process after writing the file and seeing if the file is present on the filesystem and whether it can be opened manually using Word. This should bisect the problem and narrow it down to a writing problem or a reading problem.
It could be possible that an error is raised on the write and it's not being caught or whatever, leaving an empty or un-flushed (open) file. So having a way to monitor that step is probably a good idea. Writing to a log comes to mind as how you might manage that.
Inspecting the particular cases where there is a failure and managing to reproduce it are going to be critically important. If that's not possible, it's going to be a tough road of guesswork and disappointment on both sides.

It turns out some code was added recently before this started happening, which effectively sent a duplicate request to the server to generate the document in question. These requests seem to run in parallel - which is surprising because I would predict the conflict to happen much more frequently (same template file being used, generated document writing to the same directory).
Seems like if the sequence of the requests happened in a particular timing, the "find-and-replace" operation of one request would run into the "save" operation of the other request. So in other words I think one request was trying to open a document that was in the process of being saved.
So I'm glad it's not something more obscure with the python-docx library, which would have been a lot harder to nail down.

Azure ML Studio ML Pipeline - Exception: No temp file found

I've successfully run an ML Pipeline experiment and published the Azure ML Pipeline without issues. When I run the following directly after the successful run and publish (i.e. I'm running all cells using Jupyter), the test fails!
interactive_auth = InteractiveLoginAuthentication()
auth_header = interactive_auth.get_authentication_header()
rest_endpoint = published_pipeline.endpoint
response = requests.post(rest_endpoint,
headers=auth_header,
json={"ExperimentName": "***redacted***",
"ParameterAssignments": {"process_count_per_node": 6}})
run_id = response.json()["Id"]
Here is the error in azureml-logs/70_driver_log.txt:
[2020-12-10T17:17:50.124303] The experiment failed. Finalizing run...
Cleaning up all outstanding Run operations, waiting 900.0 seconds
3 items cleaning up...
Cleanup took 0.20258069038391113 seconds
Traceback (most recent call last):
File "driver/amlbi_main.py", line 48, in <module>
main()
File "driver/amlbi_main.py", line 44, in main
JobStarter().start_job()
File "/mnt/batch/tasks/shared/LS_root/jobs/***redacted***/azureml/***redacted***/mounts/workspaceblobstore/azureml/***redacted***/driver/job_starter.py", line 52, in start_job
job.start()
File "/mnt/batch/tasks/shared/LS_root/jobs/***redacted***/azureml/***redacted***/mounts/workspaceblobstore/azureml/***redacted***/driver/job.py", line 105, in start
master.wait()
File "/mnt/batch/tasks/shared/LS_root/jobs/***redacted***/azureml/***redacted***/mounts/workspaceblobstore/azureml/***redacted***/driver/master.py", line 301, in wait
file_helper.start()
File "/mnt/batch/tasks/shared/LS_root/jobs/***redacted***/azureml/***redacted***/mounts/workspaceblobstore/azureml/***redacted***/driver/file_helper.py", line 206, in start
self.analyze_source()
File "/mnt/batch/tasks/shared/LS_root/jobs/***redacted***/azureml/***redacted***/mounts/workspaceblobstore/azureml/***redacted***/driver/file_helper.py", line 69, in analyze_source
raise Exception(message)
Exception: No temp file found. The job failed. A job should generate temp files or should fail before this. Please check logs for the cause.
Exception: No temp file found. The job failed. A job should generate temp files or should fail before this. Please check logs for the cause.
Here are the errors in logs/sys/warning.txt:
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://eastus.experiments.azureml.net/execution/v1.0/subscriptions/***redacted***/resourceGroups/***redacted***/providers/Microsoft.MachineLearningServices/workspaces/***redacted***/experiments/***redacted-experiment-name***/runs/***redacted-run-id***/telemetry
[...]
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url:
with the same URL.
Next...
When I wait a few minutes and rerun the following code/cell.
interactive_auth = InteractiveLoginAuthentication()
auth_header = interactive_auth.get_authentication_header()
rest_endpoint = published_pipeline.endpoint
response = requests.post(rest_endpoint,
headers=auth_header,
json={"ExperimentName": "***redacted***",
"ParameterAssignments": {"process_count_per_node": 2}})
run_id = response.json()["Id"]
It completes successfully!? Huh? (I changed the process count here, but I don't think that makes a difference). Also, there is no user error here in the logs.
Any ideas as to what could be going on here?
Thanks in advance for any insights you might have, and happy coding! :)
========== UPDATE #1: ==========
Running on 1 file with ~300k rows. Sometimes the job works and sometimes it doesn't. We've tried many versions with different config settings, all result in a failure from time to time. Changed sklearn models to use n_jobs=1. We're scoring text data for NLP work.
default_ds = ws.get_default_datastore()
# output dataset
output_dir = OutputFileDatasetConfig(destination=(def_file_store, 'model/results')).register_on_complete(name='model_inferences')
# location of scoring script
experiment_folder = 'model_pipeline'
rit = 60*60*24
parallel_run_config = ParallelRunConfig(
source_directory=experiment_folder,
entry_script="score.py",
mini_batch_size="5",
error_threshold=10,
output_action="append_row",
environment=batch_env,
compute_target=compute_target,
node_count=5,
run_invocation_timeout=rit,
process_count_per_node=1
)
Our next test was going to be - chuck each row of data into its own file. I tried this with just 30 rows i.e. 30 files each with 1 record for scoring, and still getting the same error. This time I changed the error threshold to 1.
2020-12-17 02:26:16,721|ParallelRunStep.ProgressSummary|INFO|112|The ParallelRunStep processed all mini batches. There are 6 mini batches with 30 items. Processed 6 mini batches containing 30 items, 30 succeeded, 0 failed. The error threshold is 1.
2020-12-17 02:26:16,722|ParallelRunStep.Telemetry|INFO|112|Start concatenating.
2020-12-17 02:26:17,202|ParallelRunStep.FileHelper|ERROR|112|No temp file found. The job failed. A job should generate temp files or should fail before this. Please check logs for the cause.
2020-12-17 02:26:17,368|ParallelRunStep.Telemetry|INFO|112|Run status: Running
2020-12-17 02:26:17,495|ParallelRunStep.Telemetry|ERROR|112|Exception occurred executing job: No temp file found. The job failed. A job should generate temp files or should fail before this. Please check logs for the cause..
Traceback (most recent call last):
File "/mnt/batch/tasks/shared/LS_root/jobs/**redacted**/mounts/workspaceblobstore/azureml/**redacted**/driver/job.py", line 105, in start
master.wait()
File "/mnt/batch/tasks/shared/LS_root/jobs/**redacted**/mounts/workspaceblobstore/azureml/**redacted**/driver/master.py", line 301, in wait
file_helper.start()
File "/mnt/batch/tasks/shared/LS_root/jobs/**redacted**/mounts/workspaceblobstore/azureml/**redacted**/driver/file_helper.py", line 206, in start
self.analyze_source()
File "/mnt/batch/tasks/shared/LS_root/jobs/**redacted**/mounts/workspaceblobstore/azureml/**redacted**/driver/file_helper.py", line 69, in analyze_source
raise Exception(message)
Exception: No temp file found. The job failed. A job should generate temp files or should fail before this. Please check logs for the cause.
And on the rounds where it does complete, only some of the records are returned. One time the # of records returned I think was 25 or 23, and another time it was 15.
========== UPDATE #2: 12/17/2020 ==========
I removed one of my models (my model is a weight blend of 15 models). I even cleaned up my text fields, removing all tabs, newlines, and commas. Now I'm scoring 30 files, each with 1 record, and the job completes sometimes, but it doesn't return 30 records. Other times it returns an error, and still getting "No temp file found" error.

I think I might have answered my own question. I think the issue was with
OutputFileDatasetConfig
Once I switched back to using
PipelineData
Everything started working again. I guess Azure wasn't kidding when they say that OutputFileDatasetConfig is still experimental.
The thing I still don't understand is how we're supposed to pick up the results of an ML Studio Pipeline from a Data Factory Pipeline without OutputFileDatasetConfig? PipelineData outputs the results in a folder based on the child step run id, so how is Data Factory supposed to know where to get the results? Would love to hear any feedback anyone might have. Thanks :)
== Update ==
For picking up results of an ML Studio Pipeline from a Data Factory Pipeline, check out Pick up Results From ML Studio Pipeline in Data Factory Pipeline
== Update #2 ==
https://github.com/Azure/azure-sdk-for-python/issues/16568#issuecomment-781526789
Hi #yeamusic21 , thank you for your feedback, in current version,
OutputDatasetConfig can't work with ParallelRunStep, we are working on
fixing it.

Pyomo breaks when solving iteratively

I am doing a loop across year, and for each year I solve an optimization problem. Inside the loop I do:
#Optimization
opt = SolverFactory("ipopt")
results = opt.solve(model3 , keepfiles=False, load_solutions=False)
model3.solutions.load_from(results)
The program works well, but I am having some times (randomly) this problem:
File "", line 47, in
results = opt.solve(model3 , keepfiles=False, load_solutions=False)
File "C:\Users\escriva\AppData\Local\Continuum\anaconda3\lib\site-packages\pyomo\opt\base\solvers.py", line 631, in solve
result = self._postsolve()
File "C:\Users\escriva\AppData\Local\Continuum\anaconda3\lib\site-packages\pyomo\opt\solver\shellcmd.py", line 282, in _postsolve
os.remove(self._soln_file)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\Users\escriva\tmpc2aly83o.pyomo.sol'
Then, I run it again and it works, but it breaks again (randomly) several years later. I think that it's related to the next iteration of the optimization having problems because the past optimization is not totally cleared up.
Any help?
Thanks so much in advance!

I think that I solved my question:
I was working on my Dropbox directory an someone told me that this might be the cause of the slow response when deleting solver files.
So I moved my directory to my C drive and now I don't have any problem.
Hope this is helpful!

SQLite3 selects fail on UUIDs after registering converter

I am having an issue when using SQLite and UUID solution as suggested by #Jonathan Reinhart here I am using Python3 and the error I get when doing selects in DB is:
sqlite3.register_converter('guid', lambda b: UUID(bytes_le=b))
File "/usr/lib/python3.6/uuid.py", line 146, in __init__
raise ValueError('bytes_le is not a 16-char string')
ValueError: bytes_le is not a 16-char string
The values received are proper UUIDs.
I'm a bit puzzled by the error, since in the original answer it seems to work. Can someone provide some clarification? I am able to make inserts into the DB, only the selects fail.
I can do selects directly in the DB and I don't get any errors, but not with sqlite3 python cursor.
Thank you, much appreciated.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

CouchDB stops with socket error 54 partway through first indexing - couchdb

Related

pymongo.errors.CursorNotFound: cursor id "..." not found

python-docx: Error opening file - "Bad magic number for file header" / "EOFError"

Azure ML Studio ML Pipeline - Exception: No temp file found

Pyomo breaks when solving iteratively

SQLite3 selects fail on UUIDs after registering converter

Categories

Resources