I have a program that runs in AWS and reads over 400k documents in a DB. It ran flawlessly until recently. I'm not sure what change but now I'm getting the pymongo.errors.CursorNotFound: cursor id "..." not found
I tried researching and it seems to be a connection issue to the DB, but I have not changed anything.
Below is the stack trace:
Text Analysis Started....
DB Connection init...
Traceback (most recent call last):
File "predict.py", line 8, in <module>
textanalyser.start()
File "/usr/src/app/text_analyser.py", line 100, in start
for row in table_data:
File "/usr/local/lib/python3.7/site-packages/pymongo/cursor.py", line 1156, in next
if len(self.__data) or self._refresh():
File "/usr/local/lib/python3.7/site-packages/pymongo/cursor.py", line 1093, in _refresh
self.__send_message(g)
File "/usr/local/lib/python3.7/site-packages/pymongo/cursor.py", line 955, in __send_message
address=self.__address)
File "/usr/local/lib/python3.7/site-packages/pymongo/mongo_client.py", line 1346, in _run_operation_with_response
exhaust=exhaust)
File "/usr/local/lib/python3.7/site-packages/pymongo/mongo_client.py", line 1464, in _retryable_read
return func(session, server, sock_info, slave_ok)
File "/usr/local/lib/python3.7/site-packages/pymongo/mongo_client.py", line 1340, in _cmd
unpack_res)
File "/usr/local/lib/python3.7/site-packages/pymongo/server.py", line 136, in run_operation_with_response
_check_command_response(first)
File "/usr/local/lib/python3.7/site-packages/pymongo/helpers.py", line 156, in _check_command_response
raise CursorNotFound(errmsg, code, response)
pymongo.errors.CursorNotFound: cursor id 3011673819761463104 not found
Any help you can provide would be greatly appreciated.
This is a very common issue in MongoDB. I will elaborate on the issue first then provide possible workarounds for you.
Whenever you perform a find or Aggregate operation on MongoDB, it returns a cursor to you which will have a unique cursor id assigned to it. This cursor will have a deadline where it will delete after few minutes of inactivity. This is done so to save the memory and CPU usage of the machine running MongoDB. The maximum document returned from a cursor is 16MB or the value set in the MongoDB config file.
Let's assume you perform a find operation with 1000 records in a batch of 100 in a MongoDB server with 10 min cursor idle timeout configured. If the processing of 300 - 400 documents takes more than 10 minutes, that cursor is terminated and you won't be able to get 400 - 500 batch documents since it is not able to match that id.
There are few workarounds though.
Workaround - 1:
You can set the no cursor timeout option no_cursor_timeout=True for find commands.
Note: Don't forget to terminate the cursor in the end
cursor = col.find({}, no_cursor_timeout=True)
for x in cursor:
print(x)
cursor.close() # <- Don't forget to close the cursor
Workaround - 2:
Additionally, limit the batch size to a lesser number batch_size=1
What this does is send documents in a batch of 10 overwriting the default.
cursor = col.find({}, no_cursor_timeout=True, batch_size=1)
for x in cursor:
print(x)
cursor.close() # <- Don't forget to close the cursor
Related
I'm trying to train research model ssd_mobilenet_v1_fpn_640x640_coco17_tpu-8 using the MultiWorkerMirroredStrategy (by setting --num_workers=2 in the invocation of model_main_tf2.py). I'm trying to train across two workers (0 and 1), each with a single GPU. However, when I attempt this I get the following error, always on worker 1:
Traceback (most recent call last):
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\tensorflow\python\distribute\input_lib.py", line 553, in __next__
return self.get_next()
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\tensorflow\python\distribute\input_lib.py", line 610, in get_next
return self._get_next_no_partial_batch_handling(name)
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\tensorflow\python\distribute\input_lib.py", line 642, in _get_next_no_partial_batch_handling
replicas.extend(self._iterators[i].get_next_as_list(new_name))
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\tensorflow\python\distribute\input_lib.py", line 1594, in get_next_as_list
return self._format_data_list_with_options(self._iterator.get_next())
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\tensorflow\python\data\ops\multi_device_iterator_ops.py", line 580, in get_next
result.append(self._device_iterators[i].get_next())
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\tensorflow\python\data\ops\iterator_ops.py", line 889, in get_next
return self._next_internal()
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\tensorflow\python\data\ops\iterator_ops.py", line 819, in _next_internal
ret = gen_dataset_ops.iterator_get_next(
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\tensorflow\python\ops\gen_dataset_ops.py", line 2922, in iterator_get_next
_ops.raise_from_not_ok_status(e, name)
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\tensorflow\python\framework\ops.py", line 7186, in raise_from_not_ok_status
raise core._status_to_exception(e) from None # pylint: disable=protected-access
tensorflow.python.framework.errors_impl.OutOfRangeError: End of sequence [Op:IteratorGetNext]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\JS\Desktop\Tensorflow\models\research\object_detection\model_main_tf2.py", line 114, in <module>
tf.compat.v1.app.run()
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\tensorflow\python\platform\app.py", line 36, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\absl\app.py", line 312, in run
_run_main(main, args)
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\absl\app.py", line 258, in _run_main
sys.exit(main(argv))
File "C:\Users\JS\Desktop\Tensorflow\models\research\object_detection\model_main_tf2.py", line 105, in main
model_lib_v2.train_loop(
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\object_detection\model_lib_v2.py", line 605, in train_loop
load_fine_tune_checkpoint(
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\object_detection\model_lib_v2.py", line 401, in load_fine_tune_checkpoint
_ensure_model_is_built(model, input_dataset, unpad_groundtruth_tensors)
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\object_detection\model_lib_v2.py", line 161, in _ensure_model_is_built
features, labels = iter(input_dataset).next()
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\tensorflow\python\distribute\input_lib.py", line 549, in next
return self.__next__()
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\tensorflow\python\distribute\input_lib.py", line 555, in __next__
raise StopIteration
StopIteration
Worker 0 eventually fails after detecting that worker 1 has gone down.
This error happens regardless of the physical machines on which the two workers run. In other words I see it if I'm running both workers on a single machine (using localhost) OR different machines on the same network.
Based on the trace in the error messages, the error appears to be occurring whenever the training loop attempts to iterate over the training data generated by strategy.experimental_distribute_datasets_from_function. Note that if I change the strategy to MirroredStrategy it runs fine on a single machine (no other changes made). I'm not sure if I'm doing something wrong or if there is a bug in the object detection API.
My setup on both machines is identical (I basically followed the setup instructions on the object detection web-site):
Windows 10
Tensorflow 2.8.0
Cuda Toolkit 11.2
cudnn 8.1
Has anyone ever seen this error before? If so, is there a way around it?
Ok, I think I understand the issue. In the object detection library there is a file called dataset_builder.py that builds the training dataset from the TFRecord stored in the file specified in the pipeline.config file (in the input_path item of the tf_record_input_reader). The function that actually reads the TFRecord file is _read_dataset_internal. This function treats the input_path of the pipeline config as a LIST OF FILES and then applies a sharding function (passed as an argument) to divide the files between the replicas doing the training (one replica per worker). Since my input_path only specified a single TFRecord file it was assigned to the first replica and the other replicas were given empty filenames!! Thus only the first replica actually had an input dataset to work with, hence the crash.
The solution was to split the training data across two files (two TFRecords) and then set the input_path in pipeline.config to be a list of paths rather than a single path. Once I did this it appears as though the model trained successfully (at least it didn't crash).
I'm not sure if this is a bug in the object detection code or not. I assumed that if I only had one training record (visible to both workers) that both workers would use it and just batch the data accordingly. I'm just not sure if the assumption itself is wrong or if the assumption is correct and the code is wrong.
Anyway, I this helps anyone who might be wrestling with the same issue.
The company I work for distributes document assembly software that uses the python-docx library. The software runs a function on every generated document that opens the document and does a simple search and replace for characters that weren't escaped properly (namely "& amp;" -> "&").
FYI The actual document assembly uses python-docx-template. However, the error happens after the document has already been assembled and the error is triggered by the search-and-replace function, which only uses python-docx.
Recently, we've had a few cases where documents are failing to generate on client deployments. They're throwing an error on this line where the document object is instantiated:
doc = Document(docx=Path(doc_path))
We've seen two errors:
raise BadZipFile("Bad magic number for file header")
and
raise EOFError
The software is widely used and we've never had this issue before. We can't reproduce it in our test environments. The error has only started appearing in the past week but has shown up for several clients after they were updated. The software will fail to generate a particular document some number of times but will succeed after a few tries.
We've only seen it happen with one document in particular, but all documents use the same search and replace function, and like I said the error is only intermittent with the problem document.
There have been no changes in code to this search and replace function and I can't think of any other meaningful difference to our doc assembly process that would explain this.
I'm having a lot of trouble finding info on what could cause this specifically with the python-docx library. Is this a sign that the generated document is corrupted? If anyone is able to shed some light on possible causes that would be very helpful!
Here's the stack trace for both errors:
Bad magic number...
File "/home/user/app/application/document_assembly/core_da.py", line 524, in translate_ampersands
doc = Document(docx=Path(doc_path))
File "/home/user/app-venv/lib/python3.6/site-packages/docx/api.py", line 25, in Document
document_part = Package.open(docx).main_document_part
File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/package.py", line 116, in open
pkg_reader = PackageReader.from_file(pkg_file)
File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/pkgreader.py", line 36, in from_file
phys_reader, pkg_srels, content_types
File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/pkgreader.py", line 69, in _load_serialized_parts
for partname, blob, reltype, srels in part_walker:
File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/pkgreader.py", line 104, in _walk_phys_parts
part_srels = PackageReader._srels_for(phys_reader, partname)
File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/pkgreader.py", line 83, in _srels_for
rels_xml = phys_reader.rels_xml_for(source_uri)
File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/phys_pkg.py", line 129, in rels_xml_for
rels_xml = self.blob_for(source_uri.rels_uri)
File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/phys_pkg.py", line 108, in blob_for
return self._zipf.read(pack_uri.membername)
File "/usr/lib/python3.6/zipfile.py", line 1337, in read
with self.open(name, "r", pwd) as fp:
File "/usr/lib/python3.6/zipfile.py", line 1396, in open
raise BadZipFile("Bad magic number for file header")
zipfile.BadZipFile: Bad magic number for file header
EOFError
File "/home/user/app/application/document_assembly/core_da.py", line 524, in translate_ampersands
doc = Document(docx=Path(doc_path))
File "/home/user/app-venv/lib/python3.6/site-packages/docx/api.py", line 25, in Document
document_part = Package.open(docx).main_document_part
File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/package.py", line 116, in open
pkg_reader = PackageReader.from_file(pkg_file)
File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/pkgreader.py", line 36, in from_file
phys_reader, pkg_srels, content_types
File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/pkgreader.py", line 69, in _load_serialized_parts
for partname, blob, reltype, srels in part_walker:
File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/pkgreader.py", line 110, in _walk_phys_parts
for partname, blob, reltype, srels in next_walker:
File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/pkgreader.py", line 105, in _walk_phys_parts
blob = phys_reader.blob_for(partname)
File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/phys_pkg.py", line 108, in blob_for
return self._zipf.read(pack_uri.membername)
File "/usr/lib/python3.6/zipfile.py", line 1338, in read
return fp.read()
File "/usr/lib/python3.6/zipfile.py", line 858, in read
buf += self._read1(self.MAX_N)
File "/usr/lib/python3.6/zipfile.py", line 940, in _read1
data += self._read2(n - len(data))
File "/usr/lib/python3.6/zipfile.py", line 975, in _read2
raise EOFError
EOFError
Both of these errors indicate that the specified file is not a valid zip archive. So I expect something is going wrong with the writing of the file (by the step prior to find-and-replace).
I would start by stopping the process after writing the file and seeing if the file is present on the filesystem and whether it can be opened manually using Word. This should bisect the problem and narrow it down to a writing problem or a reading problem.
It could be possible that an error is raised on the write and it's not being caught or whatever, leaving an empty or un-flushed (open) file. So having a way to monitor that step is probably a good idea. Writing to a log comes to mind as how you might manage that.
Inspecting the particular cases where there is a failure and managing to reproduce it are going to be critically important. If that's not possible, it's going to be a tough road of guesswork and disappointment on both sides.
It turns out some code was added recently before this started happening, which effectively sent a duplicate request to the server to generate the document in question. These requests seem to run in parallel - which is surprising because I would predict the conflict to happen much more frequently (same template file being used, generated document writing to the same directory).
Seems like if the sequence of the requests happened in a particular timing, the "find-and-replace" operation of one request would run into the "save" operation of the other request. So in other words I think one request was trying to open a document that was in the process of being saved.
So I'm glad it's not something more obscure with the python-docx library, which would have been a lot harder to nail down.
I have some float numbers to be stored in a big (500K x 500K) matrix. I am storing them in chunks, by using arrays of variable sizes (in accordance to some specific conditions).
I have a parallellised code (Python3.3 and h5py) which produces the arrays and put them in a shared queue, and one dedicated process that pops from the queue and writes them one-by-one in the HDF5 matrix. It works as expected approximately 90% of the time.
Occasionally, I got writing errors for specific arrays. If I run it multiple times, the faulty arrays change all the times.
Here's the code:
def writer(in_q):
# Open HDF5 archive
hdf5_file = h5py.File("./google_matrix_test.hdf5")
hdf5_scores = hdf5_file['scores']
while True:
# Get some data
try:
data = in_q.get(timeout=5)
except:
hdf5_file.flush()
print('HDF5 archive updated.')
break
# Process the data
try:
hdf5_scores[data[0], data[1]:data[2]+1] = numpy.matrix(data[3:])
except:
# Print faulty chunk's info
print('E: ' + str(data[0:3]))
in_q.put(data) # <- doesn't solve
in_q.task_done()
def compute():
jobs_queue = JoinableQueue()
scores_queue = JoinableQueue()
processes = []
processes.append(Process(target=producer, args=(jobs_queue, data,)))
processes.append(Process(target=writer, args=(scores_queue,)))
for i in range(10):
processes.append(Process(target=consumer, args=(jobs_queue,scores_queue,)))
for p in processes:
p.start()
processes[1].join()
scores_queue.join()
Here's the error:
Process Process-2:
Traceback (most recent call last):
File "/local/software/python3.3/lib/python3.3/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/local/software/python3.3/lib/python3.3/multiprocessing/process.py", line 95, in run
self._target(*self._args, **self._kwargs)
File "./compute_scores_multiprocess.py", line 104, in writer
hdf5_scores[data[0], data[1]:data[2]+1] = numpy.matrix(data[3:])
File "/local/software/python3.3/lib/python3.3/site-packages/h5py/_hl/dataset.py", line 551, in __setitem__
self.id.write(mspace, fspace, val, mtype)
File "h5d.pyx", line 217, in h5py.h5d.DatasetID.write (h5py/h5d.c:2925)
File "_proxy.pyx", line 120, in h5py._proxy.dset_rw (h5py/_proxy.c:1491)
File "_proxy.pyx", line 93, in h5py._proxy.H5PY_H5Dwrite (h5py/_proxy.c:1301)
OSError: can't write data (Dataset: Write failed)
If I insert a pause of two seconds (time.sleep(2)) among writing tasks then the problem seems solved (although I cannot waste 2 seconds per writing since I need to write more than 250.000 times). If I capture the writing exception and put the faulty array in the queue, the script will never stop (presumably).
I am using CentOS (2.6.32-279.11.1.el6.x86_64). Any insight?
Thanks a lot.
When using the multiprocessing module with HDF5, the only big restriction is that you can't have any files open (even read-only) when fork() is called. In other words, if you open a file in the master process to write, and then Python spins off a subprocess for computation, there may be problems. It has to do with how fork() works and the choices HDF5 itself makes about how to handle file descriptors.
My advice is to double-check your application to make sure you're creating any Pools, etc. before opening the master file for writing.
I am using cassandra db ,while i use select at some times i get this exception?
Traceback (most recent call last):
File "bin/cqlsh", line 1001, in perform_statement_untraced
self.cursor.execute(statement, decoder=decoder)
File "bin/../lib/cql-internal-only-1.4.0.zip/cql-1.4.0/cql/cursor.py", line 81, in execute
return self.process_execution_results(response, decoder=decoder)
File "bin/../lib/cql-internal-only-1.4.0.zip/cql-1.4.0/cql/thrifteries.py", line 131, in process_execution_results
raise Exception('unknown result type %s' % response.type)
Exception: unknown result type None
can any one explain why this exceptions occur and also i get Internal application error.
what this error message actually means?
EDIT: I get this error for the first time, next time onwards its running correctly.I dont get why it is so?
//cql query via cqlsh
select * from event_logging limit 5;
I'm trying to sort ~13,000 documents on my Mac's local CouchDB database by date, but it gets hung up on document 5407 each time. I've tried increasing the time-out tolerance on Futon but to no avail. This is the error message I'm getting:
for row in db.view('index15/by_date_time', startkey=start, endkey=end):
File "/Library/Python/2.6/site-packages/CouchDB-0.8-py2.6.egg/couchdb/client.py", line 984, in iter
File "/Library/Python/2.6/site-packages/CouchDB-0.8-py2.6.egg/couchdb/client.py", line 1003, in rows
File "/Library/Python/2.6/site-packages/CouchDB-0.8-py2.6.egg/couchdb/client.py", line 990, in _fetch
File "/Library/Python/2.6/site-packages/CouchDB-0.8-py2.6.egg/couchdb/client.py", line 880, in _exec
File "/Library/Python/2.6/site-packages/CouchDB-0.8-py2.6.egg/couchdb/http.py", line 393, in get_json
File "/Library/Python/2.6/site-packages/CouchDB-0.8-py2.6.egg/couchdb/http.py", line 374, in get
File "/Library/Python/2.6/site-packages/CouchDB-0.8-py2.6.egg/couchdb/http.py", line 419, in _request
File "/Library/Python/2.6/site-packages/CouchDB-0.8-py2.6.egg/couchdb/http.py", line 239, in request
File "/Library/Python/2.6/site-packages/CouchDB-0.8-py2.6.egg/couchdb/http.py", line 205, in _try_request_with_retries
socket.error: 54
incidentally, this is the same error message that is produced when I have a typo in my script.
I'm using couchpy to create the view as follows:
def dateTimeToDocMapper(doc):
from dateutil.parser import parse
from datetime import datetime as dt
if doc.get('Date'):
# [year, month, day, hour, min, sec]
_date = list(dt.timetuple(parse(doc['Date']))[:-3])
yield (_date, doc)
while this is running, I can open a python shell and using server.tasks() I can see that the indexing is indeed taking place.
>>> server.tasks()
[{u'status': u'Processed 75 of 13567 changes (0%)', u'pid': u'<0.451.0>', u'task': u'gmail2 _design/index11', u'type': u'View Group Indexer'}]
but each time it gets stuck on process 5407 of 13567 changes (it takes ~8 minutes to get this far). I have examined what I believe to be document 5407 and it doesn't appear to be anything out of the ordinary.
Incidentally, if I try to restart the process after it stops, I get this response from server.tasks()
>>> server.tasks()
[{u'status': u'Processed 0 of 8160 changes (0%)', u'pid': u'<0.1224.0>', u'task': u'gmail2 _design/index11', u'type': u'View Group Indexer'}]
in other words, couchDB seems to have recognized that it's already processed the first 5407 of the 13567 changes and now has only 8160 left.
but then it almost immediately quits and gives me the same socket.error: 54
I have been searching the internet for the last few hours to no avail. I have tried initiating the indexing from other locations, such as Futon. As I mentioned, one of my errors was an OS timeout error, and increasing the time_out thresholds in Futon's configuration seemed to help with that.
Please, if anyone could shed light on this issue, I would be very very grateful. I'm wondering if there's a way to restart the process once its already indexed 5407 documents, or better yet if there's a way to prevent the thing from quitting 1/3 of the way through in the first place.
Thanks so much.
From what I gather, CouchDB builds your view contents by sending all documents to your couchpy view server, which runs your Python code on that document. If that code fails for any reason, CouchDB will be notified that something went wrong, which will stop the update of the view contents.
So, there is something wrong with document 5408 that causes your Python code to misbehave. If you need more help, I suggest you post that document here. Alternatively, look into the logs for your couchpy view server: they might contain information about how your code failed.