Google PubSub - restarting subscription after exception raised - python-3.x

Good day,
I am running some long-running async jobs using PubSub to trigger a function. Occasionally, the task may fail. In such cases, I simply want to log the exception, acknowledge the message, and restart the subscription to ensure that the subscriber is still pulling new messages after the failure has occurred.
I have placed some simplified code to demonstrate my current set up below:
try:
while True:
streaming_pull_future = workers.subscriber.subscribe(
subscription_path, callback=worker_task <- includes logic to ack() the message if it's failed before
)
print(f'Listening for messages on {subscription_path}')
try:
streaming_pull_future.result()
except Exception as e:
print(streaming_pull_future.cancelled()) #<-- this evaluates to false
streaming_pull_future.cancel() #<-- this results in RunTimeError: set_result can only be called once.
print(e)
except KeyboardInterrupt: # seems to be an issue as per Github PubSub issue #17. No keyboard interrupt
streaming_pull_future.cancel()
I keep seeing a RuntimeError: set_result can only be called oncewhen I execute the streaming_pull_future.cancel() in the exception handler. I checked whether perhaps the subscriber had already been cancelled but when I logged out the status it evaluated to False. Yet when I then call the cancel() method I get the error. I want to ensure that any threads are cleaned up before making a new subscription in the case where I could have several errors. Does anyone know why this is happening and a way around it?
I am running Python 3.7.4 with PubSub 1.2.0 and grpcio 1.27.1.
Update:
As per comments, please see a reproducible example. The stack trace raised is included:
Listening for messages on projects/trigger-web-app/subscriptions/load-job-sub
968432700946405
Top-level exception occurred in callback while processing a message
Traceback (most recent call last):
File "C:\..\lib\site-packages\google\cloud\pubsub_v1\subscriber\_protocol\streaming_pull_manager.py", line
71, in _wrap_callback_errors
callback(message)
File "test.py", line 19, in worker_task
a = 1/0 # cause an exception to be raised
ZeroDivisionError: division by zero
968424309156485
Top-level exception occurred in callback while processing a message
Traceback (most recent call last):
File "C:\...\lib\site-packages\google\cloud\pubsub_v1\subscriber\_protocol\streaming_pull_manager.py", line
71, in _wrap_callback_errors
callback(message)
File "test.py", line 19, in worker_task
a = 1/0 # cause an exception to be raised
ZeroDivisionError: division by zero
Traceback (most recent call last):
File "test.py", line 29, in main
streaming_pull_future.result()
File "C:...\lib\site-packages\google\cloud\pubsub_v1\futures.py", line 105, in result
raise err
File "C:\...\lib\site-packages\google\cloud\pubsub_v1\subscriber\_protocol\streaming_pull_manager.py", line
71, in _wrap_callback_errors
callback(message)
File "test.py", line 19, in worker_task
a = 1/0 # cause an exception to be raised
ZeroDivisionError: division by zero
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "test.py", line 35, in <module>
main()
File "test.py", line 31, in main
streaming_pull_future.cancel()
File "C:\...\lib\site-packages\google\cloud\pubsub_v1\subscriber\futures.py", line 46, in cancel
return self._manager.close()
File "C:\...\lib\site-packages\google\cloud\pubsub_v1\subscriber\_protocol\streaming_pull_manager.py", line
496, in close
callback(self, reason)
File "C:\...\lib\site-packages\google\cloud\pubsub_v1\subscriber\futures.py", line 37, in _on_close_callback
self.set_result(True)
File "C:\...\lib\site-packages\google\cloud\pubsub_v1\futures.py", line 155, in set_result
raise RuntimeError("set_result can only be called once.")
RuntimeError: set_result can only be called once.
import os
from google.cloud import pubsub_v1
subscriber = pubsub_v1.SubscriberClient()
project_id=os.environ['GOOGLE_CLOUD_PROJECT']
subscription_name=os.environ['GOOGLE_CLOUD_PUBSUB_SUBSCRIPTION_NAME']
subscription_path = f'projects/{project_id}/subscriptions/{subscription_name}'
def worker_task( message ):
job_id = message.message_id
print(job_id)
a = 1/0 # cause an exception to be raised
message.ack()
def main():
streaming_pull_future = subscriber.subscribe(
subscription_path, callback=worker_task
)
print(f'Listening for messages on {subscription_path}')
try:
streaming_pull_future.result()
except Exception as e:
streaming_pull_future.cancel() # if exception in callback handler, this will raise a RunTimError
print(e)
if __name__ == '__main__':
main()
Thank you.

Related

Python Multiprocessing - 'map_async' RunTimeError issue handling

Here is a sample code which I'm running.
def function(elem):
var1 = elem[0]
var2 = elem[1]
length = nx.shortest_path_length(G, source=var1, target=var2)
return length
p = mp.Pool(processes=4)
results = p.map_async(function, iterable=elements)
track_job(results)
p.close()
p.join()
The error which I'm facing is -
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/usr/local/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.8/multiprocessing/managers.py", line 192, in accepter
t.start()
File "/usr/local/lib/python3.8/threading.py", line 852, in start
_start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread
Post this error entire process is halted/paused.
Two Questions:
Is this issue at the hardware level? Or can this be avoided?
How to handle this error, so that the other processes are still running? Also retry the faulty process?
TIA
I typically do stuff like this, maybe it will work for you:
import psutil
p = mp.Pool(psutil.cpu_count(logical=False))
with Pool((psutil.cpu_count()-1) or 1) as p:
try:
results = [r for r in p.map_async(function, elements)]
except Exception as e1:
print(f'{e1}')
finally:
p.close()
p.join()
For the track_job(results), you may want to multiprocess that as well depending on what it's doing.
The try / except / finally error handling could be used to help your program continue when it hits any exception as well, using Exception as the base exception case.

ThreadPoolExecutor keep waiting when exception happens

from asyncio import FIRST_EXCEPTION
from concurrent.futures.thread import ThreadPoolExecutor
from queue import Queue
from concurrent.futures import wait
import os
def worker(i: int, in_queue: Queue) -> None:
while 1:
data = in_queue.get()
if data is None:
in_queue.put(data)
print(f'worker {i} exit')
return
print(os.path.exists(data))
def main():
with ThreadPoolExecutor(max_workers=2) as executor:
queue = Queue(maxsize=2)
workers = [executor.submit(worker, i, queue) for i in range(2)]
for obj in [{'fn': '/path/to/sth'}, {}]:
fn = obj['fn'] # here is the exception
queue.put((fn,))
queue.put(None)
done, error = wait(workers, return_when=FIRST_EXCEPTION)
print(done, error)
main()
This program get stuck when exception happens.
From the log:
Traceback (most recent call last):
File "test.py", line 34, in <module>
main()
File "test.py", line 31, in main
print(done, error)
File "/.pyenv/versions/3.7.4/lib/python3.7/concurrent/futures/_base.py", line 623, in __exit__
self.shutdown(wait=True)
File "/.pyenv/versions/3.7.4/lib/python3.7/concurrent/futures/thread.py", line 216, in shutdown
t.join()
File "/.pyenv/versions/3.7.4/lib/python3.7/threading.py", line 1044, in join
self._wait_for_tstate_lock()
File "/.pyenv/versions/3.7.4/lib/python3.7/threading.py", line 1060, in _wait_for_tstate_lock
elif lock.acquire(block, timeout):
KeyboardInterrupt
It happens because wait function keep locked, but it's weird because the exception happens before the wait function. It should be returned when exception happens!!
Why it doesn't return immediately when exception happens?
Yes, I find the reason. When the exception happens, executor will exit, and in with statement, it calls self.shutdown(wait=True), so main thread waits for sub thread to exit, however sub thread keep running.
So the solution is that, shutdown the executor manually with wait=Falseļ¼š
try:
## code here
except Exception as e:
traceback.print_exc()
executor.shutdown(False)

Running Faust with kafka crashes with ConsumerStoppedError

I freshly installed Faust and ran a basic program to send and receive messages over Kafka.I used the sample code mentioned in (Faust example of publishing to a kafka topic) While running the program initially it connects to Kafka(which is also running on my machine). But then while trying to consume Kafka gets disconnected and the app crashes with the below exception
[2020-11-11 07:08:26,623] [76392] [ERROR] [^---Fetcher]: Crashed reason=ConsumerStoppedError()
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/mode/services.py", line 779, in _execute_task
await task
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/faust/transport/consumer.py", line 176, in _fetcher
await self._drainer
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/faust/transport/consumer.py", line 1039, in _drain_messages
async for tp, message in ait:
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/faust/transport/consumer.py", line 640, in getmany
records, active_partitions = await self._wait_next_records(timeout)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/faust/transport/consumer.py", line 676, in _wait_next_records
records = await self._getmany(
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/faust/transport/consumer.py", line 1269, in _getmany
return await self._thread.getmany(active_partitions, timeout)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/faust/transport/drivers/aiokafka.py", line 805, in getmany
return await self.call_thread(
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/mode/threads.py", line 436, in call_thread
result = await promise
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/mode/threads.py", line 383, in _process_enqueued
result = await maybe_async(method(*args, **kwargs))
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/mode/utils/futures.py", line 134, in maybe_async
return await res
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/faust/transport/drivers/aiokafka.py", line 822, in _fetch_records
raise ConsumerStoppedError()
On debugging the reason for the consumer getting disconnected I see that in fetcher.py of alokafka consumer is the connection getting closed due to the below exception
Unable to display children:Error resolving variables Traceback (most recent call last):
File "/Applications/PyCharm CE.app/Contents/plugins/python-ce/helpers/pydev/_pydevd_bundle/pydevd_resolver.py", line 205, in resolve
def resolve(self, dict, key):
KeyError: 'Exception'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Applications/PyCharm CE.app/Contents/plugins/python-ce/helpers/pydev/_pydevd_bundle/pydevd_comm.py", line 1227, in do_it
def do_it(self, dbg):
File "/Applications/PyCharm CE.app/Contents/plugins/python-ce/helpers/pydev/_pydevd_bundle/pydevd_vars.py", line 262, in resolve_compound_variable_fields
def resolve_compound_variable_fields(thread_id, frame_id, scope, attrs):
File "/Applications/PyCharm CE.app/Contents/plugins/python-ce/helpers/pydev/_pydevd_bundle/pydevd_vars.py", line 169, in getVariable
def getVariable(thread_id, frame_id, scope, attrs):
File "/Applications/PyCharm CE.app/Contents/plugins/python-ce/helpers/pydev/_pydevd_bundle/pydevd_resolver.py", line 205, in resolve
def resolve(self, dict, key):
AttributeError: 'dict' object has no attribute 'Exception'
The software versions are as given below
Mac OS : 10.15.4
Kafka : 2_12.2.1.1
Aiokafka: 1.1.6
Python : 3.9.0
Faust : 1.10.4
Please help here.
I think your problem is the same mine I was with python 3.9 and I changed to 3.8 now It works.

How to get the actual file name in exception message in Databricks?

I am trying to figure out, how to get the actual file/module name in exception message in Databricks.
import traceback
def bad():
print("hello")
a = 1/0
def main():
try:
bad()
except Exception as e:
print(traceback.format_exc())
main()
When I run this the exception message I get like -
hello
Traceback (most recent call last):
File "<command-162594828857243>", line 8, in main
bad()
File "<command-162594828857243>", line 4, in bad
a = 1/0
ZeroDivisionError: division by zero
The "<command-162594828857243>" doesn't help at the time of debugging. I want the actual file/module name there.

How do I catch an exception that occured within the target of a Process?

I am creating a process to execute a function. If the function raises an exception I am not able to catch it. The following is a sample code.
from multiprocessing import Process
import traceback
import time
class CustomTimeoutException(Exception):
pass
def temp1():
time.sleep(5)
print('In temp1')
raise Exception('I have raised an exception')
def main():
try:
p = Process(target=temp1)
p.start()
p.join(10)
if p.is_alive():
p.terminate()
raise CustomTimeoutException('Timeout')
except CustomTimeoutException as e:
print('in Custom')
print(e)
except Exception as e:
print('In exception')
print(e)
if __name__ == "__main__":
main()
When I run the above code the exception raised within temp1 does not get caught. Below is the sample output
In temp1
Process Process-1:
Traceback (most recent call last):
File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "temp.py", line 12, in temp1
raise Exception('I have raised an exception')
Exception: I have raised an exception
I have also tried overwriting the run method of the Process class as mentioned in https://stackoverflow.com/a/33599967/9971556 Wasn't very helpful.
Expected output:
In exception
I have raised an exception

Resources