pyspark streaming: failing to execute rdd.count() on workers - apache-spark

I have a pyspark streaming job doing something along these lines:
def printrddcount(rdd):
c = rdd.count()
print("{1}: Received an RDD of {0} rows".format("CANNOTCOUNT", datetime.now().isoformat()) )
and then:
...
stream.foreachRDD(printrddcount)
From what I get, the printrdd function will be executed within the workers
And, yes, I know it's a bad idea to do a print() within the worker. But that's not the point.
I'm pretty sure this very code was working until very recently.
(and, it looked differently, because the content of 'c' was actually printed in the print statement, rather than just collected, and then thrown away...)
But now, it seems that (all of a sudden?), then rdd.count() has stopped working ans is making my worker process die saying:
UnpicklingError: NEWOBJ class argument has NULL tp_new
full (well, python only) stacktrace:
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/worker.py", line 163, in main
func, profiler, deserializer, serializer = read_command(pickleSer, infile)
File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/worker.py", line 54, in read_command
command = serializer._read_with_length(file)
File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/serializers.py", line 169, in _read_with_length
return self.loads(obj)
File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/serializers.py", line 454, in loads
return pickle.loads(obj)
UnpicklingError: NEWOBJ class argument has NULL tp_new
The line where it fails is, indeed, the one saying rdd.count()
Any idea why rdd.count() would fail?
If something is supposed to be serialized, it should be the rdd, right?

Ok. I investigated a bit further.
There's nothing wrong with rdd.count()
Only thing wrong is that there is another transformation along the pipe that somehow 'corrupts' (Closes? Invalidates? Something along those lines) the rdd.
So, when it gets to the printrddcount function it cannot be serialized any more and gives the error.
The issue is within a code that looks like:
...
log = logging.getLogger(__name__)
...
def parse(parse_function):
def parse_function_wrapper(event):
try:
log.info("parsing")
new_event = parse_function(event)
except ParsingFailedException as e:
pass
return new_event
return parse_function_wrapper
and then:
stream = stream.map(parse(parse_event))
Now, the log.info (tried a lot of variations, in the beginning logging was within an exception handler) is the one creating the issue.
Which leads me to say that, most probably, it is the logger object that cannot be serialized, for some reason.
Closing this thread myself as it has actually nothing to do with rdd serialization; and most probably not even with pyspark even.

Related

Multiprocessing.Pool: can not iterate over IMapIterator object in AWS Batch because of PicklingError

I need to request huge bulk of data from an API endpoint and I want to use multiprocessing (vs multithreading, company framework limitations)
I have a multiprocessing.Pool with predefined concurrency CONCURRENCY in a class called Batcher. The class looks like this:
class Batcher:
def __init__(self, concurrency: int = 8):
self.concurrency = concurrency
def _interprete_response_to_succ_or_err(self, resp: requests.Response) -> str:
if isinstance(resp, str):
if "Error:" in resp:
return "dlq"
else:
return "err"
if isinstance(resp, requests.Response):
if resp.status_code == 200:
return "succ"
else:
return "err"
def _fetch_dat_data(self, id: str) -> requests.Response:
try:
resp = requests.get(API_ENDPOINT)
return resp
except Exception as e:
return f"ID {id} -> Error: {str(e)}"
def _dispatch_batch(self, batch: list) -> dict:
pool = MPool(self.concurrency)
results = pool.imap(self._fetch_dat_data, batch)
pool.close()
pool.join()
return results
def _run_batch(self, id):
return self._dispatch_batch(id)
def start(self, id_list: list):
""" In real class, this function will create smaller
batches from bigger chunks of data """
results = self._run_batch(id_list)
print(
[
res.text
for res in results
if self._interprete_response_to_succ_or_err(res) == "succ"
]
)
this class is called in file like this
if __name__ == "__main__":
"""
the source of ids is a csv file with single column in s3 that contains list
of columns with single id per line
"""
id_list = boto3_get_object_body(my_file_name).decode().split("\n") # custom function, works
batcher = Batcher()
batcher.start(id_list)
This script is a part of AWS Batch Job that is triggered via CLI. the same function runs perfectly on my local machine with same environment as in AWS Batch. It throws
_pickle.PicklingError: Can't pickle <class 'boto3.resources.factory.s3.ServiceResource'>: attribute lookup s3.ServiceResource on boto3.resources.factory failed
in the line where I try to iterate over IMapIterator object results that is generated by pool.imap()
Relevant Traceback:
for res in results
File "/usr/local/lib/python3.9/multiprocessing/pool.py", line 870, in next
raise value
File "/usr/local/lib/python3.9/multiprocessing/pool.py", line 537, in _handle_tasks
put(task)
File "/usr/local/lib/python3.9/multiprocessing/connection.py", line 211, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/usr/local/lib/python3.9/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
_pickle.PicklingError: Can't pickle <class 'boto3.resources.factory.s3.ServiceResource'>: attribute lookup s3.ServiceResource on boto3.resources.factory failed
I am wondering if I am missing something blatantly obvious or this issue is related to EC2 Instance spun on by batch job at this point and appreciate any kind of lead to root cause analysis.
This error happens because multiprocessing could not import the relevant datatype for duplicating data or calling the target function in the new process it started. This usually happens when the object necessary for the target function to run is created someplace the child process do not know about (for example, a class created inside the if __name__ ==... block in main module), or if the object's __qualname__ property has been fiddled with (you might see this using something similar to functools.wraps or monkey-patching in general)
Therefore, to actually "fix" this, you need to dig in your code and see if the above is true. A good place to start is with the class that is raising the issue (in this case it's boto3.resources.factory.s3.ServiceResource), can you import this in the main module before the if __name__... block runs?
However, most of the times, you can get away with by simply reducing the data required to start the target function (less data = less chances for faults occuring). In this case, the target function you are calling in the pool is an instance method. To start this function in a new process, multiprocessing would need to pickle all the instance attributes, which might have their own instance attributes, and so on. Not only does this add overhead, it could also be possible that the problem lies in a particular instance attribute. Therefore, just as a good practice, if your target function can run independently but is currently an instance method, change it a to staticmethod instead.
In this case, this would mean changing _fetch_dat_data to a staticmethod, and submitting it to the pool using type(self)._fetch_dat_data instead.

Django model object as parameter for celery task raises EncodeError - 'object of type someModelName is not JSON serializable'

Im working with a django project(im pretty new to django) and running into an issue passing a model object between my view and a celery task.
I am taking input from a form which contains several ModelChoiceField fields and using the selected object in a celery task. When I queue the task(from the post method in the view) using someTask.delay(x, y, z) where x, y and z are various objects from the form ModelChoiceFields I get the error object of type <someModelName> is not JSON serializable.
That said, if I create a simple test function and pass any of the same objects from the form into the function I get the expected behavior and the name of the object selected in the form is logged.
def test(object):
logger.debug(object.name)
I have done some poking based on the above error and found django serializers which allows for a workaround by serializing the object using serializers.serialize('json', [template]), in the view before passing it to the celery task.
I can then access the object in the celery task by using template = json.loads(template)[0].get('fields') to access its required bits as a dictionary -- while this works, it does seem a bit inelegant and I wanted to see if there is something I am missing here.
Im obviously open to any feedback/guidance here however my main questions are:
Why do I get the object...is not JSON serializable error when passing a model object into a celery task but not when passing to my simple test function?
Is the approach using django serializers before queueing the celery task considered acceptable/correct or is there a cleaner way to achieve this goal?
Any suggestions would be greatly appreciated.
Traceback:
I tried to post the full traceback here as well however including that caused the post to get flagged as 'this looks like spam'
Internal Server Error: /build/
Traceback (most recent call last):
File "/home/tech/sandbox_project/venv/lib/python3.8/site-packages/kombu/serialization.py", line 49, in _reraise_errors
yield
File "/home/tech/sandbox_project/venv/lib/python3.8/site-packages/kombu/serialization.py", line 220, in dumps
payload = encoder(data)
File "/home/tech/sandbox_project/venv/lib/python3.8/site-packages/kombu/utils/json.py", line 65, in dumps
return _dumps(s, cls=cls or _default_encoder,
File "/usr/lib/python3.8/json/__init__.py", line 234, in dumps
return cls(
File "/usr/lib/python3.8/json/encoder.py", line 199, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/usr/lib/python3.8/json/encoder.py", line 257, in iterencode
return _iterencode(o, 0)
File "/home/tech/sandbox_project/venv/lib/python3.8/site-packages/kombu/utils/json.py", line 55, in default
return super().default(o)
File "/usr/lib/python3.8/json/encoder.py", line 179, in default
raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type Template is not JSON serializable
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/tech/sandbox_project/venv/lib/python3.8/site-packages/django/core/handlers/exception.py", line 47, in inner
response = get_response(request)
Add this lines to settings.py
# Project/settings.py
CELERY_ACCEPT_CONTENT = ['json']
CELERY_TASK_SERIALIZER = 'json'
CELERY_RESULT_SERIALIZER = 'json'
Then instead of passing object, send JSON with id/pk if you're using a model instance call the task like this..
test.delay({'pk': 1})
Django model instance is not available in celery environment, as it runs in a different process
How you can get the model instance inside task then? Well, you can do something like below -
def import_django_instance():
"""
Makes django environment available
to tasks!!
"""
import django
import os
os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'Project.settings')
django.setup()
# task
#shared_task(name="simple_task")
def simple_task(data):
import_django_instance()
from app.models import AppModel
pk = data.get('pk')
instance = AppModel.objects.get(pk=pk)
# your operation

Async version of built-in print (stdout)

I have a problem understanding some of the limitations using print inside an async function. Basically this is my code:
#!/usr/bin/env python
import sys
import asyncio
import aiohttp
async amain(loop):
session = aiohttp.ClientSession(loop=loop)
try:
# using session to fetch a large json file which is stored
# in obj
print(obj) # for debugging purposes
finally:
await session.close()
def main():
loop = asyncio.get_event_loop()
res = 1
try:
res = loop.run_until_complete(amain(loop, args))
except KeyboardInterrupt:
# silence traceback when pressing ctrl+c
pass
loop.close()
return res
if __name__ == '__main__':
sys.exit(main())
If I execute this, then the json object is printed on stdout and the suddenly dies with this error
$ dwd-get-sensor-file ; echo $?
Traceback (most recent call last):
File "/home/yanez/anaconda/py3/envs/mondas/bin/dwd-get-sensor-file", line 11, in <module>
load_entry_point('mondassatellite', 'console_scripts', 'dwd-get-sensor-file')()
File "/home/yanez/projects/mondassatellite/mondassatellite/mondassatellite/bin/dwd_get_sensor_file.py", line 75, in main
res = loop.run_until_complete(amain(loop, args))
File "/home/yanez/anaconda/py3/envs/mondas/lib/python3.7/asyncio/base_events.py", line 579, in run_until_complete
return future.result()
File "/home/yanez/projects/mondassatellite/mondassatellite/mondassatellite/bin/dwd_get_sensor_file.py", line 57, in amain
print(obj)
BlockingIOError: [Errno 11] write could not complete without blocking
1
The funny thing is that when I execute my code redirecting stdout to a file like this
$ dwd-get-sensor-file > output.txt ; echo $?
0
the exception doesn't happen and the whole output is correctly redirected to output.txt.
For testing purposes I converted the json object to a string and instead of doing print(obj) I do sys.stdout.write(obj_as_str) then I get this
exception:
BlockingIOError: [Errno 11] write could not complete without blocking
Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>
I've searched for this BlockingIOError exception but all threads I find have something to do with network sockets or CI builds. But I found one
interesting github comment:
The make: write error is almost certainly EAGAIN from stdout. Pretty much every command line tool expects stdout to be in blocking mode, and does not properly retry when in nonblocking mode.
So when I executed this
python -c 'import os,sys,fcntl; flags = fcntl.fcntl(sys.stdout, fcntl.F_GETFL); print(flags&os.O_NONBLOCK);'
I get 2048, which means blocking (or is this the other way round? I'm confused). After executing this
python -c 'import os,sys,fcntl; flags = fcntl.fcntl(sys.stdout, fcntl.F_GETFL); fcntl.fcntl(sys.stdout, fcntl.F_SETFL, flags&~os.O_NONBLOCK);'
I don't get the BlockingIOError exceptions anymore, but I don't like this solution though.
So, my question is: how should we deal when writing to stdout inside an async function? If I know that I'm dealing with stdout, should I
set stdout to non-blocking and revert it back when my program exits? Is there a specific strategy for this?
Give aiofiles a try, using stdout FD as the file object.
aiofiles helps with this by introducing asynchronous versions of files that support delegating operations to a separate thread pool.
In terms of actually using aiofiles with an FD directly, you could probably extend the aiofiles.os module, using wrap(os.write).

Why is this exception immediately raised from an asyncio Task?

My understanding from the documentation is that asyncio.Tasks, as an asyncio.Future subclass, will store exceptions raised in them and they can be retrieved at my leisure.
However, in this sample code, the exception is raised immediately:
import asyncio
async def bad_task():
raise Exception()
async def test():
loop = asyncio.get_event_loop()
task = loop.create_task(bad_task())
await task
# I would expect to get here
exp = task.exception()
# but we never do because the function exits on line 3
loop = asyncio.get_event_loop()
loop.run_until_complete(test())
loop.close()
Example output (Python 3.6.5):
python3 ./test.py
Traceback (most recent call last):
File "./test.py", line 15, in <module>
loop.run_until_complete(test())
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/asyncio/base_events.py", line 468, in run_until_complete
return future.result()
File "./test.py", line 9, in test
await task
File "./test.py", line 4, in bad_task
raise Exception()
Exception
Is this a quirk of creating & calling tasks when already within async code?
await will raise any exception thrown by the task, because it's meant to make asynchronous code look almost exactly like synchronous code. If you want to catch them, you can use a normal try...except clause.
As Matti explained, exceptions raised by a coroutine are propagated to the awaiting site. This is intentional, as it ensures that errors do not pass silently by default. However, if one needs to do so, it is definitely possible to await a task's completion without immediately accessing its result/exception.
Here is a simple and efficient way to do so, by using a small intermediate Future:
async def test():
loop = asyncio.get_event_loop()
task = loop.create_task(bad_task())
task_done = loop.create_future() # you could also use asyncio.Event
# Arrange for task_done to complete once task completes.
task.add_done_callback(task_done.set_result)
# Wait for the task to complete. Since we're not obtaining its
# result, this won't raise no matter what bad_task() does...
await task_done
# ...and this will work as expected.
exp = task.exception()

Getting Python error, "TypeError: 'NoneType' object is not callable" SOMETIMES

Not very new to programming or to python but incredibly green to using pyunit. Need to use it for my new job and I keep getting this error but only sometimes when it is run. My code below.
import unittest
from nose_parameterized import parameterized
from CheckFromFile import listFileCheck, RepresentsFloat
testParams = listFileCheck()
class TestSequence(unittest.TestCase):
#parameterized.expand(testParams)
def test_sequence(self, name, a, b):
if RepresentsFloat(a):
self.assertAlmostEqual(a,b,2)
else:
self.assertEqual(a,b)
if __name__ == '__main__':
unittest.main()
What is happening here is that my test case is pulling a method listFileCheck from another class. What it does is it reads values from the serial port communicating with the control board and compares them with a calibration file. It puts the control board values in an MD array along with the calibration file values. These values can be either str, int, or float.
I used the test case to compare the values to one another however I keep getting this error but only sometimes. After every 3rd or so run it fails with this error.
Error
Traceback (most recent call last):
File "C:\Python34\lib\unittest\case.py", line 57, in testPartExecutor
yield
File "C:\Python34\lib\unittest\case.py", line 574, in run
testMethod()
TypeError: 'NoneType' object is not callable
Process finished with exit code 0
Anyone know why I might be getting this error on occasion?

Resources