How does spark load python package depends on the external libarary? - apache-spark

I plan to mask the data by batch using udf. The udf calls the ecc and aes to mask the data, the concrete packages are:
cryptography
eciespy
I got the following error
Driver stacktrace:
22/03/21 11:30:52 INFO DAGScheduler: Job 1 failed: showString at NativeMethodAccessorImpl.java:0, took 1.766196 s
Traceback (most recent call last):
File "/home/hadoop/pyspark-dm.py", line 495, in <module>
df_result.show()
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 485, in show
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 117, in deco
pyspark.sql.utils.PythonException:
An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
File "/mnt/yarn/usercache/hadoop/appcache/application_1647852036838_0018/container_1647852036838_0018_01_000004/pyspark.zip/pyspark/worker.py", line 588, in main
func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
File "/mnt/yarn/usercache/hadoop/appcache/application_1647852036838_0018/container_1647852036838_0018_01_000004/pyspark.zip/pyspark/worker.py", line 447, in read_udfs
udfs.append(read_single_udf(pickleSer, infile, eval_type, runner_conf, udf_index=i))
File "/mnt/yarn/usercache/hadoop/appcache/application_1647852036838_0018/container_1647852036838_0018_01_000004/pyspark.zip/pyspark/worker.py", line 249, in read_single_udf
f, return_type = read_command(pickleSer, infile)
File "/mnt/yarn/usercache/hadoop/appcache/application_1647852036838_0018/container_1647852036838_0018_01_000004/pyspark.zip/pyspark/worker.py", line 69, in read_command
command = serializer._read_with_length(file)
File "/mnt/yarn/usercache/hadoop/appcache/application_1647852036838_0018/container_1647852036838_0018_01_000004/pyspark.zip/pyspark/serializers.py", line 160, in _read_with_length
return self.loads(obj)
File "/mnt/yarn/usercache/hadoop/appcache/application_1647852036838_0018/container_1647852036838_0018_01_000004/pyspark.zip/pyspark/serializers.py", line 430, in loads
return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'ecies'
I loaded the environment by archives
os.environ['PYSPARK_PYTHON'] = "./environment/bin/python"
# spark session initialization
spark =SparkSession.builder.config('spark.sql.hive.metastore.sharedPrefixes', 'com.amazonaws.services.dynamodbv2').config('spark.sql.warehouse.dir', 'hdfs:///user/spark/warehouse').config('spark.sql.catalogImplementation', 'hive').config("spark.archives", "pyspark_venv.tar.gz#environment").getOrCreate()
I pack the dependency library by venv-pack and upload it by spark-submit
22/03/21 11:44:36 INFO SparkContext: Unpacking an archive pyspark_venv.tar.gz#environment from /mnt/tmp/spark-060999fd-4410-405d-8d15-1b832d09f86c/pyspark_venv.tar.gz to /mnt/tmp/spark-dc9e1f8b-5d91-4ccf-8f20-d85ed72e9eca/userFiles-1c03e075-1fb2-4ffd-a527-bb4d650e4df8/environment
When I executed the pyspark script in local mode, it worked well.

Build the archives
First, read the [doc] (https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html#using-virtualenv) in detail.
Second, create virtual environment.
Third, install the related package.
Finally, pack the virtual environment by venv-pack.
Modify the source code
conf = SparkConf()
conf.setExecutorEnv('PYSPARK_PYTHON', './environment/bin/python')
conf.set('spark.sql.hive.metastore.sharedPrefixes', 'com.amazonaws.services.dynamodbv2')
conf.set('spark.sql.warehouse.dir', 'hdfs:///user/spark/warehouse')
conf.set('spark.sql.catalogImplementation', 'hive')
conf.set("spark.archives", "pyspark_venv.tar.gz#environment")
Submit the job
spark-submit pyspark-dm.py --archives pyspark_env.tar.gz

Related

multiprocess package -> BaseProcess: `TypeError: BaseProcess._Popen()`

I am a noob in Python multiprocessing and trying to learn it.
While trying to run a class method on a separate Multiprocessing-Process. The Python class is initialized in the main process and I am trying to start a class method in a separate process.
def __initialise_strat_class_obj(self, *, strat_name: str, strat_config: StratInput):
try:
strat_instance = strat_class_dict[strat_name](strat_args=self.strat_args, strat_input=strat_config) # dynamically fetching & calling the class
strat_instance.init()
except Exception as e:
self._logger.error(f"unable to init strategy `{strat_config.id}`: {e}")
return False
strat_process = Process(target=strat_instance.run)
strat_process.start()# starting the class method in a separate process
def start(self):
self.init()
if self.is_active():
self.run()
else:
self._logger.warning(f"{self.id} -> strat is inactive, returning from thread")
return
However, I encountered the cannot pickle '_io.TextIOWrapper' object error. Below is the stack trace of the error.
Traceback (most recent call last):
File "D:\repo\StockTradeBotV2\src\strategy\__init__.py", line 237, in __init__
self.__initialise_strat_class_obj(strat_name=strat_name, strat_config=strat_inp)
File "D:\repo\StockTradeBotV2\src\strategy\__init__.py", line 263, in __initialise_strat_class_obj
strat_process.start()
File "C:\Program Files\Python311\Lib\multiprocessing\process.py", line 121, in start
self._popen = self._Popen(self)
^^^^^^^^^^^^^^^^^
File "C:\Program Files\Python311\Lib\multiprocessing\context.py", line 224, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Program Files\Python311\Lib\multiprocessing\context.py", line 336, in _Popen
return Popen(process_obj)
^^^^^^^^^^^^^^^^^^
File "C:\Program Files\Python311\Lib\multiprocessing\popen_spawn_win32.py", line 94, in __init__
reduction.dump(process_obj, to_child)
File "C:\Program Files\Python311\Lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
TypeError: cannot pickle '_io.TextIOWrapper' object
To overcome the serialization issue, I made use of the multiprocess package which utilizes dill for serializing objects. This time I got a completely different error:
Traceback (most recent call last):
File "D:\repo\StockTradeBotV2\src\strategy\__init__.py", line 237, in __init__
self.__initialise_strat_class_obj(strat_name=strat_name, strat_config=strat_inp)
File "D:\repo\StockTradeBotV2\src\strategy\__init__.py", line 263, in __initialise_strat_class_obj
strat_process.start()
File "C:\PythonEnvironments\StockTradeBotV2\Lib\site-packages\multiprocess\process.py", line 121, in start
self._popen = self._Popen(self)
^^^^^^^^^^^^^^^^^
TypeError: BaseProcess._Popen() takes 1 positional argument but 2 were given
I have tried calling the raw class without initializing and passing the self-argument which didn’t help. On changing the Process to Thread, the code works as intended without any errors, but I want this to run on its own process instead of parallelism so that the additional cores on my PC are used.
I am doing all this on Python-3.11.1 with multiprocess-0.70.14 package.
Edit: I have to use multiprocess.Process rather than from multiprocess.process import BaseProcess as Process which solved the issue

Running training using torch.distributed.launch

I'm trying to run training of the following model: https://github.com/Atten4Vis/ConditionalDETR
by using a script conddetr_r50_epoch50.sh, just like it is said in README. It looks like this:
script_name1=`basename $0`
script_name=${script_name1:0:${#script_name1}-3}
python -m torch.distributed.launch \
--nproc_per_node=8 \
--use_env \
main.py \
--coco_path ../data/coco \
--output_dir output/$script_name
But I am getting the following errors:
NOTE: Redirects are currently not supported in Windows or MacOs.
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DESKTOP-16DB4TE]:29500 (system error: 10049 - The requested address is not valid in its context.).
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DESKTOP-16DB4TE]:29500 (system error: 10049 - The requested address is not valid in its context.).
Traceback (most recent call last):
File "C:\DETR\ConditionalDETR\main.py", line 258, in <module>
main(args)
File "C:\DETR\ConditionalDETR\main.py", line 116, in main
utils.init_distributed_mode(args)
File "C:\DETR\ConditionalDETR\util\misc.py", line 429, in init_distributed_mode
torch.cuda.set_device(args.gpu)
File "C:\ProgramData\Anaconda3\envs\conditional_detr\lib\site-packages\torch\cuda\__init__.py", line 326, in set_device
torch._C._cuda_setDevice(device)
AttributeError: module 'torch._C' has no attribute '_cuda_setDevice'
C:\ProgramData\Anaconda3\envs\conditional_detr\lib\site-packages\torch\distributed\launch.py:180: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 55928) of binary: C:\ProgramData\Anaconda3\envs\conditional_detr\python.exe
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\envs\conditional_detr\lib\runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\ProgramData\Anaconda3\envs\conditional_detr\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "C:\ProgramData\Anaconda3\envs\conditional_detr\lib\site-packages\torch\distributed\launch.py", line 195, in <module>
main()
File "C:\ProgramData\Anaconda3\envs\conditional_detr\lib\site-packages\torch\distributed\launch.py", line 191, in main
launch(args)
File "C:\ProgramData\Anaconda3\envs\conditional_detr\lib\site-packages\torch\distributed\launch.py", line 176, in launch
run(args)
File "C:\ProgramData\Anaconda3\envs\conditional_detr\lib\site-packages\torch\distributed\run.py", line 753, in run
elastic_launch(
File "C:\ProgramData\Anaconda3\envs\conditional_detr\lib\site-packages\torch\distributed\launcher\api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "C:\ProgramData\Anaconda3\envs\conditional_detr\lib\site-packages\torch\distributed\launcher\api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
I am very new to PyTorch I do not quite understand why I'm getting this errors and what should I do to fix this?

Serialization error with Spark Pandas_UDF

I have a python function that I converted to Pandas_UDF function and it worked fine up until last week but getting the below error from the last few days. We tried a simple python function with Pandas UDF and it is not throwing this error. I am not sure what exactly in my code is causing this. Has there been any change to spark environment. I am using Azure Databricks if it helps.
Search only turned up the this link but it is old.
Appreciate any pointers on how to fix this issue.
Thanks,
Yudi
SparkException: Job aborted due to stage failure: Task 0 in stage 23.0 failed 4 times, most recent failure: Lost task 0.3 in stage 23.0 (TID 252, 172.17.69.7, executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/databricks/spark/python/pyspark/serializers.py", line 180, in _read_with_length
return self.loads(obj)
File "/databricks/spark/python/pyspark/serializers.py", line 669, in loads
return pickle.loads(obj, encoding=encoding)
File "/databricks/spark/python/pyspark/cloudpickle.py", line 875, in subimport
import(name)
ImportError: No module named '_pandasujson'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/databricks/spark/python/pyspark/worker.py", line 394, in main
func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
File "/databricks/spark/python/pyspark/worker.py", line 234, in read_udfs
arg_offsets, udf = read_single_udf(pickleSer, infile, eval_type, runner_conf)
File "/databricks/spark/python/pyspark/worker.py", line 160, in read_single_udf
f, return_type = read_command(pickleSer, infile)
File "/databricks/spark/python/pyspark/worker.py", line 69, in read_command
command = serializer._read_with_length(file)
File "/databricks/spark/python/pyspark/serializers.py", line 183, in _read_with_length
raise SerializationError("Caused by " + traceback.format_exc())
pyspark.serializers.SerializationError: Caused by Traceback (most recent call last):
File "/databricks/spark/python/pyspark/serializers.py", line 180, in _read_with_length
return self.loads(obj)
File "/databricks/spark/python/pyspark/serializers.py", line 669, in loads
return pickle.loads(obj, encoding=encoding)
File "/databricks/spark/python/pyspark/cloudpickle.py", line 875, in subimport
import(name)
ImportError: No module named '_pandasujson'

TypeError: can't pickle memoryview objects when running basic add.delay(1,2) test

Trying to run the most basic test of add.delay(1,2) using celery 4.1.0 with Python 3.6.4 and getting the following error:
[2018-02-27 13:58:50,194: INFO/MainProcess] Received task:
exb.tasks.test_tasks.add[52c3fb33-ce00-4165-ad18-15026eca55e9]
[2018-02-27 13:58:50,194: CRITICAL/MainProcess] Unrecoverable error:
SystemError(' returned a result with an error set',) Traceback (most
recent call last): File
"/opt/myapp/lib/python3.6/site-packages/kombu/messaging.py", line 624,
in _receive_callback
return on_m(message) if on_m else self.receive(decoded, message) File
"/opt/myapp/lib/python3.6/site-packages/celery/worker/consumer/consumer.py",
line 570, in on_task_received
callbacks, File "/opt/myapp/lib/python3.6/site-packages/celery/worker/strategy.py",
line 145, in task_message_handler
handle(req) File "/opt/myapp/lib/python3.6/site-packages/celery/worker/worker.py", line
221, in _process_task_sem
return self._quick_acquire(self._process_task, req) File "/opt/myapp/lib/python3.6/site-packages/kombu/async/semaphore.py",
line 62, in acquire
callback(*partial_args, **partial_kwargs) File "/opt/myapp/lib/python3.6/site-packages/celery/worker/worker.py", line
226, in _process_task
req.execute_using_pool(self.pool) File "/opt/myapp/lib/python3.6/site-packages/celery/worker/request.py",
line 531, in execute_using_pool
correlation_id=task_id, File "/opt/myapp/lib/python3.6/site-packages/celery/concurrency/base.py",
line 155, in apply_async
**options) File "/opt/myapp/lib/python3.6/site-packages/billiard/pool.py", line 1486,
in apply_async
self._quick_put((TASK, (result._job, None, func, args, kwds))) File
"/opt/myapp/lib/python3.6/site-packages/celery/concurrency/asynpool.py",
line 813, in send_job
body = dumps(tup, protocol=protocol) TypeError: can't pickle memoryview objects
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File
"/opt/myapp/lib/python3.6/site-packages/celery/worker/worker.py", line
203, in start
self.blueprint.start(self) File "/opt/myapp/lib/python3.6/site-packages/celery/bootsteps.py", line
119, in start
step.start(parent) File "/opt/myapp/lib/python3.6/site-packages/celery/bootsteps.py", line
370, in start
return self.obj.start() File "/opt/myapp/lib/python3.6/site-packages/celery/worker/consumer/consumer.py",
line 320, in start
blueprint.start(self) File "/opt/myapp/lib/python3.6/site-packages/celery/bootsteps.py", line
119, in start
step.start(parent) File "/opt/myapp/lib/python3.6/site-packages/celery/worker/consumer/consumer.py",
line 596, in start
c.loop(*c.loop_args()) File "/opt/myapp/lib/python3.6/site-packages/celery/worker/loops.py", line
88, in asynloop
next(loop) File "/opt/myapp/lib/python3.6/site-packages/kombu/async/hub.py", line 354,
in create_loop
cb(*cbargs) File "/opt/myapp/lib/python3.6/site-packages/kombu/transport/base.py", line
236, in on_readable
reader(loop) File "/opt/myapp/lib/python3.6/site-packages/kombu/transport/base.py", line
218, in _read
drain_events(timeout=0) File "/opt/myapp/lib/python3.6/site-packages/librabbitmq-2.0.0-py3.6-linux-x86_64.egg/librabbitmq/init.py",
line 227, in drain_events
self._basic_recv(timeout) SystemError: returned a result with an error set
I cannot find any previous evidence of anyone hitting this error. I noticed from the celery site that only python 3.5 is mentioned as supported, is that the issue or is this something I am missing?
Any help would be much appreciated!
UPDATE: Tried with Python 3.5.5 and the problem persists. Tried with Django 4.0.2 and the problem persists.
UPDATE: Uninstalled librabbitmq and the problem stopped. This was seen after migration from Python 2.7.5, Django 1.7.7 to Python 3.6.4, Django 2.0.2.
After uninstalling librabbitmq, the problem was resolved.

Spark cannot serialise recursion function, giving PicklingError

I am writing a pyspark program, containing a recursion function. When I execute this program, will get error below.
Traceback (most recent call last):
File "/Applications/PyCharm CE.app/Contents/helpers/pydev/pydevd.py", line 1578, in <module>
globals = debugger.run(setup['file'], None, None, is_module)
File "/Applications/PyCharm CE.app/Contents/helpers/pydev/pydevd.py", line 1015, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "/Users/Documents/repos/
File "/Users/Documents/repos/main.py", line 62, in main
run_by_date(dt.datetime.today() - dt.timedelta(days=1))
File "/Users/Documents/repos/main.py", line 50, in run_by_date
parsed_rdd.repartition(1).saveAsTextFile(save_path)
File "/Users/Documents/tools/spark-2.1.0/python/pyspark/rdd.py", line 2058, in repartition
return self.coalesce(numPartitions, shuffle=True)
File "/Users/Documents/tools/spark-2.1.0/python/pyspark/rdd.py", line 2075, in coalesce
jrdd = selfCopy._jrdd.coalesce(numPartitions, shuffle)
File "/Users/Documents/tools/spark-2.1.0/python/pyspark/rdd.py", line 2439, in _jrdd
self._jrdd_deserializer, profiler)
File "/Users/Documents/tools/spark-2.1.0/python/pyspark/rdd.py", line 2372, in _wrap_function
pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
File "/Users/Documents/tools/spark-2.1.0/python/pyspark/rdd.py", line 2358, in _prepare_for_python_RDD
pickled_command = ser.dumps(command)
File "/Users/Documents/tools/spark-2.1.0/python/pyspark/serializers.py", line 440, in dumps
return cloudpickle.dumps(obj, 2)
File "/Users/Documents/tools/spark-2.1.0/python/pyspark/cloudpickle.py", line 667, in dumps
cp.dump(obj)
File "/Users/Documents/tools/spark-2.1.0/python/pyspark/cloudpickle.py", line 111, in dump
raise pickle.PicklingError(msg)
pickle.PicklingError: Could not pickle object as excessively deep recursion required.
I understand this might be due to recursion has a huge depth when serialise it. So how to handle this problem usually.
Thank you so much.

Resources