I plan to mask the data by batch using udf. The udf calls the ecc and aes to mask the data, the concrete packages are:
cryptography
eciespy
I got the following error
Driver stacktrace:
22/03/21 11:30:52 INFO DAGScheduler: Job 1 failed: showString at NativeMethodAccessorImpl.java:0, took 1.766196 s
Traceback (most recent call last):
File "/home/hadoop/pyspark-dm.py", line 495, in <module>
df_result.show()
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 485, in show
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 117, in deco
pyspark.sql.utils.PythonException:
An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
File "/mnt/yarn/usercache/hadoop/appcache/application_1647852036838_0018/container_1647852036838_0018_01_000004/pyspark.zip/pyspark/worker.py", line 588, in main
func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
File "/mnt/yarn/usercache/hadoop/appcache/application_1647852036838_0018/container_1647852036838_0018_01_000004/pyspark.zip/pyspark/worker.py", line 447, in read_udfs
udfs.append(read_single_udf(pickleSer, infile, eval_type, runner_conf, udf_index=i))
File "/mnt/yarn/usercache/hadoop/appcache/application_1647852036838_0018/container_1647852036838_0018_01_000004/pyspark.zip/pyspark/worker.py", line 249, in read_single_udf
f, return_type = read_command(pickleSer, infile)
File "/mnt/yarn/usercache/hadoop/appcache/application_1647852036838_0018/container_1647852036838_0018_01_000004/pyspark.zip/pyspark/worker.py", line 69, in read_command
command = serializer._read_with_length(file)
File "/mnt/yarn/usercache/hadoop/appcache/application_1647852036838_0018/container_1647852036838_0018_01_000004/pyspark.zip/pyspark/serializers.py", line 160, in _read_with_length
return self.loads(obj)
File "/mnt/yarn/usercache/hadoop/appcache/application_1647852036838_0018/container_1647852036838_0018_01_000004/pyspark.zip/pyspark/serializers.py", line 430, in loads
return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'ecies'
I loaded the environment by archives
os.environ['PYSPARK_PYTHON'] = "./environment/bin/python"
# spark session initialization
spark =SparkSession.builder.config('spark.sql.hive.metastore.sharedPrefixes', 'com.amazonaws.services.dynamodbv2').config('spark.sql.warehouse.dir', 'hdfs:///user/spark/warehouse').config('spark.sql.catalogImplementation', 'hive').config("spark.archives", "pyspark_venv.tar.gz#environment").getOrCreate()
I pack the dependency library by venv-pack and upload it by spark-submit
22/03/21 11:44:36 INFO SparkContext: Unpacking an archive pyspark_venv.tar.gz#environment from /mnt/tmp/spark-060999fd-4410-405d-8d15-1b832d09f86c/pyspark_venv.tar.gz to /mnt/tmp/spark-dc9e1f8b-5d91-4ccf-8f20-d85ed72e9eca/userFiles-1c03e075-1fb2-4ffd-a527-bb4d650e4df8/environment
When I executed the pyspark script in local mode, it worked well.
Build the archives
First, read the [doc] (https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html#using-virtualenv) in detail.
Second, create virtual environment.
Third, install the related package.
Finally, pack the virtual environment by venv-pack.
Modify the source code
conf = SparkConf()
conf.setExecutorEnv('PYSPARK_PYTHON', './environment/bin/python')
conf.set('spark.sql.hive.metastore.sharedPrefixes', 'com.amazonaws.services.dynamodbv2')
conf.set('spark.sql.warehouse.dir', 'hdfs:///user/spark/warehouse')
conf.set('spark.sql.catalogImplementation', 'hive')
conf.set("spark.archives", "pyspark_venv.tar.gz#environment")
Submit the job
spark-submit pyspark-dm.py --archives pyspark_env.tar.gz
Related
I am a noob in Python multiprocessing and trying to learn it.
While trying to run a class method on a separate Multiprocessing-Process. The Python class is initialized in the main process and I am trying to start a class method in a separate process.
def __initialise_strat_class_obj(self, *, strat_name: str, strat_config: StratInput):
try:
strat_instance = strat_class_dict[strat_name](strat_args=self.strat_args, strat_input=strat_config) # dynamically fetching & calling the class
strat_instance.init()
except Exception as e:
self._logger.error(f"unable to init strategy `{strat_config.id}`: {e}")
return False
strat_process = Process(target=strat_instance.run)
strat_process.start()# starting the class method in a separate process
def start(self):
self.init()
if self.is_active():
self.run()
else:
self._logger.warning(f"{self.id} -> strat is inactive, returning from thread")
return
However, I encountered the cannot pickle '_io.TextIOWrapper' object error. Below is the stack trace of the error.
Traceback (most recent call last):
File "D:\repo\StockTradeBotV2\src\strategy\__init__.py", line 237, in __init__
self.__initialise_strat_class_obj(strat_name=strat_name, strat_config=strat_inp)
File "D:\repo\StockTradeBotV2\src\strategy\__init__.py", line 263, in __initialise_strat_class_obj
strat_process.start()
File "C:\Program Files\Python311\Lib\multiprocessing\process.py", line 121, in start
self._popen = self._Popen(self)
^^^^^^^^^^^^^^^^^
File "C:\Program Files\Python311\Lib\multiprocessing\context.py", line 224, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Program Files\Python311\Lib\multiprocessing\context.py", line 336, in _Popen
return Popen(process_obj)
^^^^^^^^^^^^^^^^^^
File "C:\Program Files\Python311\Lib\multiprocessing\popen_spawn_win32.py", line 94, in __init__
reduction.dump(process_obj, to_child)
File "C:\Program Files\Python311\Lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
TypeError: cannot pickle '_io.TextIOWrapper' object
To overcome the serialization issue, I made use of the multiprocess package which utilizes dill for serializing objects. This time I got a completely different error:
Traceback (most recent call last):
File "D:\repo\StockTradeBotV2\src\strategy\__init__.py", line 237, in __init__
self.__initialise_strat_class_obj(strat_name=strat_name, strat_config=strat_inp)
File "D:\repo\StockTradeBotV2\src\strategy\__init__.py", line 263, in __initialise_strat_class_obj
strat_process.start()
File "C:\PythonEnvironments\StockTradeBotV2\Lib\site-packages\multiprocess\process.py", line 121, in start
self._popen = self._Popen(self)
^^^^^^^^^^^^^^^^^
TypeError: BaseProcess._Popen() takes 1 positional argument but 2 were given
I have tried calling the raw class without initializing and passing the self-argument which didn’t help. On changing the Process to Thread, the code works as intended without any errors, but I want this to run on its own process instead of parallelism so that the additional cores on my PC are used.
I am doing all this on Python-3.11.1 with multiprocess-0.70.14 package.
Edit: I have to use multiprocess.Process rather than from multiprocess.process import BaseProcess as Process which solved the issue
I'm trying to run training of the following model: https://github.com/Atten4Vis/ConditionalDETR
by using a script conddetr_r50_epoch50.sh, just like it is said in README. It looks like this:
script_name1=`basename $0`
script_name=${script_name1:0:${#script_name1}-3}
python -m torch.distributed.launch \
--nproc_per_node=8 \
--use_env \
main.py \
--coco_path ../data/coco \
--output_dir output/$script_name
But I am getting the following errors:
NOTE: Redirects are currently not supported in Windows or MacOs.
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DESKTOP-16DB4TE]:29500 (system error: 10049 - The requested address is not valid in its context.).
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DESKTOP-16DB4TE]:29500 (system error: 10049 - The requested address is not valid in its context.).
Traceback (most recent call last):
File "C:\DETR\ConditionalDETR\main.py", line 258, in <module>
main(args)
File "C:\DETR\ConditionalDETR\main.py", line 116, in main
utils.init_distributed_mode(args)
File "C:\DETR\ConditionalDETR\util\misc.py", line 429, in init_distributed_mode
torch.cuda.set_device(args.gpu)
File "C:\ProgramData\Anaconda3\envs\conditional_detr\lib\site-packages\torch\cuda\__init__.py", line 326, in set_device
torch._C._cuda_setDevice(device)
AttributeError: module 'torch._C' has no attribute '_cuda_setDevice'
C:\ProgramData\Anaconda3\envs\conditional_detr\lib\site-packages\torch\distributed\launch.py:180: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 55928) of binary: C:\ProgramData\Anaconda3\envs\conditional_detr\python.exe
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\envs\conditional_detr\lib\runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\ProgramData\Anaconda3\envs\conditional_detr\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "C:\ProgramData\Anaconda3\envs\conditional_detr\lib\site-packages\torch\distributed\launch.py", line 195, in <module>
main()
File "C:\ProgramData\Anaconda3\envs\conditional_detr\lib\site-packages\torch\distributed\launch.py", line 191, in main
launch(args)
File "C:\ProgramData\Anaconda3\envs\conditional_detr\lib\site-packages\torch\distributed\launch.py", line 176, in launch
run(args)
File "C:\ProgramData\Anaconda3\envs\conditional_detr\lib\site-packages\torch\distributed\run.py", line 753, in run
elastic_launch(
File "C:\ProgramData\Anaconda3\envs\conditional_detr\lib\site-packages\torch\distributed\launcher\api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "C:\ProgramData\Anaconda3\envs\conditional_detr\lib\site-packages\torch\distributed\launcher\api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
I am very new to PyTorch I do not quite understand why I'm getting this errors and what should I do to fix this?
I have a python function that I converted to Pandas_UDF function and it worked fine up until last week but getting the below error from the last few days. We tried a simple python function with Pandas UDF and it is not throwing this error. I am not sure what exactly in my code is causing this. Has there been any change to spark environment. I am using Azure Databricks if it helps.
Search only turned up the this link but it is old.
Appreciate any pointers on how to fix this issue.
Thanks,
Yudi
SparkException: Job aborted due to stage failure: Task 0 in stage 23.0 failed 4 times, most recent failure: Lost task 0.3 in stage 23.0 (TID 252, 172.17.69.7, executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/databricks/spark/python/pyspark/serializers.py", line 180, in _read_with_length
return self.loads(obj)
File "/databricks/spark/python/pyspark/serializers.py", line 669, in loads
return pickle.loads(obj, encoding=encoding)
File "/databricks/spark/python/pyspark/cloudpickle.py", line 875, in subimport
import(name)
ImportError: No module named '_pandasujson'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/databricks/spark/python/pyspark/worker.py", line 394, in main
func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
File "/databricks/spark/python/pyspark/worker.py", line 234, in read_udfs
arg_offsets, udf = read_single_udf(pickleSer, infile, eval_type, runner_conf)
File "/databricks/spark/python/pyspark/worker.py", line 160, in read_single_udf
f, return_type = read_command(pickleSer, infile)
File "/databricks/spark/python/pyspark/worker.py", line 69, in read_command
command = serializer._read_with_length(file)
File "/databricks/spark/python/pyspark/serializers.py", line 183, in _read_with_length
raise SerializationError("Caused by " + traceback.format_exc())
pyspark.serializers.SerializationError: Caused by Traceback (most recent call last):
File "/databricks/spark/python/pyspark/serializers.py", line 180, in _read_with_length
return self.loads(obj)
File "/databricks/spark/python/pyspark/serializers.py", line 669, in loads
return pickle.loads(obj, encoding=encoding)
File "/databricks/spark/python/pyspark/cloudpickle.py", line 875, in subimport
import(name)
ImportError: No module named '_pandasujson'
Trying to run the most basic test of add.delay(1,2) using celery 4.1.0 with Python 3.6.4 and getting the following error:
[2018-02-27 13:58:50,194: INFO/MainProcess] Received task:
exb.tasks.test_tasks.add[52c3fb33-ce00-4165-ad18-15026eca55e9]
[2018-02-27 13:58:50,194: CRITICAL/MainProcess] Unrecoverable error:
SystemError(' returned a result with an error set',) Traceback (most
recent call last): File
"/opt/myapp/lib/python3.6/site-packages/kombu/messaging.py", line 624,
in _receive_callback
return on_m(message) if on_m else self.receive(decoded, message) File
"/opt/myapp/lib/python3.6/site-packages/celery/worker/consumer/consumer.py",
line 570, in on_task_received
callbacks, File "/opt/myapp/lib/python3.6/site-packages/celery/worker/strategy.py",
line 145, in task_message_handler
handle(req) File "/opt/myapp/lib/python3.6/site-packages/celery/worker/worker.py", line
221, in _process_task_sem
return self._quick_acquire(self._process_task, req) File "/opt/myapp/lib/python3.6/site-packages/kombu/async/semaphore.py",
line 62, in acquire
callback(*partial_args, **partial_kwargs) File "/opt/myapp/lib/python3.6/site-packages/celery/worker/worker.py", line
226, in _process_task
req.execute_using_pool(self.pool) File "/opt/myapp/lib/python3.6/site-packages/celery/worker/request.py",
line 531, in execute_using_pool
correlation_id=task_id, File "/opt/myapp/lib/python3.6/site-packages/celery/concurrency/base.py",
line 155, in apply_async
**options) File "/opt/myapp/lib/python3.6/site-packages/billiard/pool.py", line 1486,
in apply_async
self._quick_put((TASK, (result._job, None, func, args, kwds))) File
"/opt/myapp/lib/python3.6/site-packages/celery/concurrency/asynpool.py",
line 813, in send_job
body = dumps(tup, protocol=protocol) TypeError: can't pickle memoryview objects
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File
"/opt/myapp/lib/python3.6/site-packages/celery/worker/worker.py", line
203, in start
self.blueprint.start(self) File "/opt/myapp/lib/python3.6/site-packages/celery/bootsteps.py", line
119, in start
step.start(parent) File "/opt/myapp/lib/python3.6/site-packages/celery/bootsteps.py", line
370, in start
return self.obj.start() File "/opt/myapp/lib/python3.6/site-packages/celery/worker/consumer/consumer.py",
line 320, in start
blueprint.start(self) File "/opt/myapp/lib/python3.6/site-packages/celery/bootsteps.py", line
119, in start
step.start(parent) File "/opt/myapp/lib/python3.6/site-packages/celery/worker/consumer/consumer.py",
line 596, in start
c.loop(*c.loop_args()) File "/opt/myapp/lib/python3.6/site-packages/celery/worker/loops.py", line
88, in asynloop
next(loop) File "/opt/myapp/lib/python3.6/site-packages/kombu/async/hub.py", line 354,
in create_loop
cb(*cbargs) File "/opt/myapp/lib/python3.6/site-packages/kombu/transport/base.py", line
236, in on_readable
reader(loop) File "/opt/myapp/lib/python3.6/site-packages/kombu/transport/base.py", line
218, in _read
drain_events(timeout=0) File "/opt/myapp/lib/python3.6/site-packages/librabbitmq-2.0.0-py3.6-linux-x86_64.egg/librabbitmq/init.py",
line 227, in drain_events
self._basic_recv(timeout) SystemError: returned a result with an error set
I cannot find any previous evidence of anyone hitting this error. I noticed from the celery site that only python 3.5 is mentioned as supported, is that the issue or is this something I am missing?
Any help would be much appreciated!
UPDATE: Tried with Python 3.5.5 and the problem persists. Tried with Django 4.0.2 and the problem persists.
UPDATE: Uninstalled librabbitmq and the problem stopped. This was seen after migration from Python 2.7.5, Django 1.7.7 to Python 3.6.4, Django 2.0.2.
After uninstalling librabbitmq, the problem was resolved.
I am writing a pyspark program, containing a recursion function. When I execute this program, will get error below.
Traceback (most recent call last):
File "/Applications/PyCharm CE.app/Contents/helpers/pydev/pydevd.py", line 1578, in <module>
globals = debugger.run(setup['file'], None, None, is_module)
File "/Applications/PyCharm CE.app/Contents/helpers/pydev/pydevd.py", line 1015, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "/Users/Documents/repos/
File "/Users/Documents/repos/main.py", line 62, in main
run_by_date(dt.datetime.today() - dt.timedelta(days=1))
File "/Users/Documents/repos/main.py", line 50, in run_by_date
parsed_rdd.repartition(1).saveAsTextFile(save_path)
File "/Users/Documents/tools/spark-2.1.0/python/pyspark/rdd.py", line 2058, in repartition
return self.coalesce(numPartitions, shuffle=True)
File "/Users/Documents/tools/spark-2.1.0/python/pyspark/rdd.py", line 2075, in coalesce
jrdd = selfCopy._jrdd.coalesce(numPartitions, shuffle)
File "/Users/Documents/tools/spark-2.1.0/python/pyspark/rdd.py", line 2439, in _jrdd
self._jrdd_deserializer, profiler)
File "/Users/Documents/tools/spark-2.1.0/python/pyspark/rdd.py", line 2372, in _wrap_function
pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
File "/Users/Documents/tools/spark-2.1.0/python/pyspark/rdd.py", line 2358, in _prepare_for_python_RDD
pickled_command = ser.dumps(command)
File "/Users/Documents/tools/spark-2.1.0/python/pyspark/serializers.py", line 440, in dumps
return cloudpickle.dumps(obj, 2)
File "/Users/Documents/tools/spark-2.1.0/python/pyspark/cloudpickle.py", line 667, in dumps
cp.dump(obj)
File "/Users/Documents/tools/spark-2.1.0/python/pyspark/cloudpickle.py", line 111, in dump
raise pickle.PicklingError(msg)
pickle.PicklingError: Could not pickle object as excessively deep recursion required.
I understand this might be due to recursion has a huge depth when serialise it. So how to handle this problem usually.
Thank you so much.