Mismatch in number of rows imported into cassandra table (COPY command) - cassandra

I am trying to dump csv file into cassandra table using COPY command. But, number of rows in my csv file and number of rows in cassandra is not consistent.
Number of rows in CSV files : 49765 (excluding header)
Number of rows in cassandra table:
cqlsh:test_df> select Count(*) from test_table;
count
-------
46982
(1 rows)
Warnings :
Aggregation query used without partition key
copy command :
COPY test_table (column1,column2,column3) from 'temp.csv' with delimiter = ',' and header = True;
Error:
Starting copy of test_df.test_bhavcopy with columns [symbol, instrument, expiry_dt, strike_pr, option_typ, open, high, low, close, settle_pr, contracts, val_inlakh, open_int, ch_in_oi, price_date, key].
Process ImportProcess-3:ate: 8387 rows/s; Avg. rate: 3937 rows/s
Traceback (most recent call last):
P rocess ImportProcess-2:
File "X:\Anaconda\lib\multiprocessing\process.py", line 267, in _bootstrap
Traceback (most recent call last):
Process ImportProcess-1:
T raceback (most recent call last):
File "X:\Anaconda\lib\multiprocessing\process.py", line 267, in _bootstrap
File "X:\Anaconda\lib\multiprocessing\process.py", line 267, in _bootstrap
self.run()
File "X:\apache-cassandra-3.11.3\bin\..\pylib\cqlshlib\copyutil.py", line 2328, in run
self.run()
self.run()
File "X:\apache-cassandra-3.11.3\bin\..\pylib\cqlshlib\copyutil.py", line 2328, in run
File "X:\apache-cassandra-3.11.3\bin\..\pylib\cqlshlib\copyutil.py", line 2328, in run
self.close()
File "X:\apache-cassandra-3.11.3\bin\..\pylib\cqlshlib\copyutil.py", line 2332, in close
self._session.cluster.shutdown()
self.close()
File "X:\apache-cassandra-3.11.3\bin\..\lib\cassandra-driver-internal-only-3.11.0-bb96859b.zip\cassandra-driver-3.11.0-bb96859b\cassandra\cluster.py", line 1259, in shutdown
self.close()
File "X:\apache-cassandra-3.11.3\bin\..\pylib\cqlshlib\copyutil.py", line 2332, in close
File "X:\apache-cassandra-3.11.3\bin\..\pylib\cqlshlib\copyutil.py", line 2332, in close
self._session.cluster.shutdown()
self._session.cluster.shutdown()
File "X:\apache-cassandra-3.11.3\bin\..\lib\cassandra-driver-internal-only-3.11.0-bb96859b.zip\cassandra-driver-3.11.0-bb96859b\cassandra\cluster.py", line 1259, in shutdown
File "X:\apache-cassandra-3.11.3\bin\..\lib\cassandra-driver-internal-only-3.11.0-bb96859b.zip\cassandra-driver-3.11.0-bb96859b\cassandra\cluster.py", line 1259, in shutdown
self.control_connection.shutdown()
File "X:\apache-cassandra-3.11.3\bin\..\lib\cassandra-driver-internal-only-3.11.0-bb96859b.zip\cassandra-driver-3.11.0-bb96859b\cassandra\cluster.py", line 2850, in shutdown
self._connection.close()
File "X:\apache-cassandra-3.11.3\bin\..\lib\cassandra-driver-internal-only-3.11.0-bb96859b.zip\cassandra-driver-3.11.0-bb96859b\cassandra\io\asyncorereactor.py", line 373, in close
AsyncoreConnection.create_timer(0, partial(asyncore.dispatcher.close, self))
File "X:\apache-cassandra-3.11.3\bin\..\lib\cassandra-driver-internal-only-3.11.0-bb96859b.zip\cassandra-driver-3.11.0-bb96859b\cassandra\io\asyncorereactor.py", line 335, in create_timer
cls._loop.add_timer(timer)
A ttributeError: 'NoneType' object has no attribute 'add_timer'
self.control_connection.shutdown()
File "X:\apache-cassandra-3.11.3\bin\..\lib\cassandra-driver-internal-only-3.11.0-bb96859b.zip\cassandra-driver-3.11.0-bb96859b\cassandra\cluster.py", line 2850, in shutdown
self.control_connection.shutdown()
self._connection.close()
File "X:\apache-cassandra-3.11.3\bin\..\lib\cassandra-driver-internal-only-3.11.0-bb96859b.zip\cassandra-driver-3.11.0-bb96859b\cassandra\cluster.py", line 2850, in shutdown
File "X:\apache-cassandra-3.11.3\bin\..\lib\cassandra-driver-internal-only-3.11.0-bb96859b.zip\cassandra-driver-3.11.0-bb96859b\cassandra\io\asyncorereactor.py", line 373, in close
self._connection.close()
File "X:\apache-cassandra-3.11.3\bin\..\lib\cassandra-driver-internal-only-3.11.0-bb96859b.zip\cassandra-driver-3.11.0-bb96859b\cassandra\io\asyncorereactor.py", line 373, in close
AsyncoreConnection.create_timer(0, partial(asyncore.dispatcher.close, self))
AsyncoreConnection.create_timer(0, partial(asyncore.dispatcher.close, self))
File "X:\apache-cassandra-3.11.3\bin\..\lib\cassandra-driver-internal-only-3.11.0-bb96859b.zip\cassandra-driver-3.11.0-bb96859b\cassandra\io\asyncorereactor.py", line 335, in create_timer
File "X:\apache-cassandra-3.11.3\bin\..\lib\cassandra-driver-internal-only-3.11.0-bb96859b.zip\cassandra-driver-3.11.0-bb96859b\cassandra\io\asyncorereactor.py", line 335, in create_timer
cls._loop.add_timer(timer)
A cls._loop.add_timer(timer)
ttributeError: 'NoneType' object has no attribute 'add_timer'
AttributeError: 'NoneType' object has no attribute 'add_timer'
Processed: 49765 rows; Rate: 4193 rows/s; Avg. rate: 3906 rows/s
49765 rows imported from 1 files in 12.742 seconds (0 skipped).
Maybe its due to this error.

Found a fix :
I edited my asyncorereactor.py in
cassandra-driver-internal-only-3.11.0-bb96859b.zip/cassandra-driver-3.11.0-bb96859b/cassandra/io/asyncorereactor.py
to self.create_timer() from AsyncoreConnection.create_timer() as suggested in this post
https://datastax-oss.atlassian.net/browse/PYTHON-862?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel

Related

Azure-quantum: problems with jobs submission

I use pyquil for azure quantum, and submit jobs with run_batch method of AzureQuantumComputer class. For batches with up to 10 circuits there are no problems, but larger batches result in an error below.
Traceback (most recent call last):
File "C:\Users\Enter\anaconda3\envs\qiskit_env\lib\site-packages\msrest\serialization.py",
line 1436, in _deserialize
found_value = key_extractor(attr, attr_desc, data)
File "C:\Users\Enter\anaconda3\envs\qiskit_env\lib\site-packages\msrest\serialization.py",
line 1180, in rest_key_extractor
return working_data.get(key)
AttributeError: 'str' object has no attribute 'get'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\Enter\anaconda3\envs\qiskit_env\lib\site-packages\msrest\serialization.py",
line 1509, in failsafe_deserialize
return self(target_obj, data, content_type=content_type)
File "C:\Users\Enter\anaconda3\envs\qiskit_env\lib\site-packages\msrest\serialization.py",
line 1376, in __call__
return self._deserialize(target_obj, data)
File "C:\Users\Enter\anaconda3\envs\qiskit_env\lib\site-packages\msrest\serialization.py",
line 1454, in _deserialize
raise_with_traceback(DeserializationError, msg, err)
File "C:\Users\Enter\anaconda3\envs\qiskit_env\lib\site-packages\msrest\exceptions.py", line
51, in raise_with_traceback
raise error.with_traceback(exc_traceback)
File "C:\Users\Enter\anaconda3\envs\qiskit_env\lib\site-packages\msrest\serialization.py",
line 1436, in _deserialize
found_value = key_extractor(attr, attr_desc, data)
File "C:\Users\Enter\anaconda3\envs\qiskit_env\lib\site-packages\msrest\serialization.py",
line 1180, in rest_key_extractor
return working_data.get(key)
azure.core.exceptions.DeserializationError: ("Unable to deserialize to object: type,
AttributeError: 'str' object has no attribute 'get'", AttributeError("'str' object has no
attribute 'get'"))
Traceback (most recent call last):
File "C:\Users\Enter\PycharmProjects\QREM_pipline_development\pyquil_experiments.py", line
209, in <module>
unprocessed_results_now = pyquil_utilities.run_batches_parametric(backend_name=backend_name,
File
"C:\Users\Enter\PycharmProjects\QREM_SECRET_DEVELOPMENT_LOC\backends_support\pyquil\pyquil_utiliti
es.py", line 415, in run_batches_parametric
results = backend_instance.run_batch(executable,
File "C:\Users\Enter\anaconda3\envs\qiskit_env\lib\site-
packages\pyquil_for_azure_quantum_init_.py", line 141, in run_batch
return qam.run_batch(executable, memory_map)
File "C:\Users\Enter\anaconda3\envs\qiskit_env\lib\site-
packages\pyquil_for_azure_quantum_init_.py", line 336, in run_batch
job = self._target.submit(
File "C:\Users\Enter\anaconda3\envs\qiskit_env\lib\site-
packages\azure\quantum\target\rigetti\target.py", line 183, in submit
return super().submit(input_data, name, input_params, **kwargs)
File "C:\Users\Enter\anaconda3\envs\qiskit_env\lib\site-
packages\azure\quantum\target\target.py", line 141, in submit
return Job.from_input_data(
File "C:\Users\Enter\anaconda3\envs\qiskit_env\lib\site-
packages\azure\quantum\job\base_job.py",
line 117, in from_input_data
return cls.from_storage_uri(
File "C:\Users\Enter\anaconda3\envs\qiskit_env\lib\site-
packages\azure\quantum\job\base_job.py",
line 207, in from_storage_uri
job.submit()
File "C:\Users\Enter\anaconda3\envs\qiskit_env\lib\site-packages\azure\quantum\job\job.py",
line 45, in submit
job = self.workspace.submit_job(self)
File "C:\Users\Enter\anaconda3\envs\qiskit_env\lib\site-packages\azure\quantum\workspace.py",
line 265, in submit_job
details = client.create(
File "C:\Users\Enter\anaconda3\envs\qiskit_env\lib\site-
packages\azure\core\tracing\decorator.py", line 78, in wrapper_use_tracer
return func(*args, **kwargs)
File "C:\Users\Enter\anaconda3\envs\qiskit_env\lib\site-
packages\azure\quantum_client\operations_jobs_operations.py", line 387, in create
raise HttpResponseError(response=response, model=error)
azure.core.exceptions.HttpResponseError: Operation returned an invalid status 'Forbidden'
Content:
403 Forbidden
403 Forbidden
Microsoft-Azure-Application-Gateway/v2
I tried to run different circuits, it seems to only depend on the number of circuits in a batch, not the structure of a circuit. Programs are compiled to native quil locally.

Write Shapefile to AWS S3 with geopandas in Glue Python Shell

I have read shapefile in a zip format from my S3 bucket successfully through geopandas, but I get error when trying to output the same geodataframe as a shapefile to the same S3 bucket.
The code below is how I read the zip file, and it works nicely:
## session for connecting to S3
session = boto3.session.Session(aws_access_key_id='MY-KEY-ID',
aws_secret_access_key='MY-KEY')
s3 = session.resource('s3')
bucket = s3.Bucket('my_bucket')
## read shapefile
TPG = bucket.Object(key='/shapefiles/grid.zip')
TPGrid = geopandas.read_file(TPG.get()['Body'])
But when I tried to output the same geodataframe like this:
TPGrid.to_file(filename='s3://my_bucket/output/TPGrid.zip', driver='ESRI Shapefile')
I will get error code:
ERROR:fiona._env:Only read-only mode is supported for /vsicurl
ERROR:fiona._env:Only read-only mode is supported for /vsicurl
ERROR:fiona._env:Only read-only mode is supported for /vsicurl
ERROR:fiona._env:Unable to open /vsis3/my_bucket/output/TPGrid.zip/TPGrid.shp or /vsis3/my_bucket/output/TPGrid.zip/TPGrid.SHP.
Traceback (most recent call last):
File "fiona/ogrext.pyx", line 1133, in fiona.ogrext.WritingSession.start
File "fiona/_err.pyx", line 291, in fiona._err.exc_wrap_pointer
fiona._err.CPLE_AppDefinedError: Unable to open /vsis3/my_bucket/output/TPGrid.zip/TPGrid.shp or /vsis3/my_bucket/output/TPGrid.zip/TPGrid.SHP.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/tmp/runscript.py", line 211, in <module>
runpy.run_path(temp_file_path, run_name='__main__')
File "/usr/local/lib/python3.6/runpy.py", line 263, in run_path
pkg_name=pkg_name, script_name=fname)
File "/usr/local/lib/python3.6/runpy.py", line 96, in _run_module_code
mod_name, mod_spec, pkg_name, script_name)
File "/usr/local/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/tmp/glue-python-scripts-c8krhm5u/test_to_file_geo.py", line 40, in <module>
File "/glue/lib/installation/geopandas/geodataframe.py", line 1086, in to_file
_to_file(self, filename, driver, schema, index, **kwargs)
File "/glue/lib/installation/geopandas/io/file.py", line 328, in _to_file
filename, mode=mode, driver=driver, crs_wkt=crs_wkt, schema=schema, **kwargs
File "/glue/lib/installation/fiona/env.py", line 408, in wrapper
return f(*args, **kwargs)
File "/glue/lib/installation/fiona/__init__.py", line 274, in open
**kwargs)
File "/glue/lib/installation/fiona/collection.py", line 165, in __init__
self.session.start(self, **kwargs)
File "fiona/ogrext.pyx", line 1141, in fiona.ogrext.WritingSession.start
fiona.errors.DriverIOError: Unable to open /vsis3/my_bucket/output/TPGrid.zip/TPGrid.shp or /vsis3/my_bucket/output/TPGrid.zip/TPGrid.SHP.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/tmp/runscript.py", line 230, in <module>
raise e_type(e_value).with_traceback(new_stack)
File "/tmp/glue-python-scripts-c8krhm5u/test_to_file_geo.py", line 40, in <module>
File "/glue/lib/installation/geopandas/geodataframe.py", line 1086, in to_file
_to_file(self, filename, driver, schema, index, **kwargs)
File "/glue/lib/installation/geopandas/io/file.py", line 328, in _to_file
filename, mode=mode, driver=driver, crs_wkt=crs_wkt, schema=schema, **kwargs
File "/glue/lib/installation/fiona/env.py", line 408, in wrapper
return f(*args, **kwargs)
File "/glue/lib/installation/fiona/__init__.py", line 274, in open
**kwargs)
File "/glue/lib/installation/fiona/collection.py", line 165, in __init__
self.session.start(self, **kwargs)
File "fiona/ogrext.pyx", line 1141, in fiona.ogrext.WritingSession.start
fiona.errors.DriverIOError: Unable to open /vsis3/my_bucket/output/TPGrid.zip/TPGrid.shp or /vsis3/my_bucket/output/TPGrid.zip/TPGrid.SHP.
I have tried several ways, such as using '.csv' or '.shp', but not any one worked.
I am using python 3.6 and packages below, hope these information will help:
geopandas-0.9.0
shapely-1.7.1
fiona-1.8.20
GDAL-3.2.3
I kept fighting with this problem all day....
Any help will be highly appreciated.

How do I append rows to an excel spreadsheet with openpyxl and then save the excel file?

I'm writing a script that adds new data to an existing excel spreadsheet. Currently, the spreadsheet is 500k+ rows long. I've been using openpyxl to open the spreadsheet as xlsxwriter doesn't currently have any editing capabilities. However, when I use the provided append() method as explained in this answer to a similar problem.
I'm currently running Python 3.7.3 with openpyxl 2.6.2 on a Windows 7 computer.
from openpyxl import load_workbook, Workbook
records = [object list] #this is just a list of objects
file_name = 'existing_excel_file.xlsx'
excel_workbook = load_workbook(file_name, read_only=False)
worksheet = excel_workbook.active
row_list = []
for record in records:
row_number += 1
row_list.append([
str(record.weekno),
str(record.date1),
str(record.code),
str(record.customer),
str(record.date2
])
for row in row_list:
worksheet.append(row)
excel_workbook.save(file_name )
Obviously, it's supposed to save the file with the appended lines.
append() is working alright, but when I try to execute the save() method, I receive this error:
ValueError: I/O operation on closed file.
EDIT: At the suggestion of #CharlieClark, I grabbed the full traceback. I also noticed that there is a MemoryError that I simply didn't notice before (careless, I know) which might be the source of my issue; until this is resolved, I'm researching how to increase the memory being used in openpyxl as I'm sure that's probably the key. Regardless, here's the dump. Warning: it's a big hairy traceback.
Traceback (most recent call last):
File "C:\Users\davidm\AppData\Local\Programs\Python\Python37-32\Lib\xml\etree\ElementTree.py", line 836, in _get_writer
yield file.write
File "C:\Users\davidm\AppData\Local\Programs\Python\Python37-32\Lib\xml\etree\ElementTree.py", line 777, in write
short_empty_elements=short_empty_elements)
File "C:\Users\davidm\AppData\Local\Programs\Python\Python37-32\Lib\xml\etree\ElementTree.py", line 942, in _serialize_xml
short_empty_elements=short_empty_elements)
File "C:\Users\davidm\AppData\Local\Programs\Python\Python37-32\Lib\xml\etree\ElementTree.py", line 942, in _serialize_xml
short_empty_elements=short_empty_elements)
File "C:\Users\davidm\AppData\Local\Programs\Python\Python37-32\Lib\xml\etree\ElementTree.py", line 942, in _serialize_xml
short_empty_elements=short_empty_elements)
File "C:\Users\davidm\AppData\Local\Programs\Python\Python37-32\Lib\xml\etree\ElementTree.py", line 935, in _serialize_xml
write(" %s=\"%s\"" % (qnames[k], v))
MemoryError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "manage.py", line 21, in <module>
main()
File "manage.py", line 17, in main
execute_from_command_line(sys.argv)
File "C:\Users\davidm\projects\web_admin\venv\lib\site-packages\django\core\management\__init__.py", line 381, in execute_from_command_line
utility.execute()
File "C:\Users\davidm\projects\web_admin\venv\lib\site-packages\django\core\management\__init__.py", line 375, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File "C:\Users\davidm\projects\web_admin\venv\lib\site-packages\django\core\management\base.py", line 316, in run_from_argv
self.execute(*args, **cmd_options)
File "C:\Users\davidm\projects\web_admin\venv\lib\site-packages\django\core\management\base.py", line 353, in execute
output = self.handle(*args, **options)
File "C:\Users\davidm\projects\web_admin\web_admin\ezcorp\management\commands\codes.py", line 183, in handle
File "C:\Users\davidm\projects\web_admin\venv\lib\site-packages\openpyxl\workbook\workbook.py", line 397, in save
save_workbook(self, filename)
File "C:\Users\davidm\projects\web_admin\venv\lib\site-packages\openpyxl\writer\excel.py", line 294, in save_workbook
writer.save()
File "C:\Users\davidm\projects\web_admin\venv\lib\site-packages\openpyxl\writer\excel.py", line 276, in save
self.write_data()
File "C:\Users\davidm\projects\rdm_admin\venv\lib\site-packages\openpyxl\writer\excel.py", line 76, in write_data
self._write_worksheets()
File "C:\Users\davidm\projects\rdm_admin\venv\lib\site-packages\openpyxl\writer\excel.py", line 216, in _write_worksheets
self.write_worksheet(ws)
File "C:\Users\davidm\projects\rdm_admin\venv\lib\site-packages\openpyxl\writer\excel.py", line 201, in write_worksheet
writer.write()
File "C:\Users\davidm\projects\rdm_admin\venv\lib\site-packages\openpyxl\worksheet\_writer.py", line 358, in write
self.close()
File "C:\Users\davidm\projects\rdm_admin\venv\lib\site-packages\openpyxl\worksheet\_writer.py", line 366, in close
self.xf.close()
File "C:\Users\davidm\projects\rdm_admin\venv\lib\site-packages\openpyxl\worksheet\_writer.py", line 297, in get_stream
pass
File "C:\Users\davidm\AppData\Local\Programs\Python\Python37-32\Lib\contextlib.py", line 119, in __exit__
next(self.gen)
File "C:\Users\davidm\projects\rdm_admin\venv\lib\site-packages\et_xmlfile\xmlfile.py", line 50, in element
self._write_element(el)
File "C:\Users\davidm\projects\rdm_admin\venv\lib\site-packages\et_xmlfile\xmlfile.py", line 77, in _write_element
xml = tostring(element)
File "C:\Users\davidm\AppData\Local\Programs\Python\Python37-32\Lib\xml\etree\ElementTree.py", line 1136, in tostring
short_empty_elements=short_empty_elements)
File "C:\Users\davidm\AppData\Local\Programs\Python\Python37-32\Lib\xml\etree\ElementTree.py", line 777, in write
short_empty_elements=short_empty_elements)
File "C:\Users\davidm\AppData\Local\Programs\Python\Python37-32\Lib\contextlib.py", line 130, in __exit__
self.gen.throw(type, value, traceback)
File "C:\Users\davidm\AppData\Local\Programs\Python\Python37-32\Lib\xml\etree\ElementTree.py", line 836, in _get_writer
yield file.write
File "C:\Users\davidm\AppData\Local\Programs\Python\Python37-32\Lib\contextlib.py", line 511, in __exit__
raise exc_details[1]
File "C:\Users\davidm\AppData\Local\Programs\Python\Python37-32\Lib\contextlib.py", line 496, in __exit__
if cb(*exc_details):
File "C:\Users\davidm\AppData\Local\Programs\Python\Python37-32\Lib\contextlib.py", line 383, in _exit_wrapper
callback(*args, **kwds)
ValueError: I/O operation on closed file.

How to read .hql file (to run hive query) in pyspark

I have .hql file with huge amount of queries. It is running slow in hive. I want to read and run .hql file using pyspark/sparksql.
I tried count = sqlContext.sql(open("file.hql").read()).count() but gives the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/data/CDH-5.7.1-1.cdh5.7.1/lib/spark/python/pyspark/sql/context.py", line 580, in sql
return DataFrame(self._ssql_ctx.sql(sqlQuery), self)
File "/data/CDH-5.7.1-1.cdh5.7.1/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
File "/data/CDH-5.7.1-1.cdh5.7.1/lib/spark/python/pyspark/sql/utils.py", line 51, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u"missing EOF at ';' near 'db'; line 1 pos 36"

Spark cannot serialise recursion function, giving PicklingError

I am writing a pyspark program, containing a recursion function. When I execute this program, will get error below.
Traceback (most recent call last):
File "/Applications/PyCharm CE.app/Contents/helpers/pydev/pydevd.py", line 1578, in <module>
globals = debugger.run(setup['file'], None, None, is_module)
File "/Applications/PyCharm CE.app/Contents/helpers/pydev/pydevd.py", line 1015, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "/Users/Documents/repos/
File "/Users/Documents/repos/main.py", line 62, in main
run_by_date(dt.datetime.today() - dt.timedelta(days=1))
File "/Users/Documents/repos/main.py", line 50, in run_by_date
parsed_rdd.repartition(1).saveAsTextFile(save_path)
File "/Users/Documents/tools/spark-2.1.0/python/pyspark/rdd.py", line 2058, in repartition
return self.coalesce(numPartitions, shuffle=True)
File "/Users/Documents/tools/spark-2.1.0/python/pyspark/rdd.py", line 2075, in coalesce
jrdd = selfCopy._jrdd.coalesce(numPartitions, shuffle)
File "/Users/Documents/tools/spark-2.1.0/python/pyspark/rdd.py", line 2439, in _jrdd
self._jrdd_deserializer, profiler)
File "/Users/Documents/tools/spark-2.1.0/python/pyspark/rdd.py", line 2372, in _wrap_function
pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
File "/Users/Documents/tools/spark-2.1.0/python/pyspark/rdd.py", line 2358, in _prepare_for_python_RDD
pickled_command = ser.dumps(command)
File "/Users/Documents/tools/spark-2.1.0/python/pyspark/serializers.py", line 440, in dumps
return cloudpickle.dumps(obj, 2)
File "/Users/Documents/tools/spark-2.1.0/python/pyspark/cloudpickle.py", line 667, in dumps
cp.dump(obj)
File "/Users/Documents/tools/spark-2.1.0/python/pyspark/cloudpickle.py", line 111, in dump
raise pickle.PicklingError(msg)
pickle.PicklingError: Could not pickle object as excessively deep recursion required.
I understand this might be due to recursion has a huge depth when serialise it. So how to handle this problem usually.
Thank you so much.

Resources