pyspark write failed with StackOverflowError - apache-spark

I was planning to convert fixed-width to Parquet in AWS Glue,
my data has around 1600 columns, and around 3000 rows.
Seems like when i try to write the spark dataframe (in parquet), I am getting "StackOverflow" issue.
Issue is seen even when i do count(), show() etc.
I tried calling cache(),repartition() but still see this error.
Code works if I reduce number of columns to 500.
Please help
below is my code
data_df = spark.read.text(input_path)
schema_df = pd.read_json(schema_path)
df = data_df
for r in schema_df.itertuples():
df = df.withColumn(
str(r.name), df.value.substr(int(r.start), int(r.length))
)
df = df.drop("value")
df.write.mode("overwrite").option("compression", "gzip").parquet(output_path) # FAILING HERE
Stack trace below.
>
2021-11-10 05:00:13,542 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(70)): Error from Python:Traceback (most recent call last):
File "/tmp/conv_fw_2_pq.py", line 148, in <module>
partition_ts=parsed_args.partition_timestamp,
File "/tmp/conv_fw_2_pq.py", line 125, in process_file
df.write.mode("overwrite").option("compression", "gzip").parquet(output_path)
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 839, in parquet
self._jwrite.parquet(path)
File "/opt/amazon/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/opt/amazon/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
**py4j.protocol.Py4JJavaError: An error occurred while calling o7066.parquet.
: java.lang.StackOverflowError**
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:355)
at org.apache.spark.sql.catalyst.expressions.Expression.references(Expression.scala:88)
at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$references$1.apply(Expression.scala:88)
at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$references$1.apply(Expression.scala:88)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:355)
at org.apache.spark.sql.catalyst.expressions.Expression.references(Expression.scala:88)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$references$1.apply(QueryPlan.scala:45)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$references$1.apply(QueryPlan.scala:45)
at scala.collection.immutable.Stream$$anonfun$flatMap$1.apply(Stream.scala:497)
at scala.collection.immutable.Stream$$anonfun$flatMap$1.apply(Stream.scala:497)

The official Spark documentation has the following description:
This method(withColumn) introduces a projection internally. Therefore, calling it multiple times, for instance, via loops in order to add multiple columns can generate big plans which can cause performance issues and even StackOverflowException. To avoid this, use **select()** with the multiple columns at once.
It is recommended that you first construct the select list, and then use the select method to construct a new dataframe.

Related

Sending PySpark results to MQTT broker with foreachRDD and paho

I'm trying to send a DStream with my computed results to an MQTT broker, but foreachRDD keeps crashing.
I'm running Spark 2.4.3 with Bahir for MQTT subscribe compiled from git master. Everything works up to this point. Before trying to publish my results with MQTT, I tried to saveAsFiles(), and that worked (but isn't exactly what I want).
def sendPartition(part):
# code for publishing with MQTT here
return 0
mydstream = MQTTUtils.createStream(ssc, brokerUrl, topic)
mydstream = packets.map(change_format) \
.map(lambda mac: (mac, 1)) \
.reduceByKey(lambda a, b: a + b)
mydstream.foreachRDD(lambda rdd: rdd.foreachPartition(sendPartition)) # line 56
the resulting Error I get is this:
org.apache.spark.SparkException: An exception was raised by Python:
Traceback (most recent call last):
File "/SPARK_HOME/python/lib/pyspark.zip/pyspark/streaming/util.py", line 68, in call
r = self.func(t, *rdds)
File "/SPARK_HOME/python/lib/pyspark.zip/pyspark/streaming/dstream.py", line 161, in <lambda>
func = lambda t, rdd: old_func(rdd)
File "/path/to/my/code.py", line 56, in <lambda>
mydstream.foreachRDD(lambda rdd: rdd.foreachPartition(sendPartition))
File "/SPARK_HOME/python/lib/pyspark.zip/pyspark/rdd.py", line 806, in foreachPartition
self.mapPartitions(func).count() # Force evaluation
File "/SPARK_HOME/python/lib/pyspark.zip/pyspark/rdd.py", line 1055, in count
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
File "/SPARK_HOME/python/lib/pyspark.zip/pyspark/rdd.py", line 1046, in sum
return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add)
File "/SPARK_HOME/python/lib/pyspark.zip/pyspark/rdd.py", line 917, in fold
vals = self.mapPartitions(func).collect()
File "/SPARK_HOME/python/lib/pyspark.zip/pyspark/rdd.py", line 816, in collect
sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
File "/SPARK_HOME/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/SPARK_HOME/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.: java.lang.IllegalArgumentException: Unsupported class file major version 55
with lots of java errors following, but I suspect the error is in my code.
Are you able to run other Spark commands? At the end of your stack trace, you see java.lang.IllegalArgumentException: Unsupported class file major version 55. This indicates that you are running Spark on an unsupported version of Java.
Spark is not yet compatible with Java 11 (due to limitations imposed by Scala I think). Try configuring spark to use Java 8. The specifics vary a bit based on what platform you're on. You'll probably need to install Java 8, and change the JAVA_HOME environment variable to point towards the new installation.

Error while collecting the data from dataframe column in Pyspark

i am using Pyspark (Python 3.7 with Spark 2.4) and have a small line of code to collect a date from one of the attributes in Dataframe, i can run the same code from pyspark command line, however in my production code it errors out.
Here is the line of code where i read a dataframe "df" and collecting date from field "job_id".
>>> run_dt = map( lambda r:r[0], df.filter((df['delivery_date'] == '2017-12-31')).select(max(substring(df['job_id'], 9, 10).cast("integer")).alias('last_run')).collect())[0]
>>> print(run_dt)
2017123101
The same code line gives me an error in my production code while evaluation. The error message is-
File "C:\Users\spark-2.4.2-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\sql\dataframe.py", line 533, in collect
File "C:\Users\spark-2.4.2-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py", line 1257, in __call__
File "C:\Users\spark-2.4.2-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\sql\utils.py", line 63, in deco
File "C:\Users\spark-2.4.2-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o68.collectToPython.

Errors writing a dataframe to DB2 using Pandas to_sql

I am trying to load data from a pandas dataframe to an IBM DB2 Data Warehouse environment. The table already exists so I am just appending rows to the table. I have built the dataframe to mirror every field in the table exactly.
I am using Pandas to_sql method to try to get the dataframe data to the table. I already know that I am connected to the database, but when I run the code I am getting the following error:
AttributeError: 'function' object has no attribute 'cursor'
I didn't see anything in the pandas documentation about having to define a cursor when using to_sql. Any help would be appreciated.
I tried writing a direct sql insert statement rather than using to_sql but couldn't get that to work properly either. I already have a to_csv method where I'm writing the dataframe to a csv file, so I would like to just use the same dataframe to insert to the table.
I cannot add too much code as this is a company project, but the table has 15 columns with differing datatypes (decimal, character, timestamp).
This is my to_sql statement:
`output_df.to_sql(name='PD5', con=self.db2_conn, schema='REBTEAM', if_exists='append', index=False)`
I expect the table to be loaded with the rows. The test file I'm using has 880 rows, so I would expect the table to have 880 rows.
Here is the entire error message I'm getting:
Warning (from warnings module):
File "C:\Users\dt24358\lib\site-packages\pandas\core\generic.py", line 2531
dtype=dtype, method=method)
UserWarning: The spaces in these column names will not be changed. In pandas versions < 0.14, spaces were converted to underscores.
Exception in Tkinter callback
Traceback (most recent call last):
File "C:\Users\dt24358\lib\tkinter\__init__.py", line 1705, in __call__
return self.func(*args)
File "C:\Users\dt24358\Scripts\Pricing Tool\Rebate_GUI_SQL.py", line 100, in <lambda>
command= lambda: self.submit_click(self.path, self.fileName, self.save_location, self.request_var.get(), self.execution_var.get(),self.dt_user_id, self.rebateAggregator))
File "C:\Users\dt24358\Scripts\Pricing Tool\Rebate_GUI_SQL.py", line 210, in submit_click
output_df.to_sql(name='PD5', con=self.db2_conn, schema='REBTEAM', if_exists='append', index=False)
File "C:\Users\dt24358\lib\site-packages\pandas\core\generic.py", line 2531, in to_sql
dtype=dtype, method=method)
File "C:\Users\dt24358\lib\site-packages\pandas\io\sql.py", line 460, in to_sql
chunksize=chunksize, dtype=dtype, method=method)
File "C:\Users\dt24358\lib\site-packages\pandas\io\sql.py", line 1546, in to_sql
table.create()
File "C:\Users\dt24358\lib\site-packages\pandas\io\sql.py", line 572, in create
if self.exists():
File "C:\Users\dt24358\lib\site-packages\pandas\io\sql.py", line 560, in exists
return self.pd_sql.has_table(self.name, self.schema)
File "C:\Users\dt24358\lib\site-packages\pandas\io\sql.py", line 1558, in has_table
return len(self.execute(query, [name, ]).fetchall()) > 0
File "C:\Users\dt24358\lib\site-packages\pandas\io\sql.py", line 1426, in execute
cur = self.con.cursor()
AttributeError: 'function' object has no attribute 'cursor'

spark java.lang.stackoverflow logistic regression fit with large dataset

I am trying to fit a logistic regression model for a data set with 470 features and 10 million training instances. Here is a snippet of my code.
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import RFormula
formula = RFormula(formula = "label ~ .-classWeight")
bestregLambdaVal = 0.005
bestregAlphaVal = 0.01
lr = LogisticRegression(maxIter=1000, regParam=bestregLambdaVal, elasticNetParam=bestregAlphaVal,weightCol="classWeight")
pipeLineLr = Pipeline(stages = [formula, lr])
pipeLineFit = pipeLineLr.fit(mySparkDataFrame[featureColumnNameList + ['classWeight','label']])
I have also created a checkpoint directory,
sc.setCheckpointDir('checkpoint/')
as suggested here:
Spark gives a StackOverflowError when training using ALS
However I get an error and here is a partial trace:
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/base.py", line 64, in fit
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/pipeline.py", line 108, in _fit
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/base.py", line 64, in fit
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", line 265, in _fit
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", line 262, in _fit_java
File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o383361.fit.
: java.lang.StackOverflowError
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at scala.collection.immutable.List$SerializationProxy.writeObject(List.scala:468)
at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)
I would also like to note that the 470 feature columns were iteratively added to spark data frame using withcolumn().
So the mistake I was making is that, when checkpointing the dataframe, I would only do:
mySparkDataFrame.checkpoint(eager=True)
The right was to do:
mySparkDataFrame = mySparkDataFrame.checkpoint(eager=True)
This is based on another question I had asked (and got an answer for) here:
pyspark rdd isCheckPointed() is false
Also, it is recommended to persist() the dataframe before checkpoint and also to count() it after the checkpoint

Spark 2.0 with Zeppelin 0.6.1 - SQLContext not available

I am running spark 2.0 and zeppelin-0.6.1-bin-all on a Linux server. The default spark notebook runs just fine, but when I try to create and run a new notebook in pyspark using sqlContext I get the error "py4j.Py4JException: Method createDataFrame([class java.util.ArrayList, class java.util.ArrayList, null]) does not exist"
I tried running a simple code,
%pyspark
wordsDF = sqlContext.createDataFrame([('cat',), ('elephant',), ('rat',), ('rat',), ('cat', )], ['word'])
wordsDF.show()
print type(wordsDF)
wordsDF.printSchema()
I get the error,
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-7635635698598314374.py", line 266, in
raise Exception(traceback.format_exc())
Exception: Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-7635635698598314374.py", line 259, in
exec(code)
File "", line 1, in
File "/spark/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/context.py", line 299, in createDataFrame
return self.sparkSession.createDataFrame(data, schema, samplingRatio)
File "/spark/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/spark/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/spark/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 316, in get_return_value
format(target_id, ".", name, value))
Py4JError: An error occurred while calling o48.createDataFrame. Trace:
py4j.Py4JException: Method createDataFrame([class java.util.ArrayList, class java.util.ArrayList, null]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:272)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:211)
at java.lang.Thread.run(Thread.java:745)
When I try the same code with "sqlContext = SQLContext(sc)" it works just fine.
I have tried setting the interpreter "zeppelin.spark.useHiveContext false" configuration but it did not work.
I must obviously be missing something since this is such a simple operation. Please advice if there is any other configuration to be set or what I am missing.
I tested the same piece of code with Zeppelin 0.6.0 and it is working fine.
SparkSession is the default entry-point for Spark 2.0.0, which is mapped to spark in Zeppelin 0.6.1 (as it is in the Spark shell). Have you tried spark.createDataFrame(...)?

Resources