Error while collecting the data from dataframe column in Pyspark - apache-spark

i am using Pyspark (Python 3.7 with Spark 2.4) and have a small line of code to collect a date from one of the attributes in Dataframe, i can run the same code from pyspark command line, however in my production code it errors out.
Here is the line of code where i read a dataframe "df" and collecting date from field "job_id".
>>> run_dt = map( lambda r:r[0], df.filter((df['delivery_date'] == '2017-12-31')).select(max(substring(df['job_id'], 9, 10).cast("integer")).alias('last_run')).collect())[0]
>>> print(run_dt)
2017123101
The same code line gives me an error in my production code while evaluation. The error message is-
File "C:\Users\spark-2.4.2-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\sql\dataframe.py", line 533, in collect
File "C:\Users\spark-2.4.2-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py", line 1257, in __call__
File "C:\Users\spark-2.4.2-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\sql\utils.py", line 63, in deco
File "C:\Users\spark-2.4.2-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o68.collectToPython.

Related

Can't write anything in PySpark

I try to write a DataFrame into any kind of file format. I tried to reinstall spark several times in different ways and different versions, but receieve the same error everytime, even on another machine. Currently using Spark 3.3.1 on Hadoop 2.7 locally on Windows 11:
data = [[1, 43, 41], [2, 43, 41], [3, 43, 4]]
x = spark.createDataFrame(data)
x.write.csv('qqq')
And receive this:
File "D:\venvs\spark2\spark_hw.py", line 77, in <module>
x.write.csv('qqq')
File "D:\venvs\spark2\lib\site-packages\pyspark\sql\readwriter.py", line 1240, in csv
self._jwrite.csv(path)
File "D:\venvs\spark2\lib\site-packages\py4j\java_gateway.py", line 1321, in __call__
return_value = get_return_value(
File "D:\venvs\spark2\lib\site-packages\pyspark\sql\utils.py", line 190, in deco
return f(*a, **kw)
File "D:\venvs\spark2\lib\site-packages\py4j\protocol.py", line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o44.csv.
: org.apache.spark.SparkException: Job aborted.
x.write.format("csv").save("path/where/file/should/go")
will write your dataframe to a csv to the path specified in the save method

pyspark write failed with StackOverflowError

I was planning to convert fixed-width to Parquet in AWS Glue,
my data has around 1600 columns, and around 3000 rows.
Seems like when i try to write the spark dataframe (in parquet), I am getting "StackOverflow" issue.
Issue is seen even when i do count(), show() etc.
I tried calling cache(),repartition() but still see this error.
Code works if I reduce number of columns to 500.
Please help
below is my code
data_df = spark.read.text(input_path)
schema_df = pd.read_json(schema_path)
df = data_df
for r in schema_df.itertuples():
df = df.withColumn(
str(r.name), df.value.substr(int(r.start), int(r.length))
)
df = df.drop("value")
df.write.mode("overwrite").option("compression", "gzip").parquet(output_path) # FAILING HERE
Stack trace below.
>
2021-11-10 05:00:13,542 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(70)): Error from Python:Traceback (most recent call last):
File "/tmp/conv_fw_2_pq.py", line 148, in <module>
partition_ts=parsed_args.partition_timestamp,
File "/tmp/conv_fw_2_pq.py", line 125, in process_file
df.write.mode("overwrite").option("compression", "gzip").parquet(output_path)
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 839, in parquet
self._jwrite.parquet(path)
File "/opt/amazon/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/opt/amazon/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
**py4j.protocol.Py4JJavaError: An error occurred while calling o7066.parquet.
: java.lang.StackOverflowError**
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:355)
at org.apache.spark.sql.catalyst.expressions.Expression.references(Expression.scala:88)
at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$references$1.apply(Expression.scala:88)
at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$references$1.apply(Expression.scala:88)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:355)
at org.apache.spark.sql.catalyst.expressions.Expression.references(Expression.scala:88)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$references$1.apply(QueryPlan.scala:45)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$references$1.apply(QueryPlan.scala:45)
at scala.collection.immutable.Stream$$anonfun$flatMap$1.apply(Stream.scala:497)
at scala.collection.immutable.Stream$$anonfun$flatMap$1.apply(Stream.scala:497)
The official Spark documentation has the following description:
This method(withColumn) introduces a projection internally. Therefore, calling it multiple times, for instance, via loops in order to add multiple columns can generate big plans which can cause performance issues and even StackOverflowException. To avoid this, use **select()** with the multiple columns at once.
It is recommended that you first construct the select list, and then use the select method to construct a new dataframe.

Sending PySpark results to MQTT broker with foreachRDD and paho

I'm trying to send a DStream with my computed results to an MQTT broker, but foreachRDD keeps crashing.
I'm running Spark 2.4.3 with Bahir for MQTT subscribe compiled from git master. Everything works up to this point. Before trying to publish my results with MQTT, I tried to saveAsFiles(), and that worked (but isn't exactly what I want).
def sendPartition(part):
# code for publishing with MQTT here
return 0
mydstream = MQTTUtils.createStream(ssc, brokerUrl, topic)
mydstream = packets.map(change_format) \
.map(lambda mac: (mac, 1)) \
.reduceByKey(lambda a, b: a + b)
mydstream.foreachRDD(lambda rdd: rdd.foreachPartition(sendPartition)) # line 56
the resulting Error I get is this:
org.apache.spark.SparkException: An exception was raised by Python:
Traceback (most recent call last):
File "/SPARK_HOME/python/lib/pyspark.zip/pyspark/streaming/util.py", line 68, in call
r = self.func(t, *rdds)
File "/SPARK_HOME/python/lib/pyspark.zip/pyspark/streaming/dstream.py", line 161, in <lambda>
func = lambda t, rdd: old_func(rdd)
File "/path/to/my/code.py", line 56, in <lambda>
mydstream.foreachRDD(lambda rdd: rdd.foreachPartition(sendPartition))
File "/SPARK_HOME/python/lib/pyspark.zip/pyspark/rdd.py", line 806, in foreachPartition
self.mapPartitions(func).count() # Force evaluation
File "/SPARK_HOME/python/lib/pyspark.zip/pyspark/rdd.py", line 1055, in count
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
File "/SPARK_HOME/python/lib/pyspark.zip/pyspark/rdd.py", line 1046, in sum
return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add)
File "/SPARK_HOME/python/lib/pyspark.zip/pyspark/rdd.py", line 917, in fold
vals = self.mapPartitions(func).collect()
File "/SPARK_HOME/python/lib/pyspark.zip/pyspark/rdd.py", line 816, in collect
sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
File "/SPARK_HOME/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/SPARK_HOME/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.: java.lang.IllegalArgumentException: Unsupported class file major version 55
with lots of java errors following, but I suspect the error is in my code.
Are you able to run other Spark commands? At the end of your stack trace, you see java.lang.IllegalArgumentException: Unsupported class file major version 55. This indicates that you are running Spark on an unsupported version of Java.
Spark is not yet compatible with Java 11 (due to limitations imposed by Scala I think). Try configuring spark to use Java 8. The specifics vary a bit based on what platform you're on. You'll probably need to install Java 8, and change the JAVA_HOME environment variable to point towards the new installation.

Errors writing a dataframe to DB2 using Pandas to_sql

I am trying to load data from a pandas dataframe to an IBM DB2 Data Warehouse environment. The table already exists so I am just appending rows to the table. I have built the dataframe to mirror every field in the table exactly.
I am using Pandas to_sql method to try to get the dataframe data to the table. I already know that I am connected to the database, but when I run the code I am getting the following error:
AttributeError: 'function' object has no attribute 'cursor'
I didn't see anything in the pandas documentation about having to define a cursor when using to_sql. Any help would be appreciated.
I tried writing a direct sql insert statement rather than using to_sql but couldn't get that to work properly either. I already have a to_csv method where I'm writing the dataframe to a csv file, so I would like to just use the same dataframe to insert to the table.
I cannot add too much code as this is a company project, but the table has 15 columns with differing datatypes (decimal, character, timestamp).
This is my to_sql statement:
`output_df.to_sql(name='PD5', con=self.db2_conn, schema='REBTEAM', if_exists='append', index=False)`
I expect the table to be loaded with the rows. The test file I'm using has 880 rows, so I would expect the table to have 880 rows.
Here is the entire error message I'm getting:
Warning (from warnings module):
File "C:\Users\dt24358\lib\site-packages\pandas\core\generic.py", line 2531
dtype=dtype, method=method)
UserWarning: The spaces in these column names will not be changed. In pandas versions < 0.14, spaces were converted to underscores.
Exception in Tkinter callback
Traceback (most recent call last):
File "C:\Users\dt24358\lib\tkinter\__init__.py", line 1705, in __call__
return self.func(*args)
File "C:\Users\dt24358\Scripts\Pricing Tool\Rebate_GUI_SQL.py", line 100, in <lambda>
command= lambda: self.submit_click(self.path, self.fileName, self.save_location, self.request_var.get(), self.execution_var.get(),self.dt_user_id, self.rebateAggregator))
File "C:\Users\dt24358\Scripts\Pricing Tool\Rebate_GUI_SQL.py", line 210, in submit_click
output_df.to_sql(name='PD5', con=self.db2_conn, schema='REBTEAM', if_exists='append', index=False)
File "C:\Users\dt24358\lib\site-packages\pandas\core\generic.py", line 2531, in to_sql
dtype=dtype, method=method)
File "C:\Users\dt24358\lib\site-packages\pandas\io\sql.py", line 460, in to_sql
chunksize=chunksize, dtype=dtype, method=method)
File "C:\Users\dt24358\lib\site-packages\pandas\io\sql.py", line 1546, in to_sql
table.create()
File "C:\Users\dt24358\lib\site-packages\pandas\io\sql.py", line 572, in create
if self.exists():
File "C:\Users\dt24358\lib\site-packages\pandas\io\sql.py", line 560, in exists
return self.pd_sql.has_table(self.name, self.schema)
File "C:\Users\dt24358\lib\site-packages\pandas\io\sql.py", line 1558, in has_table
return len(self.execute(query, [name, ]).fetchall()) > 0
File "C:\Users\dt24358\lib\site-packages\pandas\io\sql.py", line 1426, in execute
cur = self.con.cursor()
AttributeError: 'function' object has no attribute 'cursor'

Why does PCA in pyspark run out of memory?

When I run PCA in pyspark, I run out of memory. This is pyspark 1.6.3 and teh execution environment is a zeppelin notebook. Here is an example. Let df be a pyspark DataFrame where 'vectors' is the desired input column (containing a SparseVector of data).
from pyspark.ml.feature import PCA
pca = PCA(k = 100, inputCol="vectors", outputCol = "pca").fit(df)
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-2419389767585347468.py", line 360, in <module>
exec(code, _zcUserQueryNameSpace)
File "<stdin>", line 2, in <module>
File "/usr/hdp/current/spark-client/python/pyspark/ml/pipeline.py", line 69, in fit
return self._fit(dataset)
File "/usr/hdp/current/spark-client/python/pyspark/ml/wrapper.py", line 133, in _fit
java_model = self._fit_java(dataset)
File "/usr/hdp/current/spark-client/python/pyspark/ml/wrapper.py", line 130, in _fit_java
return self._java_obj.fit(dataset._jdf)
File "/usr/hdp/current/spark-client/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/hdp/current/spark-client/python/pyspark/sql/utils.py", line 45, in deco
return f(*a, **kw)
File "/usr/hdp/current/spark-client/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o222.fit.
: java.lang.OutOfMemoryError: Java heap space
But check this out:
import pandas as pd
import numpy as np
pandf = df.toPandas()
densevectors = [np.array(sparse.toArray()) for sparse in pandf['vectors']]
xtrain = np.vstack(densevectors)
from sklearn.decomposition import PCA as skPCA
skpca = skPCA(n_components=100).fit(xtrain)
skpca.components_.shape
(100, 41277)
Execution time is 14 seconds. There are no memory problems, of course, because the input dataset only has ~9000 rows of sparse vectors. In spark-defaults.conf, driver and executor memory are both set at 12g, and this is an 8 node cluster that should have 32g available per node. There is no way the entire input dataset even takes up 1 MB, not even as a .csv format.
Why is pyspark's PCA implementation running out of memory?

Resources