PySpark SQL error - ExternalAppendOnlyUnsafeRowArray - apache-spark

I am trying to perform a complex query through Spark. When I try to visualize the results (by calling the .show() method) the application stucks.
I tried to see what was going on by looking at the Spark logs and I noticed that the Spark Job related to the *show * method had a huge number of tasks and seemed to be not running at all.
By looking at the logs, I noticed the following:
INFO ExternalAppendOnlyUnsafeRowArray: Reached spill threshold of 4096 rows, switching to org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter
I tried to modify the spill threshold by injecting the following config in the spark session:
("spark.shuffle.sort.bypassMergeThreshold", "8192")
But I got the same log message and outcome.
I am running the application on Spark 3.2.1.

Related

Azure Synapse Spark Pool not working properly for simple tasks

I am currently having issues with my Apache Spark Pool in Azure Synapse Analytics using python: A rather trivial task of displaying a pandas dataframe of 200k rows is not possible, but merging on 1.7 million rows and grouping is possible. This leaves me confused as I have not seen this issue previously.
Calling the dataframe with df_mara results in the following error:
LivyHttpRequestFailure: Something went wrong while processing your request. Please try again later. HTTP status code: 500. Trace ID: d3082022-0cd9-4f87-8818-9edf2718faf5.
For info the shape of df_mara is ~200k rows and 17 columns with a size of ~40 mb.
I have tried to delete the spark pool and create a new one. I've tried turning up the nodes and the intelligent cache parameters, but no success. And since I have worked with this same spark pool for some time with no issues, I don't expect increasing the number of cores will fix the issue.

PySpark job only partly running

I have a PySpark script that I am running locally with spark-submit in Docker. In my script I have a call toPandas() on a PySpark DataFrame and afterwards I have various manipulations of the DataFrame, finishing in a call to to_csv() to write results to a local CSV file.
When I run this script, the code after the call to toPandas() does not appear to run. I have log statements before this method call and afterwards, however only the log entries before the call show up on the spark-submit console output. I have thought that maybe this is due to the rest of the code being run in a separate executor process by Spark, so the logs don't show on the console. If this is true, how can I see my application logs for the executor? I have enabled the event log with spark.eventLog.enabled=true but this seems to show only internal events, not my actual application log statements.
Even if the assumption about executor logs above is true or false, I don't see the CSV file written to the path that I expect (/tmp). Further, the history server says No completed applications found! when I start it, configuring it to read the event log (spark.history.fs.logDirectory for the history server, spark.eventLog.dir for Spark). It does show an incomplete application, the only complete job listed there is for my toPandas() call.
How am I supposed to figure out what's happening? Spark shows no errors at any point, and I can't seem to view my own application logs.
When you use the toPandas() to convert your spark dataframe to your pandas dataframe, it's actually a heavy action because it will pull all the records to the driver.
Remember that Spark is a distributed computing engine and it's doing the parallel computing. Therefore your data will be distributed to different node and it's completely different to pandas dataframe, since pandas works on single machine but spark work in cluster. You can check this post: why does python dataFrames' are localted only in the same machine?
Back to your post, actually it covers 2 questions:
Why there is no logs after toPandas(): As mentioned above, Spark is a distributed computing engine. The event log will only save the job details which appear in Spark computation DAG. Other non Spark log will not be saved in spark log, if you really those log, you need to use external library like logging to collect the logs in driver.
Why there is no CSV saved in /tmp dir: As you mentioned that when you check the event log, there is a an incomplete application but not a failed application, I believe you dataframe is so huge that you collection has not finished and your transformation in pandas dataframe has even not yet started. You can try to collect few record, let's say df.limit(20).toPandas() to see if it works or not. If it works, that means your dataframe that converts to pandas is so large and it takes time. If it's not work, maybe you can share more about the error traceback.

Databricks notebooks crashes on memory job

I am running few operations to aggregate a big quantity of data (about 600gb) on azure databricks. I noticed recently that the notebook crashes and the databricks returns the error below. The same code worked before with smaller 6 nodes cluster. After upgrading it to 12 nodes, I started getting this and I am doubting that it is a config problem.
Any help please, I use the default spark configuration with partitions number=200 and I have 88 executors on my nodes.
Thanks
Internal error, sorry. Attach your notebook to a different cluster or restart the current cluster.
java.lang.RuntimeException: abort: DriverClient destroyed
at com.databricks.backend.daemon.driver.DriverClient.$anonfun$poll$3(DriverClient.scala:381)
at scala.concurrent.Future.$anonfun$flatMap$1(Future.scala:307)
at scala.concurrent.impl.Promise.$anonfun$transformWith$1(Promise.scala:41)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
at com.databricks.threading.NamedExecutor$$anon$2.$anonfun$run$1(NamedExecutor.scala:335)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.logging.UsageLogging.$anonfun$withAttributionContext$1(UsageLogging.scala:238)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:233)
at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:230)
at com.databricks.threading.NamedExecutor.withAttributionContext(NamedExecutor.scala:265)
at com.databricks.threading.NamedExecutor$$anon$2.run(NamedExecutor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I'm not sure about the cost implications, but how about enabling auto scaling option on cluster and bumping up Max Workers. Also you can try changing the Worker Type to have better resources
Just for other people facing similar issue.
In my situation, sometimes the same error happened when there's multiple Spark actions in one cell of a Databricks notebook.
Surprisingly, spliting the cell before the code where the error occurred or simply inserting time.sleep(5) there worked for me. However I'm not sure why it worked...
For example:
df1.count() # some Spark action
# split the cell or insert `time.sleep(5)` here
pipeline.fit(df1) # another Spark action where the error happened

The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached

I try to analyze a dataset of 500Mb in Databricks. These data are stored in Excel file. The first thing that I did was to install Spark Excel package com.crealytics.spark.excel from Maven (last version - 0.11.1).
These are the parameters of the cluster:
Then I executed the following code in Scala notebook:
val df_spc = spark.read
.format("com.crealytics.spark.excel")
.option("useHeader", "true")
.load("dbfs:/FileStore/tables/test.xlsx")
But I got error about the Java heap size and then I get another error "java.io.IOException: GC overhead limit exceeded". Then I executed this code again and got another error after 5 minutes running:
The spark driver has stopped unexpectedly and is restarting. Your
notebook will be automatically reattached.
I do not understand why it happens. In fact the data set is quite small for the distributed computing and the cluster size should be ok to process these data. What should I check to solve it?
I also got stuck in same situation where i am unable to process my 35000 record xlsx file.
Below solutions I tried to work around:
With the free azure subscription and 14 day pay as you go mode, you can process xlsx with less number of records.In my case with trial version, I have to change it to 25 records.
Also downgrade the worker type to Standard_F4S 8GB Memory 4core, 0.5DBU, 1 worker configuration.
Added below options:
sqlContext.read.format("com.crealytics.spark.excel").
option("location","filename here...").option("useHeader","true").option("treatEmptyValueAsNulls","true").option("maxRowsInMemory",20).option("inferSchema","true").load("filename here...")
I had this same issue. We reached out to DataBricks, who provided us this answer
"In the past we were able to address this issue by simply restarting a cluster that has been up for a long period of time.
This issue occurs due the fact that JVMs reuse the memory locations too many times and start misbehaving."

spark wholeTextFiles fails for large data

I use pyspark version 1.5.0 with Cloudera 5.5.0. All scripts are running fine except when I use sc.wholeTextFiles. Using this command gives an error:
Kryo Serialization failed: Buffer overflow. Available:0, required: 23205706. To avoid this, increase spark.kryoserializer.buffer.max
However, I don't find the property spark.kryoserializer.buffer.max in the spark web UI; it is not present under the Environment tab in Spark web UI. The only "kryo" in this page is the value org.apache.spark.selializer.KryoSerializer of the name spark.serializer.
Why can't I see this property? And how to fix the problem?
EDIT
Turns out that the Kryo error was caused by a printing to the shell. Without printing, the error is actually java.io.IOExceptionL Filesystem closed!
The script now works correctly for a small portion of the data, but running it on all of the data (about 500GB, 10,000 files) returns this error.
I tried to pass in the propery --conf "spak.yarn.executor.memoryOverhead=2000", and it seems that it allows a slightly larger part of the data to be read, but it still ultimately fails on the full data. It takes 10-15 minutes of running before the error appears.
The RDD is big, but the error is produced even when only doing .count() on it.
You should pass such property when submitting a job. This is why it's not in Cloudera UI.
http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cdh_ig_running_spark_apps.html
In your case: --conf "spark.kryoserializer.buffer.max = 64M" (for example)
Also, I'm not sure but it might happen that if you increase Kryo buffer you might want to increase akka frame size.

Resources