Databricks notebooks crashes on memory job - azure

I am running few operations to aggregate a big quantity of data (about 600gb) on azure databricks. I noticed recently that the notebook crashes and the databricks returns the error below. The same code worked before with smaller 6 nodes cluster. After upgrading it to 12 nodes, I started getting this and I am doubting that it is a config problem.
Any help please, I use the default spark configuration with partitions number=200 and I have 88 executors on my nodes.
Thanks
Internal error, sorry. Attach your notebook to a different cluster or restart the current cluster.
java.lang.RuntimeException: abort: DriverClient destroyed
at com.databricks.backend.daemon.driver.DriverClient.$anonfun$poll$3(DriverClient.scala:381)
at scala.concurrent.Future.$anonfun$flatMap$1(Future.scala:307)
at scala.concurrent.impl.Promise.$anonfun$transformWith$1(Promise.scala:41)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
at com.databricks.threading.NamedExecutor$$anon$2.$anonfun$run$1(NamedExecutor.scala:335)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.logging.UsageLogging.$anonfun$withAttributionContext$1(UsageLogging.scala:238)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:233)
at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:230)
at com.databricks.threading.NamedExecutor.withAttributionContext(NamedExecutor.scala:265)
at com.databricks.threading.NamedExecutor$$anon$2.run(NamedExecutor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

I'm not sure about the cost implications, but how about enabling auto scaling option on cluster and bumping up Max Workers. Also you can try changing the Worker Type to have better resources

Just for other people facing similar issue.
In my situation, sometimes the same error happened when there's multiple Spark actions in one cell of a Databricks notebook.
Surprisingly, spliting the cell before the code where the error occurred or simply inserting time.sleep(5) there worked for me. However I'm not sure why it worked...
For example:
df1.count() # some Spark action
# split the cell or insert `time.sleep(5)` here
pipeline.fit(df1) # another Spark action where the error happened

Related

PySpark SQL error - ExternalAppendOnlyUnsafeRowArray

I am trying to perform a complex query through Spark. When I try to visualize the results (by calling the .show() method) the application stucks.
I tried to see what was going on by looking at the Spark logs and I noticed that the Spark Job related to the *show * method had a huge number of tasks and seemed to be not running at all.
By looking at the logs, I noticed the following:
INFO ExternalAppendOnlyUnsafeRowArray: Reached spill threshold of 4096 rows, switching to org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter
I tried to modify the spill threshold by injecting the following config in the spark session:
("spark.shuffle.sort.bypassMergeThreshold", "8192")
But I got the same log message and outcome.
I am running the application on Spark 3.2.1.

GCS does not write all records in Spark3

I have seen several threads related to this but I found that mostly the issue is with AWS s3 and not Azure or GCS. I have a situation where I am running dataproc cluster and writing results in parquet table backed by GCS bucket.
Now, the behavior of GCS so far has been inconsistent. It sometimes writes all records and sometimes misses few records (not files, it's records). Like if I am writing 43000 records, it will write about 42745 records something. The reason I mentioned it as records because it produces 100 files of equal size when correctly written and it still has all 100 files and if it was missing single file, it should have missed about 4000 records. The data is equally distributed. Also, when I rerun the job, it sometimes writes all records, or sometimes writes different number of records, i.e. 42985 for example.
Everytime this happens, I have noticed a stacktrace in spark job for that specific hour like below. Also, this doesn't cause the job to fail. It just gives this stacktrace but the job status turns out as success after spark-sql query.
22/11/22 00:59:13 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 55.0 in stage 2.0 (TID 255) (cluster-sample-w-3.c.network.internal executor 3): org.apache.spark.SparkException: Task failed while writing rows.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:296)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:210)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:505)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:508)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.FileNotFoundException: listStatus(hadoopPath: gs://<some_bucket>/hive/warehouse/<some_db>.db/<some_table>/data/_temporary/0/_temporary/attempt_202211220058563982258671276457664_0002_m_000055_255/dt=20221111/hr=01): 'gs://<some_bucket>/hive/warehouse/<some_db>.db/<some_table>/data/_temporary/0/_temporary/attempt_202211220058563982258671276457664_0002_m_000055_255/dt=20221111/hr=01' does not exist.
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.listStatus(GoogleHadoopFileSystemBase.java:865)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergeDirectory(FileOutputCommitter.java:529)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:501)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergeDirectory(FileOutputCommitter.java:538)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:501)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergeDirectory(FileOutputCommitter.java:538)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:501)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergeDirectory(FileOutputCommitter.java:538)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:501)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitTask(FileOutputCommitter.java:653)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitTask(FileOutputCommitter.java:616)
at org.apache.spark.mapred.SparkHadoopMapRedUtil$.performCommit$1(SparkHadoopMapRedUtil.scala:50)
at org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:77)
at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.commitTask(HadoopMapReduceCommitProtocol.scala:269)
at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:79)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:280)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1473)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:286)
... 9 more
Caused by: java.io.FileNotFoundException: Item not found: gs://<somebucket>/hive/warehouse/<some_db>.db/<some_table>/data/_temporary/0/_temporary/attempt_202211220058563982258671276457664_0002_m_000055_255/dt=20221111/hr=01
at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.listFileInfo(GoogleCloudStorageFileSystem.java:1039)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.listStatus(GoogleHadoopFileSystemBase.java:856)
... 26 more
This is happening across multiple tables and randomly. So, it does brings question if GCS writes are consistent in Spark? I did read this part where it says Cloud storage is not drop in replacement for HDFS, but then what's the alternative to solve such random behavior.
Environment:
GCS bucket:
Spark 3.1.3
Scala: 2.12.14
Dataproc Image: 2.0-rocky8
GCS Hadoop connector: gcs-connector-hadoop3-2.2.8.jar
Hadoop 3.2.3
Source code repository https://bigdataoss-internal.googlesource.com/third_party/apache/hadoop -r c87f29d51bb88311d1adba1bc5bd7dfdfa345ebc
Compiled by bigtop on 2022-11-01T20:07Z
Compiled with protoc 2.5.0

Emr Notebook - Session isn't active

I'm using EMR Notebooks with pyspark and livy.
I'm reading the data from s3 which is in parquet format and string into a pyspark dataframe. there are approx. 2 million rows. when i do a join operation. I am getting 400 session isn't active. for which i have already set the livy timeout to 5h.
An error was encountered:
Invalid status code '400' from
https://172.31.12.103:18888/sessions/5/statements/20 with error
payload:
"requirement failed: Session isn't active."
I had the same issue and the reason for the timeout is the driver running out of memory. By default the driver memory is 1000M when creating a spark application through EMR Notebooks even if you set a higher value through config.json. You can see that by executing the code from within a jupyter notebook
spark.sparkContext.getConf().get('spark.driver.memory')
1000M
To increase the driver memory just do
%%configure -f
{"driverMemory": "6000M"}
This will restart the application with increased driver memory. You might need to use higher values for your data. Hope it helps.
You can try working your operation on small amount of data first. Once it is working end to end as expected, you can move to large data.

The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached

I try to analyze a dataset of 500Mb in Databricks. These data are stored in Excel file. The first thing that I did was to install Spark Excel package com.crealytics.spark.excel from Maven (last version - 0.11.1).
These are the parameters of the cluster:
Then I executed the following code in Scala notebook:
val df_spc = spark.read
.format("com.crealytics.spark.excel")
.option("useHeader", "true")
.load("dbfs:/FileStore/tables/test.xlsx")
But I got error about the Java heap size and then I get another error "java.io.IOException: GC overhead limit exceeded". Then I executed this code again and got another error after 5 minutes running:
The spark driver has stopped unexpectedly and is restarting. Your
notebook will be automatically reattached.
I do not understand why it happens. In fact the data set is quite small for the distributed computing and the cluster size should be ok to process these data. What should I check to solve it?
I also got stuck in same situation where i am unable to process my 35000 record xlsx file.
Below solutions I tried to work around:
With the free azure subscription and 14 day pay as you go mode, you can process xlsx with less number of records.In my case with trial version, I have to change it to 25 records.
Also downgrade the worker type to Standard_F4S 8GB Memory 4core, 0.5DBU, 1 worker configuration.
Added below options:
sqlContext.read.format("com.crealytics.spark.excel").
option("location","filename here...").option("useHeader","true").option("treatEmptyValueAsNulls","true").option("maxRowsInMemory",20).option("inferSchema","true").load("filename here...")
I had this same issue. We reached out to DataBricks, who provided us this answer
"In the past we were able to address this issue by simply restarting a cluster that has been up for a long period of time.
This issue occurs due the fact that JVMs reuse the memory locations too many times and start misbehaving."

Elasticsearch write using Spark

I'm creating a document collection in Spark as an RDD and using the Spark read/write library from Elasticsearch. The Cluster that creates the collection is large so when it writes to ES I get the errors below indicating ES is overloaded, which does not surprise me. This does not seem to fail the job. The tasks may being retried and eventually succeed. In the Spark GUI the job is reported as having finishing successfully.
is there a way to somehow throttle the ES writing lib to avoid the retries (I can't change the cluster size)?
Do these errors mean that some data was not written to the index?
Here is one of many reported task failure errors, but again no job failure is reported:
2017-03-20 10:48:27,745 WARN org.apache.spark.scheduler.TaskSetManager [task-result-getter-2] - Lost task 568.1 in stage 81.0 (TID 18982, ip-172-16-2-76.ec2.internal): org.apache.spark.util.TaskCompletionListenerException: Could not write all entries [41/87360] (maybe ES was overloaded?). Bailing out...
at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:112)
at org.apache.spark.scheduler.Task.run(Task.scala:102)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
The lib I'm using is
org.elasticsearch" % "elasticsearch-spark_2.10" % "2.1.2"
Can you follow this link - https://www.elastic.co/guide/en/elasticsearch/hadoop/current/spark.html
In the Spark conf property or in your elastic search property you need to increase the max number of records which can be dumped in a single post, and that should solve your problem.

Resources