Emr Notebook - Session isn't active - apache-spark

I'm using EMR Notebooks with pyspark and livy.
I'm reading the data from s3 which is in parquet format and string into a pyspark dataframe. there are approx. 2 million rows. when i do a join operation. I am getting 400 session isn't active. for which i have already set the livy timeout to 5h.
An error was encountered:
Invalid status code '400' from
https://172.31.12.103:18888/sessions/5/statements/20 with error
payload:
"requirement failed: Session isn't active."

I had the same issue and the reason for the timeout is the driver running out of memory. By default the driver memory is 1000M when creating a spark application through EMR Notebooks even if you set a higher value through config.json. You can see that by executing the code from within a jupyter notebook
spark.sparkContext.getConf().get('spark.driver.memory')
1000M
To increase the driver memory just do
%%configure -f
{"driverMemory": "6000M"}
This will restart the application with increased driver memory. You might need to use higher values for your data. Hope it helps.

You can try working your operation on small amount of data first. Once it is working end to end as expected, you can move to large data.

Related

PySpark SQL error - ExternalAppendOnlyUnsafeRowArray

I am trying to perform a complex query through Spark. When I try to visualize the results (by calling the .show() method) the application stucks.
I tried to see what was going on by looking at the Spark logs and I noticed that the Spark Job related to the *show * method had a huge number of tasks and seemed to be not running at all.
By looking at the logs, I noticed the following:
INFO ExternalAppendOnlyUnsafeRowArray: Reached spill threshold of 4096 rows, switching to org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter
I tried to modify the spill threshold by injecting the following config in the spark session:
("spark.shuffle.sort.bypassMergeThreshold", "8192")
But I got the same log message and outcome.
I am running the application on Spark 3.2.1.

Memory leak in Spark Structured Streaming driver

I am using Spark 3.1.2 with hadoop 3.2.0 to run Spark Structured Streaming (SSS) aggregation job, running on Spark K8S.
Theses job are reading files from S3 using SSS provided File Source input and also use S3 for checkpointing (with the directory output commiter).
What I noticed is that, after few days of running, the driver is having memory issue and crash.
As the driver is not doing many things (just calling Spark SQL functions and write the output to S3), I am wondering how to detect the source of these memory issues (memory leaking from hadoop/S3A library ?) and how I can fix them.
As shown on the screenshot, the driver take some time befoe using all the memory, and once it reached it, it seems to be able to call GC enough time. But after 1 week of running, it crash, as if the GC doesn't run enough/doesn't find something to clean.

Out of memory error while running spark submit

I am trying to load a 60gb table data onto a spark python dataframe and then write that into a hive table.
I have set driver memory, executor memory, max result size sufficiently to handle the data. But i am getting error when i run through spark submit with all the above said configs mentioned in command line.
Note: Through spark python shell (by specifying driver & executor memory while launching the shell), i am able to populate the target hive table.
Any thoughts??
Try using syntax:
./spark-submit --conf ...
For the memory-related configuration. What I suspect you're doing is - you are setting them, while initializing SparkSession - which becomes irrelevant, since kernel is already started by then. Same parameters, as you set for running shell will do.
https://spark.apache.org/docs/latest/submitting-applications.html

The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached

I try to analyze a dataset of 500Mb in Databricks. These data are stored in Excel file. The first thing that I did was to install Spark Excel package com.crealytics.spark.excel from Maven (last version - 0.11.1).
These are the parameters of the cluster:
Then I executed the following code in Scala notebook:
val df_spc = spark.read
.format("com.crealytics.spark.excel")
.option("useHeader", "true")
.load("dbfs:/FileStore/tables/test.xlsx")
But I got error about the Java heap size and then I get another error "java.io.IOException: GC overhead limit exceeded". Then I executed this code again and got another error after 5 minutes running:
The spark driver has stopped unexpectedly and is restarting. Your
notebook will be automatically reattached.
I do not understand why it happens. In fact the data set is quite small for the distributed computing and the cluster size should be ok to process these data. What should I check to solve it?
I also got stuck in same situation where i am unable to process my 35000 record xlsx file.
Below solutions I tried to work around:
With the free azure subscription and 14 day pay as you go mode, you can process xlsx with less number of records.In my case with trial version, I have to change it to 25 records.
Also downgrade the worker type to Standard_F4S 8GB Memory 4core, 0.5DBU, 1 worker configuration.
Added below options:
sqlContext.read.format("com.crealytics.spark.excel").
option("location","filename here...").option("useHeader","true").option("treatEmptyValueAsNulls","true").option("maxRowsInMemory",20).option("inferSchema","true").load("filename here...")
I had this same issue. We reached out to DataBricks, who provided us this answer
"In the past we were able to address this issue by simply restarting a cluster that has been up for a long period of time.
This issue occurs due the fact that JVMs reuse the memory locations too many times and start misbehaving."

spark wholeTextFiles fails for large data

I use pyspark version 1.5.0 with Cloudera 5.5.0. All scripts are running fine except when I use sc.wholeTextFiles. Using this command gives an error:
Kryo Serialization failed: Buffer overflow. Available:0, required: 23205706. To avoid this, increase spark.kryoserializer.buffer.max
However, I don't find the property spark.kryoserializer.buffer.max in the spark web UI; it is not present under the Environment tab in Spark web UI. The only "kryo" in this page is the value org.apache.spark.selializer.KryoSerializer of the name spark.serializer.
Why can't I see this property? And how to fix the problem?
EDIT
Turns out that the Kryo error was caused by a printing to the shell. Without printing, the error is actually java.io.IOExceptionL Filesystem closed!
The script now works correctly for a small portion of the data, but running it on all of the data (about 500GB, 10,000 files) returns this error.
I tried to pass in the propery --conf "spak.yarn.executor.memoryOverhead=2000", and it seems that it allows a slightly larger part of the data to be read, but it still ultimately fails on the full data. It takes 10-15 minutes of running before the error appears.
The RDD is big, but the error is produced even when only doing .count() on it.
You should pass such property when submitting a job. This is why it's not in Cloudera UI.
http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cdh_ig_running_spark_apps.html
In your case: --conf "spark.kryoserializer.buffer.max = 64M" (for example)
Also, I'm not sure but it might happen that if you increase Kryo buffer you might want to increase akka frame size.

Resources