Spark off heap memory leak on Yarn with Kafka direct stream - apache-spark

I am running spark streaming 1.4.0 on Yarn (Apache distribution 2.6.0) with java 1.8.0_45 and also Kafka direct stream. I am also using spark with scala 2.11 support.
The issue I am seeing is that both driver and executor containers are gradually increasing the physical memory usage till a point where yarn container kill it. I have configured upto 192M Heap and 384 off heap space in my driver but it eventually runs out of it
The Heap memory appears to be fine with regular GC cycles. There is no OutOffMemory encountered ever in any such runs
Infact I am not generating any traffic on the kafka queues still this happens. Here is the code I am using
object SimpleSparkStreaming extends App {
val conf = new SparkConf()
val ssc = new StreamingContext(conf,Seconds(conf.getLong("spark.batch.window.size",1L)));
ssc.checkpoint("checkpoint")
val topics = Set(conf.get("spark.kafka.topic.name"));
val kafkaParams = Map[String, String]("metadata.broker.list" -> conf.get("spark.kafka.broker.list"))
val kafkaStream = KafkaUtils.createDirectStream[String,String,StringDecoder,StringDecoder](ssc, kafkaParams, topics)
kafkaStream.foreachRDD(rdd => {
rdd.foreach(x => {
println(x._2)
})
})
kafkaStream.print()
ssc.start()
ssc.awaitTermination()
}
I am running this on CentOS 7. The command used for spark submit is following
./bin/spark-submit --class com.rasa.cloud.prototype.spark.SimpleSparkStreaming \
--conf spark.yarn.executor.memoryOverhead=256 \
--conf spark.yarn.driver.memoryOverhead=384 \
--conf spark.kafka.topic.name=test \
--conf spark.kafka.broker.list=172.31.45.218:9092 \
--conf spark.batch.window.size=1 \
--conf spark.app.name="Simple Spark Kafka application" \
--master yarn-cluster \
--num-executors 1 \
--driver-memory 192m \
--executor-memory 128m \
--executor-cores 1 \
/home/centos/spark-poc/target/lib/spark-streaming-prototype-0.0.1-SNAPSHOT.jar
Any help is greatly appreciated
Regards,
Apoorva

Try increasing executor cores. In your example the only core is dedicated for consuming the streaming data, leaving no cores to process in the incoming data.

It could be a memory leak... Have you try with conf.set("spark.executor.extraJavaOptions","-XX:+UseG1GC") ?

This is not a Kafka answer this will be isolated to Spark and how its cataloguing system is poor when it comes to consistent persistence and large operations. If you are consistently writing to a perisitence layer (i.e. in a loop re-persisting a DF after a large operation then running again) or running a large query (i.e. inputDF.distinct.count); the Spark job will begin placing some data into memory and not efficiently removing the objects that are stale.
This means overtime an object that was able to quickly run once, will steadily slow down until no memory remains available. For everyone at home spin up a AWS EMR with a large DataFrame loaded int the environment run the below query:
var iterator = 1
val endState = 15
var currentCount = 0
while (iterator <= endState) {
currentCount = inputDF.distinct.count
print("The number of unique records are : " + currentCount)
iterator = iterator + 1
}
While the job is running watch the Spark UIs memory management, if the DF is sufficiently large enough for the session, you will start to notice a drop in run-time with each subsequent run, mainly blocks are becoming stale but Spark is unable to identify when to clean those blocks.
The best way I have found a solution to this problem was by writing my DF locally, clearing the persisitence layer and loading the data back in. It is a "sledge-hammer" approach to the problem, but for my business case it was the easily solution to implement that caused a 90% increase in run-time for our large tables (taking 540 minutes to around 40 with less memory).
The code I currently use is:
val interimDF = inputDF.action
val tempDF = interimDF.write.format(...).option("...","...").save("...")
spark.catalog.clearCache
val interimDF = spark.read..format(...).option("...","...").save("...").persist
interimDF.count
Here are a derivative if you dont unpersist DFs in child sub-processes:
val interimDF = inputDF.action
val tempDF = interimDF.write.format(...).option("...","...").save("...")
for ((k,v) <- sc.getPersistentRDDs) {
v.unpersist()
}
val interimDF = spark.read..format(...).option("...","...").save("...").persist
interimDF.count

Related

Spark process data more than available memory

I'm working with Apache Spark version 3.1.2 deployed on a cluster of 4 nodes, each having 24GB of memory and 8 cores i.e. ~96GB of distributed memory. I want to read-in and process about ~120GB of compressed (gzip) json data.
Following is a generic code flow of my processing
data = spark.read.option('multiline', True).json(data_path, schema=schema)
result = data.filter(data['col_1']['col_1_1'].isNotNull() | data['col2'].isNotNull()) \
.rdd \
.map(parse_json_and_select_columns_of_interest) \
.toDF(schema_of_interest) \
.filter(data['col_x'].isin(broadcast_filter_list)) \
.rdd \
.map(lambda x: (x['col_key'], x.asDict())) \
.groupByKey() \
.mapValues(compute_and_add_extra_columns) \
.flatMap(...) \
.reduceByKey(lambda a,b:a+b) \ <--- OOM
.sortByKey()
.map(append_columns_based_on_key)
.saveAsTextFile(...)
I have tried with following executor settings
# Tiny executors
--num-executors 32
--executor-cores 1
--executor-memory 2g
# Fat executors
--num-executors 4
--executor-cores 8
--executor-memory 20g
However, for all of these settings, I keep getting out of memory especially on .reduceByKey(lambda a,b:a+b). My question is, (1) Regardless of performance, can I change my code flow to avoid getting OOM? or (2) Should I add more memory to my cluster? (Avoiding this since it may not be a sustainable solution in long run)
Thanks
I would actually guess it's the sortByKey causing the OOM and would suggest increasing the number of partitions you are using by passing an argument sortByKey(numPartitions = X).
Also, I can suggest trying to use DataFrame API where possible.

pyspark spark 2.4 on EMR 5.27 - cluster stop processing after listing files

Given an application converting csv to parquet (from and to S3) with little transformation:
for table in tables:
df_table = spark.read.format('csv') \
.option("header", "true") \
.option("escape", "\"") \
.load(path)
df_one_seven_thirty_days = df_table \
.filter(
(df_table['date'] == fn.to_date(fn.lit(one_day))) \
| (df_table['date'] == fn.to_date(fn.lit(seven_days))) \
| (df_table['date'] == fn.to_date(fn.lit(thirty_days)))
)
for i in df_one_seven_thirty_days.schema.names:
df_one_seven_thirty_days = df_one_seven_thirty_days.withColumnRenamed(i, colrename(i).lower())
df_one_seven_thirty_days.createOrReplaceTempView(table)
df_sql = spark.sql("SELECT * FROM "+table)
df_sql.write \
.mode("overwrite").format('parquet') \
.partitionBy("customer_id", "date") \
.option("path", path) \
.saveAsTable(adwords_table)
I'm facing a difficulty with spark EMR.
On local with spark submit, this has no difficulties running (140MB of data) and quite fast.
But on EMR, it's another story.
the first "adwords_table" will be converted without problems but the second one stays idle.
I've gone through the spark jobs UI provided by EMR and I noticed that once this task is done:
Listing leaf files and directories for 187 paths:
Spark kills all executors:
and 20min later nothing more happens. All the tasks are on "Completed" and no new ones are starting.
I'm waiting for the saveAsTable to start.
My local machine is 8 cores 15GB and the cluster is made of 10 nodes r3.4xlarge:
32 vCore, 122 GiB memory, 320 SSD GB storage EBS Storage:200 GiB
The configuration is using maximizeResourceAllocation true and I've only change the --num-executors / --executor-cores to 5
Does any know why the cluster goes into "idle" and don't finishes the task? (it'll eventually crashes without errors 3 hours later)
EDIT:
I made few progress by removing all glue catalogue connections + downgrading hadoop to use: hadoop-aws:2.7.3
Now the saveAsTable is working just fine, but once it finishes, I see the executors being removed and the cluster is idle, the step doesn't finish.
Thus my problem is still the same.
What I found out after many tries and headaches is that the cluster is still running / processing.
It is actually trying to write the data, but only from the master node.
Surprisingly enough, this won't be showing on the UI and it gives an impression of being idle.
The writing is taking few hours, no matter what I do (repartition(1), bigger cluster, etc).
The main problem here is the saveAsTable, I have no clue what it is doing that takes so long or make the writing so slow.
Thus I went for the write.parquet("hdfs:///tmp_loc") locally on the cluster and then processed to use the aws s3-dist-cp from the hdfs to the s3 folder.
The performance are outstanding, I went from a saveAsTable (taking 3 to 5 hours to write 17k rows / 120MB) to 3min.
As the data / schema might change at some point, I just execute a glue save from a sql request.
I am also facing the same issue, is the issue related the new version of EMR 5.27?
For me also job is getting stuck for one executor for very long.It completes all 99% executor and this happens while reading the files.

Dynamic resource allocation for spark applications not working

I am new to Spark and trying to figure out how dynamic resource allocation works. I have spark structured streaming application which is trying to read million records at a time from Kafka and process them. My application always starts with 3 executors and never increase the number of executors.
It takes 5-10 minutes to finish the processing. I thought it will increase the number of executors(up to 10) and try to finish the processing sooner, which is not happening.What am I missing here? How is this supposed to work?
I have set below properties in Ambari for Spark
spark.dynamicAllocation.enabled = true
spark.dynamicAllocation.initialExecutors = 3
spark.dynamicAllocation.maxExecutors = 10
spark.dynamicAllocation.minExecutors = 3
spark.shuffle.service.enabled = true
Below is how my submit command looks like
/usr/hdp/3.0.1.0-187/spark2/bin/spark-submit --class com.sb.spark.sparkTest.sparkTest --master yarn --deploy-mode cluster --queue default sparkTest-assembly-0.1.jar
Spark code
//read stream
val dsrReadStream = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", brokers) //kafka bokers
.option("startingOffsets", startingOffsets) // start point to read
.option("maxOffsetsPerTrigger", maxoffsetpertrigger) // no. of records per batch
.option("failOnDataLoss", "true")
/****
Logic to validate format of loglines. Writing invalid log lines to kafka and store valid log lines in 'dsresult'
****/
//write stream
val dswWriteStream =dsresult.writeStream
.outputMode(outputMode) // file write mode, default append
.format(writeformat) // file format ,default orc
.option("path",outPath) //hdfs file write path
.option("checkpointLocation", checkpointdir) location
.option("maxRecordsPerFile", 999999999)
.trigger(Trigger.ProcessingTime(triggerTimeInMins))
Just to Clarify further,
spark.streaming.dynamicAllocation.enabled=true
worked only for Dstreams API. See Jira
Also, if you set
spark.dynamicAllocation.enabled=true
and run a structured streaming job, the batch dynamic allocation algorithm kicks in, which may not be very optimal. See Jira
Dynamic Resource Allocation does not work with Spark Streaming
Refer this link

Spark Failure : Caused by: org.apache.spark.shuffle.FetchFailedException: Too large frame: 5454002341

I am generating a hierarchy for a table determining the parent child.
Below is the configuration used, even after getting the error with regards to the too large frame:
Spark properties
--conf spark.yarn.executor.memoryOverhead=1024mb \
--conf yarn.nodemanager.resource.memory-mb=12288mb \
--driver-memory 32g \
--driver-cores 8 \
--executor-cores 32 \
--num-executors 8 \
--executor-memory 256g \
--conf spark.maxRemoteBlockSizeFetchToMem=15g
import org.apache.log4j.{Level, Logger};
import org.apache.spark.SparkContext;
import org.apache.spark.sql.{DataFrame, SparkSession};
import org.apache.spark.sql.functions._;
import org.apache.spark.sql.expressions._;
lazy val sparkSession = SparkSession.builder.enableHiveSupport().getOrCreate();
import spark.implicits._;
val hiveEmp: DataFrame = sparkSession.sql("select * from db.employee");
hiveEmp.repartition(300);
import org.apache.spark.sql.functions._;
val nestedLevel = 3;
val empHierarchy = (1 to nestedLevel).foldLeft(hiveEmp.as("wd0")) { (wDf, i) =>
val j = i - 1
wDf.join(hiveEmp.as(s"wd$i"), col(s"wd$j.parent_id".trim) === col(s"wd$i.id".trim), "left_outer")
}.select(
col("wd0.id") :: col("wd0.parent_id") ::
col("wd0.amount").as("amount") :: col("wd0.payment_id").as("payment_id") :: (
(1 to nestedLevel).toList.map(i => col(s"wd$i.amount").as(s"amount_$i")) :::
(1 to nestedLevel).toList.map(i => col(s"wd$i.payment_id").as(s"payment_id_$i"))
): _*);
empHierarchy.write.saveAsTable("employee4");
Error
Caused by: org.apache.spark.SparkException: Task failed while writing rows
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
... 3 more
Caused by: org.apache.spark.shuffle.FetchFailedException: Too large frame: 5454002341
at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:361)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:336)
use this spark config, spark.maxRemoteBlockSizeFetchToMem < 2g
Since there is lot of issues with> 2G partition (cannot shuffle, cannot cache on disk), Hence it is throwing failedfetchedexception too large data frame.
Suresh is right. Here's a better documented & formatted version of his answer with some useful background info:
bug report (link to the fix is at the very bottom)
fix (fixed as of 2.2.0 - already mentioned by Jared)
change of config's default value (changed as of 2.4.0)
If you're on a version 2.2.x or 2.3.x, you can achieve the same effect by setting the value of the config to Int.MaxValue - 512, i.e. by setting spark.maxRemoteBlockSizeFetchToMem=2147483135. See here for the default value used as of September 2019.
This means that size of your dataset partitions is enormous. You need to repartition your dataset to more partitions.
you can do this using,
df.repartition(n)
Here, n is dependent on the size of your dataset.
Got the exact same error when trying to Backfill a few years of Data. Turns out, its because your partitions are of size > 2gb.
You can either Bump up the number of partitions (using repartition()) so that your partitions are under 2GB. (Keep your partitions close to 128mb to 256mb i.e. close to the HDFS Block size)
Or you can bump up the shuffle limit to > 2GB as mentioned above. (Avoid it). Also, partitions with large amount of data will result in tasks that take a long time to finish.
Note: repartition(n) will result in n part files per partition during write to s3/hdfs.
Read this for more info:
http://www.russellspitzer.com/2018/05/10/SparkPartitions/
I was experiencing the same issue while I was working on a ~ 700GB dataset. Decreasing spark.maxRemoteBlockSizeFetchToMem didn't help in my case. In addition, I wasn't able to increase the amount of partitions.
Doing the following worked for me:
Increasing spark.network.timeout (default value is 120 seconds in Spark 2.3) which is affecting the following:
spark.core.connection.ack.wait.timeout
spark.storage.blockManagerSlaveTimeoutMs
spark.shuffle.io.connectionTimeout
spark.rpc.askTimeout
spark.rpc.lookupTimeout
Setting spark.network.timeout=600s (default is 120s in Spark 2.3)
Setting spark.io.compression.lz4.blockSize=512k (default is 32k in Spark 2.3)
Setting spark.shuffle.file.buffer=1024k(default is 32k in Spark 2.3)
The below config worked for me.
keep spark.sql.shuffle.partitions and spark.default.parallelism same number
spark.maxRemoteBlockSizeFetchToMem <2GB
spark.shuffle.spill.compress and spark.shuffle.compress to "true..
"spark.maxRemoteBlockSizeFetchToMem": "2147483135",
"spark.sql.shuffle.partitions": "3000",
"spark.default.parallelism": "3000",
"spark.shuffle.spill.compress": "true",
"spark.shuffle.compress": "true"

Spark java.lang.OutOfMemoryError : Java Heap space [duplicate]

This question already has answers here:
Spark java.lang.OutOfMemoryError: Java heap space
(14 answers)
Closed 1 year ago.
I am geting the above error when i run a model training pipeline with spark
`val inputData = spark.read
.option("header", true)
.option("mode","DROPMALFORMED")
.csv(input)
.repartition(500)
.toDF("b", "c")
.withColumn("b", lower(col("b")))
.withColumn("c", lower(col("c")))
.toDF("b", "c")
.na.drop()`
inputData has about 25 million rows and is about 2gb in size. the model building phase happens like so
val tokenizer = new Tokenizer()
.setInputCol("c")
.setOutputCol("tokens")
val cvSpec = new CountVectorizer()
.setInputCol("tokens")
.setOutputCol("features")
.setMinDF(minDF)
.setVocabSize(vocabSize)
val nb = new NaiveBayes()
.setLabelCol("bi")
.setFeaturesCol("features")
.setPredictionCol("prediction")
.setSmoothing(smoothing)
new Pipeline().setStages(Array(tokenizer, cvSpec, nb)).fit(inputData)
I am running the above spark jobs locally in a machine with 16gb RAM using the following command
spark-submit --class holmes.model.building.ModelBuilder ./holmes-model-building/target/scala-2.11/holmes-model-building_2.11-1.0.0-SNAPSHOT-7d6978.jar --master local[*] --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.kryoserializer.buffer.max=2000m --conf spark.driver.maxResultSize=2g --conf spark.rpc.message.maxSize=1024 --conf spark.memory.offHeap.enabled=true --conf spark.memory.offHeap.size=50g --driver-memory=12g
The oom error is triggered by (at the bottow of the stack trace)
by org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:706)
Logs :
Caused by: java.lang.OutOfMemoryError: Java heap space at java.lang.reflect.Array.newInstance(Array.java:75) at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1897) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1529) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:706)
Any suggestions will be great :)
Things I would try:
1) Removing spark.memory.offHeap.enabled=true and increasing driver memory to something like 90% of the available memory on the box. You probably are aware of this since you didn't set executor memory, but in local mode the driver and the executor all run in the same process which is controlled by driver-memory. I haven't tried it, but the offHeap feature sounds like it has limited value. Reference
2) An actual cluster instead of local mode. More nodes will obviously give you more RAM.
3a) If you want to stick with local mode, try using less cores. You can do this by specifying the number of cores to use in the master setting like --master local[4] instead of local[*] which uses all of them. Running with less threads simultaneously processing data will lead to less data in RAM at any given time.
3b) If you move to a cluster, you may also want to tweak the number of executors cores for the same reason as mentioned above. You can do this with the --executor-cores flag.
4) Try with more partitions. In your example code you repartitioned to 500 partitions, maybe try 1000, or 2000? More partitions means each partition is smaller and less memory pressure.
Usually, this error is thrown when there is insufficient space to allocate an object in the Java heap. In this case, The garbage collector cannot make space available to accommodate a new object, and the heap cannot be expanded further. Also, this error may be thrown when there is insufficient native memory to support the loading of a Java class. In a rare instance, a java.lang.OutOfMemoryError may be thrown when an excessive amount of time is being spent doing garbage collection and little memory is being freed.
How to fix error :
How to set Apache Spark Executor memory
Spark java.lang.OutOfMemoryError: Java heap space

Resources