spark master goes down with out of memory exception - apache-spark

I have 1 spark master and 2 slave nodes setup with 8 gb memory each on AWS. I have setup spark master to run every 1 hour. I have a cassandra database which is read every hour from spark to get records and process it in spark. There are around 5000 records every hour. My spark master crashed in one of the run saying
"15/12/20 11:04:45 ERROR ActorSystemImpl: Uncaught fatal error from thread [sparkMaster-akka.actor.default-dispatcher-4436] shutting down ActorSystem [sparkMaster]
java.lang.OutOfMemoryError: GC overhead limit exceeded
at scala.math.BigInt$.apply(BigInt.scala:82)
at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:16)
at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:42)
at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:35)
at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:42)
at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:35)
at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:3066)
at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2161)
at org.json4s.jackson.JsonMethods$class.parse(JsonMethods.scala:19)
at org.json4s.jackson.JsonMethods$.parse(JsonMethods.scala:44)
at org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:58)
at org.apache.spark.deploy.master.Master.rebuildSparkUI(Master.scala:793)
at org.apache.spark.deploy.master.Master.removeApplication(Master.scala:734)
at org.apache.spark.deploy.master.Master.org$apache$spark$deploy$master$Master$$finishApplication(Master.scala:712)
at org.apache.spark.deploy.master.Master$$anonfun$receiveWithLogging$1$$anonfun$applyOrElse$28.apply(Master.scala:445)
at org.apache.spark.deploy.master.Master$$anonfun$receiveWithLogging$1$$anonfun$applyOrElse$28.apply(Master.scala:445)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.deploy.master.Master$$anonfun$receiveWithLogging$1.applyOrElse(Master.scala:445)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59)
at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
at org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at org.apache.spark.deploy.master.Master.aroundReceive(Master.scala:52)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
"
Can you please let me know the reason why spark master crashed with out of memory. I have this as setup for spark
_executorMemory=6G
_driverMemory=6G
creating 8 paritions in my code.
Why does master goes down which out of memory
Here is the code
//create spark context
_sparkContext = new SparkContext(_conf)
//load the cassandra table
val tabledf = _sqlContext.read.format("org.apache.spark.sql.cassandra").options(Map( "table" -> "events", "keyspace" -> "sams")).load
val whereQuery = "addedtime >= '" + _from + "' AND addedtime < '" + _to + "'"
helpers.printnextLine("Where query to run on Cassandra : " + whereQuery)
val rdd = tabledf.filter(whereQuery)
rdd.registerTempTable("rdd")
val selectQuery = "lower(brandname) as brandname, lower(appname) as appname, lower(packname) as packname, lower(assetname) as assetname, eventtime, lower(eventname) as eventname, lower(client.OSName) as platform, lower(eventorigin) as eventorigin, meta.price as price"
val modefiedDF = _sqlContext.sql("select " + selectQuery + " from rdd")
//cache the rdd
modefiedDF.cache
// perform groupby operation
grprdd = filterrdd.groupBy("brandname", "appname", "packname", "eventname", "platform", "eventorigin", "price").count()
grprdd.foreachPartition{iter =>
{
iter.foreach(element =>
{
// Write to sql server table
val statement = con.createStatement()
statement.executeUpdate(insertQuery)
finally
{
if(con != null)
con.close
}
// clear the cache
_sqlContext.clearCache()

The problem may be that you are asking spark master to use 6 GB and spark executor to use another 6 GB (total 12 GB to be used). However the system only has a total 8 GB RAM available.
Of this 8 GB you should also allow some memory to be utilized for OS processes (say 1 GB)k. Thus total RAM available to spark (master and worker combined) is only 7 GB.
Set executorMemory and driverMemory accordingly.

Related

Spark Out of memory issue : forEachPartition

We are processing roughly 500 MB file of data in EMR.
I am performing the following operations on the file.
read csv :
val df = spark.read.format("csv").load(s3)
aggregating by key and creating the list :
val data = filteredDf.groupBy($"<key>")
.agg(collect_list(struct(cols.head, cols.tail: _*)) as "finalData")
.toJSON
Iterating through each partition and storing per key aggregation to S3 and sending the key to SQS.
data.foreachPartition(partition => {
partition.foreach(json => ......)
}
Data is skewed with one account having almost 10M records (~400 MB). I am experiencing out of memory issue during foreachPartition for the given account.
Configuration:
1 driver : m4.4xlarge CPU Cores : 16 and Memory : 64GB
1 executor : m4.2x large CPU Cores : 8 and Memory : 32GB
driver-memory: 20G
executor-memory: 10G
Partitions : default 200 [ most of them don't do anything ]
Any help is much appreciated! thanks a lot in advance :)

Spark GraphFrames High Shuffle read/write

Hi I have created Graph using vertex and edge files. Size of graph is 600GB. I am querying this graph using motif feature of Spark GraphFrames.
I have setup an AWS EMR cluster for querying graph.
cluster details:- 1 master and 8 slaves
Master Node:
m5.xlarge
4 vCore, 16 GiB memory, EBS only storage
EBS Storage:64 GiB
Slave Node:
m5.4xlarge
16 vCore, 64 GiB memory, EBS only storage
EBS Storage:256 GiB (per instance)
I am facing very high shuffle read(3.4TB) and write(2TB), this is affecting performance and it takes around 50 mins to execute only 10 queries.Is there any way to reduce such high shuffle.
Following is my spark code:-
val spark = SparkSession.builder.appName("SparkGraph POC").getOrCreate()
val g:GraphFrame = GraphFrame(vertexDf, edgeDf)
//queries
val q1 = g.find(" (a)-[r1]->(b); (b)-[r2]->(c)")
q1.filter(
" r1.relationship = 'knows' and" +
" r2.relationship = 'knows'").distinct()
.createOrReplaceTempView("q1table")
spark.sql("select a.id as a_id,a.name as a_name," +
"b.id as b_id,b.name as b_name," +
"c.id as c_id,c.name as c_name from q1table")
.write
.option("quote", "\"")
.option("escape", "\"")
.option("header","true")
.csv(resFilePath + "/q1")
spark.catalog.uncacheTable("q1table")
val q2 = g.find(" (a)-[r1]->(b); (b)-[r2]->(c); (c)-[r3]->(d); (d)-[r4]->(e)")
q2.filter(
" a.name = 'user1' and" +
" e.name = 'user4' and" +
" r1.relationship = 'knows' and" +
" r2.relationship = 'knows' and" +
" r3.relationship = 'knows' and" +
" r4.relationship = 'knows'").distinct()
.createOrReplaceTempView("q2table")
spark.sql("select a.id as a_id, a.name as a_name ," +
"e.id as e_id, e.name as e_name from q2table")
.write
.option("quote", "\"")
.option("escape", "\"")
.option("header","true")
.csv(resFilePath + "/q2")
spark.catalog.uncacheTable("q2table")
spark.stop()
The problem with the implementation of Graphframes is that it makes self joins of the internal dataframes as many times as you use on the motifs. That means that you will have more a more shuffle as the length of the chain increases
You can see more details at https://www.waitingforcode.com/apache-spark-graphframes/motifs-finding-graphframes/read
I have also tried a similar approach and have seen that when the length of the chain is greater than 12 Spark starts being not responsive and connections are lost with executors, even if I increased resources.
If you are trying to do that, I would recommend using a graph database instead.
Hope this helps

Optimization Spark job - Spark 2.1

my spark job currently runs in 59 mins. I want to optimize it so that I it takes less time. I have noticed that the last step of the job takes a lot of time (55 mins) (see the screenshots of the spark job in Spark UI below).
I need to join a big dataset with a smaller one, apply transformations on this joined dataset (creating a new column).
At the end, I should have a dataset repartitioned based on the column PSP (see snippet of the code below). I also perform a sort at the end (sort each partition based on 3 columns).
All the details (infrastructure, configuration, code) can be found below.
Snippet of my code :
spark.conf.set("spark.sql.shuffle.partitions", 4158)
val uh = uh_months
.withColumn("UHDIN", datediff(to_date(unix_timestamp(col("UHDIN_YYYYMMDD"), "yyyyMMdd").cast(TimestampType)),
to_date(unix_timestamp(col("january"), "yyyy-MM-dd").cast(TimestampType))))
"ddMMMyyyy")).cast(TimestampType)))
.withColumn("DVA_1", date_format(col("DVA"), "dd/MM/yyyy"))
.drop("UHDIN_YYYYMMDD")
.drop("january")
.drop("DVA")
.persist()
val uh_flag_comment = new TransactionType().transform(uh)
uh.unpersist()
val uh_joined = uh_flag_comment.join(broadcast(smallDF), "NO_NUM")
.select(
uh.col("*"),
smallDF.col("PSP"),
smallDF.col("minrel"),
smallDF.col("Label"),
smallDF.col("StartDate"))
.withColumnRenamed("DVA_1", "DVA")
smallDF.unpersist()
val uh_to_be_sorted = uh_joined.repartition(4158, col("PSP"))
val uh_final = uh_joined.sortWithinPartitions(col("NO_NUM"), col("UHDIN"), col("HOURMV"))
uh_final
EDITED - Repartition logic
val sqlContext = spark.sqlContext
sqlContext.udf.register("randomUDF", (partitionCount: Int) => {
val r = new scala.util.Random
r.nextInt(partitionCount)
// Also tried with r.nextInt(partitionCount) + col("PSP")
})
val uh_to_be_sorted = uh_joined
.withColumn("tmp", callUDF("RandomUDF", lit("4158"))
.repartition(4158, col("tmp"))
.drop(col("tmp"))
val uh_final = uh_to_be_sorted.sortWithinPartitions(col("NO_NUM"), col("UHDIN"), col("HOURMV"))
uh_final
smallDF is a small dataset (535MB) that I broadcast.
TransactionType is a class where I add a new column of string elements to my uh dataframe based on the value of 3 columns (MMED, DEBCRED, NMTGP), checking the values of those columns using regex.
I previously faced a lot of issues (job failing) because of shuffle blocks that were not found. I discovered that I was spilling to disk and had a lot of GC memory issues so I increased the "spark.sql.shuffle.partitions" to 4158.
WHY 4158 ?
Partition_count = (stage input data) / (target size of your partition)
so Shuffle partition_count = (shuffle stage input data) / 200 MB = 860000/200=4300
I have 16*24 - 6 =378 cores availaible. So if I want to run every tasks in one go, I should divide 4300 by 378 which is approximately 11. Then 11*378=4158
Spark Version: 2.1
Cluster configuration:
24 compute nodes (workers)
16 vcores each
90 GB RAM per node
6 cores are already being used by other processes/jobs
Current Spark configuration:
-master: yarn
-executor-memory: 26G
-executor-cores: 5
-driver memory: 70G
-num-executors: 70
-spark.kryoserializer.buffer.max=512
-spark.driver.cores=5
-spark.driver.maxResultSize=500m
-spark.memory.storageFraction=0.4
-spark.memory.fraction=0.9
-spark.hadoop.fs.permissions.umask-mode=007
How is the job executed:
We build an artifact (jar) with IntelliJ and then send it to a server. Then a bash script is executed. This script:
export some environment variables (SPARK_HOME, HADOOP_CONF_DIR, PATH and SPARK_LOCAL_DIRS)
launch the spark-submit command with all the parameters defined in the spark configuration above
retrieves the yarn logs of the application
Spark UI screenshots
DAG
#Ali
From the Summary Metrics we can say that your data is Skewed ( Max Duration : 49 min and Max Shuffle Read Size/Records : 2.5 GB/ 23,947,440 where as on an average it's taking about 4-5 mins and processing less than 200 MB/1.2 MM rows)
Now that we know the problem might be skew of data in few partition(s) , I think we can fix this by changing repartition logic val uh_to_be_sorted = uh_joined.repartition(4158, col("PSP")) by chosing something (like some other column or adding any other column to PSP)
few links to refer on data skew and fix
https://dzone.com/articles/optimize-spark-with-distribute-by-cluster-by
https://datarus.wordpress.com/2015/05/04/fighting-the-skew-in-spark/
Hope this helps

Spark Dataframe leftanti Join Fails

We are trying to publish deltas from a Hive table to Kafka. The table in question is a single partition, single block file of 244 MB. Our cluster is configured for a 256M block size, so we're just about at the max for a single file in this case.
Each time that table is updated, a copy is archived, then we run our delta process.
In the function below, we have isolated the different joins and have confirmed that the inner join performs acceptably (about 3 minutes), but the two antijoin dataframes will not complete -- we keep throwing more resources at the Spark job, but are continuing to see the errors below.
Is there a practical limit on dataframe sizes for this kind of join?
private class DeltaColumnPublisher(spark: SparkSession, sink: KafkaSink, source: RegisteredDataset)
extends BasePublisher(spark, sink, source) with Serializable {
val deltaColumn = "hadoop_update_ts" // TODO: move to the dataset object
def publishDeltaRun(dataLocation: String, archiveLocation: String): (Long, Long) = {
val current = spark.read.parquet(dataLocation)
val previous = spark.read.parquet(archiveLocation)
val inserts = current.join(previous, keys, "leftanti")
val updates = current.join(previous, keys).where(current.col(deltaColumn) =!= previous.col(deltaColumn))
val deletes = previous.join(current, keys, "leftanti")
val upsertCounter = spark.sparkContext.longAccumulator("upserts")
val deleteCounter = spark.sparkContext.longAccumulator("deletes")
logInfo("sending inserts to kafka")
sink.sendDeltasToKafka(inserts, "U", upsertCounter)
logInfo("sending updates to kafka")
sink.sendDeltasToKafka(updates, "U", upsertCounter)
logInfo("sending deletes to kafka")
sink.sendDeltasToKafka(deletes, "D", deleteCounter)
(upsertCounter.value, deleteCounter.value)
}
}
The errors we're seeing seems to indicate that the driver is losing contact with the executors. We have increased the executor memory up to 24G and the network timeout as high as 900s and the heartbeat interval as high as 120s.
17/11/27 20:36:18 WARN netty.NettyRpcEndpointRef: Error sending message [message = Heartbeat(1,[Lscala.Tuple2;#596e3aa6,BlockManagerId(1, server, 46292, None))] in 2 attempts
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.executor.heartbeatInterval
at ...
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [120 seconds]
at ...
Later in the logs:
17/11/27 20:42:37 WARN netty.NettyRpcEndpointRef: Error sending message [message = Heartbeat(1,[Lscala.Tuple2;#25d1bd5f,BlockManagerId(1, server, 46292, None))] in 3 attempts
org.apache.spark.SparkException: Exception thrown in awaitResult
at ...
Caused by: java.lang.RuntimeException: org.apache.spark.SparkException: Could not find HeartbeatReceiver.
The config switches we have been manipulating (without success) are --executor-memory 24G --conf spark.network.timeout=900s --conf spark.executor.heartbeatInterval=120s
The option I failed to consider is to increase my driver resources. I added --driver-memory 4G and --driver-cores 2 and saw my job complete in about 9 minutes.
It appears that an inner join of these two files (or using the built-in except() method) puts memory pressure on the executors. Partitioning on one of the key columns seems to help ease that memory pressure, but increases overall time because there is more shuffling involved.
Doing the left-anti join between these two files requires that we have more driver resources. Didn’t expect that.

Spark Streaming - Same processing time for 4 cores and 16 cores. Why?

Scenario: I am doing some testing with spark streaming. The files with around 100 records comes in every 25 seconds.
Problem: The processing is taking on average 23 seconds for 4 core pc using local[*] in the program. When i deploy the same app to server with 16 cores i was expecting an improvement in processing time. However, i see it is still taking same time in 16 cores as well (also checked cpu usages in ubuntu and cpu is being fully utilized). All the configurations are default provided by spark.
Question:
Should not processing time decrease with increase in number of cores available for the streaming job?
Code:
val conf = new SparkConf()
.setMaster("local[*]")
.setAppName(this.getClass.getCanonicalName)
.set("spark.hadoop.validateOutputSpecs", "false")
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc, Seconds(25))
val sqc = new SQLContext(sc)
val gpsLookUpTable = MapInput.cacheMappingTables(sc, sqc).persist(StorageLevel.MEMORY_AND_DISK_SER_2)
val broadcastTable = sc.broadcast(gpsLookUpTable)
jsonBuilder.append("[")
ssc.textFileStream("hdfs://localhost:9000/inputDirectory/")
.foreachRDD { rdd =>
if (!rdd.partitions.isEmpty) {
val header = rdd.first().split(",")
val rowsWithoutHeader = Utils.dropHeader(rdd)
rowsWithoutHeader.foreach { row =>
jsonBuilder.append("{")
val singleRowArray = row.split(",")
(header, singleRowArray).zipped
.foreach { (x, y) =>
jsonBuilder.append(convertToStringBasedOnDataType(x, y))
// GEO Hash logic here
if (x.equals("GPSLat") || x.equals("Lat")) {
lattitude = y.toDouble
}
else if (x.equals("GPSLon") || x.equals("Lon")) {
longitude = y.toDouble
if (x.equals("Lon")) {
// This section is used to convert GPS Look Up to GPS LookUP with Hash
jsonBuilder.append(convertToStringBasedOnDataType("geoCode", GeoHash.encode(lattitude, longitude)))
}
else {
val selectedRow = broadcastTable.value
.filter("geoCode LIKE '" + GeoHash.subString(lattitude, longitude) + "%'")
.withColumn("Distance", calculateDistance(col("Lat"), col("Lon")))
.orderBy("Distance")
.select("TrackKM", "TrackName").take(1)
if (selectedRow.length != 0) {
jsonBuilder.append(convertToStringBasedOnDataType("TrackKm", selectedRow(0).get(0)))
jsonBuilder.append(convertToStringBasedOnDataType("TrackName", selectedRow(0).get(1)))
}
else {
jsonBuilder.append(convertToStringBasedOnDataType("TrackKm", "NULL"))
jsonBuilder.append(convertToStringBasedOnDataType("TrackName", "NULL"))
}
}
}
}
jsonBuilder.setLength(jsonBuilder.length - 1)
jsonBuilder.append("},")
}
sc.parallelize(Seq(jsonBuilder.toString)).repartition(1).saveAsTextFile("hdfs://localhost:9000/outputDirectory")
It sounds like you are using only one thread, whether the application runs on a machine with 4 or 16 cores won't matter if that is the case.
It sounds like 1 file comes in, that 1 file is 1 RDD partition with 100 rows. You iterate over the rows in that RDD and append the jsonBuilder. At the end you call repartition(1) which will make the writing of the file single threaded.
You could reparation your data-set to 12 RDD partitions after you pick up the file, to ensure that other threads work on the rows. But unless I am missing something you are lucky this isn't happening. What happens if two threads are calling jsonBuilder.append("{") at the same time? Won't they create invalid JSON. I could be missing something here.
You could test to see if I am correct about the single threaded-ness of your application by adding logging like this:
scala> val rdd1 = sc.parallelize(1 to 10).repartition(1)
rdd1: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[5] at repartition at <console>:21
scala> rdd1.foreach{ r => {println(s"${Thread.currentThread.getName()} => $r")} }
Executor task launch worker-40 => 1
Executor task launch worker-40 => 2
Executor task launch worker-40 => 3
Executor task launch worker-40 => 4
Executor task launch worker-40 => 5
Executor task launch worker-40 => 6
Executor task launch worker-40 => 7
Executor task launch worker-40 => 8
Executor task launch worker-40 => 9
Executor task launch worker-40 => 10
scala> val rdd3 = sc.parallelize(1 to 10).repartition(3)
rdd3: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[40] at repartition at <console>:21
scala> rdd3.foreach{ r => {println(s"${Thread.currentThread.getName()} => $r")} }
Executor task launch worker-109 => 1
Executor task launch worker-108 => 2
Executor task launch worker-95 => 3
Executor task launch worker-95 => 4
Executor task launch worker-109 => 5
Executor task launch worker-108 => 6
Executor task launch worker-108 => 7
Executor task launch worker-95 => 8
Executor task launch worker-109 => 9
Executor task launch worker-108 => 10

Resources