Spark process never ends when insert into Hive table - apache-spark

I trying to append some rows (5 million rows/ 2800 columns) in a Hive table through Spark/Scala, but the process seems to stuck after long hours. The logs don't show any errors.
How can I be sure the process is really running?
Is there something to do to optimize the job?
My submit configs:
--driver-memory 15 G
--executor-memory 30g
--num-executors 35
--executor-cores 5
Thanks!
def exprToAppend(myCols: Set[String], allCols: Set[String]) = {
import org.apache.spark.sql.functions._
allCols.toList.map(x => x match {
case x if myCols.contains(x) => col(x)
case _ => lit(0d).as(x)
})
}
val insert : DataFrame = tableFinal.select(exprToAppend(tableFinal.columns.toSet, historico.columns.toSet):_ *).select(historico.columns.map(x => col(x)) :_*);
insert.write.mode("append")
.format("parquet")
.insertInto(s"${Configuration.SIGLA}${Configuration.TABLE_HIST}")

Related

Tensorflow Java use too much memory with spark on YARN

When using tensorflow java for inference the amount of memory to make the job run on YARN is abnormally large. The job run perfectly with spark on my computer (2 cores 16Gb of RAM) and take 35 minutes to complete. But when I try to run it on YARN with 10 executors 16Gb memory and 16 Gb memoryOverhead the executors are killed for using too much memory.
Prediction Run on an Hortonworks cluster with YARN 2.7.3 and Spark 2.2.1. Previously we used DL4J to do inference and everything run under 3 min.
Tensor are correctly closed after usage and we use a mapPartition to do prediction. Each task contain approximately 20.000 records (1Mb) so this will make input tensor of 2.000.000x14 and output tensor of 2.000.000 (5Mb).
option passed to spark when running on YARN
--master yarn --deploy-mode cluster --driver-memory 16G --num-executors 10 --executor-memory 16G --executor-cores 2 --conf spark.driver.memoryOverhead=16G --conf spark.yarn.executor.memoryOverhead=16G --conf spark.sql.shuffle.partitions=200 --conf spark.tasks.cpu=2
This configuration may work if we set spark.sql.shuffle.partitions=2000 but it take 3 hours
UPDATE:
The difference between local and cluster was in fact due to a missing filter. we actually run the prediction on more data than we though.
To reduce memory footprint of each partition you must create batch inside each partition (use grouped(batchSize)). Thus you are faster than running predict for each row and you allocate tensor of predermined size (batchSize). If you investigate the code of tensorflowOnSpark scala inference this is what they did. Below you will find a reworked example of an implementation this code may not compile but you get the idea of how to do it.
lazy val sess = SavedModelBundle.load(modelPath, "serve").session
lazy val numberOfFeatures = 1
lazy val laggedFeatures = Seq("cost_day1", "cost_day2", "cost_day3")
lazy val numberOfOutputs = 1
val predictionsRDD = preprocessedData.rdd.mapPartitions { partition =>
partition.grouped(batchSize).flatMap { batchPreprocessed =>
val numberOfLines = batchPreprocessed.size
val featuresShape: Array[Long] = Array(numberOfLines, laggedFeatures.size / numberOfFeatures, numberOfFeatures)
val featuresBuffer: FloatBuffer = FloatBuffer.allocate(numberOfLines)
for (
featuresWithKey <- batchPreprocessed;
feature <- featuresWithKey.features
) {
featuresBuffer.put(feature)
}
featuresBuffer.flip()
val featuresTensor = Tensor.create(featuresShape, featuresBuffer)
val results: Tensor[_] = sess.runner
.feed("cost", featuresTensor)
.fetch("prediction")
.run.get(0)
val output = Array.ofDim[Float](results.numElements(), numberOfOutputs)
val outputArray: Array[Array[Float]] = results.copyTo(output)
results.close()
featuresTensor.close()
outputArray
}
}
spark.createDataFrame(predictionsRDD)
We use FloatBuffer and Shape to create Tensor as recommended in this issue

spark sql Insert into HIVE external partitioned table takes more time

I have a spark sql statement that inserts data into a Hive external partitioned table. The insert takes more than 30 minutes to complete for just 200k data.
I have tried to increase the executor.memoryOverhead to 4086. still i see the same time in the insert statement.
This is the values given for the execution.
--executor-cores 4 --executor-memory 3G --num-executors 25 --conf spark.executor.memoryOverhead=4096 --driver-memory 4g
Spark Code:
Table_1.createOrReplaceTempView(tempViewName)
config = self.context.get_config()
insert_query = config['tables']['hive']['1']['insertStatement']
insertStatement = insert_query + tempViewName
self.spark.sql(insertStatement)
self.logger.info("************insert completed************")
repairTableQuery = config['tables']['hive']['training']['repairtable']
self.spark.sql(repairTableQuery)
self.logger.info("************repair completed************")
end = datetime.now()```
Would doing a coalesce partition before insert statement help in faster execution.

Why spark partition all the data in one executor?

I am working with Spark GraphX. I am building a graph from a file (around 620 mb, 50K vertices and almost 50 millions of edges). I am using a spark cluster with: 4 workers, each one with 8 cores and 13.4g of ram, 1 driver with the same specs. When I submit my .jar to the cluster, randomly one of the workers loads all the data on it. All the task needed for the computing are requested to that worker. While the computing the remaining three are without doing nothing. I have try everything and i do not found nothing that can force to compute in all of the workers.
When Spark build the graph and I look for the number of partitions of the RDD of vertices say 5, but if I repartition that RDD for example with 32 (number of cores in total) Spark load the data in every worker but gets slow the computation.
Im launching the spark submit by this way:
spark-submit --master spark://172.30.200.20:7077 --driver-memory 12g --executor-memory 12g --class interscore.InterScore /root/interscore/interscore.jar hdfs://172.30.200.20:9000/user/hadoop/interscore/network.dat hdfs://172.30.200.20:9000/user/hadoop/interscore/community.dat 111
The code is here:
object InterScore extends App{
val sparkConf = new SparkConf().setAppName("Big-InterScore")
val sc = new SparkContext(sparkConf)
val t0 = System.currentTimeMillis
runInterScore(args(0), args(1), args(2))
println("Running time " + (System.currentTimeMillis - t0).toDouble / 1000)
sc.stop()
def runInterScore(netPath:String, communitiesPath:String, outputPath:String) = {
val communities = sc.textFile(communitiesPath).map(x => {
val a = x.split('\t')
(a(0).toLong, a(1).toInt)
}).cache
val graph = GraphLoader.edgeListFile(sc, netPath, true)
.partitionBy(PartitionStrategy.RandomVertexCut)
.groupEdges(_ + _)
.joinVertices(communities)((_, _, c) => c)
.cache
val lvalues = graph.aggregateMessages[Double](
m => {
m.sendToDst(if (m.srcAttr != m.dstAttr) 1 else 0)
m.sendToSrc(if (m.srcAttr != m.dstAttr) 1 else 0)
}, _ + _)
val communitiesIndices = communities.map(x => x._2).distinct.collect
val verticesWithLValue = graph.vertices.repartition(32).join(lvalues).cache
println("K = " + communitiesIndices.size)
graph.unpersist()
graph.vertices.unpersist()
communitiesIndices.foreach(c => {
//COMPUTE c
}
})
}
}

Structured Streaming with mapGroupState causing GC and Performance Issues

In our application we are using structured streaming with MapGroupWithState in combination with read from Kafka.
After starting the application, during the initial batches the performance is good, if i see the kafka lastProgress almost 65K per second. After few batches the performance is reduced completely to around 2000 per second.
in MapGroupWithState Function basically an update and comparison to the value from state store is happening(code snippet provided below).
Number of Offsets from Kafka - 100000
After starting the application, during the initial batches the performance is good, if i see the kafka lastProgress almost 65K per second. After few batches the performance is reduced completely to around 2000 per second.
If we see the Thread Dump from one of executor then there is no suspicious except Blocked threads from spark UI
GC Stats from one of the executor as below , seems
Didn't see much difference after GC
Code Snippet
case class MonitoringEvent(InternalID: String, monStartTimestamp: Timestamp, EndTimestamp: Timestamp, Stream: String, ParentID: Option[String])
val df = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", Config.uatKafkaUrl)
.option("subscribe", Config.interBranchInputTopic)
.option("startingOffsets", "earliest")
.option("failOnDataLoss", "true")
.option("maxOffsetsPerTrigger", "100000")
.option("request.required.acks", "all")
.load()
.selectExpr("CAST(value AS STRING)")
val me: Dataset[MonitoringEvent] = df.select(from_json($"value", schema).as("data")).select($"data.*").as[MonitoringEvent]
val IB = me.groupByKey(x => (x.ParentID.getOrElse(x.InternalID)))
.mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout)(IBTransformer.mappingFunctionIB _)
.flatMap(x => x)
val IBStream = IB
.select(to_json(struct($"*")).as("value"), $"InternalID".as("key"))
.writeStream
.format("kafka")
.queryName("InterBranch_Events_KafkaWriter")
.option("kafka.bootstrap.servers", Config.uatKafkaUrl)
.option("topic", Config.interBranchTopicComplete)
.option("checkpointLocation", Config.interBranchCheckPointDir)
.outputMode("update")
.start()
object IBTransformer extends Serializable {
case class IBStateStore(InternalID: String, monStartTimestamp: Timestamp)
def mappingFunctionIB(intrKey: String, intrValue: Iterator[MonitoringEvent], intrState: GroupState[IBStateStore]): Seq[MonitoringEvent] = {
try {
if (intrState.hasTimedOut) {
intrState.remove()
Seq.empty
} else {
val events = intrValue.toSeq
if (events.map(_.Status).contains(Started)) {
val tmp = events.filter(x => (x.Status == Started && x.InternalID == intrKey)).head
val toStore = IBStateStore(tmp.InternalID, tmp.monStartTimestamp)
intrState.update(toStore)
intrState.setTimeoutDuration(1200000)
}
val IB = events.filter(_.ParentID.isDefined)
if (intrState.exists && IB.nonEmpty) {
val startEvent = intrState.get
val IBUpdate = IB.map {x => x.copy(InternalID = startEvent.InternalID, monStartTimestamp = startEvent.monStartTimestamp) }
IBUpdate.foreach(id => intrState.update((IBStateStore(id.InternalID, id.monStartTimestamp)))) // updates the state with new IDs
IBUpdate
} else {
Seq.empty
}
}
}
catch
.
.
.
}
}
Number of executers used - 8
Exector Memory - 8G
Driver Memory - 8G
Java options and memory i provide in my spark Submit script
--executor-memory 8G \
--executor-cores 8 \
--num-executors 4 \
--driver-memory 8G \
--driver-java-options "-Dsun.security.krb5.debug=true -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Duser.timezone=UTC -Dconfig.file=configIB.conf -Dlog4j.configuration=IBprocessor.log4j.properties" \
Tried using G1GC in java options, but there is no improvement. The keys we hold is also less than the size provided, so not sure where it is going wrong .
Any suggestions to improve performance and eliminate GC Issues ?

How to assgin spark task balance?

My spark is installed in CDH5 5.8.0 and run its application in yarn. There are 5 servers in the cluster. One server is for resource manager. The other four servers are node manager. Each server has 2 core and 8G memory.
The spark application main logic is not complex: Query table from postgres db. Do some business for each record and finally save result to db. Here is main code:
String columnName="id";
long lowerBound=1;
long upperBound=100000;
int numPartitions=20;
String tableBasic="select * from table1 order by id";
DataFrame dfBasic = sqlContext.read().jdbc(JDBC_URL, tableBasic, columnName, lowerBound, upperBound,numPartitions, dbProperties);
JavaRDD<EntityResult> rddResult = dfBasic.javaRDD().flatMap(new FlatMapFunction<Row, Result>() {
public Iterable<Result> call(Row row) {
List<Result> list = new ArrayList<Result>();
........
return list;
}
});
DataFrame saveDF = sqlContext.createDataFrame(rddResult, Result.class);
saveDF = saveDF.select("id", "column 1", "column 2",);
saveDF.write().mode(SaveMode.Append).jdbc(SQL_CONNECTION_URL, "table2", dbProperties);
I use this command to submit application to yarn:
spark-submit --master yarn-cluster --executor-memory 6G --executor-cores 2 --driver-memory 6G --conf spark.default.parallelism=90 --conf spark.storage.memoryFraction=0.4 --conf spark.shuffle.memoryFraction=0.4 --conf spark.executor.memory=3G --class com.Main1 jar1-0.0.1.jar
There are 7 executors and 20 partitions. When the table records is small, for example less than 200000, the 20 active tasks can assign to the 7 executors balance, like this:
Assign task averagely
But when the table records is huge, for example 1000000, the task will not assign averagely. There is always one executor run long time, the other executors run shortly. Some executors can't assign task. Like this:
enter image description here

Resources