When using tensorflow java for inference the amount of memory to make the job run on YARN is abnormally large. The job run perfectly with spark on my computer (2 cores 16Gb of RAM) and take 35 minutes to complete. But when I try to run it on YARN with 10 executors 16Gb memory and 16 Gb memoryOverhead the executors are killed for using too much memory.
Prediction Run on an Hortonworks cluster with YARN 2.7.3 and Spark 2.2.1. Previously we used DL4J to do inference and everything run under 3 min.
Tensor are correctly closed after usage and we use a mapPartition to do prediction. Each task contain approximately 20.000 records (1Mb) so this will make input tensor of 2.000.000x14 and output tensor of 2.000.000 (5Mb).
option passed to spark when running on YARN
--master yarn --deploy-mode cluster --driver-memory 16G --num-executors 10 --executor-memory 16G --executor-cores 2 --conf spark.driver.memoryOverhead=16G --conf spark.yarn.executor.memoryOverhead=16G --conf spark.sql.shuffle.partitions=200 --conf spark.tasks.cpu=2
This configuration may work if we set spark.sql.shuffle.partitions=2000 but it take 3 hours
UPDATE:
The difference between local and cluster was in fact due to a missing filter. we actually run the prediction on more data than we though.
To reduce memory footprint of each partition you must create batch inside each partition (use grouped(batchSize)). Thus you are faster than running predict for each row and you allocate tensor of predermined size (batchSize). If you investigate the code of tensorflowOnSpark scala inference this is what they did. Below you will find a reworked example of an implementation this code may not compile but you get the idea of how to do it.
lazy val sess = SavedModelBundle.load(modelPath, "serve").session
lazy val numberOfFeatures = 1
lazy val laggedFeatures = Seq("cost_day1", "cost_day2", "cost_day3")
lazy val numberOfOutputs = 1
val predictionsRDD = preprocessedData.rdd.mapPartitions { partition =>
partition.grouped(batchSize).flatMap { batchPreprocessed =>
val numberOfLines = batchPreprocessed.size
val featuresShape: Array[Long] = Array(numberOfLines, laggedFeatures.size / numberOfFeatures, numberOfFeatures)
val featuresBuffer: FloatBuffer = FloatBuffer.allocate(numberOfLines)
for (
featuresWithKey <- batchPreprocessed;
feature <- featuresWithKey.features
) {
featuresBuffer.put(feature)
}
featuresBuffer.flip()
val featuresTensor = Tensor.create(featuresShape, featuresBuffer)
val results: Tensor[_] = sess.runner
.feed("cost", featuresTensor)
.fetch("prediction")
.run.get(0)
val output = Array.ofDim[Float](results.numElements(), numberOfOutputs)
val outputArray: Array[Array[Float]] = results.copyTo(output)
results.close()
featuresTensor.close()
outputArray
}
}
spark.createDataFrame(predictionsRDD)
We use FloatBuffer and Shape to create Tensor as recommended in this issue
Related
my spark job currently runs in 59 mins. I want to optimize it so that I it takes less time. I have noticed that the last step of the job takes a lot of time (55 mins) (see the screenshots of the spark job in Spark UI below).
I need to join a big dataset with a smaller one, apply transformations on this joined dataset (creating a new column).
At the end, I should have a dataset repartitioned based on the column PSP (see snippet of the code below). I also perform a sort at the end (sort each partition based on 3 columns).
All the details (infrastructure, configuration, code) can be found below.
Snippet of my code :
spark.conf.set("spark.sql.shuffle.partitions", 4158)
val uh = uh_months
.withColumn("UHDIN", datediff(to_date(unix_timestamp(col("UHDIN_YYYYMMDD"), "yyyyMMdd").cast(TimestampType)),
to_date(unix_timestamp(col("january"), "yyyy-MM-dd").cast(TimestampType))))
"ddMMMyyyy")).cast(TimestampType)))
.withColumn("DVA_1", date_format(col("DVA"), "dd/MM/yyyy"))
.drop("UHDIN_YYYYMMDD")
.drop("january")
.drop("DVA")
.persist()
val uh_flag_comment = new TransactionType().transform(uh)
uh.unpersist()
val uh_joined = uh_flag_comment.join(broadcast(smallDF), "NO_NUM")
.select(
uh.col("*"),
smallDF.col("PSP"),
smallDF.col("minrel"),
smallDF.col("Label"),
smallDF.col("StartDate"))
.withColumnRenamed("DVA_1", "DVA")
smallDF.unpersist()
val uh_to_be_sorted = uh_joined.repartition(4158, col("PSP"))
val uh_final = uh_joined.sortWithinPartitions(col("NO_NUM"), col("UHDIN"), col("HOURMV"))
uh_final
EDITED - Repartition logic
val sqlContext = spark.sqlContext
sqlContext.udf.register("randomUDF", (partitionCount: Int) => {
val r = new scala.util.Random
r.nextInt(partitionCount)
// Also tried with r.nextInt(partitionCount) + col("PSP")
})
val uh_to_be_sorted = uh_joined
.withColumn("tmp", callUDF("RandomUDF", lit("4158"))
.repartition(4158, col("tmp"))
.drop(col("tmp"))
val uh_final = uh_to_be_sorted.sortWithinPartitions(col("NO_NUM"), col("UHDIN"), col("HOURMV"))
uh_final
smallDF is a small dataset (535MB) that I broadcast.
TransactionType is a class where I add a new column of string elements to my uh dataframe based on the value of 3 columns (MMED, DEBCRED, NMTGP), checking the values of those columns using regex.
I previously faced a lot of issues (job failing) because of shuffle blocks that were not found. I discovered that I was spilling to disk and had a lot of GC memory issues so I increased the "spark.sql.shuffle.partitions" to 4158.
WHY 4158 ?
Partition_count = (stage input data) / (target size of your partition)
so Shuffle partition_count = (shuffle stage input data) / 200 MB = 860000/200=4300
I have 16*24 - 6 =378 cores availaible. So if I want to run every tasks in one go, I should divide 4300 by 378 which is approximately 11. Then 11*378=4158
Spark Version: 2.1
Cluster configuration:
24 compute nodes (workers)
16 vcores each
90 GB RAM per node
6 cores are already being used by other processes/jobs
Current Spark configuration:
-master: yarn
-executor-memory: 26G
-executor-cores: 5
-driver memory: 70G
-num-executors: 70
-spark.kryoserializer.buffer.max=512
-spark.driver.cores=5
-spark.driver.maxResultSize=500m
-spark.memory.storageFraction=0.4
-spark.memory.fraction=0.9
-spark.hadoop.fs.permissions.umask-mode=007
How is the job executed:
We build an artifact (jar) with IntelliJ and then send it to a server. Then a bash script is executed. This script:
export some environment variables (SPARK_HOME, HADOOP_CONF_DIR, PATH and SPARK_LOCAL_DIRS)
launch the spark-submit command with all the parameters defined in the spark configuration above
retrieves the yarn logs of the application
Spark UI screenshots
DAG
#Ali
From the Summary Metrics we can say that your data is Skewed ( Max Duration : 49 min and Max Shuffle Read Size/Records : 2.5 GB/ 23,947,440 where as on an average it's taking about 4-5 mins and processing less than 200 MB/1.2 MM rows)
Now that we know the problem might be skew of data in few partition(s) , I think we can fix this by changing repartition logic val uh_to_be_sorted = uh_joined.repartition(4158, col("PSP")) by chosing something (like some other column or adding any other column to PSP)
few links to refer on data skew and fix
https://dzone.com/articles/optimize-spark-with-distribute-by-cluster-by
https://datarus.wordpress.com/2015/05/04/fighting-the-skew-in-spark/
Hope this helps
I have spark job code as below. Which works fine with below configuration on cluster.
String path = "/tmp/one.txt";
JavaRDD<SomeClass> jRDD = spark.read()
.textFile(path)
.javaRDD()
.map(line -> {
return new SomeClass(line);
});
Dataset responseSet = sparkSession.createDataFrame(jRDD, SomeClass.class);
responseSet.write()
.format("text")
.save(path + "processed");
Whereas, If I want to read binary file(same size as text) it takes much more time.
String path = "/tmp/one.txt";
JavaRDD<SomeClass> jRDD = sparkContext
.binaryRecords(path, 10000, new Configuration())
.toJavaRDD()
.map(line -> {
return new SomeClass(line);
});
Dataset responseSet = spark.createDataFrame(jRDD, SomeClass.class);
responseSet.write()
.format("text")
.save(path + "processed");
Below is my configuration.
driver-memory 8g
executor-memory 6g
num-executors 16
Time taken by first code with 150 MB file is 1.30 mins.
Time taken by second code with 150 MB file is 4 mins.
Also, first code was able to run on all 16 executors whereas second uses only one.
ny suggestions why it is slow?
I found the issue. The textFile()method was creating 16 partitions(you can checknumOfPartitions using getNumPartitions() method on RDD) whereas binaryRecords() created only 1(Java binaryRecords doesn't provide overloaded method which specifies num of partitions to be created).
I increased numOfPartitions on RDD created by binaryRecords() by using repartition(NUM_OF_PARTITIONS) method on RDD.
My spark is installed in CDH5 5.8.0 and run its application in yarn. There are 5 servers in the cluster. One server is for resource manager. The other four servers are node manager. Each server has 2 core and 8G memory.
The spark application main logic is not complex: Query table from postgres db. Do some business for each record and finally save result to db. Here is main code:
String columnName="id";
long lowerBound=1;
long upperBound=100000;
int numPartitions=20;
String tableBasic="select * from table1 order by id";
DataFrame dfBasic = sqlContext.read().jdbc(JDBC_URL, tableBasic, columnName, lowerBound, upperBound,numPartitions, dbProperties);
JavaRDD<EntityResult> rddResult = dfBasic.javaRDD().flatMap(new FlatMapFunction<Row, Result>() {
public Iterable<Result> call(Row row) {
List<Result> list = new ArrayList<Result>();
........
return list;
}
});
DataFrame saveDF = sqlContext.createDataFrame(rddResult, Result.class);
saveDF = saveDF.select("id", "column 1", "column 2",);
saveDF.write().mode(SaveMode.Append).jdbc(SQL_CONNECTION_URL, "table2", dbProperties);
I use this command to submit application to yarn:
spark-submit --master yarn-cluster --executor-memory 6G --executor-cores 2 --driver-memory 6G --conf spark.default.parallelism=90 --conf spark.storage.memoryFraction=0.4 --conf spark.shuffle.memoryFraction=0.4 --conf spark.executor.memory=3G --class com.Main1 jar1-0.0.1.jar
There are 7 executors and 20 partitions. When the table records is small, for example less than 200000, the 20 active tasks can assign to the 7 executors balance, like this:
Assign task averagely
But when the table records is huge, for example 1000000, the task will not assign averagely. There is always one executor run long time, the other executors run shortly. Some executors can't assign task. Like this:
enter image description here
I am trying to multiply against a large matrix that is stored in parquet format, so am being careful not to store the RDD in memory, but am getting an OOM error from the parquet reader:
15/12/06 05:23:36 WARN TaskSetManager: Lost task 950.0 in stage 4.0
(TID 28398, 172.31.34.233): java.lang.OutOfMemoryError: Java heap space
at org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:755)
at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:494)
at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:127)
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:208)
at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
...
Specifically, the matrix is a 46752-by-54843120 dense matrix of 32-bit floats that is stored in parquet format (each row is about 1.7GB uncompressed).
The following code loads this matrix as a Spark IndexedRowMatrix and multiplies it by a random vector (the rows are stored with an associated string label, and the floats have to be converted to doubles because IndexedRows can only use doubles):
val rows = {
sqlContext.read.parquet(datafname).rdd.map {
case SQLRow(rowname: String, values: WrappedArray[Float]) =>
// DenseVectors have to be doubles
val vector = new DenseVector(values.toArray.map(v => v.toDouble))
new IndexedRow(indexLUT(rowname), vector)
}
}
val nrows : Long = 46752
val ncols = 54843120
val A = new IndexedRowMatrix(rows, nrows, ncols)
A.rows.unpersist() // doesn't help avoid OOM
val x = new DenseMatrix(ncols, 1, BDV.rand(ncols).data)
A.multiply(x).rows.collect
I am using the following options when running
--driver-memory 220G
--num-executors 203
--executor-cores 4
--executor-memory 25G
--conf spark.storage.memoryFraction=0
There are 25573 partitions to the parquet file, so the uncompressed Float values of each partition should be less than 4Gb; I expect this should imply that the current executor memory is much more than sufficient (I cannot raise the executor-memory setting).
Any ideas why this is running into OOM errors and how to fix it? As far as I can tell, there's no reason for the parquet reader to be storing anything.
I cannot process graph with 230M edges.
I cloned apache.spark, built it and then tried it on cluster.
I use Spark Standalone Cluster:
-5 machines (each has 12 cores/32GB RAM)
-'spark.executor.memory' == 25g
-'spark.driver.memory' == 3g
Graph has 231359027 edges. And its file weights 4,524,716,369 bytes.
Graph is represented in text format:
sourceVertexId destinationVertexId
My code:
object Canonical {
def main(args: Array[String]) {
val numberOfArguments = 3
require(args.length == numberOfArguments, s"""Wrong argument number. Should be $numberOfArguments .
|Usage: <path_to_grpah> <partiotioner_name> <minEdgePartitions> """.stripMargin)
var graph: Graph[Int, Int] = null
val nameOfGraph = args(0).substring(args(0).lastIndexOf("/") + 1)
val partitionerName = args(1)
val minEdgePartitions = args(2).toInt
val sc = new SparkContext(new SparkConf()
.setSparkHome(System.getenv("SPARK_HOME"))
.setAppName(s" partitioning | $nameOfGraph | $partitionerName | $minEdgePartitions parts ")
.setJars(SparkContext.jarOfClass(this.getClass).toList))
graph = GraphLoader.edgeListFile(sc, args(0), false, edgeStorageLevel = StorageLevel.MEMORY_AND_DISK,
vertexStorageLevel = StorageLevel.MEMORY_AND_DISK, minEdgePartitions = minEdgePartitions)
graph = graph.partitionBy(PartitionStrategy.fromString(partitionerName))
println(graph.edges.collect.length)
println(graph.vertices.collect.length)
}
}
After I run it I encountered number of java.lang.OutOfMemoryError: Java heap space errors and of course I did not get a result.
Do I have problem in the code? Or in cluster configuration?
Because it works fine for relatively small graphs. But for this graph it never worked. (And I do not think that 230M edges is too big data)
Thank you for any advise!
RESOLVED
I did not put enough memory for driver program.
I've changed cluster configuration to:
-4 workers (each has 12 cores/32GB RAM)
-1 master with driver program (each has 12 cores/32GB RAM)
-'spark.executor.memory' == 25g
-'spark.driver.memory' == 25g
And also it was not good idea to collect all vertices and edges to count them. It is easy to do just this: graph.vertices.count and graph.edges.count
What I suggest is you do a binary search to find the maximal size of data the cluster can handle. Take 50% of the graph and see if that works. If it does, try 75%. Etc.
My rule of thumb is you need 20–30× the memory for a given size of input. For 4.5 GB this suggests the limit would be around 100 GB. You have exactly that amount. I have no experience with GraphX: it probably adds another multiplier to the memory use. In my opinion you simply don't have enough memory.