My spark is installed in CDH5 5.8.0 and run its application in yarn. There are 5 servers in the cluster. One server is for resource manager. The other four servers are node manager. Each server has 2 core and 8G memory.
The spark application main logic is not complex: Query table from postgres db. Do some business for each record and finally save result to db. Here is main code:
String columnName="id";
long lowerBound=1;
long upperBound=100000;
int numPartitions=20;
String tableBasic="select * from table1 order by id";
DataFrame dfBasic = sqlContext.read().jdbc(JDBC_URL, tableBasic, columnName, lowerBound, upperBound,numPartitions, dbProperties);
JavaRDD<EntityResult> rddResult = dfBasic.javaRDD().flatMap(new FlatMapFunction<Row, Result>() {
public Iterable<Result> call(Row row) {
List<Result> list = new ArrayList<Result>();
........
return list;
}
});
DataFrame saveDF = sqlContext.createDataFrame(rddResult, Result.class);
saveDF = saveDF.select("id", "column 1", "column 2",);
saveDF.write().mode(SaveMode.Append).jdbc(SQL_CONNECTION_URL, "table2", dbProperties);
I use this command to submit application to yarn:
spark-submit --master yarn-cluster --executor-memory 6G --executor-cores 2 --driver-memory 6G --conf spark.default.parallelism=90 --conf spark.storage.memoryFraction=0.4 --conf spark.shuffle.memoryFraction=0.4 --conf spark.executor.memory=3G --class com.Main1 jar1-0.0.1.jar
There are 7 executors and 20 partitions. When the table records is small, for example less than 200000, the 20 active tasks can assign to the 7 executors balance, like this:
Assign task averagely
But when the table records is huge, for example 1000000, the task will not assign averagely. There is always one executor run long time, the other executors run shortly. Some executors can't assign task. Like this:
enter image description here
Related
I am running a glue ETL transformation job. This job is suppose to read data from s3 and converts to parquet.
Below is the glue source.... sourcePath is the location of the s3 file.
In this location we have around 100 million json files.. all of them are nested into sub-folders.
So that is the reason I am applying exclusionPattern to exclude and files starting with a (which are around 2.7 million files) and I believe that only the files starting with a will be processed.
val file_paths = Array(sourcePath)
val exclusionPattern = "\"" + sourcePath + "{[!a]}**" + "\""
glueContext
.getSourceWithFormat(connectionType = "s3",
options = JsonOptions(Map(
"paths" -> file_paths, "recurse" -> true, "groupFiles" -> "inPartition", "exclusions" -> s"[$exclusionPattern]"
)),
format = "json",
transformationContext = "sourceDF"
)
.getDynamicFrame()
.map(transformRow, "error in row")
.toDF()
After running this job with Standard worker type and with G2 worker type as well. I keep getting error
#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p"
# Executing /bin/sh -c "kill -9 27788"...
And in the cloudwatch I can see that the driver memory is getting utilised 100% but executor memory usage is almost nil.
When running the job I am setting spark.driver.memory=10g and spark.driver.memoryOverhead=4096 and --conf job parameter.
This is the details in the logs
--conf spark.hadoop.yarn.resourcemanager.connect.max-wait.ms=60000
--conf spark.hadoop.fs.defaultFS=hdfs://ip-myip.compute.internal:1111
--conf spark.hadoop.yarn.resourcemanager.address=ip-myip.compute.internal:1111
--conf spark.dynamicAllocation.enabled=true
--conf spark.shuffle.service.enabled=true
--conf spark.dynamicAllocation.minExecutors=1
--conf spark.dynamicAllocation.maxExecutors=4
--conf spark.executor.memory=20g
--conf spark.executor.cores=16
--conf spark.driver.memory=20g
--conf spark.default.parallelism=80
--conf spark.sql.shuffle.partitions=80
--conf spark.network.timeout=600
--job-bookmark-option job-bookmark-disable
--TempDir s3://my-location/admin
--class com.example.ETLJob
--enable-spark-ui true
--enable-metrics
--JOB_ID j_111...
--spark-event-logs-path s3://spark-ui
--conf spark.driver.memory=20g
--JOB_RUN_ID jr_111...
--conf spark.driver.memoryOverhead=4096
--scriptLocation s3://my-location/admin/Job/ETL
--SOURCE_DATA_LOCATION s3://xyz/
--job-language scala
--DESTINATION_DATA_LOCATION s3://xyz123/
--JOB_NAME ETL
Any ideas what could be the issue.
Thanks
If you have too many files, you are probably overwhelming the driver. Try using the useS3ListImplementation. This is an implementation of the Amazon S3 ListKeys operation, which splits large results sets into multiple responses.
try to add:
"useS3ListImplementation" -> true
[1] https://aws.amazon.com/premiumsupport/knowledge-center/glue-oom-java-heap-space-error/
As suggested by #eman...
I applied all 3 groupFiles, groupSize and useS3ListImplementation.. like below
options = JsonOptions(Map(
"path" -> sourcePath,
"recurse" -> true,
"groupFiles" -> "inPartition",
"groupSize" -> 104857600,//100 mb
"useS3ListImplementation" -> true
))
And that is working for me... there is also an option of "acrossPartitions" if data is not arranged properly.
When using tensorflow java for inference the amount of memory to make the job run on YARN is abnormally large. The job run perfectly with spark on my computer (2 cores 16Gb of RAM) and take 35 minutes to complete. But when I try to run it on YARN with 10 executors 16Gb memory and 16 Gb memoryOverhead the executors are killed for using too much memory.
Prediction Run on an Hortonworks cluster with YARN 2.7.3 and Spark 2.2.1. Previously we used DL4J to do inference and everything run under 3 min.
Tensor are correctly closed after usage and we use a mapPartition to do prediction. Each task contain approximately 20.000 records (1Mb) so this will make input tensor of 2.000.000x14 and output tensor of 2.000.000 (5Mb).
option passed to spark when running on YARN
--master yarn --deploy-mode cluster --driver-memory 16G --num-executors 10 --executor-memory 16G --executor-cores 2 --conf spark.driver.memoryOverhead=16G --conf spark.yarn.executor.memoryOverhead=16G --conf spark.sql.shuffle.partitions=200 --conf spark.tasks.cpu=2
This configuration may work if we set spark.sql.shuffle.partitions=2000 but it take 3 hours
UPDATE:
The difference between local and cluster was in fact due to a missing filter. we actually run the prediction on more data than we though.
To reduce memory footprint of each partition you must create batch inside each partition (use grouped(batchSize)). Thus you are faster than running predict for each row and you allocate tensor of predermined size (batchSize). If you investigate the code of tensorflowOnSpark scala inference this is what they did. Below you will find a reworked example of an implementation this code may not compile but you get the idea of how to do it.
lazy val sess = SavedModelBundle.load(modelPath, "serve").session
lazy val numberOfFeatures = 1
lazy val laggedFeatures = Seq("cost_day1", "cost_day2", "cost_day3")
lazy val numberOfOutputs = 1
val predictionsRDD = preprocessedData.rdd.mapPartitions { partition =>
partition.grouped(batchSize).flatMap { batchPreprocessed =>
val numberOfLines = batchPreprocessed.size
val featuresShape: Array[Long] = Array(numberOfLines, laggedFeatures.size / numberOfFeatures, numberOfFeatures)
val featuresBuffer: FloatBuffer = FloatBuffer.allocate(numberOfLines)
for (
featuresWithKey <- batchPreprocessed;
feature <- featuresWithKey.features
) {
featuresBuffer.put(feature)
}
featuresBuffer.flip()
val featuresTensor = Tensor.create(featuresShape, featuresBuffer)
val results: Tensor[_] = sess.runner
.feed("cost", featuresTensor)
.fetch("prediction")
.run.get(0)
val output = Array.ofDim[Float](results.numElements(), numberOfOutputs)
val outputArray: Array[Array[Float]] = results.copyTo(output)
results.close()
featuresTensor.close()
outputArray
}
}
spark.createDataFrame(predictionsRDD)
We use FloatBuffer and Shape to create Tensor as recommended in this issue
I have a spark sql statement that inserts data into a Hive external partitioned table. The insert takes more than 30 minutes to complete for just 200k data.
I have tried to increase the executor.memoryOverhead to 4086. still i see the same time in the insert statement.
This is the values given for the execution.
--executor-cores 4 --executor-memory 3G --num-executors 25 --conf spark.executor.memoryOverhead=4096 --driver-memory 4g
Spark Code:
Table_1.createOrReplaceTempView(tempViewName)
config = self.context.get_config()
insert_query = config['tables']['hive']['1']['insertStatement']
insertStatement = insert_query + tempViewName
self.spark.sql(insertStatement)
self.logger.info("************insert completed************")
repairTableQuery = config['tables']['hive']['training']['repairtable']
self.spark.sql(repairTableQuery)
self.logger.info("************repair completed************")
end = datetime.now()```
Would doing a coalesce partition before insert statement help in faster execution.
I trying to append some rows (5 million rows/ 2800 columns) in a Hive table through Spark/Scala, but the process seems to stuck after long hours. The logs don't show any errors.
How can I be sure the process is really running?
Is there something to do to optimize the job?
My submit configs:
--driver-memory 15 G
--executor-memory 30g
--num-executors 35
--executor-cores 5
Thanks!
def exprToAppend(myCols: Set[String], allCols: Set[String]) = {
import org.apache.spark.sql.functions._
allCols.toList.map(x => x match {
case x if myCols.contains(x) => col(x)
case _ => lit(0d).as(x)
})
}
val insert : DataFrame = tableFinal.select(exprToAppend(tableFinal.columns.toSet, historico.columns.toSet):_ *).select(historico.columns.map(x => col(x)) :_*);
insert.write.mode("append")
.format("parquet")
.insertInto(s"${Configuration.SIGLA}${Configuration.TABLE_HIST}")
I want to run spark-shell in yarn mode with a certain number of cores.
the command I use is as follows
spark-shell --num-executors 25 --executor-cores 4 --executor-memory 1G \
--driver-memory 1G --conf spark.yarn.executor.memoryOverhead=2048 --master yarn \
--conf spark.driver.maxResultSize=10G \
--conf spark.serializer=org.apache.spark.serializer.KyroSerializer \
-i input.scala
input.scala looks something like this
import java.io.ByteArrayInputStream
// Plaintext sum on 10M rows
def aggrMapPlain(iter: Iterator[Long]): Iterator[Long] = {
var res = 0L
while (iter.hasNext) {
val cur = iter.next
res = res + cur
}
List[Long](res).iterator
}
val pathin_plain = <some file>
val rdd0 = sc.sequenceFile[Int, Long](pathin_plain)
val plain_table = rdd0.map(x => x._2).cache
plain_table.count
0 to 200 foreach { i =>
println("Plain - 10M rows - Run "+i+":")
plain_table.mapPartitions(aggrMapPlain).reduce((x,y)=>x+y)
}
On executing this the Spark UI first spikes to about 40 cores, and then settles at 26 cores.
On recommendation of this I changed the following in my yarn-site.xml
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>101</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-vcores</name>
<value>101</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>102400</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>102400</value>
</property>
But I still cannot force spark to use 100 cores, which I need as I am doing benchmarking against earlier tests.
I am using Apache Spark 1.6.1.
Each node on the cluster including the driver has 16 cores and 112GB of memory.
They are on Azure (hdinsight cluster).
2 driver nodes + 7 worker nodes.
I'm unfamiliar with Azure, but I guess YARN is YARN, so you should make sure that you have
yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DominantResourceCalculator
in capacity-scheduler.xml.
(See this similar question and answer)