spark sql Insert into HIVE external partitioned table takes more time - apache-spark

I have a spark sql statement that inserts data into a Hive external partitioned table. The insert takes more than 30 minutes to complete for just 200k data.
I have tried to increase the executor.memoryOverhead to 4086. still i see the same time in the insert statement.
This is the values given for the execution.
--executor-cores 4 --executor-memory 3G --num-executors 25 --conf spark.executor.memoryOverhead=4096 --driver-memory 4g
Spark Code:
Table_1.createOrReplaceTempView(tempViewName)
config = self.context.get_config()
insert_query = config['tables']['hive']['1']['insertStatement']
insertStatement = insert_query + tempViewName
self.spark.sql(insertStatement)
self.logger.info("************insert completed************")
repairTableQuery = config['tables']['hive']['training']['repairtable']
self.spark.sql(repairTableQuery)
self.logger.info("************repair completed************")
end = datetime.now()```
Would doing a coalesce partition before insert statement help in faster execution.

Related

Tensorflow Java use too much memory with spark on YARN

When using tensorflow java for inference the amount of memory to make the job run on YARN is abnormally large. The job run perfectly with spark on my computer (2 cores 16Gb of RAM) and take 35 minutes to complete. But when I try to run it on YARN with 10 executors 16Gb memory and 16 Gb memoryOverhead the executors are killed for using too much memory.
Prediction Run on an Hortonworks cluster with YARN 2.7.3 and Spark 2.2.1. Previously we used DL4J to do inference and everything run under 3 min.
Tensor are correctly closed after usage and we use a mapPartition to do prediction. Each task contain approximately 20.000 records (1Mb) so this will make input tensor of 2.000.000x14 and output tensor of 2.000.000 (5Mb).
option passed to spark when running on YARN
--master yarn --deploy-mode cluster --driver-memory 16G --num-executors 10 --executor-memory 16G --executor-cores 2 --conf spark.driver.memoryOverhead=16G --conf spark.yarn.executor.memoryOverhead=16G --conf spark.sql.shuffle.partitions=200 --conf spark.tasks.cpu=2
This configuration may work if we set spark.sql.shuffle.partitions=2000 but it take 3 hours
UPDATE:
The difference between local and cluster was in fact due to a missing filter. we actually run the prediction on more data than we though.
To reduce memory footprint of each partition you must create batch inside each partition (use grouped(batchSize)). Thus you are faster than running predict for each row and you allocate tensor of predermined size (batchSize). If you investigate the code of tensorflowOnSpark scala inference this is what they did. Below you will find a reworked example of an implementation this code may not compile but you get the idea of how to do it.
lazy val sess = SavedModelBundle.load(modelPath, "serve").session
lazy val numberOfFeatures = 1
lazy val laggedFeatures = Seq("cost_day1", "cost_day2", "cost_day3")
lazy val numberOfOutputs = 1
val predictionsRDD = preprocessedData.rdd.mapPartitions { partition =>
partition.grouped(batchSize).flatMap { batchPreprocessed =>
val numberOfLines = batchPreprocessed.size
val featuresShape: Array[Long] = Array(numberOfLines, laggedFeatures.size / numberOfFeatures, numberOfFeatures)
val featuresBuffer: FloatBuffer = FloatBuffer.allocate(numberOfLines)
for (
featuresWithKey <- batchPreprocessed;
feature <- featuresWithKey.features
) {
featuresBuffer.put(feature)
}
featuresBuffer.flip()
val featuresTensor = Tensor.create(featuresShape, featuresBuffer)
val results: Tensor[_] = sess.runner
.feed("cost", featuresTensor)
.fetch("prediction")
.run.get(0)
val output = Array.ofDim[Float](results.numElements(), numberOfOutputs)
val outputArray: Array[Array[Float]] = results.copyTo(output)
results.close()
featuresTensor.close()
outputArray
}
}
spark.createDataFrame(predictionsRDD)
We use FloatBuffer and Shape to create Tensor as recommended in this issue

Teradata extraction by pyspark2 is taking long time

I am trying to extract maximum date from teradata table thru pyspark2. While the simple query is running for few seconds in Teradata, in spark after 1 hour of execution it is not giving me any answer.
I am executing pyspark2 in CLI, and I already kept tdgssconfig.jar,terajdbc4.jar in the same location
pyspark2 --conf spark.ui.port=45321 --jars tdgssconfig.jar,terajdbc4.jar
TD_QUERY = "(select max({a}) as max_date from {b}) as temp".format(a=Partition_Info,b=SOURCE_TABLE_VIEW)
df_td_date = spark.read\
.format("jdbc")\
.option("url",connection_url)\
.option("driver",connection_driver)\
.option("dbtable",TD_QUERY)\
.option("user",user_name)\
.option("password",pwd)\
.load()
max_date_temp = df_td_max_date.collect()
Please let me know, if I need to improve any part of this code?

Spark process never ends when insert into Hive table

I trying to append some rows (5 million rows/ 2800 columns) in a Hive table through Spark/Scala, but the process seems to stuck after long hours. The logs don't show any errors.
How can I be sure the process is really running?
Is there something to do to optimize the job?
My submit configs:
--driver-memory 15 G
--executor-memory 30g
--num-executors 35
--executor-cores 5
Thanks!
def exprToAppend(myCols: Set[String], allCols: Set[String]) = {
import org.apache.spark.sql.functions._
allCols.toList.map(x => x match {
case x if myCols.contains(x) => col(x)
case _ => lit(0d).as(x)
})
}
val insert : DataFrame = tableFinal.select(exprToAppend(tableFinal.columns.toSet, historico.columns.toSet):_ *).select(historico.columns.map(x => col(x)) :_*);
insert.write.mode("append")
.format("parquet")
.insertInto(s"${Configuration.SIGLA}${Configuration.TABLE_HIST}")

How to effectively join large tables in SparkSql?

I am trying to improve performance on a join involving two large tables using SparkSql. From various sources, I figured that the RDDs need to be partitioned.
Source: https://deepsense.io/optimize-spark-with-distribute-by-and-cluster-by
However, when you load a file directly from a parquet file as given below, I am not sure how it can be created as a paired RDD!
With Spark 2.0.1, using “cluster by” has no effect.
val rawDf1 = spark.read.parquet(“file in hdfs”)
rawDf1 .createOrReplaceTempView(“rawdf1”)
val rawDf2 = spark.read.parquet(“file in hdfs”)
rawDf2 .createOrReplaceTempView(“rawdf2”)
val rawDf3 = spark.read.parquet(“file in hdfs”)
rawDf3 .createOrReplaceTempView(“rawdf3”)
val df1 = spark.sql(“select * from rawdf1 cluster by key)
df1 .createOrReplaceTempView(“df1”)
val df2 = spark.sql(“select * from rawdf2 cluster by key)
df2 .createOrReplaceTempView(“df2”)
val df3 = spark.sql(“select * from rawdf3 cluster by key)
df3 .createOrReplaceTempView(“df3”)
val resultDf = spark.sql(“select * from df1 a inner join df2 b on a.key = b.key inner join df3 c on a.key =c.key”)
Whether I use "cluster by" key or not, I still see the same query plan being generated by Spark. How can I create a rdd pair in spark sql so that joins can use tables that can be partitioned?
Without proper partitioning, a lot of shuffles are happening resulting in long delays.
Our configuration ( 5 worker nodes with 1 executor (5 cores per executor) each having 32 cores and 128 GB of RAM):
spark.cores.max 25
spark.default.parallelism 75
spark.driver.extraJavaOptions -XX:+UseG1GC
spark.executor.memory 60G
spark.rdd.compress True
spark.driver.maxResultSize 4g
spark.driver.memory 8g
spark.executor.cores 5
spark.executor.extraJavaOptions -Djdk.nio.maxCachedBufferSize=262144
spark.memory.storageFraction 0.2
To add more info: I am joining more than one table in the same select using the same key across all tables. So it is not possible to create a dataframe first to call repartitionby. I understand I can do this using dataframe api. But my question is how to accomplish this using plain sparksql.

How to assgin spark task balance?

My spark is installed in CDH5 5.8.0 and run its application in yarn. There are 5 servers in the cluster. One server is for resource manager. The other four servers are node manager. Each server has 2 core and 8G memory.
The spark application main logic is not complex: Query table from postgres db. Do some business for each record and finally save result to db. Here is main code:
String columnName="id";
long lowerBound=1;
long upperBound=100000;
int numPartitions=20;
String tableBasic="select * from table1 order by id";
DataFrame dfBasic = sqlContext.read().jdbc(JDBC_URL, tableBasic, columnName, lowerBound, upperBound,numPartitions, dbProperties);
JavaRDD<EntityResult> rddResult = dfBasic.javaRDD().flatMap(new FlatMapFunction<Row, Result>() {
public Iterable<Result> call(Row row) {
List<Result> list = new ArrayList<Result>();
........
return list;
}
});
DataFrame saveDF = sqlContext.createDataFrame(rddResult, Result.class);
saveDF = saveDF.select("id", "column 1", "column 2",);
saveDF.write().mode(SaveMode.Append).jdbc(SQL_CONNECTION_URL, "table2", dbProperties);
I use this command to submit application to yarn:
spark-submit --master yarn-cluster --executor-memory 6G --executor-cores 2 --driver-memory 6G --conf spark.default.parallelism=90 --conf spark.storage.memoryFraction=0.4 --conf spark.shuffle.memoryFraction=0.4 --conf spark.executor.memory=3G --class com.Main1 jar1-0.0.1.jar
There are 7 executors and 20 partitions. When the table records is small, for example less than 200000, the 20 active tasks can assign to the 7 executors balance, like this:
Assign task averagely
But when the table records is huge, for example 1000000, the task will not assign averagely. There is always one executor run long time, the other executors run shortly. Some executors can't assign task. Like this:
enter image description here

Resources