Retrieve the time it took for spark to finish the job - apache-spark

I need to time some things in spark, e.g. how long it takes for spark to read my file, so I like to use sc.setLogLevel("INFO") to enable extra information printed to the screen and one thing I find really useful is when a message like this is printed
2018-12-18 02:05:38 INFO DAGScheduler:54 - Job 2 finished: count at <console>:26, took 9.555080 s because this tells me how long something took.
Is there anyway to get this programmatically (preferably in scala) ? Right now I just copy this result and save it in a text file.

You can create something like:
import scala.concurrent.duration._
case class TimedResult[R](result: R, durationInNanoSeconds: FiniteDuration)
def time[R](block: => R): TimedResult[R] = {
val t0 = System.nanoTime()
val result = block
val t1 = System.nanoTime()
val duration = t1 - t0
TimedResult(result, duration nanoseconds)
}
and then use it call your code block :
val timedResult = time{
someDataframe.count()
}
println("Count of records ${timedResult.result}")
println("Time taken : ${timedResult.durationInNanoSeconds.toSeconds}")

There are 2 solutions available to record logging for your spark program.
a) You can redirect your console output to a desired file while using the spark-submit command.
spark-submit your_code_file > logfile.txt 2>&1
b) There are 2 log files (log4j.properties) can be created for driver and executor and while issuing the spark-submit command include them by giving their path in java options for driver and executer.
spark-submit --class MAIN_CLASS --driver-java-options "-Dlog4j.configuration=file:LOG4J_PATH" --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=file:LOG4J_PATH" --master MASTER_IP:PORT JAR_PATH

Related

Status of structured streaming query in PySpark

I am following the book Spark - Definitive Guide and I was writing basic program that streams the data . The books says that I should use awaitTermination() method to process the query correctly. When I run the below code , it runs indefinitely until I press Ctrl+C and it ends with exception. My question is how can I monitor the status of my streaming query and as soon as my streaming completes , my program should exit after showing the output. Like in the example code below , as soon as it reads all the files and writes the file on the console , it should have ended but it didn't . I also tried inserting activityQuery.stop() but that also didn't work. How can I achieve the same . Any help be appreciated.
from pyspark import SparkConf
from pyspark.sql import *
from pyspark.sql.functions import *
from time import sleep
conf = SparkConf()
spark = SparkSession.builder.config(conf=conf).appName('testapp').getOrCreate()
spark.conf.set("spark.sql.shuffle.partitions", 2)
spark.conf.set("spark.sql.streaming.schemaInference", "true")
static = spark.read.format("json").load("/home/scom/.test/spark/Spark-The-Definitive-Guide/data/activity-data/")
dataSchema = static.schema
streaming = spark.readStream.schema(dataSchema).option("maxFilesPerTrigger", 1).json("/home/scom/.test/spark/Spark-The-Definitive-Guide/data/activity-data/")
activityCounts = streaming.groupBy("gt").count()
activityQuery = activityCounts.writeStream.queryName("activity_counts").format("console").outputMode("complete").start()
activityQuery.awaitTermination()
for x in range(5):
spark.sql("select * from activity_counts").show()
sleep(1)

How to use spark to write to HBase using multi-thread

I'm using spark to write data to HBase, but at the writing stage, only one executor and one core are executing.
I wonder why my code is not writing properly or what should I do to make it write faster?
Here is my code:
val df = ss.sql("SQL")
HBaseTableWriterUtil.hbaseWrite(ss, tableList, df)
def hbaseWrite(ss:SparkSession,tableList: List[String], df:DataFrame): Unit ={
val tableName = tableList(0)
val rowKeyName = tableList(4)
val rowKeyType = tableList(5)
hbaseConf.set(TableOutputFormat.OUTPUT_TABLE, s"${tableName}")
//写入到HBase
val sc = ss.sparkContext
sc.hadoopConfiguration.addResource(hbaseConf)
val columns = df.columns
val result = df.rdd.mapPartitions(par=>{
par.map(row=>{
var rowkey:String =""
if("String".equals(rowKeyType)){
rowkey = row.getAs[String](rowKeyName)
}else if("Long".equals(rowKeyType)){
rowkey = row.getAs[Long](rowKeyName).toString
}
val put = new Put(Bytes.toBytes(rowkey))
for(name<-columns){
var value = row.get(row.fieldIndex(name))
if(value!=null){
put.addColumn(Bytes.toBytes("cf"),Bytes.toBytes(name),Bytes.toBytes(value.toString))
}
}
(new ImmutableBytesWritable,put)
})
})
val job = Job.getInstance(sc.hadoopConfiguration)
job.setOutputKeyClass(classOf[ImmutableBytesWritable])
job.setOutputValueClass(classOf[Result])
job.setOutputFormatClass(classOf[TableOutputFormat[ImmutableBytesWritable]])
result.saveAsNewAPIHadoopDataset(job.getConfiguration)
}
You may not control how many parallel execute may write to HBase.
Though you can start multiple Spark jobs in multiThreaded client program.
e.g. You can have a shell script which triggers multiple spark-submit command to induce parallelism. Each spark job can work on one set of data independent to each other and push into HBase.
This can also be done using Spark Java/Scala SparkLauncher API using it with Java concurrent API (e.g. Executor framework).
val sparkLauncher = new SparkLauncher
//Set Spark properties.only Basic ones are shown here.It will be overridden if properties are set in Main class.
sparkLauncher.setSparkHome("/path/to/SPARK_HOME")
.setAppResource("/path/to/jar/to/be/executed")
.setMainClass("MainClassName")
.setMaster("MasterType like yarn or local[*]")
.setDeployMode("set deploy mode like cluster")
.setConf("spark.executor.cores","2")
// Lauch spark application
val sparkLauncher1 = sparkLauncher.startApplication()
//get jobId
val jobAppId = sparkLauncher1.getAppId
//Get status of job launched.THis loop will continuely show statuses like RUNNING,SUBMITED etc.
while (true) {
println(sparkLauncher1.getState().toString)
}
However, the challenge is to track each of them for failure and automatic recovery. It may be tricky specially when partial data is already written into HBase. i.e. A job fails to process the complete set of data assigned to it. You may have to automatically clean the data from HBase before automatically retrigger.

Spark streaming NetworkWordCount example creates multiple jobs per batch

I am running the basic NetworkWordCount program on yarn cluster through spark-shell. Here is my code snippet -
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.storage.StorageLevel
val ssc = new StreamingContext(sc, Seconds(60))
val lines = ssc.socketTextStream("172.26.32.34", 9999, StorageLevel.MEMORY_ONLY)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
The output on console and stats on Streaming tab are also as expected.
But when I look at jobs tab, per 1-minute batch interval two jobs get triggered, shouldn't it be one job per interval? Screenshot below -
Now when I look at completed batches on Streaming UI, I see exactly one batch per minute.Screenshot below -
Am I missing something? Also, I noticed that the start job also has two states with the same name that spawns a different number of tasks as seen in image below, what exactly is happening here?

How to control Spark application per task/stage/job attempt?

I'd like to stop Spark from retrying a Spark application in case some particular exception is thrown. I only want to limit the number of retries in case certain conditions are met. Otherwise, I want default number of retries.
Note that there is only one Spark job which a Spark application runs.
I tried setting javaSparkContext.setLocalProperty("spark.yarn.maxAppAttempts", "1"); in the case of the exception, but it still retries the whole job.
I submit the Spark application as follows:
spark-submit --deploy-mode cluster theSparkApp.jar
I have a use case where I want to delete the output if it is created by the previous retry of the same job, but fail the job if the output folder is not empty (in 1st retry). Can you think of any other way to achieve this?
I have a use case where I want to delete the output if it is created by the previous retry of the same job, but fail the job if the output folder is not empty (in 1st retry). Can you think of any other way to achieve this ?
You can use TaskContext to control how your Spark job behaves given, say, the number of retries as follows:
val rdd = sc.parallelize(0 to 8, numSlices = 1)
import org.apache.spark.TaskContext
def businessCondition(ctx: TaskContext): Boolean = {
ctx.attemptNumber == 0
}
val mapped = rdd.map { n =>
val ctx = TaskContext.get
if (businessCondition(ctx)) {
println("Failing the task because business condition is met")
throw new IllegalArgumentException("attemptNumber == 0")
}
println(s"It's ok to proceed -- business condition is NOT met")
n
}
mapped.count

Getting parameters of Spark submit while running a Spark job

I am running a spark job by spark-submit and using its --files parameter to load a log4j.properties file.
In my Spark job I need to get this parameter
object LoggerSparkUsage {
def main(args: Array[String]): Unit = {
//DriverHolder.log.info("unspark")
println("args are....."+args.mkString(" "))
val conf = new SparkConf().setAppName("Simple_Application")//.setMaster("local[4]")
val sc = new SparkContext(conf)
// conf.getExecutorEnv.
val count = sc.parallelize(Array(1, 2, 3)).count()
println("these are files"+conf.get("files"))
LoggerDriver.log.info("log1 for info..")
LoggerDriver.log.info("log2 for infor..")
f2
}
def f2{LoggerDriver.log.info("logs from another function..")}
}
my spark submit is something like this:
/opt/mapr/spark/spark-1.6.1/bin/spark-submit --class "LoggerSparkUsage" --master yarn-client --files src/main/resources/log4j.properties /mapr/cellos-mapr/user/mbazarganigilani/SprkHbase/target/scala-2.10/sprkhbase_2.10-1.0.2.jar
I tried to get the properties using
conf.get("files")
but it gives me an exception
can anyone give me a solution for this?
A correct key for files is spark.files:
scala.util.Try(sc.getConf.get("spark.files"))
but to get actual path on the workers you have to use SparkFiles:
org.apache.spark.SparkFiles.get(fileName)
If it is not sufficient you can pass these second as application arguments and retrieve from main args or use custom key in spark.conf.

Resources