Finding Scheduler Delay for Spark - apache-spark

I want to be able to generate a table for the metric for each task, like the one on the Collector Spark UI when you visit a particular stage.
One of the columns is Scheduler delay, which I cannot find in any of the REST api provided by Spark.
All the other columns exists (when I browse /api/v1/applications/[app-id]/stages/[stage-id]/[attempt]/taskList).
How is scheduler delay calculate/is there a way for me to pull that data out without scraping the Collector Spark UI webpage?

The scheduler delay is not provided in the history api, yes. For the UI, it is calculated as follows:
private[ui] def getSchedulerDelay(info: TaskInfo, metrics: TaskMetricsUIData, currentTime: Long): Long = {
if (info.finished) {
val totalExecutionTime = info.finishTime - info.launchTime
val executorOverhead = (metrics.executorDeserializeTime + metrics.resultSerializationTime)
math.max(0,totalExecutionTime - metrics.executorRunTime - executorOverhead - getGettingResultTime(info, currentTime))
} else {
// The task is still running and the metrics like executorRunTime are not available.
0L
}
}
see https://github.com/apache/spark/blob/branch-2.0/core/src/main/scala/org/apache/spark/ui/jobs/StagePage.scala line number 770

At least for spark 1.6, if you are looking for scheduling delay for spark streaming batch, you can look at spark streaming UI source code.
It uses a Class BatchUIData, in which scheduling delay is defined:
/**
* Time taken for the first job of this batch to start processing from the time this batch
* was submitted to the streaming scheduler. Essentially, it is
* `processingStartTime` - `submissionTime`.
*/
def schedulingDelay: Option[Long] = processingStartTime.map(_ - submissionTime)

Related

pyspark-How to submit Spark SQL in parrallism?

Hi I've more than 1200+ SQL queries and want to submit multiple SQL queries in parallel and store each of them into CSV files,
since python has a GIL limit, how to submit in parallel,
I've seen other demos and they are all scala-based spark app.
# return about 61K records
SQL = """
SELECT * FROM TEMP_VIEW WHERE index>=1 and index<=10;
"""
# return about 60K records
SQL2 = """
SELECT * FROM TEMP_VIEW WHERE index>=11 and index<=20;
"""
....
# this will use for loop to submit
Any suggestion will be super helpful! Thanks in advance!
According to Scheduling Within an Application you can submit from different threads. Example:
from multiprocessing import Process # Process follows the API of threading.Thread
def submit_job(query):
...
for job in jobs:
Process(target=submit_job, args=('query 1',)).start()
But keep in mind that by default, Spark’s scheduler runs jobs in FIFO fashion, so I think you should consider changing the scheduler into FAIR so you can run multiple jobs in parallel

How to start multiple streaming queries in a single Spark application?

I have built few Spark Structured Streaming queries to run on EMR, they are long running queries, and need to run at all times, since they are all ETL type queries, when I submit a job to YARN cluster on EMR, I can submit a single spark application. So that spark application should have multiple streaming queries.
I am confused on how to build/start multiple streaming queries within same submit programmatically.
For ex: I have this code:
case class SparkJobs(prop: Properties) extends Serializable {
def run() = {
Type1SparkJobBuilder(prop).build().awaitTermination()
Type1SparkJobBuilder(prop).build().awaitTermination()
}
}
I fire this in my main class with SparkJobs(new Properties()).run()
When I see in the spark history server, only the first spark streaming job (Type1SparkJob) is running.
What is the recommended way to fire multiple streaming queries within same spark submit programatically, I could not find proper documentation either.
Since you're calling awaitTermination on the first query it's going to block until it completes before starting the second query. So you want to kick off both queries, but then use StreamingQueryManager.awaitAnyTermination.
val query1 = df.writeStream.start()
val query2 = df.writeStream.start()
spark.streams.awaitAnyTermination()
In addition to the above, by default Spark uses the FIFO scheduler. Which means the first query gets all resources in the cluster while it's executing. Since you're trying to run multiple queries concurrently you should switch to the FAIR scheduler
If you have some queries that should have more resources than the others then you can also tune the individual scheduler pools.
val query1=ds.writeSteam.{...}.start()
val query2=ds.writeSteam.{...}.start()
val query3=ds.writeSteam.{...}.start()
query3.awaitTermination()
AwaitTermination() will block your process until finish, which will never happen in a streaming app, call it on your last query should fix your problem

building the thread pool in spark streaming program

To avoid delaying and to speed up the process,i build the thread pool in the spark streaming. The main program is listed as follows:
stream.foreachRDD(rdd=> {
rdd.foreachPartition { rddPartition => {
val client: Client = ESClient.getInstance.getClient
var num = Random.nextInt()
val threadPool: ExecutorService = Executors.newFixedThreadPool(5)
val confs = new Configuration()
rddPartition.foreach(x => {
threadPool.execute(new esThread(x._2, num, client, confs))
} ) } } } )
The function of the esThread is that firstly,we inquire the elasticsearch,then we get the query result of ES,finally we write the result to HDFS. But we find data of the result file in HDFS lack a lot,which is a little left. I wonder that we can build the thread pool in the spark streaming. Does the thread pool in spark streaming make some data missing?
thanks for your help.
Partitions are processed by separate threads already, and stream won't proceed to the next batch until the previous one has finished. So it is not likely to buy you anything and makes resource usage tracking less transparent.
At the same time, as your code is implemented at this moment, you're likely to loose data. Since threadPool doesn't awaitTermination, parent thread might exit before all data has been processed.
Overall it is not useful approach. If you want to increase throughput you should tune number of partitions and amount of computing resources.

How do I stop a spark streaming job?

I have a Spark Streaming job which has been running continuously. How do I stop the job gracefully? I have read the usual recommendations of attaching a shutdown hook in the job monitoring and sending a SIGTERM to the job.
sys.ShutdownHookThread {
logger.info("Gracefully stopping Application...")
ssc.stop(stopSparkContext = true, stopGracefully = true)
logger.info("Application stopped gracefully")
}
It seems to work but does not look like the cleanest way to stop the job. Am I missing something here?
From a code perspective it may make sense but how do you use this in a cluster environment? If we start a spark streaming job (we distribute the jobs on all the nodes in the cluster) we will have to keep track of the PID for the job and the node on which it was running. Finally when we have to stop the process, we need to keep track which node the job was running at and the PID for that. I was just hoping that there would be a simpler way of job control for streaming jobs.
You can stop your streaming context in cluster mode by running the following command without needing to sending a SIGTERM. This will stop the streaming context without you needing to explicitly stop it using a thread hook.
$SPARK_HOME_DIR/bin/spark-submit --master $MASTER_REST_URL --kill $DRIVER_ID
-$MASTER_REST_URL is the rest url of the spark driver, ie something like spark://localhost:6066
-$DRIVER_ID is something like driver-20150915145601-0000
If you want spark to stop your app gracefully, you can try setting the following system property when your spark app is initially submitted (see http://spark.apache.org/docs/latest/submitting-applications.html on setting spark configuration properties).
spark.streaming.stopGracefullyOnShutdown=true
This is not officially documented, and I gathered this from looking at the 1.4 source code. This flag is honored in standalone mode. I haven't tested it in clustered mode yet.
I am working with spark 1.4.*
Depends on the use case and how driver can be used.
Consider the case you wanted to collect some N records(tweets) from the Spark Structured Streaming, store them in Postgresql and stop the stream once the count crosses N records.
One way of doing this is to use accumulator and python threading.
Create a Python thread with stream query object and the accumulator, stop the query once the count is crossed
While starting the stream query pass the accumulator variable and update the value for each batch of the stream.
Sharing the code snippet for understanding/illustration purpose...
import threading
import time
def check_n_stop_streaming(query, acc, num_records=3500):
while (True):
if acc.value > num_records:
print_info(f"Number of records received so far {acc.value}")
query.stop()
break
else:
print_info(f"Number of records received so far {acc.value}")
time.sleep(1)
...
count_acc = spark.sparkContext.accumulator(0)
...
def postgresql_all_tweets_data_dump(df,
epoch_id,
raw_tweet_table_name,
count_acc):
print_info("Raw Tweets...")
df.select(["text"]).show(50, False)
count_acc += df.count()
mode = "append"
url = "jdbc:postgresql://{}:{}/{}".format(self._postgresql_host,
self._postgresql_port,
self._postgresql_database)
properties = {"user": self._postgresql_user,
"password": self._postgresql_password,
"driver": "org.postgresql.Driver"}
df.write.jdbc(url=url, table=raw_tweet_table_name, mode=mode, properties=properties)
...
query = tweet_stream.writeStream.outputMode("append"). \
foreachBatch(lambda df, id :
postgresql_all_tweets_data_dump(df=df,
epoch_id=id,
raw_tweet_table_name=raw_tweet_table_name,
count_acc=count_acc)).start()
stop_thread = threading.Thread(target=self.check_n_stop_streaming, args=(query, num_records, raw_tweet_table_name, ))
stop_thread.setDaemon(True)
stop_thread.start()
query.awaitTermination()
stop_thread.join()
If all you need is just stop running streaming application, then simplest way is via Spark admin UI (you can find it's URL in the startup logs of Spark master).
There is a section in the UI, that shows running streaming applications, and there are tiny (kill) url buttons near each application ID.
It is official now,please look into original apache documentation here-
http://spark.apache.org/docs/latest/configuration.html#spark-streaming

Spark SQL + Streaming issues

We are trying to implement a use case using Spark Streaming and Spark SQL that allows us to run user-defined rules against some data (See below for how the data is captured and used). The idea is to use SQL to specify the rules and return the results as alerts to the users. Executing the query based on each incoming event batch seems to be very slow. Would appreciate if anyone can suggest a better approach to implementing this use case. Also, would like know if Spark is executing the sql on the driver or workers? Thanks in advance. Given below are the steps we perform in order to achieve this -
1) Load the initial dataset from an external database as a JDBCRDD
JDBCRDD<SomeState> initialRDD = JDBCRDD.create(...);
2) Create an incoming DStream (that captures updates to the initialized data)
JavaReceiverInputDStream<SparkFlumeEvent> flumeStream =
FlumeUtils.createStream(ssc, flumeAgentHost, flumeAgentPort);
JavaDStream<SomeState> incomingDStream = flumeStream.map(...);
3) Create a Pair DStream using the incoming DStream
JavaPairDStream<Object,SomeState> pairDStream =
incomingDStream.map(...);
4) Create a Stateful DStream from the pair DStream using the initialized RDD as the base state
JavaPairDStream<Object,SomeState> statefulDStream = pairDStream.updateStateByKey(...);
JavaRDD<SomeState> updatedStateRDD = statefulDStream.map(...);
5) Run a user-driven query against the updated state based on the values in the incoming stream
incomingStream.foreachRDD(new Function<JavaRDD<SomeState>,Void>() {
#Override
public Void call(JavaRDD<SomeState> events) throws Exception {
updatedStateRDD.count();
SQLContext sqx = new SQLContext(events.context());
schemaDf = sqx.createDataFrame(updatedStateRDD, SomeState.class);
schemaDf.registerTempTable("TEMP_TABLE");
sqx.sql(SELECT col1 from TEMP_TABLE where <condition1> and <condition2> ...);
//collect the results and process and send alerts
...
}
);
The first step should be to identify which step is taking most of the time.
Please see the Spark Master UI and identify which Step/ Phase is taking most of the time.
There are few best practices + my observations which you can consider: -
Use Singleton SQLContext - See example - https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/SqlNetworkWordCount.scala
updateStateByKey can be a memory intensive operation in case of large number of keys. You need to check size of data processed by
updateStateByKey function and also if it fits well in the given
memory.
How is your GC behaving?
Are you really using "initialRDD"? if not then do not load it. In case it is static dataset then cache it.
Check the time taken by your SQL Query too.
Here are few more questions/ areas which can help you
What is the StorageLevel for DStreams?
Size of cluster and configuration of Cluster
version of Spark?
Lastly - ForEachRDD is an Output Operation which executes the given function on the Driver but RDD might actions and those actions are executed on worker nodes.
You may need to read this for better explaination about Output Operations - http://spark.apache.org/docs/latest/streaming-programming-guide.html#output-operations-on-dstreams
I too facing the same issue could you please let me know if you have got the solution for the same? Though I have mentioned the detailed use case in below post.
Spark SQL + Window + Streming Issue - Spark SQL query is taking long to execute when running with spark streaming

Resources